- How are deleted accounts handled – both from the past and into the future? (As a historian I think they should – though this may be controversial – continue to be archived and preserved as part of a moment in time)
- Is it noted when an account changed ownership? (ie was established and then in the future was either lapsed and taken by another party once Twitter released it or was perhaps due to a court order transfered to another person – for example if it was deemed to have been infringing someone’s trademark
- Are accounts which are set private part of this corpus of data? Their announcement notes that this was a donation of “public tweets” but what about accounts whose status has changed over time? Are tweets which were made when an account was set private – but which was later set public noted as having been sent when the account was private? Or conversely are public tweets preserved if they were public when they were sent even if the account is later reset to be private?
- Does the corpus include the changelog of each account? ie could the “following” and “followers” of a given account be recreated at a particular point in time & later analyzed for changes over time (who someone was following / who was following them is not a minor matter at all for a lot of academic research questions – even just historical interest
- Are spammer accounts which were created & detected & deleted part of this corpus? On the one hand their presence would complicate a lot of academic studies (they would inflate a lot of studies – since many spammers spammed via retweets etc) but on the other hand studying false positives and relative ratios of “spammer” accounts to “real” accounts would be pretty interesting to study – especially as Twitter’s ability to detect spammers got better it would be useful to revisit moments in time (such as @aplusk & @cnn’s race to 1M followers) to analyze what percentage of their “followers” were spammers, what percentage were accounts that hadn’t yet & didn’t into the future see much usage etc.
- Will the corpus include Direct Messages? (which are private) but which are still pretty crucial historical documents in many cases. The DM’s to public officials for example could be arguably already required to be publicly disclosed.
- Will the corpus include elements of Twitter which are no longer part of Twitter (for example people’s Track settings
- Will there be an attempt by Twitter (or by the Library of Congress) to pull an Internet Archive move (or partner with them) to resolve:
- Links to images, videos, music and other media?
- URL resolution (both archive what the state of the page was when it was tweeted out – which may now be impossible to replicate) and especially at least resolving (when still possible) what a shortened URL resolves to
- Will the corpus include people’s Avatar images (which have in many cases changed over time), their bios, URL’s, Locations and Twitter website background and other settings? (not just private/public but have they linked a phone number to their account? have they set anyone whom they are following to be delivered via SMS etc)
- Will it archive Lists from the point when they were introduced? (and with Lists will it track how those lists were created over time?)
- Will the corpus include noting which accounts were blocking other accounts? (and when Twitter rolled it out when accounts were marked as being spam). In some cases people who were not spammers were marked as spam by a few users – I’m sure – and in some cases may have been flagged and later reinstated – will the corpus track stuff such as that?)
- As Twitter added features (and changed others) will the corpus reflect those changes? (Retweets for example and more recently a lot of changes around Geo data and very soon a whole lot more meta data for every Tweet)
- Will the corpus attempt to reflect other public faces of Twitter? For example logs of searches which people performed on Twitter or Who was on the “Suggested User List” at a given point in time or what was shown to users at “trending topics” over time – etc.
I hope that beyond preserving what is, I think we can all agree, a very real historical (and ongoing) document. I hope that this move is just the first of many – this archive should be widely available at least to be preserved for the future and it should, I hope, be made available to lots of academic researchers in the near future.
In their announcement Twitter notes that there will be a 6 month delay in what is available, which I think is unfortunate, and they are restricting it to “non-commercial researchers” which I think is also unfortunate as the line between commercial and non-commercial is never entirely clear. I also believe that there are many non-obvious uses for this corpus of data in a wide array of research fields beyond serving as a historical document, this corpus could help many fields of study such as linguistics, AI research and much more.