Searching for the Moon

Shannon Clark's rambles and conversations on food, geeks, San Francisco and occasionally economics

Posts Tagged ‘history’

13 questions for Twitter and the Library of Congress

Posted by shannonclark on April 15, 2010

Earlier this week Twitter announced that they had donated a copy of their entire corpus dating back to the first Tweet to the Library of Congress. The Library of Congress tweeted the announcement and wrote about it on the LOC Blog. Historians, social scientists and many many other researchers will soon (if access if made available) have access to a truly unique corpus of data on a global scale about individual expression, reactions to real time events and much more. Yet there are a lot of questions and nuances.
A few that come immediately to mind.
  1. How are deleted accounts handled – both from the past and into the future? (As a historian I think they should – though this may be controversial – continue to be archived and preserved as part of a moment in time)
  2. Is it noted when an account changed ownership? (ie was established and then in the future was either lapsed and taken by another party once Twitter released it or was perhaps due to a court order transfered to another person – for example if it was deemed to have been infringing someone’s trademark
  3. Are accounts which are set private part of this corpus of data? Their announcement notes that this was a donation of “public tweets” but what about accounts whose status has changed over time? Are tweets which were made when an account was set private – but which was later set public noted as having been sent when the account was private? Or conversely are public tweets preserved if they were public when they were sent even if the account is later reset to be private?
  4. Does the corpus include the changelog of each account? ie could the “following” and “followers” of a given account be recreated at a particular point in time & later analyzed for changes over time (who someone was following / who was following them is not a minor matter at all for a lot of academic research questions – even just historical interest
  5. Are spammer accounts which were created & detected & deleted part of this corpus? On the one hand their presence would complicate a lot of academic studies (they would inflate a lot of studies – since many spammers spammed via retweets etc) but on the other hand studying false positives and relative ratios of “spammer” accounts to “real” accounts would be pretty interesting to study – especially as Twitter’s ability to detect spammers got better it would be useful to revisit moments in time (such as @aplusk & @cnn’s race to 1M followers) to analyze what percentage of their “followers” were spammers, what percentage were accounts that hadn’t yet & didn’t into the future see much usage etc.
  6. Will the corpus include Direct Messages? (which are private) but which are still pretty crucial historical documents in many cases. The DM’s to public officials for example could be arguably already required to be publicly disclosed.
  7. Will the corpus include elements of Twitter which are no longer part of Twitter (for example people’s Track settings
  8. Will there be an attempt by Twitter (or by the Library of Congress) to pull an Internet Archive move (or partner with them) to resolve:
    1. Links to images, videos, music and other media?
    2. URL resolution (both archive what the state of the page was when it was tweeted out – which may now be impossible to replicate) and especially at least resolving (when still possible) what a shortened URL resolves to
  9. Will the corpus include people’s Avatar images (which have in many cases changed over time), their bios, URL’s, Locations and Twitter website background and other settings? (not just private/public but have they linked a phone number to their account? have they set anyone whom they are following to be delivered via SMS etc)
  10. Will it archive Lists from the point when they were introduced? (and with Lists will it track how those lists were created over time?)
  11. Will the corpus include noting which accounts were blocking other accounts? (and when Twitter rolled it out when accounts were marked as being spam). In some cases people who were not spammers were marked as spam by a few users – I’m sure – and in some cases may have been flagged and later reinstated – will the corpus track stuff such as that?)
  12. As Twitter added features (and changed others) will the corpus reflect those changes? (Retweets for example and more recently a lot of changes around Geo data and very soon a whole lot more meta data for every Tweet)
  13. Will the corpus attempt to reflect other public faces of Twitter? For example logs of searches which people performed on Twitter or Who was on the “Suggested User List” at a given point in time or what was shown to users at “trending topics” over time – etc.
Lots of questions – but mostly I’m very excited.

I hope that beyond preserving what is, I think we can all agree, a very real historical (and ongoing) document. I hope that this move is just the first of many – this archive should be widely available at least to be preserved for the future and it should, I hope, be made available to lots of academic researchers in the near future.

In their announcement Twitter notes that there will be a 6 month delay in what is available, which I think is unfortunate, and they are restricting it to “non-commercial researchers” which I think is also unfortunate as the line between commercial and non-commercial is never entirely clear. I also believe that there are many non-obvious uses for this corpus of data in a wide array of research fields beyond serving as a historical document, this corpus could help many fields of study such as linguistics, AI research and much more.

Posted in futureculture, internet, networks | Tagged: , , , | Leave a Comment »

Celebrating Ada Lovelace Day – thanks Mom

Posted by shannonclark on March 24, 2010

My mom taught me computer programming when I was 8.

Today (March 24th) is Ada Lovelace Day, a day to celebrate the impact of woman in technology especially computer programming.  The idea is for people to blog about their favorite tech heroine.

For me, it is my Mom, Nancy Clark. Not only did I learn programming and flowcharting by doing the homework on flowcharting she was assigning to the class she taught at a local college on computer programming when I was 8 years old and riding in the back seat of the family Volvo. But she was also in many important ways a computing pioneer.

She started programming in the late 1960’s after graduating early from the University of California Berkeley. She worked while she followed my father around the country. But her career was impressive. At Southern Pacific Railroad she was part of the team that “computerized” the whole railroad in the early 1970’s, an initiative which lead to great profitability. (see http://www.wprrhs.org/wphistory_80candles/wphistory_80candles.html for a history of Western Pacific and then Southern Pacific railroads, look in particular at the history in the early 1970’s as the railroad computerized. That was the work of my mom.

When my father took a position as a professor at Virginia Tech (where I was born a few years later) my mom took a position at the university helping to write the software which would run the entire university administration. This included a very early experiment in e-commerce where she attempted to tie the bookstore’s ordering systems to the campus class registration system to have the bookstore order the right number of books for each class (at the time however this early great idea didn’t work very well).

Years later we moved to Chicago (after a few years in New York) where my mom in an early example of a career path now familiar to millions was an independent computer consultant. At times she held multiple jobs, working as a computer consultant for a few clients while also teaching computer science at a local college. But she was always there for my sister and I and set an example that woman could have complex, technology based careers. Careers which were challenging and intellectual.

In talking with my grandfather in the past few years I have learned that I am a third generation computer person, my grandfather in the course of his career worked with and deployed some of the most complex computer installations of the day. He was literally a bit of a rocket scientist (he was trained as an aerospace engineer, designed jets for many years and then for 20+ years worked for Aerospace Corporation where he headed up their work for the US government deploying & designing satellites, mostly spy satellites). His first use of computers at while working as an early employee at Rand Corporation trying to mathematically model flight. Then years later at Aerospace corporation he deployed pairs of IBM mainframes across the globe to track and find nuclear explosions around the globe.

But most of his work was classified and though i’m sure some of his engineering focus rubbed off on my mom, mostly as I understand it his work was a mystery to my mom and my aunt (and my grandmother).

So the credit for my mom’s technical expertise and nearly 30+ year career as a computer programmer and consultant lies entirely with her and her ongoing drive to educate herself and to learn new technologies as well as remain a master of the older systems she helped write and design.

She mostly worked in the less well known types of computer programming, business languages such as Focus, used by firms such as actuarial firms to manage large pension plans. But her work managed very complex systems and in many cases helped form the base upon which the modern, Internet, always-on technology world is built.

So she is my heroine today (and always).

thanks mom!

Posted in geeks, internet, personal | Tagged: , , , | 1 Comment »