Searching for the Moon

Shannon Clark's rambles and conversations on food, geeks, San Francisco and occasionally economics

my next hack – hacking Netflix

Posted by shannonclark on October 2, 2006

With their permission that is

NetFlix has just announced the Netflix Prize which will award a prize of $1M for a 10% improvement on their recommendation engine, based on a dataset of over 100M ratings of movies which they are making available for research purposes to anyone who registers to participate.

When I founded JigZaw Inc in 2000 I embarked on many years of research into various aspects of machine learning and AI. My initial focus was on automated data acquisition, techniques for automating the understanding of data structures (especially from semi-structured data such as web pages) as well as techniques for extracting that data and converting it into “real” structured data. But beyond those techniques I also started a lot of research into data clustering methodologies and approaches, with my interest focusing mostly on some fairly complex ways of automatic data clustering into data-driven categories (including the possibility of overlapping categories) I was and am less interested in the “postal code” type of clustering, where the categories are known ahead of time, are fixed and usually are unitary – i.e. a specific element can only be placed into one and only one category.

What I’m more interested in is the much harder problem of automatic data driven clustering – clusters that are properties of the dataset but which arise naturally through the data analysis, not from a priori defined categories or cluster types.

But there has always been a very real lack of serious datasets to test my theories upon so I haven’t done much with them for many years.

Netflix’s announcements change all of this, with a single, well thought out action, they have made a very large (and furthermore mostly real) data set available to nearly anyone (if you live in certain countries or Quebec you aren’t eligible to participate). I know that I plan on registering and downloading the dataset and exploring it, even if I don’t seriously enter the competition.

Though, that said, I do think I have a number of approaches and techniques that would achieve very real and valid results.

But I do have a couple of procedural questions as well as some real concerns. First and foremost while I applaud them for the very real steps they are taking to preserve user’s privacy, by modifying the data in a variety of ways they do cloud the validity of the data as well as embed into the contest certain assumptions (some of which I had planned on questioning in a few of my approaches).

This is not all of those approached, but for example, by modifying in some cases the date when a rating was made they change in unkown ways the temporal factors implicit in those ratings – one testable assumption might be that people who typically watch movies over the weekend (and rate/return them early in the week) have very real and measurable differences from movie watchers who primarily watch movies during the week returning them anytime. Not to mention that some possibly calculable measures such as whether or not there is a correlation between how long someone kept a given movie and who positively/negatively they rated that movie would be worth testing. (I know in my own experience when my ex-girlfriend had a netflix subscription that certain types of movies, often ones we felt we “had” to watch but generally didn’t really love, might sit, unwatched, for weeks or in a few cases many months.  (time might also be a proxy for other more typical factors – the differences between a single mother renting for herself as well as for her young children and those of a single renting mostly for weekend (or less commonly weekday) movie watching with a partner.

Anyway, I encourage the research inclined among you to check out Netflix’s announcement (and as I announced a few weeks back on MeshForum, MeshForum is looking to work with companies on the creation and release of network datasets for general and broad research projects. Netflix’s model is a good one – though I also hope they make the full data collections available to anyone interested in research (and/or allow them to be mirrored in dataset archives such as the one MeshForum is looking to build). MeshForum’s mission is also to encourage companies to do more than a single, one-time release of data, rather we’ll looking to support and encourage companies to make large network datasets available on a regular and reoccurring basis (one to two quarters delayed being  perhaps a good basic model to consider).

If you are interested in working with me on this project please leave a comment with your contact information or feel free to contact me directly.

One Response to “my next hack – hacking Netflix”

  1. […] For the desktop system(s) I will be using this as my primary hub, a repository for lots of data, a testbed for visualization techniques, a development/staging server for some light programming (see my recent post on the Netflix Prize for one such use). I plan on running multiple OSes (either via dual boot or tools such as Parallels). In addition to research, writing and blogging from this system, I also anticipate doing some light audio recording/editing, some video editing, and much increased use of digital photos. And while it won’t be my primary use, I also do hope to catch up on the games I have missed playing for many years and perhaps dip into Second Life and WoW. […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: