my next hack – hacking Netflix
Posted by shannonclark on October 2, 2006
With their permission that is
NetFlix has just announced the Netflix Prize which will award a prize of $1M for a 10% improvement on their recommendation engine, based on a dataset of over 100M ratings of movies which they are making available for research purposes to anyone who registers to participate.
When I founded JigZaw Inc in 2000 I embarked on many years of research into various aspects of machine learning and AI. My initial focus was on automated data acquisition, techniques for automating the understanding of data structures (especially from semi-structured data such as web pages) as well as techniques for extracting that data and converting it into “real” structured data. But beyond those techniques I also started a lot of research into data clustering methodologies and approaches, with my interest focusing mostly on some fairly complex ways of automatic data clustering into data-driven categories (including the possibility of overlapping categories) I was and am less interested in the “postal code” type of clustering, where the categories are known ahead of time, are fixed and usually are unitary – i.e. a specific element can only be placed into one and only one category.
What I’m more interested in is the much harder problem of automatic data driven clustering – clusters that are properties of the dataset but which arise naturally through the data analysis, not from a priori defined categories or cluster types.
But there has always been a very real lack of serious datasets to test my theories upon so I haven’t done much with them for many years.
Netflix’s announcements change all of this, with a single, well thought out action, they have made a very large (and furthermore mostly real) data set available to nearly anyone (if you live in certain countries or Quebec you aren’t eligible to participate). I know that I plan on registering and downloading the dataset and exploring it, even if I don’t seriously enter the competition.
Though, that said, I do think I have a number of approaches and techniques that would achieve very real and valid results.
But I do have a couple of procedural questions as well as some real concerns. First and foremost while I applaud them for the very real steps they are taking to preserve user’s privacy, by modifying the data in a variety of ways they do cloud the validity of the data as well as embed into the contest certain assumptions (some of which I had planned on questioning in a few of my approaches).
This is not all of those approached, but for example, by modifying in some cases the date when a rating was made they change in unkown ways the temporal factors implicit in those ratings – one testable assumption might be that people who typically watch movies over the weekend (and rate/return them early in the week) have very real and measurable differences from movie watchers who primarily watch movies during the week returning them anytime. Not to mention that some possibly calculable measures such as whether or not there is a correlation between how long someone kept a given movie and who positively/negatively they rated that movie would be worth testing. (I know in my own experience when my ex-girlfriend had a netflix subscription that certain types of movies, often ones we felt we “had” to watch but generally didn’t really love, might sit, unwatched, for weeks or in a few cases many months. (time might also be a proxy for other more typical factors – the differences between a single mother renting for herself as well as for her young children and those of a single renting mostly for weekend (or less commonly weekday) movie watching with a partner.
Anyway, I encourage the research inclined among you to check out Netflix’s announcement (and as I announced a few weeks back on MeshForum, MeshForum is looking to work with companies on the creation and release of network datasets for general and broad research projects. Netflix’s model is a good one – though I also hope they make the full data collections available to anyone interested in research (and/or allow them to be mirrored in dataset archives such as the one MeshForum is looking to build). MeshForum’s mission is also to encourage companies to do more than a single, one-time release of data, rather we’ll looking to support and encourage companies to make large network datasets available on a regular and reoccurring basis (one to two quarters delayed being perhaps a good basic model to consider).
If you are interested in working with me on this project please leave a comment with your contact information or feel free to contact me directly.