One of the techniques used in some sub-fields of Artificial Intelligence is training: one feeds one’s apparatus a set of known good associations (e.g. this word sounds like this and that like that, or this is the number 3 in forty different styles of handwriting), and for each sees what answer it returns, then correct it based on its answer. This obviously requires a large corpus, and building such a corpus is time-consuming and error-prone. Imagine how much work would be involved in gathering several hundred thousand photographs and hand-identifying each person and feature in each photo. It would be mind-numbingly boring; no intelligent person would want to do it, and a low-paid data-entry clerk would likely make many mistakes. But that’s what one has to do.
Until now, that is. For there are now several social networking sites out there; these are ways for one to indicate who one is and who one’s friends are, and see who the friends of friends are and so forth. Many let one upload one’s own photos and indicate who’s in them and (this is key) where. There are also sites to trade & categorise URLs, photos and what-have you.
Well, this is a researcher’s dream: tens or hundreds of thousands of motivated categorisers. Rather than a bored data-entry clerk, each of these users wants his data to be accurate, and will update it when he notices an error. So now Facebook and Flickr provide a huge photo corpus; Facebook even has photos keyed to unique users! del.icio.us provides tagged access to URLs (useful for grading comprehension of a web page). All these data are just begging to be mined.
There are, of course, some privacy implications …