Hyperspatial text classification

Friday, 04 May 2007 by Robert A. Uhl in tech machine learning

Hyperspatial Text Classification

While reading the docs for CRM114 (a text classification engine; text classification can be used to determine if email is spam; if a log entry is important; or if a newspaper article is worth reading) I discovered that it supports a hyperspatial classifier. It’s a pretty neat idea: a document is broken into its component features (e.g. phrases and individual words; this step is pretty standard for classifiers); each feature is then hashed to a 32-bit integer value; the document is then considered to be a point in a 232-dimensional space — if a feature is present once, then the value of that dimension is one; if twice, then two and so forth.

So documents are points in this 4,294,967,296-dimensional space; what’s this buy? Well, imagine that every already-classified document is a star emitting light, and that an unknown-class document is a planet receiving light from all stars. One simply adds up the light each class sheds on the planet (nearer stars are brighter; those further away are dimmer); whichever class sheds the most light is the class of the document in question.

This sounds very complex, but it turns out to be very easy to represent and calculate. A document is represented by a sorted list of integers; each integer is the hash of a particular feature; only those features which are present are listed (this saves space since the vast majority of the 4,294,967,296 possible features are absent in any one document). To calculate the difference between two documents, just walk two indices along them, keeping track of features found in one, the other or both.

I’ve already got some working code which I’m training to recognise plain text versus HTML. We’ll see how good I can get it …