Hyperspatial Text Classification
While reading the docs for CRM114 (a text classification engine; text classification can be used to determine if email is spam; if a log entry is important; or if a newspaper article is worth reading) I discovered that it supports a hyperspatial classifier. It’s a pretty neat idea: a document is broken into its component features (e.g. phrases and individual words; this step is pretty standard for classifiers); each feature is then hashed to a 32-bit integer value; the document is then considered to be a point in a 232-dimensional space — if a feature is present once, then the value of that dimension is one; if twice, then two and so forth.
Read more →
Hyperspatial Text Classification
While reading the docs for CRM114 (a text classification engine; text classification can be used to determine if email is spam; if a log entry is important; or if a newspaper article is worth reading) I discovered that it supports a hyperspatial classifier. It’s a pretty neat idea: a document is broken into its component features (e.g. phrases and individual words; this step is pretty standard for classifiers); each feature is then hashed to a 32-bit integer value; the document is then considered to be a point in a 232-dimensional space — if a feature is present once, then the value of that dimension is one; if twice, then two and so forth.
Read more →