This week, if you remember, we are discussing Text Categorization with Support Vector Machines: Learning with Many Relevant Features - Joachims (ResearchIndex), which is a paper on Text Categorization (one of the most cited such papers on Google Scholar under the Text Categorization search).

Text Categorization is the problem of assigning one or more predefined categories to a piece of text. The Support Vector Machines approach is a supervised machine learning approach that attempts to learn a linear classifier (actually, it can be polynomial or other type using plugin functionality to the algorithm.) It is actually trying to find a “linear separating hyperplane (see guide for more info on the math.) For an implementation in Java (and other languages) check out http://www.csie.ntu.edu.tw/~cjlin/libsvm/. This site has many useful resources explaining the algorithm, how to do feature selection, etc. (see the guide.)  In two dimensional space, think of finding a line (or polynomial) that separates your good examples from your bad ones.
After introducing the topic of text categorization, the author discusses a bit on feature selection. A feature vector in the paper is a vector of distinct words and their document frequency, minus stop words. The author also throws out any words that occur less than 3 times. I understand the stop word removal piece, but I don’t get the less 3 times reasoning. My guess is it is because those terms don’t significantly contribute to the solution. Finally, features are scaled by their inverse doc. frequency. This all makes sense to me coming from IR (Info. Retrieval land.) Removing stopwords and doing length normalization are standard techniques for improving search results.
The cool thing Joachims points out about SVMs is that they “can be independent of the dimensionality of the feature space”. In other words, SVMs work pretty well with a lot of features or very few features.

In section 4, Joachims makes a very strong case for why SVMs are well suited for Text Categorization.  Summed up they are:

  1. Text can have many features (10k+ words in a collection)
  2. Most features are important
  3. Doc. vectors are sparse.  That is, most words in the collection do not occur in a particular document
  4. Most Text Cat. problems are linearly separable.  This is the key idea behind why they work.

The rest of the paper is a discussion of experiments and why SVMs are much better than the other popular approaches in use at the time.  Most notably, if you remember our discussion of Yang 97 from last week, SVMs beat up on kNN quite handily.
computer science, algorithms, support vector machines, text categorization, machine learning, supervised, libSVM

Popularity: 5% [?]

Technorati Tags: , , , , , ,