Whew, I think we’ve made it through our first paper, or we are about to anyway.  If you recall, we are working our way through Yang 97 and had made it through the first 4 sections so far, which are covered here.  This leaves us with the meat of the paper, I guess, which is the actual experiments.

I found sections 5 through 7 to be pretty straightforward.  The author tries to run a variety of classifiers on a number of different corpora.  kNN seems to hold up the best and seems to be the only approach that really scales.  Of course, keep in mind this was 1997, this is probably no longer the case.  Perhaps we can see if there have been any updates to this paper in the future (or perhaps someone can leave a link doing just that.)  Table 3 of the paper contains the guts of what you need to know about the experiments.

All in all, I think this paper is a nice soft introduction to the field of text categorization. The lessons I take from the paper are the following:

  1. Pay special attention to what is in the corpus that is being tested.  The inclusion/exclusion of unlabeled documents can make a big difference.  Try to use a standard corpus, if there is such a thing.
  2. Know what measures are being used when reading the paper and why people use those measures.
  3. Sometimes, simpler is better.  I know the fields need to advance, but sometimes a good solid algorithm like kNN is all you need.

I think for next week, I’m going to continue on with the theme of text categorization and try to dig into a specific algorithm to see how it works.  After that, I have been wanting to look into some Graph Theory algorithms in use with IR and NLP.  As always, leave comments with suggestions on anything you think I just have to read lest I miss the boat completely.