So, hopefully everyone has read the paper (http://www.paperoftheweek.com/2007/01/08/an-evaluation-of-statistical-approaches-to-text-categorization-yang-researchindex/) at least once. The first 4 sections are quite easy to get, in my opinion, as they define the problem of text categorization and lay the framework for the experiments. Digging into the details of the various implementations will be left as an exercise for the reader at this point (damn, I’ve always wanted to say that ever since wading through math proofs back in college). In a nutshell, the author is setting out to evaluate 14 different text categorization methods using a few different “standard” collections. The real effort here is to try to compare apples to apples, since some of the prior research concerning these systems has used a variety of approaches to evaluation, preventing direct comparison. In section 3.2 it is important to note the discussion of how the corpora are used. I am sure, as we go forward on many of these topics that we will come across these corpora again and again, especially the Reuters collection (heck, we even use it in Lucene for benchmarking.) Section 4, on performance measures, discusses another piece of information that occurs in much of the literature, namely the concept of evaluation. Recall and precision are very common measures and the paper does a good job of how we derive recall, precision, break even point and the F measure from the truth table generated by a binary classifier. The recall and precision methods are worth repeating here, I think, as we will see many variations of this when discussing IR, etc.:

recall = # categories found correct / total # of categories correct = the number we got right divided by the total number of right answers in the system
precision = # categories found correct / total categories found = the number we got right out of the number that we retrieved

For example, you could have perfect recall by assigning every possible category to every document, but our precision would not be very good.
Finally, section 4 ends with a discussion on what average (micro or macro)  to use.  The micro average was chosen because, according to the paper, it favors the more common categories.   This, I think, makes sense given that one is probably more interested in how a system does on the common categories, unless of course you are really interested in the rare categories . :-)  Perhaps someone with a better understanding can contribute some thoughts on this.

Popularity: 3% [?]