An Evaluation of Statistical Approaches to Text Categorization – Yang (ResearchIndex)
An Evaluation of Statistical Approaches to Text Categorization – Yang (ResearchIndex) is the first paper of the week that I am going to tackle.
Finding a first paper was harder than I thought. I originally intended to start with some of the seminal papers in IR, such as Salton’s work, or Sparck-Jones, but it seems many of these works are “locked up” in books that I don’t have access to at the moment. I guess I may need to renew my ACM membership in order to get access to more content or see if the local library has them. In the long run, I think I will probably end up also allowing for chapters of books to be read, but I first want to start with what is freely available on the web.
This paper, however, strikes me as a good intro to some of the statistical approaches to text categorization, so it seems it would be a good starting place for me. It comes recommended from Foundations of Statistical Natural Language Processing by Manning and Schutze (Chapter 16th), which is the book we used when I took Dr. Liddy’s NLP course at Syracuse University a few years back.
Paper of the Week » Blog Archive » Discussion of Sections 1-4 of Yang 97 wrote,
[...] So, hopefully everyone has read the paper (http://www.paperoftheweek.com/2007/01/08/an-evaluation-of-statistical-approaches-to-text-categorization-yang-researchindex/) at least once. The first 4 sections are quite easy to get, in my opinion, as they define the problem of text categorization and lay the framework for the experiments. Digging into the details of the various implementations will be left as an exercise for the reader at this point (damn, I’ve always wanted to say that ever since wading through math proofs back in college). In a nutshell, the author is setting out to evaluate 14 different text categorization methods using a few different “standard” collections. The real effort here is to try to compare apples to apples, since some of the prior research concerning these systems has used a variety of approaches to evaluation, preventing direct comparison. In section 3.2 it is important to note the discussion of how the corpora are used. I am sure, as we go forward on many of these topics that we will come across these corpora again and again, especially the Reuters collection (heck, we even use it in Lucene for benchmarking.) Section 4, on performance measures, discusses another piece of information that occurs in much of the literature, namely the concept of evaluation. Recall and precision are very common measures and the paper does a good job of how we derive recall, precision, break even point and the F measure from the truth table generated by a binary classifier. The recall and precision methods are worth repeating here, I think, as we will see many variations of this when discussing IR, etc.: [...]
Link | January 10th, 2007 at 6:31 pm
Paper of the Week » Blog Archive » Discussion of Sections 5-7 of Yang 97 wrote,
[...] Whew, I think we’ve made it through our first paper, or we are about to anyway. If you recall, we are working our way through Yang 97 and had made it through the first 4 sections so far, which are covered here. This leaves us with the meat of the paper, I guess, which is the actual experiments. [...]
Link | January 12th, 2007 at 6:12 pm