We are reading “A Comparison of Event Models for Naive Bayes Text Classification” by McCallum and Nigam.

Text classification is the process of assigning a document to one or more categories (we looked at classification/categorization earlier when exploring Support Vector Machines, SVMs).  My understanding of the difference between categorization and classification is that categorization has a set number of categories, whereas classification  does not.  At any rate, this paper is  comparing two different classifiers that  use  a naive Bayes approach.  The naive Bayes approach assumes that all attributes of the examples we are studying are independent of each other.  Even though this is rarely true in the real world for text (after all, if I chose all the words of this post independently we probably would have gibberish) it turns out that it still works pretty well in practice.  But I digress… The two approaches that are being compared are the Bernoulli model and the multinomial model.  The Bernoulli model uses the document as an event and builds a vector of binary attributes based on whether a term occurs or not in the document.  It DOES NOT take into account the number of times the word occurs.  In the multinomial approach, words are the event and term frequency does matter.  The next couple of sections after the introduction layout the common ground between the two approaches as well as the differences.  The differences come down to how the probabilities are calculated.

There is some interesting discussion of feature selection (a way of reducing the size of the vocabulary, which speeds things up, without, hopefully, losing too much information) using mutual information that is worth digging into a bit more if you have the time.

The next sections are where the rubber meets the road and the authors do a side by side comparison of the two approaches using 5 different collections.  You can see the results on pages 5 and 6.  Finally the discussion of the results occurs on page 6 and 7, with the bottom line seeming to be that the multinomial model seems to “be almost uniformly better than the multi-variate Bernoulli model.”

For those interested, Weka has tools for building Naive Bayes classifiers.

Popularity: 15% [?]