Finally, a chance to finish up last week’s review on “Lingo” by Osinski, et. al.  I first came across Lingo via the Carrot Search and it’s associated Carrot clustering engine.  Mr. Weiss also chimes in on the Lucene mailing list from time to time when people ask about clustering Lucene results.

To start, the paper explains the background behind clustering, highlighting two important features that clustering engines need: 1.) Good clusters! 2) Human readable labels.  Section 3 elaborates by noting that most approaches try to assign labels after the cluster, while their approach is different.  They only assign documents to a good label.

To achieve this, they extract frequent phrases from the input, to use as good labels.  Then use Singular Value Decomposition (SVD) on the term-document matrix.  At last, they match up labels with the clusters.  Section 3.1 has pseudocode for the algorithm.  In preparation for clustering, they do stopword removal and stemming.  Interestingly, they propose using the stopwords as language indicators so they know what stemmer to use.  This is clever in that it should be quite fast in comparison to using n-grams.  I wonder how accurate this approach is compared with n-grams.

Section 3.2 delves into how frequent phrases are extracted.  This is important because they form the candidate labels to assign to a cluster. Section 3.3 then discusses how cluster labels are induced from the phrases.  It is a three step process: 1) build the term-document matrix, 2) discover abstract concepts, 3) match phrases and prune labels.  The matrix is built using standard TF-IDF calculations, with titles receiving a constant scaling factor to weight them higher.  SVD is then used to find the orthogonal basis of the matrix.  I am not totally familiar with this technique (dang,  it has been a long time since linear algebra class) but, I guess the gist of it is that the basis represents the abstract concepts of the input documents.  Finally, to match phrases and prune labels, they do a cosine calculation between the abstract concepts and the terms, which allows for the selection of the best label based on the cosine score.  Labels are then pruned by finding overlapping descriptions.

Next, documents are added to clusters using the classic Vector Space Model (VSM), assigning documents to a label if it exceeds the “Snippet Assignment Threshold”, a control parameter used in the algorithm.  They do not specify how the threshold is determined.

Section 4 is an example of the process in action and 5 and 6 cover evaluation and future work.  The evaluation section mostly refers to another paper for more in-depth results.  Having used Carrot, I am quite impressed with the results, but I didn’t do a formal evaluation.  The clusters are good and the library is quite fast.

Popularity: 5% [?]