Guest Contributor wanted for next 3 weeks

by grant.ingersoll in Algorithms, Artificial Intelligence, Computer Science, Information Retrieval, Machine Learning, Natural Language Processing (NLP), Question Answering, Text Categorization, disambiguation

If you have an interest in writing on artificial intelligence, clustering, information retrieval or computer science in general and are interested in reviewing one or more articles over the coming three weeks on this forum, please contact me by leaving a comment on this post.  All topics will be subject to my review for appropriateness, but I am open to most any article or publication in the Computer Science field.

Otherwise, I will be taking a brief hiatus from reviewing papers until I return from ApacheCon Europe in the early part of May where I am giving a talk and a training on Lucene.  I have several key deadlines over the next two weeks that must take higher priority, including the publication of a couple of articles that I have been working on.  I will post details on the publication on my Lucene blog when they are available.

Popularity: 11% [?]

POTW 4/8/07: “Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections” by Cutting, et.al

by grant.ingersoll in Algorithms, Computer Science, Information Retrieval, clustering

This week’s paper is Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections by Cutting, et. al. This paper, from 1992, takes clustering out of the realm of search, as it was previously used, albeit indifferently, and proposes to use it in a document browsing scenario. In doing so, the authors propose a new information access approach called Scatter/Gather. The goal of the Scatter/Gather approach is to enablebetter browsing capabilities, not necessarily better search capabilities. The paper also promises to introduce two near linear time clustering algorithm.

Diving into section 2, the basic premise of the approach is that documents are “scattered” into small groups of documents and summaries are shown to the user.  Based on the groups the user looks at, they are then gathered back in and then this subcollection is done again.  Section 2.1 has a nice, easily understood example that is well worth the read.  Section 2.2 lays out the requirements needed to implement Scatter/Gather, which are:

  1. A good clustering algorithm that can cluster a lot of documents in a a very reasonable time.
  2. Summarization capabilities for describing the clusters.

Section 3, title Document Clustering, lays the groundwork for the rest of the paper by outlining much of the current terminology and strategies for document clustering.  It will be interesting, in coming weeks, to compare how far we have come in clustering since 1992.

To do clustering, one needs a notion of document similarity.  A common one is the cosine similarity function that is used in the Vector Space Model of IR as well.  Essentially, two documents can be represented as vectors in some vector space, and the cosine of the angle between the two vectors can be used as a measure of similarity.  See Section 4 for more information on this, as well as other definitions useful to the problem at hand.  I’m waving my hands here for the moment, as I want to see how the notion of the document profiles introduced here play out in later sections.

Section 5 takes us into the needs of a partitional clustering algorithm, which are: 1) find a set of centers, 2) assign documents to a center and 3) refine the partition.  Once complete, there is a set of disjoint document groups that cover all the documents in the corpus.  The two clustering algorithms introduced in this paper, Buckshot and Fractionation, are useful for finding the centers, Buckshot being the faster of the two, while Fractionation is more accurate.  Section 5.1 looks at how Buckshot and Fractionation find the initial centers.  Sections 5.2 then covers how documents are assigned to the center, the simplest approach being “Assign-to-Nearest” which assigns each document to the center that maximizes the similarity between the document and the center.  Section 5.3 then looks at the refinement process, which can be broken out into 3 parts, an iteration  of “Assign-to-Nearest”, the Split algorithm which divides clusters that are not well defined and the join algorithm which merges clusters that are similar.

Section 6 goes into a good amount of detail on how the work of Section 5 can be incorporated into the Scatter/Gather algorithm and how it works out in practice.  The rest of the paper makes conclusions about the work and why it is useful.  The Appendix A covers an in-depth session of the approach.

Intuitively, the approach and application makes sense, even though my knowledge of clustering leaves some me lacking a bit of understanding on the low level details.  I wonder if there are any current systems that make use of similar browsing approaches in real systems.

Popularity: 5% [?]

POTW: 4/8/07: “Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections” by Cutting, et.al

by grant.ingersoll in Algorithms, Computer Science, Information Retrieval, Natural Language Processing (NLP), clustering

This week’s paper is “Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections” by Cutting, Karger, Pedersen and Tukey.  This is one of Doug Cutting’s older works on clustering, pre Lucene fame.

Popularity: 4% [?]

POTW 3/26/07: “Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition” by Osinski, Stefanowski, and Weiss

by grant.ingersoll in Algorithms, Computer Science, Information Retrieval, Natural Language Processing (NLP), clustering

Finally, a chance to finish up last week’s review on “Lingo” by Osinski, et. al.  I first came across Lingo via the Carrot Search and it’s associated Carrot clustering engine.  Mr. Weiss also chimes in on the Lucene mailing list from time to time when people ask about clustering Lucene results.

To start, the paper explains the background behind clustering, highlighting two important features that clustering engines need: 1.) Good clusters! 2) Human readable labels.  Section 3 elaborates by noting that most approaches try to assign labels after the cluster, while their approach is different.  They only assign documents to a good label.

To achieve this, they extract frequent phrases from the input, to use as good labels.  Then use Singular Value Decomposition (SVD) on the term-document matrix.  At last, they match up labels with the clusters.  Section 3.1 has pseudocode for the algorithm.  In preparation for clustering, they do stopword removal and stemming.  Interestingly, they propose using the stopwords as language indicators so they know what stemmer to use.  This is clever in that it should be quite fast in comparison to using n-grams.  I wonder how accurate this approach is compared with n-grams.

Section 3.2 delves into how frequent phrases are extracted.  This is important because they form the candidate labels to assign to a cluster. Section 3.3 then discusses how cluster labels are induced from the phrases.  It is a three step process: 1) build the term-document matrix, 2) discover abstract concepts, 3) match phrases and prune labels.  The matrix is built using standard TF-IDF calculations, with titles receiving a constant scaling factor to weight them higher.  SVD is then used to find the orthogonal basis of the matrix.  I am not totally familiar with this technique (dang,  it has been a long time since linear algebra class) but, I guess the gist of it is that the basis represents the abstract concepts of the input documents.  Finally, to match phrases and prune labels, they do a cosine calculation between the abstract concepts and the terms, which allows for the selection of the best label based on the cosine score.  Labels are then pruned by finding overlapping descriptions.

Next, documents are added to clusters using the classic Vector Space Model (VSM), assigning documents to a label if it exceeds the “Snippet Assignment Threshold”, a control parameter used in the algorithm.  They do not specify how the threshold is determined.

Section 4 is an example of the process in action and 5 and 6 cover evaluation and future work.  The evaluation section mostly refers to another paper for more in-depth results.  Having used Carrot, I am quite impressed with the results, but I didn’t do a formal evaluation.  The clusters are good and the library is quite fast.

Popularity: 5% [?]

POTW 3/26/07: “Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition” by Osinski, Stefanowski, and Weiss

by grant.ingersoll in Computer Science

I have been swamped at work this week, and will continue this paper into next week.

Popularity: 2% [?]