This week’s paper is Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections by Cutting, et. al. This paper, from 1992, takes clustering out of the realm of search, as it was previously used, albeit indifferently, and proposes to use it in a document browsing scenario. In doing so, the authors propose a new information access approach called Scatter/Gather. The goal of the Scatter/Gather approach is to enablebetter browsing capabilities, not necessarily better search capabilities. The paper also promises to introduce two near linear time clustering algorithm.

Diving into section 2, the basic premise of the approach is that documents are “scattered” into small groups of documents and summaries are shown to the user.  Based on the groups the user looks at, they are then gathered back in and then this subcollection is done again.  Section 2.1 has a nice, easily understood example that is well worth the read.  Section 2.2 lays out the requirements needed to implement Scatter/Gather, which are:

  1. A good clustering algorithm that can cluster a lot of documents in a a very reasonable time.
  2. Summarization capabilities for describing the clusters.

Section 3, title Document Clustering, lays the groundwork for the rest of the paper by outlining much of the current terminology and strategies for document clustering.  It will be interesting, in coming weeks, to compare how far we have come in clustering since 1992.

To do clustering, one needs a notion of document similarity.  A common one is the cosine similarity function that is used in the Vector Space Model of IR as well.  Essentially, two documents can be represented as vectors in some vector space, and the cosine of the angle between the two vectors can be used as a measure of similarity.  See Section 4 for more information on this, as well as other definitions useful to the problem at hand.  I’m waving my hands here for the moment, as I want to see how the notion of the document profiles introduced here play out in later sections.

Section 5 takes us into the needs of a partitional clustering algorithm, which are: 1) find a set of centers, 2) assign documents to a center and 3) refine the partition.  Once complete, there is a set of disjoint document groups that cover all the documents in the corpus.  The two clustering algorithms introduced in this paper, Buckshot and Fractionation, are useful for finding the centers, Buckshot being the faster of the two, while Fractionation is more accurate.  Section 5.1 looks at how Buckshot and Fractionation find the initial centers.  Sections 5.2 then covers how documents are assigned to the center, the simplest approach being “Assign-to-Nearest” which assigns each document to the center that maximizes the similarity between the document and the center.  Section 5.3 then looks at the refinement process, which can be broken out into 3 parts, an iteration  of “Assign-to-Nearest”, the Split algorithm which divides clusters that are not well defined and the join algorithm which merges clusters that are similar.

Section 6 goes into a good amount of detail on how the work of Section 5 can be incorporated into the Scatter/Gather algorithm and how it works out in practice.  The rest of the paper makes conclusions about the work and why it is useful.  The Appendix A covers an in-depth session of the approach.

Intuitively, the approach and application makes sense, even though my knowledge of clustering leaves some me lacking a bit of understanding on the low level details.  I wonder if there are any current systems that make use of similar browsing approaches in real systems.