<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Paper of the Week &#187; clustering</title>
	<atom:link href="http://www.paperoftheweek.com/category/clustering/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.paperoftheweek.com</link>
	<description>Read. Learn. Discuss.</description>
	<lastBuildDate>Tue, 14 Aug 2007 01:35:31 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>POTW 5/21/07: Discussion of &#8220;A Study on Retrospective and On-Line Event Detection&#8221; by Yang, Pierce and Carbonell</title>
		<link>http://www.paperoftheweek.com/2007/06/02/potw-52107-discussion-of-a-study-on-retrospective-and-on-line-event-detection-by-yang-pierce-and-carbonell/</link>
		<comments>http://www.paperoftheweek.com/2007/06/02/potw-52107-discussion-of-a-study-on-retrospective-and-on-line-event-detection-by-yang-pierce-and-carbonell/#comments</comments>
		<pubDate>Sat, 02 Jun 2007 11:41:55 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[event detection]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[text mining]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/06/02/potw-52107-discussion-of-a-study-on-retrospective-and-on-line-event-detection-by-yang-pierce-and-carbonell/</guid>
		<description><![CDATA[Yang&#8217;s paper on on-line event detection (&#8220;A Study on Retrospective and On-Line Event Detection&#8220;) discusses the use of common text retrieval techniques to automatically detect events in news streams. Imagine that you are responsible for monitoring all the major news feeds in every single country your company does business in order to advise the CEO [...]]]></description>
			<content:encoded><![CDATA[<p>Yang&#8217;s paper on on-line event detection (&#8220;<a href="http://citeseer.ist.psu.edu/rd/0%2C51293%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/1982/http:zSzzSzwww.cs.cmu.eduzSz%7EyimingzSzpapers.yyzSzsigir98.pdf/yang98study.pdf">A Study on Retrospective and On-Line Event Detection</a>&#8220;) discusses the use of common text retrieval techniques to automatically detect events in news streams.</p>
<p>Imagine that you are responsible for monitoring all the major news feeds in every single country your company does business in order to advise the CEO on trends and events that effect your company.  Obviously, you can&#8217;t read all of them on a daily basis and having a large staff to help would be costly.  This is exactly the kind of scenario that online event detection is meant to solve.  You need a program that can identify and organize news feeds as they come in, allowing you to see the key events (or even minor events) as they are reported, not days later.</p>
<p>This paper discusses the approach of a group of researchers at Carnegie Mellon University in the field of Topic Detection and Tracking.  After providing some background  information on the topic, the authors dive into the details of their approach.  The task at hand was to analyze a corpus of documents in a temporal fashion to identify events and track them.  In other words, even though they had the whole corpus at the time, they had to pretend like they were receiving the news in chronological order just like you and I do on a daily basis.  They could not look &#8220;into the future&#8221;, if you will, in doing their analysis.</p>
<p>Yang&#8217;s group attempts to solve this problem by using a clustering approach.  In fact, they are modifying Cutting&#8217;s Scatter/Gather approach that we discussed <a href="http://www.paperoftheweek.com/2007/04/14/potw-4807-scattergather-a-cluster-based-approach-to-browsing-large-document-collections-by-cutting-etal-2/">here</a> and also a single-pass, incremental approach.  They approach the problem in two ways.  First, they use the scatter/gather approach on a &#8220;retrospective&#8221; collection containing articles that occurred in the &#8220;simulated&#8221; past.  This is done to build up statistics about past events, figuring that new events will contain similar structures and statistics (TF/IDF, etc.) albeit with variations due to new names and events.  Section 3.1 discusses the representation of the clusters and 3.2 discusses their modifications of Cutting&#8217;s Scatter/Gather approach into what they call Group Average Clustering (GAC).</p>
<p>To solve the real-time, online problem, an incremental, single-pass approach was used.  To do this kind of thing, one needs to somehow estimate corpus statistics like IDF (inverse document frequency) in order to come up with reasonable estimations in order to come up with the proper weights for the term vectors used in the clustering algorithm.  The CMU group solves this problem by originally estimating the IDF using the stats from the retrospective process and then updating it as new information becomes available in the real-time approach.</p>
<p>Much of the rest of the paper is about picking parameters and evaluation.  Section 4.3 has some interesting discussion of &#8220;Behavior Analysis&#8221; that is worth looking into.  The gist of it being that the GAC approach seemed to be good at identifying large news bursts, while the incremental approach is better at tracking at long-lasting events.  In our scenario, you will most likely be interested in both kinds of events.  The key, of course, is having the ability to zoom in/out on the various news feeds and to setup alerts, etc. that help you manage the clusters.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/06/02/potw-52107-discussion-of-a-study-on-retrospective-and-on-line-event-detection-by-yang-pierce-and-carbonell/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW 5/21/07: &#8220;A Study on Retrospective and On-Line Event Detection&#8221; by Yang, Pierce and Carbonell</title>
		<link>http://www.paperoftheweek.com/2007/05/21/potw-52107-a-study-on-retrospective-and-on-line-event-detection-by-yang-pierce-and-carbonell/</link>
		<comments>http://www.paperoftheweek.com/2007/05/21/potw-52107-a-study-on-retrospective-and-on-line-event-detection-by-yang-pierce-and-carbonell/#comments</comments>
		<pubDate>Tue, 22 May 2007 01:32:36 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[event detection]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/05/21/potw-52107-a-study-on-retrospective-and-on-line-event-detection-by-yang-pierce-and-carbonell/</guid>
		<description><![CDATA[Paper of the Week for May 20, 2007 is &#8220;A Study on Retrospective and On-Line Event Detection&#8221; by Yiming Yang, Tom Pierce and Jaime Carbonell.]]></description>
			<content:encoded><![CDATA[<p>Paper of the Week for May 20, 2007 is &#8220;<a href="http://citeseer.ist.psu.edu/rd/0%2C51293%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/1982/http:zSzzSzwww.cs.cmu.eduzSz%7EyimingzSzpapers.yyzSzsigir98.pdf/yang98study.pdf">A Study on Retrospective and On-Line Event Detection</a>&#8221; by Yiming Yang, Tom Pierce and Jaime Carbonell.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/05/21/potw-52107-a-study-on-retrospective-and-on-line-event-detection-by-yang-pierce-and-carbonell/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW 4/8/07: &#8220;Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections&#8221; by Cutting, et.al</title>
		<link>http://www.paperoftheweek.com/2007/04/14/potw-4807-scattergather-a-cluster-based-approach-to-browsing-large-document-collections-by-cutting-etal-2/</link>
		<comments>http://www.paperoftheweek.com/2007/04/14/potw-4807-scattergather-a-cluster-based-approach-to-browsing-large-document-collections-by-cutting-etal-2/#comments</comments>
		<pubDate>Sat, 14 Apr 2007 20:53:16 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Information Retrieval]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/04/14/potw-4807-scattergather-a-cluster-based-approach-to-browsing-large-document-collections-by-cutting-etal-2/</guid>
		<description><![CDATA[This week&#8217;s paper is Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections by Cutting, et. al. This paper, from 1992, takes clustering out of the realm of search, as it was previously used, albeit indifferently, and proposes to use it in a document browsing scenario. In doing so, the authors propose a new information [...]]]></description>
			<content:encoded><![CDATA[<p>This week&#8217;s paper is <a href="http://www.paperoftheweek.com/wp-admin/Scatter/Gather:%20A%20Cluster-based%20Approach%20to%20Browsing%20Large%20Document%20Collections">Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections</a> by Cutting, et. al.  This paper, from 1992, takes clustering out of the realm of search, as it was previously used, albeit indifferently, and proposes to use it in a document browsing scenario.  In doing so, the authors propose a new information access approach called <em>Scatter/Gather</em>.  The goal of the <em>Scatter/Gather</em> approach is to enablebetter browsing capabilities, not necessarily better search capabilities.  The paper also promises to introduce two near linear time clustering algorithm.</p>
<p>Diving into section 2, the basic premise of the approach is that documents are &#8220;scattered&#8221; into small groups of documents and summaries are shown to the user.  Based on the groups the user looks at, they are then gathered back in and then this subcollection is done again.  Section 2.1 has a nice, easily understood example that is well worth the read.  Section 2.2 lays out the requirements needed to implement <em>Scatter/Gather</em>, which are:</p>
<ol>
<li>A good clustering algorithm that can cluster a lot of documents in a a very reasonable time.</li>
<li>Summarization capabilities for describing the clusters.</li>
</ol>
<p>Section 3, title Document Clustering, lays the groundwork for the rest of the paper by outlining much of the current terminology and strategies for document clustering.  It will be interesting, in coming weeks, to compare how far we have come in clustering since 1992.</p>
<p>To do clustering, one needs a notion of document similarity.  A common one is the cosine similarity function that is used in the Vector Space Model of IR as well.  Essentially, two documents can be represented as vectors in some vector space, and the cosine of the angle between the two vectors can be used as a measure of similarity.  See Section 4 for more information on this, as well as other definitions useful to the problem at hand.  I&#8217;m waving my hands here for the moment, as I want to see how the notion of the document profiles introduced here play out in later sections.</p>
<p>Section 5 takes us into the needs of a partitional clustering algorithm, which are: 1) find a set of centers, 2) assign documents to a center and 3) refine the partition.  Once complete, there is a set of disjoint document groups that cover all the documents in the corpus.  The two clustering algorithms introduced in this paper, Buckshot and Fractionation, are useful for finding the centers, Buckshot being the faster of the two, while Fractionation is more accurate.  Section 5.1 looks at how Buckshot and Fractionation find the initial centers.  Sections 5.2 then covers how documents are assigned to the center, the simplest approach being &#8220;Assign-to-Nearest&#8221; which assigns each document to the center that maximizes the similarity between the document and the center.  Section 5.3 then looks at the refinement process, which can be broken out into 3 parts, an iteration  of &#8220;Assign-to-Nearest&#8221;, the Split algorithm which divides clusters that are not well defined and the join algorithm which merges clusters that are similar.</p>
<p>Section 6 goes into a good amount of detail on how the work of Section 5 can be incorporated into the <em>Scatter/Gather</em> algorithm and how it works out in practice.  The rest of the paper makes conclusions about the work and why it is useful.  The Appendix A covers an in-depth session of the approach.</p>
<p>Intuitively, the approach and application makes sense, even though my knowledge of clustering leaves some me lacking a bit of understanding on the low level details.  I wonder if there are any current systems that make use of similar browsing approaches in real systems.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/04/14/potw-4807-scattergather-a-cluster-based-approach-to-browsing-large-document-collections-by-cutting-etal-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW: 4/8/07: &#8220;Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections&#8221; by Cutting, et.al</title>
		<link>http://www.paperoftheweek.com/2007/04/09/potw-4807-scattergather-a-cluster-based-approach-to-browsing-large-document-collections-by-cutting-etal/</link>
		<comments>http://www.paperoftheweek.com/2007/04/09/potw-4807-scattergather-a-cluster-based-approach-to-browsing-large-document-collections-by-cutting-etal/#comments</comments>
		<pubDate>Mon, 09 Apr 2007 18:53:46 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/04/09/potw-4807-scattergather-a-cluster-based-approach-to-browsing-large-document-collections-by-cutting-etal/</guid>
		<description><![CDATA[This week&#8217;s paper is &#8220;Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections&#8221; by Cutting, Karger, Pedersen and Tukey.  This is one of Doug Cutting&#8217;s older works on clustering, pre Lucene fame.]]></description>
			<content:encoded><![CDATA[<p>This week&#8217;s paper is &#8220;<span class="m"><span class="l"><a href="http://citeseer.ist.psu.edu/cutting92scattergather.html">Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections</a>&#8221; by Cutting, Karger, Pedersen and Tukey.</span></span><span class="m">  This is one of Doug Cutting&#8217;s older works on clustering, pre Lucene fame. </span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/04/09/potw-4807-scattergather-a-cluster-based-approach-to-browsing-large-document-collections-by-cutting-etal/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW 3/26/07: &#8220;Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition&#8221; by Osinski, Stefanowski, and Weiss</title>
		<link>http://www.paperoftheweek.com/2007/04/06/potw-32607-lingo-search-results-clustering-algorithm-based-on-singular-value-decomposition-by-osinski-stefanowski-and-weiss-3/</link>
		<comments>http://www.paperoftheweek.com/2007/04/06/potw-32607-lingo-search-results-clustering-algorithm-based-on-singular-value-decomposition-by-osinski-stefanowski-and-weiss-3/#comments</comments>
		<pubDate>Fri, 06 Apr 2007 14:28:03 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/04/06/potw-32607-lingo-search-results-clustering-algorithm-based-on-singular-value-decomposition-by-osinski-stefanowski-and-weiss-3/</guid>
		<description><![CDATA[Finally, a chance to finish up last week&#8217;s review on &#8220;Lingo&#8221; by Osinski, et. al.  I first came across Lingo via the Carrot Search and it&#8217;s associated Carrot clustering engine.  Mr. Weiss also chimes in on the Lucene mailing list from time to time when people ask about clustering Lucene results. To start, the paper [...]]]></description>
			<content:encoded><![CDATA[<p>Finally, a chance to finish up last week&#8217;s review on &#8220;<a href="http://www.cs.put.poznan.pl/dweiss/site/publications/download/iipwm-osinski-weiss-stefanowski-2004-lingo.pdf">Lingo</a>&#8221; by Osinski, et. al.  I first came across Lingo via <a href="http://www.carrotsearch.com">the Carrot Search</a> and it&#8217;s associated Carrot clustering engine.  Mr. Weiss also chimes in on the <a href="http://lucene.apache.org/java/docs">Lucene</a> mailing list from time to time when people ask about clustering Lucene results.</p>
<p>To start, the paper explains the background behind clustering, highlighting two important features that clustering engines need: 1.) Good clusters! 2) Human readable labels.  Section 3 elaborates by noting that most approaches try to assign labels after the cluster, while their approach is different.  They only assign documents to a good label.</p>
<p>To achieve this, they extract frequent phrases from the input, to use as good labels.  Then use Singular Value Decomposition (SVD) on the term-document matrix.  At last, they match up labels with the clusters.  Section 3.1 has pseudocode for the algorithm.  In preparation for clustering, they do stopword removal and stemming.  Interestingly, they propose using the stopwords as language indicators so they know what stemmer to use.  This is clever in that it should be quite fast in comparison to using n-grams.  I wonder how accurate this approach is compared with n-grams.</p>
<p>Section 3.2 delves into how frequent phrases are extracted.  This is important because they form the candidate labels to assign to a cluster. Section 3.3 then discusses how cluster labels are induced from the phrases.  It is a three step process: 1) build the term-document matrix, 2) discover abstract concepts, 3) match phrases and prune labels.  The matrix is built using standard TF-IDF calculations, with titles receiving a constant scaling factor to weight them higher.  SVD is then used to find the orthogonal basis of the matrix.  I am not totally familiar with this technique (dang,  it has been a long time since linear algebra class) but, I guess the gist of it is that the basis represents the abstract concepts of the input documents.  Finally, to match phrases and prune labels, they do a cosine calculation between the abstract concepts and the terms, which allows for the selection of the best label based on the cosine score.  Labels are then pruned by finding overlapping descriptions.</p>
<p>Next, documents are added to clusters using the classic Vector Space Model (VSM), assigning documents to a label if it exceeds the &#8220;Snippet Assignment Threshold&#8221;, a control parameter used in the algorithm.  They do not specify how the threshold is determined.</p>
<p>Section 4 is an example of the process in action and 5 and 6 cover evaluation and future work.  The evaluation section mostly refers to another paper for more in-depth results.  Having used Carrot, I am quite impressed with the results, but I didn&#8217;t do a formal evaluation.  The clusters are good and the library is quite fast.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/04/06/potw-32607-lingo-search-results-clustering-algorithm-based-on-singular-value-decomposition-by-osinski-stefanowski-and-weiss-3/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>POTW 3/26/07: &#8220;Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition&#8221; by Osinski, Stefanowski, and Weiss</title>
		<link>http://www.paperoftheweek.com/2007/03/26/potw-32607-lingo-search-results-clustering-algorithm-based-on-singular-value-decomposition-by-osinski-stefanowski-and-weiss/</link>
		<comments>http://www.paperoftheweek.com/2007/03/26/potw-32607-lingo-search-results-clustering-algorithm-based-on-singular-value-decomposition-by-osinski-stefanowski-and-weiss/#comments</comments>
		<pubDate>Mon, 26 Mar 2007 12:25:15 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/03/26/potw-32607-lingo-search-results-clustering-algorithm-based-on-singular-value-decomposition-by-osinski-stefanowski-and-weiss/</guid>
		<description><![CDATA[Paper of the Week for March 26 is &#8220;Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition&#8221; by Osinski, et. al.  Aah, back to matrices&#8230;]]></description>
			<content:encoded><![CDATA[<p>Paper of the Week for March 26 is &#8220;<a href="http://www.cs.put.poznan.pl/dweiss/site/publications/download/iipwm-osinski-weiss-stefanowski-2004-lingo.pdf">Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition</a>&#8221; by Osinski, et. al.  Aah, back to matrices&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/03/26/potw-32607-lingo-search-results-clustering-algorithm-based-on-singular-value-decomposition-by-osinski-stefanowski-and-weiss/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

