<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Paper of the Week &#187; Machine Learning</title>
	<atom:link href="http://www.paperoftheweek.com/category/computer-science/artificial-intelligence/machine-learning/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.paperoftheweek.com</link>
	<description>Read. Learn. Discuss.</description>
	<lastBuildDate>Tue, 14 Aug 2007 01:35:31 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>POTW 6/24/07: &#8220;Support-Vector Networks&#8221; by Cortes and Vapnik</title>
		<link>http://www.paperoftheweek.com/2007/06/25/potw-62407-support-vector-networks-by-cortes-and-vapnik/</link>
		<comments>http://www.paperoftheweek.com/2007/06/25/potw-62407-support-vector-networks-by-cortes-and-vapnik/#comments</comments>
		<pubDate>Mon, 25 Jun 2007 18:27:22 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[support vector machines]]></category>
		<category><![CDATA[SVM]]></category>
		<category><![CDATA[Text Categorization]]></category>
		<category><![CDATA[text mining]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/06/25/potw-62407-support-vector-networks-by-cortes-and-vapnik/</guid>
		<description><![CDATA[Long paper this week, but it is the original on Support Vector Machines: Support-Vector Networks by Cortes and Vapnik.  Given my schedule, I may spread this out over two weeks.]]></description>
			<content:encoded><![CDATA[<p>Long paper this week, but it is the original on Support Vector Machines: <a href="http://citeseer.ist.psu.edu/rd/0%2C500489%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/23317/http:zSzzSzwww.research.att.comzSz%7EcorinnazSzpaperszSzsupport.vector.pdf/cortes95supportvector.pdf">Support-Vector Networks</a> by Cortes and Vapnik.  Given my schedule, I may spread this out over two weeks.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/06/25/potw-62407-support-vector-networks-by-cortes-and-vapnik/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW 6/11/07: Discussion of &#8220;A Sequential Algorithm for Training Text Classifiers&#8221; by Lewis and Gale</title>
		<link>http://www.paperoftheweek.com/2007/06/17/potw-61107-discussion-of-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/</link>
		<comments>http://www.paperoftheweek.com/2007/06/17/potw-61107-discussion-of-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/#comments</comments>
		<pubDate>Mon, 18 Jun 2007 03:58:12 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[naive bayes]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[Text Categorization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/06/17/potw-61107-discussion-of-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/</guid>
		<description><![CDATA[In &#8220;A Sequential Algorithm for Training Text Classifiers&#8221; by David D. Lewis and William Gale, the authors put forth a new (at the time) method training text classifiers using an approach they call &#8220;uncertainty sampling&#8221; Section 1 outlines the problem of training, namely obtaining a good sample of text to be labeled for the trainer. [...]]]></description>
			<content:encoded><![CDATA[<p>In &#8220;A Sequential Algorithm for Training Text Classifiers&#8221; by David D.<br />
Lewis and William Gale, the authors put forth a new (at the time)<br />
method training text classifiers using an approach they call<br />
&#8220;uncertainty sampling&#8221;</p>
<p>Section 1 outlines the problem of training, namely obtaining a good<br />
sample of text to be labeled for the trainer.  After disposing of<br />
several other methods of garnering samples (random, relevance<br />
feedback based), Lewis and Gale introduce an iterative approach for<br />
manually labeling examples.</p>
<p>Section 2 then discusses the benefits of &#8220;learning by query&#8221; in<br />
theory, namely the possibility of reducing the error rate very<br />
quickly in comparison to the number of queries required.</p>
<p>Figure 1 (described in section 3) outlines their basic approach,<br />
which relies on having a human judge some subset of examples that the<br />
currently used classifier is least certain about.  This process is<br />
iterated until the human feels satisfied with the results.  One<br />
caveat of this approach is that the classifier must not only predict<br />
the class, it must give a measurement of certainty for that class.</p>
<p>Continuing on into section 4, we are introduced to how to build a<br />
classifier and use uncertainty sampling to train it.  Most of the<br />
section details the probability theory behind it, finishing up with<br />
how to do the sampling.  One thing I always wish for in these papers<br />
are concrete examples (maybe as an appendix or a reference) that work<br />
through the math on an actual toy problem.  Section 5 does just this,<br />
laying out an experiment and discussing the details, minus the math,<br />
which probably suits most people just fine.</p>
<p>Section 7 has an excellent discussion of the results, the pay dirt<br />
being that using this new method significantly reduces the number of<br />
examples required for training, at the cost of having a human in the<br />
loop.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/06/17/potw-61107-discussion-of-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW 6/11/07: &#8220;A Sequential Algorithm for Training Text Classifiers&#8221; by Lewis and Gale</title>
		<link>http://www.paperoftheweek.com/2007/06/11/potw-61107-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/</link>
		<comments>http://www.paperoftheweek.com/2007/06/11/potw-61107-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/#comments</comments>
		<pubDate>Mon, 11 Jun 2007 12:18:22 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[naive bayes]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[Text Categorization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/06/11/potw-61107-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/</guid>
		<description><![CDATA[More on text classification: &#8220;A Sequential Algorithm for Training Text Classifiers&#8221; by David Lewis and William Gale.  A little bit of an older paper, but still looks to be a good one.]]></description>
			<content:encoded><![CDATA[<p>More on text classification: &#8220;<a href="http://citeseer.ist.psu.edu/rd/52437760%2C100508%2C1%2C0.25%2CDownload/http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/508/http:zSzzSzwww.research.att.comzSz%7ElewiszSzpaperszSzlewis94c.pdf/lewis94sequential.pdf">A Sequential Algorithm for Training Text Classifiers</a>&#8221; by David Lewis and William Gale.  A little bit of an older paper, but still looks to be a good one.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/06/11/potw-61107-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW 6/3/07: Discussion of &#8220;A Comparison of Event Models for Naive Bayes Text Classification&#8221; by Andrew McCallum and Kamal Nigam</title>
		<link>http://www.paperoftheweek.com/2007/06/09/potw-6307-discussion-of-a-comparison-of-event-models-for-naive-bayes-text-classification-by-andrew-mccallum-and-kamal-nigam/</link>
		<comments>http://www.paperoftheweek.com/2007/06/09/potw-6307-discussion-of-a-comparison-of-event-models-for-naive-bayes-text-classification-by-andrew-mccallum-and-kamal-nigam/#comments</comments>
		<pubDate>Sun, 10 Jun 2007 01:34:53 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[naive bayes]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/06/09/potw-6307-discussion-of-a-comparison-of-event-models-for-naive-bayes-text-classification-by-andrew-mccallum-and-kamal-nigam/</guid>
		<description><![CDATA[We are reading &#8220;A Comparison of Event Models for Naive Bayes Text Classification&#8221; by McCallum and Nigam. Text classification is the process of assigning a document to one or more categories (we looked at classification/categorization earlier when exploring Support Vector Machines, SVMs).  My understanding of the difference between categorization and classification is that categorization has [...]]]></description>
			<content:encoded><![CDATA[<p>We are reading &#8220;<a href="http://citeseer.ist.psu.edu/rd/0%2C489994%2C1%2C0.25%2CDownload/http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/24415/http:zSzzSzlans.ece.utexas.eduzSzulgzSzpaperszSznigam-mccallum-bayes.pdf/mccallum98comparison.pdf">A Comparison of Event Models for Naive Bayes Text Classification</a>&#8221; by McCallum and Nigam.</p>
<p>Text classification is the process of assigning a document to one or more categories (we looked at classification/categorization earlier when exploring Support Vector Machines, SVMs).  My understanding of the difference between categorization and classification is that categorization has a set number of categories, whereas classification  does not.  At any rate, this paper is  comparing two different classifiers that  use  a <a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">naive Bayes</a> approach.  The naive Bayes approach assumes that all attributes of the examples we are studying are independent of each other.  Even though this is rarely true in the real world for text (after all, if I chose all the words of this post independently we probably would have gibberish) it turns out that it still works pretty well in practice.  But I digress&#8230; The two approaches that are being compared are the Bernoulli model and the multinomial model.  The Bernoulli model uses the document as an event and builds a vector of binary attributes based on whether a term occurs or not in the document.  It DOES NOT take into account the number of times the word occurs.  In the multinomial approach, words are the event and term frequency does matter.  The next couple of sections after the introduction layout the common ground between the two approaches as well as the differences.  The differences come down to how the probabilities are calculated.</p>
<p>There is some interesting discussion of feature selection (a way of reducing the size of the vocabulary, which speeds things up, without, hopefully, losing too much information) using mutual information that is worth digging into a bit more if you have the time.</p>
<p>The next sections are where the rubber meets the road and the authors do a side by side comparison of the two approaches using 5 different collections.  You can see the results on pages 5 and 6.  Finally the discussion of the results occurs on page 6 and 7, with the bottom line seeming to be that the multinomial model seems to &#8220;be almost uniformly better than the multi-variate Bernoulli model.&#8221;</p>
<p>For those interested, <a href="http://www.cs.waikato.ac.nz/~ml/weka/index.html">Weka</a> has tools for building Naive Bayes classifiers.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/06/09/potw-6307-discussion-of-a-comparison-of-event-models-for-naive-bayes-text-classification-by-andrew-mccallum-and-kamal-nigam/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW 5/14/07: Discussion of &#8220;Discovering Trends in Text Databases&#8221; by Lent et. al.</title>
		<link>http://www.paperoftheweek.com/2007/05/18/potw-51407-discussion-of-discovering-trends-in-text-databases-by-lent-et-al/</link>
		<comments>http://www.paperoftheweek.com/2007/05/18/potw-51407-discussion-of-discovering-trends-in-text-databases-by-lent-et-al/#comments</comments>
		<pubDate>Fri, 18 May 2007 13:28:33 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[text mining]]></category>
		<category><![CDATA[trend analysis]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/05/18/potw-51407-discussion-of-discovering-trends-in-text-databases-by-lent-et-al/</guid>
		<description><![CDATA[This week&#8217;s paper, &#8220;Discovering Trends in Text Databases&#8221; by Lent is my first look at some text mining tools and applications. The paper discusses a method for identifying trends in databases. In this case, a trend is defined as &#8220;a specific subsequence of the history of a phrase that satisfies the users&#8217; query over the [...]]]></description>
			<content:encoded><![CDATA[<p>This week&#8217;s paper, &#8220;<a href="http://citeseer.ist.psu.edu/rd/52437760%2C29718%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/1451/http:zSzzSzwww.almaden.ibm.comzSzcszSzpeoplezSzragrawalzSzpaperszSzkdd97_trends.pdf/lent97discovering.pdf">Discovering Trends in Text Databases</a>&#8221; by Lent is my first look at some text mining tools and applications.  The paper discusses a method for identifying trends in databases.  In this case, a trend is defined as &#8220;a specific subsequence of the history of a phrase that satisfies the users&#8217; query over the histories&#8221;.  Essentially, what the authors are doing is identifying phrases in text that has been timestamped which they can then use to match user&#8217;s queries concerning things like spikes in usage of particular phrases, etc.</p>
<p>After covering some related work about Latent Semantic Indexing (I suppose I should look into that some day), the authors delve into the methodology of identifying phrases and their histories.  There are 3 steps to the process: 1) identify frequent phrases, 2) generating histories for the phrases and 3) identifying the phrases for a given trend.</p>
<p>Phrases in this paper go beyond the simple sequence of terms, introducing the notion of a &#8220;k-phrase&#8221;.   A k-phrase is essentially a nesting of phrases and they can span sentences, etc. when appropriate.  For the histories, each word gets a transaction id and associated timestamps.  Then, given these bits of informations, the authors use a shape query language to mine the phrases and histories.  The shape query language allows the user to specify they are interested when items are &#8220;spiking&#8221; or &#8220;trending downward&#8221;, etc.  There is a reference for the shape language in the paper.</p>
<p>Finally, the paper ends with a discussion of how IBM used the approach in a patent mining system to  identify trends in patents from the US Patent office.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/05/18/potw-51407-discussion-of-discovering-trends-in-text-databases-by-lent-et-al/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW 5/14/07: &#8220;Discovering Trends in Text Databases&#8221; by Lent et. al.</title>
		<link>http://www.paperoftheweek.com/2007/05/14/potw-51407-discovering-trends-in-text-databases-by-lent-et-al/</link>
		<comments>http://www.paperoftheweek.com/2007/05/14/potw-51407-discovering-trends-in-text-databases-by-lent-et-al/#comments</comments>
		<pubDate>Mon, 14 May 2007 13:37:37 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[text mining]]></category>
		<category><![CDATA[trend analysis]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/05/14/potw-51407-discovering-trends-in-text-databases-by-lent-et-al/</guid>
		<description><![CDATA[Ah, good to be back!  This week&#8217;s paper is &#8220;Discovering Trends in Text Databases&#8221; by Brian Lent, Rakesh Agrawal and Ramakrishnan Srikant.]]></description>
			<content:encoded><![CDATA[<p>Ah, good to be back!  This week&#8217;s paper is &#8220;<a href="http://citeseer.ist.psu.edu/rd/52437760%2C29718%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/1451/http:zSzzSzwww.almaden.ibm.comzSzcszSzpeoplezSzragrawalzSzpaperszSzkdd97_trends.pdf/lent97discovering.pdf">Discovering Trends in Text Databases</a>&#8221; by Brian Lent, Rakesh Agrawal and Ramakrishnan Srikant.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/05/14/potw-51407-discovering-trends-in-text-databases-by-lent-et-al/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW to return next week</title>
		<link>http://www.paperoftheweek.com/2007/05/08/potw-to-return-next-week/</link>
		<comments>http://www.paperoftheweek.com/2007/05/08/potw-to-return-next-week/#comments</comments>
		<pubDate>Tue, 08 May 2007 12:39:57 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/05/08/potw-to-return-next-week/</guid>
		<description><![CDATA[I will be returning to writing next week after a great ApacheCon Europe conference last week.  My &#8220;Advanced Lucene&#8221; slides are available at http://www.cnlp.org/presentations/present.asp?show=conference Next week, I think I am going to start looking into things like event detection, etc.  However, I am also considering looking into some non-NLP areas related to data mining, so [...]]]></description>
			<content:encoded><![CDATA[<p>I will be returning to writing next week after a great <a href="http://www.eu.apachecon.com">ApacheCon Europe</a> conference last week.  My &#8220;Advanced Lucene&#8221; slides are available at <a href="http://www.cnlp.org/presentations/present.asp?show=conference">http://www.cnlp.org/presentations/present.asp?show=conference</a></p>
<p>Next week, I think I am going to start looking into things like event detection, etc.  However, I am also considering looking into some non-NLP areas related to data mining, so if you have a preference, let me know.  I am open to explore many new ideas in the field.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/05/08/potw-to-return-next-week/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Guest Contributor wanted for next 3 weeks</title>
		<link>http://www.paperoftheweek.com/2007/04/15/guest-contributor-wanted-for-next-3-weeks/</link>
		<comments>http://www.paperoftheweek.com/2007/04/15/guest-contributor-wanted-for-next-3-weeks/#comments</comments>
		<pubDate>Sun, 15 Apr 2007 19:40:05 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[disambiguation]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Question Answering]]></category>
		<category><![CDATA[Text Categorization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/04/15/guest-contributor-wanted-for-next-3-weeks/</guid>
		<description><![CDATA[If you have an interest in writing on artificial intelligence, clustering, information retrieval or computer science in general and are interested in reviewing one or more articles over the coming three weeks on this forum, please contact me by leaving a comment on this post.  All topics will be subject to my review for appropriateness, [...]]]></description>
			<content:encoded><![CDATA[<p>If you have an interest in writing on artificial intelligence, clustering, information retrieval or computer science in general and are interested in reviewing one or more articles over the coming three weeks on this forum, please contact me by leaving a comment on this post.  All topics will be subject to my review for appropriateness, but I am open to most any article or publication in the Computer Science field.</p>
<p>Otherwise, I will be taking a brief hiatus from reviewing papers until I return from <a href="http://www.eu.apachecon.com">ApacheCon Europe</a> in the early part of May where I am giving a talk and a training on Lucene.  I have several key deadlines over the next two weeks that must take higher priority, including the publication of a couple of articles that I have been working on.  I will post details on the publication on my <a href="http://lucene.grantingersoll.com/">Lucene blog</a> when they are available.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/04/15/guest-contributor-wanted-for-next-3-weeks/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>POTW 3/4/07: Answer Mining by Combining Extraction Techniques with Abductive Reasoning by Harabagiu, et. al.</title>
		<link>http://www.paperoftheweek.com/2007/03/09/potw-3407-answer-mining-by-combining-extraction-techniques-with-abductive-reasoning-by-harabagiu-et-al-2/</link>
		<comments>http://www.paperoftheweek.com/2007/03/09/potw-3407-answer-mining-by-combining-extraction-techniques-with-abductive-reasoning-by-harabagiu-et-al-2/#comments</comments>
		<pubDate>Fri, 09 Mar 2007 21:37:56 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Question Answering]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/03/09/potw-3407-answer-mining-by-combining-extraction-techniques-with-abductive-reasoning-by-harabagiu-et-al-2/</guid>
		<description><![CDATA[This weeks paper, &#8220;Answer Mining by Combining Extraction Techniques with Abductive Reasoning&#8220;, lays out, at a high level, the capabilities of the highest performing QA system at TREC 2003, namely Language Computer Corporation&#8217;s QA system.  The first section or two lay out the groundwork for the competition, much as was already done in the Voorhees [...]]]></description>
			<content:encoded><![CDATA[<p>This weeks paper, &#8220;<a href="http://trec.nist.gov/pubs/trec12/papers/lcc.qa.pdf">Answer Mining by Combining Extraction Techniques with Abductive Reasoning</a>&#8220;, lays out, at a high level, the capabilities of the highest performing QA system at TREC 2003, namely <a href="http://www.languagecomputer.com/">Language Computer Corporation&#8217;s</a> QA system.  The first section or two lay out the groundwork for the competition, much as was already done in the Voorhees paper from last week.  The real meat of the paper starts in the section titled &#8220;The architecture of the QA system&#8221;.</p>
<p>The system is divvied up into several components, as displayed in Figure 1 of the document.  They are the question processing unit, document processing, factoid answer processing, list answer processing and definition answer processing.  All documents are processed in the same way and passages are retrieved based on the keywords in question.  Depending on the type of question, some passages are removed if they do not have the right answer type.  Passages having a higher number of expected answer types are favored for the list based approach.</p>
<p>For factoid questions, LCC used their CICERO LITE system to provide extractions and answers were calculated based on the extractions and/or the expected answer types.    The extraction process had to identify a variety of semantic classes, such as quantity, date, people, city, etc.  The paper then discusses the types of questions that were answered by these approaches, as well as some special case scenarios related to manner of death (kind of morbid, but death is often a fascination of this kind of research, in my experience).  From here, there is a discussion of the role of theorem proving in the algorithm (see page 5), but the details are left to another paper (guess what will be the paper next week?)  I must admit, I don&#8217;t fully understand page 5 and 6 just yet, mostly, I think, because I&#8217;m not familiar with the syntax they are using, so maybe reading the next paper will make it more clear.</p>
<p>Page 6 continues with discussion of finding answers for definition questions, which relies on a pattern matching approach to find answers based on 38 internally developed patterns, some of which are in Table 5 in the paper.</p>
<p>Page 7 finishes of with a discussion of their list based approach, which uses a threshold cutoff approach that determines the similarity between the first and last answers in the list, and all those in between, cutting off the answers when they reach a threshold.</p>
<p>The rest of the paper is performance evaluation and references.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/03/09/potw-3407-answer-mining-by-combining-extraction-techniques-with-abductive-reasoning-by-harabagiu-et-al-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW 2/11/07: Discussion of sections 5-8 of Minkov, et. al</title>
		<link>http://www.paperoftheweek.com/2007/02/16/potw-21107-discussion-of-sections-5-8-of-minkov-et-al/</link>
		<comments>http://www.paperoftheweek.com/2007/02/16/potw-21107-discussion-of-sections-5-8-of-minkov-et-al/#comments</comments>
		<pubDate>Fri, 16 Feb 2007 11:54:09 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[disambiguation]]></category>
		<category><![CDATA[email]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/02/16/potw-21107-discussion-of-sections-5-8-of-minkov-et-al/</guid>
		<description><![CDATA[POTW 2/11/07: Contextual Search and Name Disambiguation in Email Using Graphs Discussion of Sections 5 through 8 The remaining sections of this paper are discussions of two applications of the algorithms plus the body of related work and conclusions that can be drawn from the work. Section 5 gives the details on what corpora were [...]]]></description>
			<content:encoded><![CDATA[<p>POTW 2/11/07: <a href="http://www.cs.cmu.edu/%7Eeinat/sigir-06.pdf">Contextual Search and Name Disambiguation in Email Using Graphs</a></p>
<p><strong>Discussion of Sections 5 through 8</strong></p>
<p>The remaining sections of this paper are discussions of two applications of the algorithms plus the body of related work and conclusions that can be drawn from the work.  Section 5 gives the details on what corpora were used (Enron, plus an internal email set not publicly available) before proceeding to the task of name disambiguation.</p>
<p>Name disambiguation in email is the task of correlating the mention of a name in an email with the actual person.  While this is fairly straightforward in most cases for people reading their own email, it becomes difficult when reading other&#8217;s email since one may not know all the people the other person does.  Multiply this by a collection of emails filled with nicknames and initials and it is not hard to see why this is difficult.  The task is useful for establishing social networks as well as other applications.  One could easily imagine an automated system that retrieved relevant information about a person mentioned in an email (bio, address, phone, past conversations, etc.) and made it available to the reader for instant access.</p>
<p>The remaining parts of section 5 go into the details of applying the algorithm and the results that are achieved.   The interesting thing in my mind is that the graph often connects different types of nodes and associates different probabilities with the transitions from one node to the other.  Applying the graph walking strategy then leads to the desired results.  Suffice it to say, the new approach performs better than the baseline approach!</p>
<p>Identifying threads of emails is the second application the authors use to demonstrate their capabilities.  Threading is the problem of identifying one or more messages that are related to some chosen email.  Many email systems do a basic job at this by comparing subject lines, esp. those that use the &#8220;RE:&#8221; prefix.  However, we all know people often treat the subject line differently.  Furthermore, people tend to quote the previous messages in the thread differently.  Some use &#8220;&gt;&#8221;, while others use &#8220;|&#8221;, while still others use nothing at all.  Add in inline replies, which is especially common on mailing lists, and you see why the problem becomes difficult.  Section 5.4 lays out the graph walk approach and compares it to TF-IDF IR approach, which, of course it does better, especially when using a machine learning re-ranking approach.</p>
<p>The rest of the paper is on related work and conclusions.  I am glad the authors address the performance in terms of scalability in the conclusions section, as I had my doubts about how well the approach could perform on large amounts of data.  In fact, I find a lot of papers in the NLP realm fail to account for performance, so it is refreshing to see it addressed.</p>
<p>graph theory, email, natural language processing, NLP, information retrieval, IR, threading, name disambiguation</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/graph+theory" rel="tag">graph theory</a>, <a href="http://technorati.com/tag/email" rel="tag">email</a>, <a href="http://technorati.com/tag/natural+language+processing" rel="tag">natural language processing</a>, <a href="http://technorati.com/tag/NLP" rel="tag">NLP</a>, <a href="http://technorati.com/tag/information+retrieval" rel="tag">information retrieval</a>, <a href="http://technorati.com/tag/IR" rel="tag">IR</a>, <a href="http://technorati.com/tag/threading" rel="tag">threading</a>, <a href="http://technorati.com/tag/name+disambiguation" rel="tag">name disambiguation</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/02/16/potw-21107-discussion-of-sections-5-8-of-minkov-et-al/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

