<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Paper of the Week &#187; Statistical Approach</title>
	<atom:link href="http://www.paperoftheweek.com/category/statistical-approach/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.paperoftheweek.com</link>
	<description>Read. Learn. Discuss.</description>
	<lastBuildDate>Tue, 14 Aug 2007 01:35:31 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>POTW 6/24/07: &#8220;Support-Vector Networks&#8221; by Cortes and Vapnik</title>
		<link>http://www.paperoftheweek.com/2007/06/25/potw-62407-support-vector-networks-by-cortes-and-vapnik/</link>
		<comments>http://www.paperoftheweek.com/2007/06/25/potw-62407-support-vector-networks-by-cortes-and-vapnik/#comments</comments>
		<pubDate>Mon, 25 Jun 2007 18:27:22 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[SVM]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[Text Categorization]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[support vector machines]]></category>
		<category><![CDATA[text mining]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/06/25/potw-62407-support-vector-networks-by-cortes-and-vapnik/</guid>
		<description><![CDATA[Long paper this week, but it is the original on Support Vector Machines: Support-Vector Networks by Cortes and Vapnik.  Given my schedule, I may spread this out over two weeks.]]></description>
			<content:encoded><![CDATA[<p>Long paper this week, but it is the original on Support Vector Machines: <a href="http://citeseer.ist.psu.edu/rd/0%2C500489%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/23317/http:zSzzSzwww.research.att.comzSz%7EcorinnazSzpaperszSzsupport.vector.pdf/cortes95supportvector.pdf">Support-Vector Networks</a> by Cortes and Vapnik.  Given my schedule, I may spread this out over two weeks.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/06/25/potw-62407-support-vector-networks-by-cortes-and-vapnik/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW 6/11/07: Discussion of &#8220;A Sequential Algorithm for Training Text Classifiers&#8221; by Lewis and Gale</title>
		<link>http://www.paperoftheweek.com/2007/06/17/potw-61107-discussion-of-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/</link>
		<comments>http://www.paperoftheweek.com/2007/06/17/potw-61107-discussion-of-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/#comments</comments>
		<pubDate>Mon, 18 Jun 2007 03:58:12 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[Text Categorization]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[naive bayes]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/06/17/potw-61107-discussion-of-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/</guid>
		<description><![CDATA[In &#8220;A Sequential Algorithm for Training Text Classifiers&#8221; by David D. Lewis and William Gale, the authors put forth a new (at the time) method training text classifiers using an approach they call &#8220;uncertainty sampling&#8221; Section 1 outlines the problem of training, namely obtaining a good sample of text to be labeled for the trainer. [...]]]></description>
			<content:encoded><![CDATA[<p>In &#8220;A Sequential Algorithm for Training Text Classifiers&#8221; by David D.<br />
Lewis and William Gale, the authors put forth a new (at the time)<br />
method training text classifiers using an approach they call<br />
&#8220;uncertainty sampling&#8221;</p>
<p>Section 1 outlines the problem of training, namely obtaining a good<br />
sample of text to be labeled for the trainer.  After disposing of<br />
several other methods of garnering samples (random, relevance<br />
feedback based), Lewis and Gale introduce an iterative approach for<br />
manually labeling examples.</p>
<p>Section 2 then discusses the benefits of &#8220;learning by query&#8221; in<br />
theory, namely the possibility of reducing the error rate very<br />
quickly in comparison to the number of queries required.</p>
<p>Figure 1 (described in section 3) outlines their basic approach,<br />
which relies on having a human judge some subset of examples that the<br />
currently used classifier is least certain about.  This process is<br />
iterated until the human feels satisfied with the results.  One<br />
caveat of this approach is that the classifier must not only predict<br />
the class, it must give a measurement of certainty for that class.</p>
<p>Continuing on into section 4, we are introduced to how to build a<br />
classifier and use uncertainty sampling to train it.  Most of the<br />
section details the probability theory behind it, finishing up with<br />
how to do the sampling.  One thing I always wish for in these papers<br />
are concrete examples (maybe as an appendix or a reference) that work<br />
through the math on an actual toy problem.  Section 5 does just this,<br />
laying out an experiment and discussing the details, minus the math,<br />
which probably suits most people just fine.</p>
<p>Section 7 has an excellent discussion of the results, the pay dirt<br />
being that using this new method significantly reduces the number of<br />
examples required for training, at the cost of having a human in the<br />
loop.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/06/17/potw-61107-discussion-of-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW 6/11/07: &#8220;A Sequential Algorithm for Training Text Classifiers&#8221; by Lewis and Gale</title>
		<link>http://www.paperoftheweek.com/2007/06/11/potw-61107-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/</link>
		<comments>http://www.paperoftheweek.com/2007/06/11/potw-61107-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/#comments</comments>
		<pubDate>Mon, 11 Jun 2007 12:18:22 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[Text Categorization]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[naive bayes]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/06/11/potw-61107-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/</guid>
		<description><![CDATA[More on text classification: &#8220;A Sequential Algorithm for Training Text Classifiers&#8221; by David Lewis and William Gale.  A little bit of an older paper, but still looks to be a good one.]]></description>
			<content:encoded><![CDATA[<p>More on text classification: &#8220;<a href="http://citeseer.ist.psu.edu/rd/52437760%2C100508%2C1%2C0.25%2CDownload/http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/508/http:zSzzSzwww.research.att.comzSz%7ElewiszSzpaperszSzlewis94c.pdf/lewis94sequential.pdf">A Sequential Algorithm for Training Text Classifiers</a>&#8221; by David Lewis and William Gale.  A little bit of an older paper, but still looks to be a good one.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/06/11/potw-61107-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW 6/3/07: Discussion of &#8220;A Comparison of Event Models for Naive Bayes Text Classification&#8221; by Andrew McCallum and Kamal Nigam</title>
		<link>http://www.paperoftheweek.com/2007/06/09/potw-6307-discussion-of-a-comparison-of-event-models-for-naive-bayes-text-classification-by-andrew-mccallum-and-kamal-nigam/</link>
		<comments>http://www.paperoftheweek.com/2007/06/09/potw-6307-discussion-of-a-comparison-of-event-models-for-naive-bayes-text-classification-by-andrew-mccallum-and-kamal-nigam/#comments</comments>
		<pubDate>Sun, 10 Jun 2007 01:34:53 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[naive bayes]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/06/09/potw-6307-discussion-of-a-comparison-of-event-models-for-naive-bayes-text-classification-by-andrew-mccallum-and-kamal-nigam/</guid>
		<description><![CDATA[We are reading &#8220;A Comparison of Event Models for Naive Bayes Text Classification&#8221; by McCallum and Nigam. Text classification is the process of assigning a document to one or more categories (we looked at classification/categorization earlier when exploring Support Vector Machines, SVMs).  My understanding of the difference between categorization and classification is that categorization has [...]]]></description>
			<content:encoded><![CDATA[<p>We are reading &#8220;<a href="http://citeseer.ist.psu.edu/rd/0%2C489994%2C1%2C0.25%2CDownload/http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/24415/http:zSzzSzlans.ece.utexas.eduzSzulgzSzpaperszSznigam-mccallum-bayes.pdf/mccallum98comparison.pdf">A Comparison of Event Models for Naive Bayes Text Classification</a>&#8221; by McCallum and Nigam.</p>
<p>Text classification is the process of assigning a document to one or more categories (we looked at classification/categorization earlier when exploring Support Vector Machines, SVMs).  My understanding of the difference between categorization and classification is that categorization has a set number of categories, whereas classification  does not.  At any rate, this paper is  comparing two different classifiers that  use  a <a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">naive Bayes</a> approach.  The naive Bayes approach assumes that all attributes of the examples we are studying are independent of each other.  Even though this is rarely true in the real world for text (after all, if I chose all the words of this post independently we probably would have gibberish) it turns out that it still works pretty well in practice.  But I digress&#8230; The two approaches that are being compared are the Bernoulli model and the multinomial model.  The Bernoulli model uses the document as an event and builds a vector of binary attributes based on whether a term occurs or not in the document.  It DOES NOT take into account the number of times the word occurs.  In the multinomial approach, words are the event and term frequency does matter.  The next couple of sections after the introduction layout the common ground between the two approaches as well as the differences.  The differences come down to how the probabilities are calculated.</p>
<p>There is some interesting discussion of feature selection (a way of reducing the size of the vocabulary, which speeds things up, without, hopefully, losing too much information) using mutual information that is worth digging into a bit more if you have the time.</p>
<p>The next sections are where the rubber meets the road and the authors do a side by side comparison of the two approaches using 5 different collections.  You can see the results on pages 5 and 6.  Finally the discussion of the results occurs on page 6 and 7, with the bottom line seeming to be that the multinomial model seems to &#8220;be almost uniformly better than the multi-variate Bernoulli model.&#8221;</p>
<p>For those interested, <a href="http://www.cs.waikato.ac.nz/~ml/weka/index.html">Weka</a> has tools for building Naive Bayes classifiers.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/06/09/potw-6307-discussion-of-a-comparison-of-event-models-for-naive-bayes-text-classification-by-andrew-mccallum-and-kamal-nigam/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW 6/3/07: &#8220;A Comparison of Event Models for Naive Bayes Text Classification&#8221; by Andrew McCallum and Kamal Nigam</title>
		<link>http://www.paperoftheweek.com/2007/06/04/potw-6307-a-comparison-of-event-models-for-naive-bayes-text-classification-by-andrew-mccallum-and-kamal-nigam/</link>
		<comments>http://www.paperoftheweek.com/2007/06/04/potw-6307-a-comparison-of-event-models-for-naive-bayes-text-classification-by-andrew-mccallum-and-kamal-nigam/#comments</comments>
		<pubDate>Mon, 04 Jun 2007 12:42:56 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[Text Categorization]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[naive bayes]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/06/04/potw-6307-a-comparison-of-event-models-for-naive-bayes-text-classification-by-andrew-mccallum-and-kamal-nigam/</guid>
		<description><![CDATA[Paper of the week for the week of June 3, 2007 is &#8220;A Comparison of event Models for Naive Bayes Text Classification&#8221; by Andrew McCallum and Kamal Nigam. This paper promises to shed some light on different ways of using bayesian classifiers. It might be useful to do some background reading on naive Bayes starting [...]]]></description>
			<content:encoded><![CDATA[<p>Paper of the week for the week of June 3, 2007 is &#8220;<a href="http://citeseer.ist.psu.edu/rd/0%2C489994%2C1%2C0.25%2CDownload/http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/24415/http:zSzzSzlans.ece.utexas.eduzSzulgzSzpaperszSznigam-mccallum-bayes.pdf/mccallum98comparison.pdf">A Comparison of event Models for Naive Bayes Text Classification</a>&#8221; by Andrew McCallum and Kamal Nigam.  This paper promises to shed some light on different ways of using bayesian classifiers.  It might be useful to do some background reading on naive Bayes starting <a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/06/04/potw-6307-a-comparison-of-event-models-for-naive-bayes-text-classification-by-andrew-mccallum-and-kamal-nigam/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Discussion of Joachims (SVMs)</title>
		<link>http://www.paperoftheweek.com/2007/01/19/discussion-of-joachims-svms/</link>
		<comments>http://www.paperoftheweek.com/2007/01/19/discussion-of-joachims-svms/#comments</comments>
		<pubDate>Fri, 19 Jan 2007 13:47:06 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[Text Categorization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/01/19/discussion-of-joachims-svms/</guid>
		<description><![CDATA[This week, if you remember, we are discussing Text Categorization with Support Vector Machines: Learning with Many Relevant Features &#8211; Joachims (ResearchIndex), which is a paper on Text Categorization (one of the most cited such papers on Google Scholar under the Text Categorization search). Text Categorization is the problem of assigning one or more predefined [...]]]></description>
			<content:encoded><![CDATA[<p>This week, if you remember, we are discussing <a href="http://citeseer.ist.psu.edu/141654.html">Text Categorization with Support Vector Machines: Learning with Many Relevant Features &#8211; Joachims (ResearchIndex)</a>, which is a paper on Text Categorization (one of the most cited such papers on Google Scholar under the Text Categorization search).</p>
<p>Text Categorization is the problem of assigning one or more predefined categories  to a piece of text.  The Support Vector Machines approach is a supervised machine learning approach that attempts to learn a linear classifier (actually, it can be polynomial or other type using plugin functionality to the algorithm.)  It is actually trying to find a &#8220;linear separating hyperplane (see <a href="http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf">guide</a> for more info on the math.)    For an implementation in Java (and other languages) check out <a href="http://www.csie.ntu.edu.tw/~cjlin/libsvm/">http://www.csie.ntu.edu.tw/~cjlin/libsvm/</a>.  This site has many useful resources explaining the algorithm, how to do feature selection, etc. (see the <a href="http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf">guide</a>.)  In two dimensional space, think of finding a line (or polynomial) that separates your good examples from your bad ones.<br />
After introducing the topic of text categorization, the author discusses a bit on feature selection.  A feature vector in the paper is a vector of distinct words and their document frequency, minus stop words.  The author also throws out any words that occur less than 3 times.  I understand the stop word removal piece, but I don&#8217;t get the less 3 times reasoning.  My guess is it is because those terms don&#8217;t significantly contribute to the solution.  Finally, features are scaled by their inverse doc. frequency.  This all makes sense to me coming from IR (Info. Retrieval land.)  Removing stopwords and doing length normalization are standard techniques for improving search results.<br />
The cool thing Joachims points out about SVMs is that they &#8220;can be independent of the dimensionality of the feature space&#8221;.  In other words, SVMs work pretty well with a lot of features or very few features.</p>
<p>In section 4, Joachims makes a very strong case for why SVMs are well suited for Text Categorization.  Summed up they are:</p>
<ol>
<li>Text can have many features (10k+ words in a collection)</li>
<li>Most features are important</li>
<li>Doc. vectors are sparse.  That is, most words in the collection do not occur in a particular document</li>
<li>Most Text Cat. problems are linearly separable.  This is the key idea behind why they work.</li>
</ol>
<p>The rest of the paper is a discussion of experiments and why SVMs are much better than the other popular approaches in use at the time.  Most notably, if you remember our discussion of <a href="http://www.paperoftheweek.com/2007/01/08/an-evaluation-of-statistical-approaches-to-text-categorization-yang-researchindex/">Yang 97</a> from last week, SVMs beat up on kNN quite handily.<br />
computer science, algorithms, support vector machines, text categorization, machine learning, supervised, libSVM</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/computer+science" rel="tag">computer science</a>, <a href="http://technorati.com/tag/algorithms" rel="tag">algorithms</a>, <a href="http://technorati.com/tag/support+vector+machines" rel="tag">support vector machines</a>, <a href="http://technorati.com/tag/text+categorization" rel="tag">text categorization</a>, <a href="http://technorati.com/tag/machine+learning" rel="tag">machine learning</a>, <a href="http://technorati.com/tag/supervised" rel="tag">supervised</a>, <a href="http://technorati.com/tag/libSVM" rel="tag">libSVM</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/01/19/discussion-of-joachims-svms/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Discussion of Sections 5-7 of Yang 97</title>
		<link>http://www.paperoftheweek.com/2007/01/12/discussion-of-sections-5-7-of-yang-97/</link>
		<comments>http://www.paperoftheweek.com/2007/01/12/discussion-of-sections-5-7-of-yang-97/#comments</comments>
		<pubDate>Sat, 13 Jan 2007 02:12:01 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[Text Categorization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/01/12/discussion-of-sections-5-7-of-yang-97/</guid>
		<description><![CDATA[Whew, I think we&#8217;ve made it through our first paper, or we are about to anyway.  If you recall, we are working our way through Yang 97 and had made it through the first 4 sections so far, which are covered here.  This leaves us with the meat of the paper, I guess, which is [...]]]></description>
			<content:encoded><![CDATA[<p>Whew, I think we&#8217;ve made it through our first paper, or we are about to anyway.  If you recall, we are working our way through<a href="http://www.paperoftheweek.com/2007/01/08/an-evaluation-of-statistical-approaches-to-text-categorization-yang-researchindex/"> Yang 97</a> and had made it through the first 4 sections so far, which are covered <a href="http://www.paperoftheweek.com/2007/01/10/discussion-of-sections-1-4-of-yang-97/">here</a>.  This leaves us with the meat of the paper, I guess, which is the actual experiments.</p>
<p>I found sections 5 through 7 to be pretty straightforward.  The author tries to run a variety of classifiers on a number of different corpora.  kNN seems to hold up the best and seems to be the only approach that really scales.  Of course, keep in mind this was 1997, this is probably no longer the case.  Perhaps we can see if there have been any updates to this paper in the future (or perhaps someone can leave a link doing just that.)  Table 3 of the paper contains the guts of what you need to know about the experiments.</p>
<p>All in all, I think this paper is a nice soft introduction to the field of text categorization. The lessons I take from the paper are the following:</p>
<ol>
<li>Pay special attention to what is in the corpus that is being tested.  The inclusion/exclusion of unlabeled documents can make a big difference.  Try to use a standard corpus, if there is such a thing.</li>
<li>Know what measures are being used when reading the paper and why people use those measures.</li>
<li>Sometimes, simpler is better.  I know the fields need to advance, but sometimes a good solid algorithm like kNN is all you need.</li>
</ol>
<p>I think for next week, I&#8217;m going to continue on with the theme of text categorization and try to dig into a specific algorithm to see how it works.  After that, I have been wanting to look into some Graph Theory algorithms in use with IR and NLP.  As always, leave comments with suggestions on anything you think I just have to read lest I miss the boat completely.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/01/12/discussion-of-sections-5-7-of-yang-97/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Discussion of Sections 1-4 of Yang 97</title>
		<link>http://www.paperoftheweek.com/2007/01/10/discussion-of-sections-1-4-of-yang-97/</link>
		<comments>http://www.paperoftheweek.com/2007/01/10/discussion-of-sections-1-4-of-yang-97/#comments</comments>
		<pubDate>Thu, 11 Jan 2007 02:31:24 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[Text Categorization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/01/10/discussion-of-sections-1-4-of-yang-97/</guid>
		<description><![CDATA[So, hopefully everyone has read the paper (http://www.paperoftheweek.com/2007/01/08/an-evaluation-of-statistical-approaches-to-text-categorization-yang-researchindex/) at least once. The first 4 sections are quite easy to get, in my opinion, as they define the problem of text categorization and lay the framework for the experiments. Digging into the details of the various implementations will be left as an exercise for the reader [...]]]></description>
			<content:encoded><![CDATA[<p>So, hopefully everyone has read the paper (<a href="http://www.paperoftheweek.com/2007/01/08/an-evaluation-of-statistical-approaches-to-text-categorization-yang-researchindex/">http://www.paperoftheweek.com/2007/01/08/an-evaluation-of-statistical-approaches-to-text-categorization-yang-researchindex/</a>) at least once.  The first 4 sections are quite easy to get, in my opinion, as they define the problem of text categorization and lay the framework for the experiments.  Digging into the details of the various implementations will be left as an exercise for the reader at this point (damn, I&#8217;ve always wanted to say that ever since wading through math proofs back in college).  In a nutshell, the author is setting out to evaluate 14 different text categorization methods using a few different &#8220;standard&#8221; collections.  The real effort here is to try to compare apples to apples, since some of the prior research concerning these systems has used a variety of approaches to evaluation, preventing direct comparison.  In section 3.2 it is important to note the discussion of how the corpora are used.  I am sure, as we go forward on many of these topics that we will come across these corpora again and again, especially the Reuters collection (heck, we even use it in <a href="http://lucene.apache.org/java/docs">Lucene</a> for benchmarking.)  Section 4, on performance measures, discusses another piece of information that occurs in much of the literature, namely the concept of evaluation.  Recall and precision are very common measures and the paper does a good job of how we derive recall, precision, break even point and the F measure from the truth table generated by a binary classifier. The recall and precision methods are worth repeating here, I think, as we will see many variations of this when discussing IR, etc.:</p>
<p>recall = # categories found correct / total # of categories correct = the number we got right divided by the total number of right answers in the system<br />
precision = # categories found correct / total categories found = the number we got right out of the number that we retrieved</p>
<p>For example, you could have perfect recall by assigning every possible category to every document, but our precision would not be very good.<br />
Finally, section 4 ends with a discussion on what average (micro or macro)  to use.  The micro average was chosen because, according to the paper, it favors the more common categories.   This, I think, makes sense given that one is probably more interested in how a system does on the common categories, unless of course you are really interested in the rare categories . <img src='http://www.paperoftheweek.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />   Perhaps someone with a better understanding can contribute some thoughts on this.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/01/10/discussion-of-sections-1-4-of-yang-97/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>An Evaluation of Statistical Approaches to Text Categorization &#8211; Yang (ResearchIndex)</title>
		<link>http://www.paperoftheweek.com/2007/01/08/an-evaluation-of-statistical-approaches-to-text-categorization-yang-researchindex/</link>
		<comments>http://www.paperoftheweek.com/2007/01/08/an-evaluation-of-statistical-approaches-to-text-categorization-yang-researchindex/#comments</comments>
		<pubDate>Mon, 08 Jan 2007 13:41:24 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[Text Categorization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/01/08/an-evaluation-of-statistical-approaches-to-text-categorization-yang-researchindex/</guid>
		<description><![CDATA[An Evaluation of Statistical Approaches to Text Categorization &#8211; Yang (ResearchIndex) is the first paper of the week that I am going to tackle. Finding a first paper was harder than I thought. I originally intended to start with some of the seminal papers in IR, such as Salton&#8217;s work, or Sparck-Jones, but it seems [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://citeseer.ist.psu.edu/yang97evaluation.html">An Evaluation of Statistical Approaches to Text Categorization &#8211; Yang (ResearchIndex)</a> is the first paper of the week that I am going to tackle.<br />
Finding a first paper was harder than I thought.  I originally intended to start with some of the seminal papers in IR, such as Salton&#8217;s work, or Sparck-Jones, but it seems many of these works are &#8220;locked up&#8221; in books that I don&#8217;t have access to at the moment.  I guess I may need to renew my ACM membership in order to get access to more content or see if the local library has them.  In the long run, I think I will probably end up also allowing for chapters of books to be read, but I first want to start with what is freely available on the web.<br />
This paper, however, strikes me as a good intro to some of the statistical approaches to text categorization, so it seems it would be a good starting place for me.  It comes recommended from <a href="http://www.amazon.com/gp/redirect.html?ie=UTF8&#038;location=http%3A%2F%2Fwww.amazon.com%2FFoundations-Statistical-Natural-Language-Processing%2Fdp%2F0262133601%2Fsr%3D8-1%2Fqid%3D1168263367%3Fie%3DUTF8%26s%3Dbooks&#038;tag=grantingersol-20&#038;linkCode=ur2&#038;camp=1789&#038;creative=9325">Foundations of Statistical Natural Language Processing</a><img width="1" height="1" border="0" style="border: medium none  ! important; margin: 0px ! important" src="http://www.assoc-amazon.com/e/ir?t=grantingersol-20&#038;l=ur2&#038;o=1" /> by Manning and Schutze (Chapter 16th), which is the book we used when I took Dr. Liddy&#8217;s NLP course at Syracuse University a few years back.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/01/08/an-evaluation-of-statistical-approaches-to-text-categorization-yang-researchindex/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
