<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Paper of the Week &#187; Text Categorization</title>
	<atom:link href="http://www.paperoftheweek.com/category/computer-science/natural-language-processing-nlp/text-categorization/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.paperoftheweek.com</link>
	<description>Read. Learn. Discuss.</description>
	<lastBuildDate>Tue, 14 Aug 2007 01:35:31 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>POTW 6/24/07: &#8220;Support-Vector Networks&#8221; by Cortes and Vapnik</title>
		<link>http://www.paperoftheweek.com/2007/06/25/potw-62407-support-vector-networks-by-cortes-and-vapnik/</link>
		<comments>http://www.paperoftheweek.com/2007/06/25/potw-62407-support-vector-networks-by-cortes-and-vapnik/#comments</comments>
		<pubDate>Mon, 25 Jun 2007 18:27:22 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[support vector machines]]></category>
		<category><![CDATA[SVM]]></category>
		<category><![CDATA[Text Categorization]]></category>
		<category><![CDATA[text mining]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/06/25/potw-62407-support-vector-networks-by-cortes-and-vapnik/</guid>
		<description><![CDATA[Long paper this week, but it is the original on Support Vector Machines: Support-Vector Networks by Cortes and Vapnik.  Given my schedule, I may spread this out over two weeks.]]></description>
			<content:encoded><![CDATA[<p>Long paper this week, but it is the original on Support Vector Machines: <a href="http://citeseer.ist.psu.edu/rd/0%2C500489%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/23317/http:zSzzSzwww.research.att.comzSz%7EcorinnazSzpaperszSzsupport.vector.pdf/cortes95supportvector.pdf">Support-Vector Networks</a> by Cortes and Vapnik.  Given my schedule, I may spread this out over two weeks.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/06/25/potw-62407-support-vector-networks-by-cortes-and-vapnik/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW 6/11/07: Discussion of &#8220;A Sequential Algorithm for Training Text Classifiers&#8221; by Lewis and Gale</title>
		<link>http://www.paperoftheweek.com/2007/06/17/potw-61107-discussion-of-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/</link>
		<comments>http://www.paperoftheweek.com/2007/06/17/potw-61107-discussion-of-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/#comments</comments>
		<pubDate>Mon, 18 Jun 2007 03:58:12 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[naive bayes]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[Text Categorization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/06/17/potw-61107-discussion-of-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/</guid>
		<description><![CDATA[In &#8220;A Sequential Algorithm for Training Text Classifiers&#8221; by David D. Lewis and William Gale, the authors put forth a new (at the time) method training text classifiers using an approach they call &#8220;uncertainty sampling&#8221; Section 1 outlines the problem of training, namely obtaining a good sample of text to be labeled for the trainer. [...]]]></description>
			<content:encoded><![CDATA[<p>In &#8220;A Sequential Algorithm for Training Text Classifiers&#8221; by David D.<br />
Lewis and William Gale, the authors put forth a new (at the time)<br />
method training text classifiers using an approach they call<br />
&#8220;uncertainty sampling&#8221;</p>
<p>Section 1 outlines the problem of training, namely obtaining a good<br />
sample of text to be labeled for the trainer.  After disposing of<br />
several other methods of garnering samples (random, relevance<br />
feedback based), Lewis and Gale introduce an iterative approach for<br />
manually labeling examples.</p>
<p>Section 2 then discusses the benefits of &#8220;learning by query&#8221; in<br />
theory, namely the possibility of reducing the error rate very<br />
quickly in comparison to the number of queries required.</p>
<p>Figure 1 (described in section 3) outlines their basic approach,<br />
which relies on having a human judge some subset of examples that the<br />
currently used classifier is least certain about.  This process is<br />
iterated until the human feels satisfied with the results.  One<br />
caveat of this approach is that the classifier must not only predict<br />
the class, it must give a measurement of certainty for that class.</p>
<p>Continuing on into section 4, we are introduced to how to build a<br />
classifier and use uncertainty sampling to train it.  Most of the<br />
section details the probability theory behind it, finishing up with<br />
how to do the sampling.  One thing I always wish for in these papers<br />
are concrete examples (maybe as an appendix or a reference) that work<br />
through the math on an actual toy problem.  Section 5 does just this,<br />
laying out an experiment and discussing the details, minus the math,<br />
which probably suits most people just fine.</p>
<p>Section 7 has an excellent discussion of the results, the pay dirt<br />
being that using this new method significantly reduces the number of<br />
examples required for training, at the cost of having a human in the<br />
loop.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/06/17/potw-61107-discussion-of-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Google&#8217;s initiatives in Artificial Intelligence</title>
		<link>http://www.paperoftheweek.com/2007/06/17/googles-initiatives-in-artificial-intelligence/</link>
		<comments>http://www.paperoftheweek.com/2007/06/17/googles-initiatives-in-artificial-intelligence/#comments</comments>
		<pubDate>Sun, 17 Jun 2007 11:42:03 +0000</pubDate>
		<dc:creator>Ian Parker</dc:creator>
				<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Question Answering]]></category>
		<category><![CDATA[Text Categorization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/06/17/googles-initiatives-in-artificial-intelligence/</guid>
		<description><![CDATA[Introduction Google&#8217;s earnings nearly doubled last year. http://news.com.com/Google+profit+nearly+doubles/2100-1030_3-6127658.html Unlike Microsoft that gets its money from shifting boxes Google relies on advertising to pay its way. There is a tremendous incentive to improve the quality of searching. The first reason is obvious. The better Google is perceived to perform as a search engine, the more people [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Introduction</strong><br />
Google&#8217;s earnings nearly doubled last year.<br />
http://<a href="http://news.com.com/Google+profit+nearly+doubles/2100-1030_3-6127658.html">news.com.com/Google+profit+nearly+doubles/2100-1030_3-6127658.html</a></p>
<p>Unlike Microsoft that gets its money from shifting boxes Google relies on advertising to pay its way. There is a tremendous incentive to improve the quality of searching. The first reason is obvious. The better Google is perceived to perform as a search engine, the more people will use Google for their searches and the greater the traffic for advertisers. The second reason is a little bit more sinister. Google gets paid according to the number of clicks made on an advertisement. Google as well as telling you the results of your search needs also to put some ads your way. The share price of Google is closely linked to the perceived quality of search.</p>
<p><strong>The quest for AI</strong><br />
As one might expect Google is deeply into AI. AI one might argue is essentially what the core business of Google depends on. Suppose we can take a web page, find out exactly what it is about, extract all the relevant facts and put them into a database, then on the prompting of a query from a user marshal all the facts that are relevant to that enquiry. This is what an AI system looking at web pages would essentially do. </p>
<p><a href="http://news.com.com/2100-11395_3-6160372.html">http://news.com.com/2100-11395_3-6160372.html</a><br />
Google is talking about the size of the human genome and the size of AI. I think the arguments are a little bit misleading. I would prefer to look at what we would expect from AI. Suppose I were to show you a box and I told you that that box was &#8220;<em>intelligent</em>&#8220;. What would you expect. Well Alan Turing devised what is now known as the Turing Test. He said that if the response of a computer to a conversation was indistinguishable from that of a human, it had passed the Turing Test. </p>
<p>On the subject of the Turing Test, Alan Turing envisaged a test which would distinguish between men and women and also would be psychic. Turing believed in ESP. Looking at Alice I am aghast, whenever I say something she always changes the subject. Hardly surprising in view of the Spanish! (<em>La estacion de resorte &#8211; El barco attravesta una cerradura</em>)</p>
<p>In other words I would expect to be able to ask questions and get an intelligible response. I could engage in a conversation if I wanted greater depth. If the box claimed to speak Spanish I would expect translations which showed an understanding of context. In fact it could not produce an intelligible response without context. We would also want the answer to statistical questions, like how do people like BMW cars? What is the correlation between this and that? Can we deduce anything about cancer from the people who get it their lifestyles etc?</p>
<p>We would also like to see some evidence of reasoning ability. Google is not committed specifically to reasoning. In a sense reasoning comes after the ability to retrieve efficiently. This has been discussed by myself and other people in &#8220;Creating Artificial Intelligence&#8221;<br />
<a href="http://groups.google.co.uk/group/creatingAI?hl=en">http://groups.google.co.uk/group/creatingAI?hl=en</a><br />
I have also written the following blogs.</p>
<p><a href="http://ipai.blogspot.com/">http://ipai.blogspot.com/</a><br />
<a href="http://ipai1.blogspot.com/">http://ipai1.blogspot.com/</a><br />
<a href="http://ipai2.blogspot.com/">http://ipai2.blogspot.com/</a></p>
<p>One thing to remember and that is that the ability to find facts is closely related to the ability to automatically construct wrappers. This is one of the main features of Web 3.0.</p>
<p>Let us now return to Google and what they are doing to produce a Web based AI</p>
<p><strong>Searching &#8211; The fundamentals</strong><br />
Search engines are basically databases. The information which is contained in the database has changed throughout the years. What the user needs to know about a Web page is :-<br />
1) What is it about?<br />
2) How is it rated, is it written by a crank or does it contain good and useful stuff?<br />
<a href="http://infolab.stanford.edu/~backrub/google.html ">http://infolab.stanford.edu/~backrub/google.html </a><br />
Describes the main techniques used in search engines.<br />
Google became the primary search engine on the basis of what might be termed a citation index. Scientists have used this principle almost from the year dot. At the bottom of an academic paper are references and these references are &#8220;citations&#8221;. The &#8220;Science Citation Index&#8221; is an index of papers which cite a given paper. Now a paper which is frequently cited is generally regarded as being a good paper. Google does exactly the same things with hyperlinks. There is also the number of times other people access a website.</p>
<p><a href="http://209.85.163.132/papers/sawzall-sciprog.pdf">http://209.85.163.132/papers/sawzall-sciprog.pdf</a><br />
The Web is of course very large and Google has to find a way of dividing up the tasks. This paper is the key to the way in which Google does this. The database is far too large to place on a single machine, and is therefore stored on a number of servers. Sawzall is quite ingenious. A query is passed round from server to server, but while the query is in transit other queries are being worked on. Hence although a query takes a few seconds to process on the network, the fact that other queries  can be processed at the same time means that a high throughput is maintained. One quite important fact is that it is possible to discuss aggregations. That is to say once websites are found with their keywords a further search based on programs written in C++ can be performed.</p>
<p><a href="http://infolab.stanford.edu/pub/papers/google.pdf">http://infolab.stanford.edu/pub/papers/google.pdf</a><br />
Describes what Google was doing in 2000</p>
<p>Google wants to know your surfing history. This will enable it to both target web pages and ads. Suppose I am a civil engineer and I enter &#8220;Bridge&#8221; as one of my search terms. A civil engineer is interested in &#8220;puente that is the sort of bridge that crosses a river. If I am a card player I will be interested in the game of Bridge. A website containing &#8220;4 hearts&#8221;, is about a card game.</p>
<p>Google also wants to target it advertisments. It wants to know what you think of a particular organization.<br />
<a href="http://ryanmcd.googlepages.com/sentimentACL07.pdf">http://ryanmcd.googlepages.com/sentimentACL07.pdf</a><br />
Does just that. It used a training set There is of course one other thing. Advertisers like some sort of feedback on how they and their product is perceived. This paper attempts to achieve this and manages to achieve scores approximating to 80%.</p>
<p>It is not my aim to make moral judgements about Google. Google in fact, unlike Microsoft, has not broken the law. Indeed the Google code is mostly open source. How it is all put together is highly proprietary, but there are references to source code in all the papers. If you are bundling inaccessible code with a inferior operating system (Windows as a sheer operating system is inferior to Linux.) a fine of x million Euros a day is appropriate. Google technology is immensely powerful and society will have to come to terms with it in some way.</p>
<p><strong>Google and Semantic Analysis</strong><br />
<a href="http://www2007.org/papers/paper342.pdf">http://www2007.org/papers/paper342.pdf</a><br />
This is a most remarkable paper. Let us dissect some of the terminology. It talks about &#8220;Vectors&#8221;. What are these &#8220;vectors&#8221;? They are all derived from Latent Semantic Analysis, or some other allied method. It talks about partially indexing the vectors (not storing the full vector). It takes queries and search results. It actually looks at what people have put into their query as keywords and the web pages they actually click on. An algorithm is developed for giving people exactly what they want. The paper makes great play on optimization for an inveted file search. Now an inverted file is a database file where the entries are indexed. Quite clearly if you are doing web based searches</p>
<p><a href="http://labs.google.com/papers/orkut-kdd2005.html ">http://labs.google.com/papers/orkut-kdd2005.html </a></p>
<p>This paper is 2007 so its results are not yet in &#8220;Google&#8221;. The methodology is amazingly powerful and could be applied in a variety of circumstances. Slightly chillingly the &#8220;Orkut&#8221; set which correlates friendship and personality and other similarities is used. The paper can effectively find you matches and build you up a friendship network. Equally it can judge you by the friends that you have!</p>
<p>Potentially you could take El Cid and its English translation and match words up. Rather you are not just matching words you are matching vectors. An inverted file then gives the correct Spanish translation for an English vector and vice versa. This program will take any set of vector pairs and do a match.</p>
<p><strong>Translation</strong><br />
At present translation with Google Translate is rather poor.<br />
<em>El barco attravesta una cerradura </em>- The boat goes through a lock<br />
<em>La estacion de resorte </em>- The season of spring.</p>
<p><a href="http://www.stefanriezler.com/PAPERS/NAACL06.pdf ">http://www.stefanriezler.com/PAPERS/NAACL06.pdf </a><br />
Google have in 2006 recruited Stefan Riezler. It is interesting in that it indicates a direction in which Google is moving. Here is his CV<br />
<a href="http://www.stefanriezler.com/CV07.pdf">http://www.stefanriezler.com/CV07.pdf</a><br />
It is probably a pretty good summary of the way in which Google intend to go. One thing should be pointed out straight away and that is that is that the Google NLP initiatives are based on strict parsing as their starting point. This contrasts with some versions of Latent Semantic Analysis where unparsed words are entered. Google looks at subjects, verbs, adjectives, adverbs, objects and possessives. Google is also interested in question and answer responses.</p>
<p><a href="http://www.cs.nyu.edu/~mohri/postscript/hbka.pdf">http://www.cs.nyu.edu/~mohri/postscript/hbka.pdf</a><br />
This paper is a review article about a very much related area of speech recognition. I think I should say straight away that the recognition of individual phonemes by computer is as good if not better than that performed manually. The reason why human speech recognition is better than that of a machine is that humans recognize words in context. This in fact makes speech very similar to translation. I can illustrate this with words that have different meanings and spellings but the same phoneme structure. Whether (<em>si</em>), weather (<em>tiempo</em>)  hear (<em>oir</em>), here (<em>aqui</em>). One thing that is a little bit disappointing is that the speech and NLP groups  in Google appear to be working independently.</p>
<p>Speech is in fact a far harder problem than translation or the discernment of meaning from text. This is because in translating from text you have fewer choices. The methos used is Markov chains and the association of neighbouring words including grammar. Interestingly in neither Riezler&#8217;s work or this are words chosen on the basis of long range meaning. Let us say we have a medical paper and we could bias the search to medical terms. They do not seem to do this.</p>
<p>To produce the right words in speech you need an iterative annealing process. This means you may wish to change the phoneme, or word, assignment once other words have been found.</p>
<p><a href="http://www.stefanriezler.com/PAPERS/ACL07.pdf">http://www.stefanriezler.com/PAPERS/ACL07.pdf</a><br />
Suppose I am not looking for a website. Suppose I want to know a fact. &#8220;What is the velocity of light?&#8221;, &#8220;What is somebody&#8217;s address?&#8221;, &#8220;What is the turnover of company X?&#8221;. To answer a question the question needs to be parsed so that its meaning can be ascertained. We are here quite close to the Turing test.<br />
<a href="http://www.cs.cmu.edu/~acarlson/semisupervised/million-fact-aaai06.pdf ">http://www.cs.cmu.edu/~acarlson/semisupervised/million-fact-aaai06.pdf </a><br />
<a href="http://www.cs.bell-labs.com/cm/cs/who/pfps/temp/web/www2007.org/papers/paper560.pdf ">http://www.cs.bell-labs.com/cm/cs/who/pfps/temp/web/www2007.org/papers/paper560.pdf </a></p>
<p>This is the first stage of Google&#8217;s program. A database of, initially, a million facts will be gathered. These facts are going into a database which will be used to answer questions. This will of course be extended as time goes on.</p>
<p><strong>Head to Head with Microsoft</strong><br />
Google has a spreadsheet and a word processor. It also features desktop publishing.<br />
<a href="https://www.google.com/accounts/ServiceLogin?service=writely&amp;passive=true&amp;continue=http%3A%2F%2Fdocs.google.com%2F%3Fhl%3Den_GB&amp;hl=en_GB&amp;ltmpl=homepage&amp;nui=1&amp;utm_source=en_GB-more&amp;utm_medium=more&amp;utm_campaign=en_GB">https://www.google.com/accounts/ServiceLogin?service=writely&amp;passive=true&amp;continue=http%3A%2F%2Fdocs.google.com%2F%3Fhl%3Den_GB&amp;hl=en_GB&amp;ltmpl=homepage&amp;nui=1&amp;utm_source=en_GB-more&amp;utm_medium=more&amp;utm_campaign=en_GB</a><br />
There are advantages and disadvantages in using the Web for basic word processing and spreadsheets. The advantages are that the software is :-</p>
<p>1) Up to date.<br />
2) Will run of both Linux and windows systems.<br />
3) Is free.<br />
4) There are facilities for work sharing.</p>
<p>http://labs.google.com/papers/gfs.html</p>
<p>5) Your work is backed up automatically.</p>
<p>The disadvantages are that you need to be connected to the Web to access your work. There are question marks over security, although to be fair Google is investing a considerable effort in this field. </p>
<p><a href="http://labs.google.com/papers.html">http://labs.google.com/papers.html</a><br />
This gives a list of Google papers. Note those on security. I have not mentioned them individually since my main thrust is AI.<br />
I feel that we should look at spellcheckers and how word processing and AI can be integrated. Often when we spell words wrong the spelling is valid but means something different. People will often spell words that sound the same wrongly. This puts spellcheckers in the same position as translators, and on a Web spell-check the latest translator can be used. If I use a translator as a spellchecker I am one stage up on anything Microsoft has produced. If you are writing in Spanish &#8220;<em>si</em>&#8221; and &#8220;<em>tiempo</em>&#8221; are never confused. In English large number of people confuse &#8220;whether&#8221; and &#8220;weather&#8221;. Present day spellcheckers pass both.</p>
<p>There is one other point. If I want to write something learned, I want references. If I write on web software Google can suggest them to me. If I have Microsoft software on my own computer, it cannot do this.</p>
<p><strong>Conclusion</strong><br />
I started off this investigation rather skeptical. I came away from Google translate distinctly unimpressed. &#8220;La estacion de resorte&#8221; &#8211; I did not know stations were elastic! I came away deeply impressed with the work which Google are doing and its scope. My criticism that the research on Natural Language should involve more interchange of information is perhaps rather carping, considering the difficulties involved in running a program on this scale.<br />
On the question of personal information I can see where Google is coming from. Lets put it this way. If you meet a friend in the street you will have remembered some of the &#8220;personal&#8221; information that they have told you if you were to start a conversation. We can thus show that any Turing machine must store personal information, you need personal information stored if you are ever going to &#8220;talk to Google&#8221;. It is also vital for proper retrieval of information. The information you get must be relevant to you.<br />
<a href="http://michaelaltendorf.wordpress.com/2007/06/13/top-100-alternative-search-engines-from-readwrite-web/">http://michaelaltendorf.wordpress.com/2007/06/13/top-100-alternative-search-engines-from-readwrite-web/</a><br />
The whole point of search engine technology is to get relevant references and facts. This reference misses this point completely. If you need a 3D display your basic engine is lousy.</p>
<p>Google is now entering the world of facts rather than just websites. This could have some very interesting consequences in the future.<br />
There is one fact that society in the future will have to come to terms with. To become president of the United States you need television exposure. Television, telephones and the Internet are now becoming one. Who will choose the programs you watch? Why Google of course. This is a tremendous responsibility.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/06/17/googles-initiatives-in-artificial-intelligence/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>POTW 6/11/07: &#8220;A Sequential Algorithm for Training Text Classifiers&#8221; by Lewis and Gale</title>
		<link>http://www.paperoftheweek.com/2007/06/11/potw-61107-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/</link>
		<comments>http://www.paperoftheweek.com/2007/06/11/potw-61107-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/#comments</comments>
		<pubDate>Mon, 11 Jun 2007 12:18:22 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[naive bayes]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[Text Categorization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/06/11/potw-61107-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/</guid>
		<description><![CDATA[More on text classification: &#8220;A Sequential Algorithm for Training Text Classifiers&#8221; by David Lewis and William Gale.  A little bit of an older paper, but still looks to be a good one.]]></description>
			<content:encoded><![CDATA[<p>More on text classification: &#8220;<a href="http://citeseer.ist.psu.edu/rd/52437760%2C100508%2C1%2C0.25%2CDownload/http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/508/http:zSzzSzwww.research.att.comzSz%7ElewiszSzpaperszSzlewis94c.pdf/lewis94sequential.pdf">A Sequential Algorithm for Training Text Classifiers</a>&#8221; by David Lewis and William Gale.  A little bit of an older paper, but still looks to be a good one.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/06/11/potw-61107-a-sequential-algorithm-for-training-text-classifiers-by-lewis-and-gale/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW 6/3/07: &#8220;A Comparison of Event Models for Naive Bayes Text Classification&#8221; by Andrew McCallum and Kamal Nigam</title>
		<link>http://www.paperoftheweek.com/2007/06/04/potw-6307-a-comparison-of-event-models-for-naive-bayes-text-classification-by-andrew-mccallum-and-kamal-nigam/</link>
		<comments>http://www.paperoftheweek.com/2007/06/04/potw-6307-a-comparison-of-event-models-for-naive-bayes-text-classification-by-andrew-mccallum-and-kamal-nigam/#comments</comments>
		<pubDate>Mon, 04 Jun 2007 12:42:56 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[naive bayes]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[Text Categorization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/06/04/potw-6307-a-comparison-of-event-models-for-naive-bayes-text-classification-by-andrew-mccallum-and-kamal-nigam/</guid>
		<description><![CDATA[Paper of the week for the week of June 3, 2007 is &#8220;A Comparison of event Models for Naive Bayes Text Classification&#8221; by Andrew McCallum and Kamal Nigam. This paper promises to shed some light on different ways of using bayesian classifiers. It might be useful to do some background reading on naive Bayes starting [...]]]></description>
			<content:encoded><![CDATA[<p>Paper of the week for the week of June 3, 2007 is &#8220;<a href="http://citeseer.ist.psu.edu/rd/0%2C489994%2C1%2C0.25%2CDownload/http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/24415/http:zSzzSzlans.ece.utexas.eduzSzulgzSzpaperszSznigam-mccallum-bayes.pdf/mccallum98comparison.pdf">A Comparison of event Models for Naive Bayes Text Classification</a>&#8221; by Andrew McCallum and Kamal Nigam.  This paper promises to shed some light on different ways of using bayesian classifiers.  It might be useful to do some background reading on naive Bayes starting <a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/06/04/potw-6307-a-comparison-of-event-models-for-naive-bayes-text-classification-by-andrew-mccallum-and-kamal-nigam/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Guest Contributor wanted for next 3 weeks</title>
		<link>http://www.paperoftheweek.com/2007/04/15/guest-contributor-wanted-for-next-3-weeks/</link>
		<comments>http://www.paperoftheweek.com/2007/04/15/guest-contributor-wanted-for-next-3-weeks/#comments</comments>
		<pubDate>Sun, 15 Apr 2007 19:40:05 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[disambiguation]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Question Answering]]></category>
		<category><![CDATA[Text Categorization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/04/15/guest-contributor-wanted-for-next-3-weeks/</guid>
		<description><![CDATA[If you have an interest in writing on artificial intelligence, clustering, information retrieval or computer science in general and are interested in reviewing one or more articles over the coming three weeks on this forum, please contact me by leaving a comment on this post.  All topics will be subject to my review for appropriateness, [...]]]></description>
			<content:encoded><![CDATA[<p>If you have an interest in writing on artificial intelligence, clustering, information retrieval or computer science in general and are interested in reviewing one or more articles over the coming three weeks on this forum, please contact me by leaving a comment on this post.  All topics will be subject to my review for appropriateness, but I am open to most any article or publication in the Computer Science field.</p>
<p>Otherwise, I will be taking a brief hiatus from reviewing papers until I return from <a href="http://www.eu.apachecon.com">ApacheCon Europe</a> in the early part of May where I am giving a talk and a training on Lucene.  I have several key deadlines over the next two weeks that must take higher priority, including the publication of a couple of articles that I have been working on.  I will post details on the publication on my <a href="http://lucene.grantingersoll.com/">Lucene blog</a> when they are available.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/04/15/guest-contributor-wanted-for-next-3-weeks/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Discussion of Joachims (SVMs)</title>
		<link>http://www.paperoftheweek.com/2007/01/19/discussion-of-joachims-svms/</link>
		<comments>http://www.paperoftheweek.com/2007/01/19/discussion-of-joachims-svms/#comments</comments>
		<pubDate>Fri, 19 Jan 2007 13:47:06 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[Text Categorization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/01/19/discussion-of-joachims-svms/</guid>
		<description><![CDATA[This week, if you remember, we are discussing Text Categorization with Support Vector Machines: Learning with Many Relevant Features &#8211; Joachims (ResearchIndex), which is a paper on Text Categorization (one of the most cited such papers on Google Scholar under the Text Categorization search). Text Categorization is the problem of assigning one or more predefined [...]]]></description>
			<content:encoded><![CDATA[<p>This week, if you remember, we are discussing <a href="http://citeseer.ist.psu.edu/141654.html">Text Categorization with Support Vector Machines: Learning with Many Relevant Features &#8211; Joachims (ResearchIndex)</a>, which is a paper on Text Categorization (one of the most cited such papers on Google Scholar under the Text Categorization search).</p>
<p>Text Categorization is the problem of assigning one or more predefined categories  to a piece of text.  The Support Vector Machines approach is a supervised machine learning approach that attempts to learn a linear classifier (actually, it can be polynomial or other type using plugin functionality to the algorithm.)  It is actually trying to find a &#8220;linear separating hyperplane (see <a href="http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf">guide</a> for more info on the math.)    For an implementation in Java (and other languages) check out <a href="http://www.csie.ntu.edu.tw/~cjlin/libsvm/">http://www.csie.ntu.edu.tw/~cjlin/libsvm/</a>.  This site has many useful resources explaining the algorithm, how to do feature selection, etc. (see the <a href="http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf">guide</a>.)  In two dimensional space, think of finding a line (or polynomial) that separates your good examples from your bad ones.<br />
After introducing the topic of text categorization, the author discusses a bit on feature selection.  A feature vector in the paper is a vector of distinct words and their document frequency, minus stop words.  The author also throws out any words that occur less than 3 times.  I understand the stop word removal piece, but I don&#8217;t get the less 3 times reasoning.  My guess is it is because those terms don&#8217;t significantly contribute to the solution.  Finally, features are scaled by their inverse doc. frequency.  This all makes sense to me coming from IR (Info. Retrieval land.)  Removing stopwords and doing length normalization are standard techniques for improving search results.<br />
The cool thing Joachims points out about SVMs is that they &#8220;can be independent of the dimensionality of the feature space&#8221;.  In other words, SVMs work pretty well with a lot of features or very few features.</p>
<p>In section 4, Joachims makes a very strong case for why SVMs are well suited for Text Categorization.  Summed up they are:</p>
<ol>
<li>Text can have many features (10k+ words in a collection)</li>
<li>Most features are important</li>
<li>Doc. vectors are sparse.  That is, most words in the collection do not occur in a particular document</li>
<li>Most Text Cat. problems are linearly separable.  This is the key idea behind why they work.</li>
</ol>
<p>The rest of the paper is a discussion of experiments and why SVMs are much better than the other popular approaches in use at the time.  Most notably, if you remember our discussion of <a href="http://www.paperoftheweek.com/2007/01/08/an-evaluation-of-statistical-approaches-to-text-categorization-yang-researchindex/">Yang 97</a> from last week, SVMs beat up on kNN quite handily.<br />
computer science, algorithms, support vector machines, text categorization, machine learning, supervised, libSVM</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/computer+science" rel="tag">computer science</a>, <a href="http://technorati.com/tag/algorithms" rel="tag">algorithms</a>, <a href="http://technorati.com/tag/support+vector+machines" rel="tag">support vector machines</a>, <a href="http://technorati.com/tag/text+categorization" rel="tag">text categorization</a>, <a href="http://technorati.com/tag/machine+learning" rel="tag">machine learning</a>, <a href="http://technorati.com/tag/supervised" rel="tag">supervised</a>, <a href="http://technorati.com/tag/libSVM" rel="tag">libSVM</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/01/19/discussion-of-joachims-svms/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Text Categorization with Support Vector Machines: Learning with Many Relevant Features &#8211; Joachims (ResearchIndex)</title>
		<link>http://www.paperoftheweek.com/2007/01/15/text-categorization-with-support-vector-machines-learning-with-many-relevant-features-joachims-researchindex/</link>
		<comments>http://www.paperoftheweek.com/2007/01/15/text-categorization-with-support-vector-machines-learning-with-many-relevant-features-joachims-researchindex/#comments</comments>
		<pubDate>Mon, 15 Jan 2007 12:36:18 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Text Categorization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/01/15/text-categorization-with-support-vector-machines-learning-with-many-relevant-features-joachims-researchindex/</guid>
		<description><![CDATA[Text Categorization with Support Vector Machines: Learning with Many Relevant Features &#8211; Joachims (ResearchIndex) Week 2, as promised, is another Text Categorization topic, and this one is pretty big, receiving over 1500 cites according to Google Scholar.  I know a little bit (little being the operative word) about SVMs (Support Vector Machines) so it will [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://citeseer.ist.psu.edu/141654.html">Text Categorization with Support Vector Machines: Learning with Many Relevant Features &#8211; Joachims (ResearchIndex)</a></p>
<p>Week 2, as promised, is another Text Categorization topic, and this one is pretty big, receiving over 1500 cites according to Google Scholar.  I know a little bit (little being the operative word) about SVMs (Support Vector Machines) so it will be interesting how my understanding of SVMs applies.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/01/15/text-categorization-with-support-vector-machines-learning-with-many-relevant-features-joachims-researchindex/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Discussion of Sections 5-7 of Yang 97</title>
		<link>http://www.paperoftheweek.com/2007/01/12/discussion-of-sections-5-7-of-yang-97/</link>
		<comments>http://www.paperoftheweek.com/2007/01/12/discussion-of-sections-5-7-of-yang-97/#comments</comments>
		<pubDate>Sat, 13 Jan 2007 02:12:01 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[Text Categorization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/01/12/discussion-of-sections-5-7-of-yang-97/</guid>
		<description><![CDATA[Whew, I think we&#8217;ve made it through our first paper, or we are about to anyway.  If you recall, we are working our way through Yang 97 and had made it through the first 4 sections so far, which are covered here.  This leaves us with the meat of the paper, I guess, which is [...]]]></description>
			<content:encoded><![CDATA[<p>Whew, I think we&#8217;ve made it through our first paper, or we are about to anyway.  If you recall, we are working our way through<a href="http://www.paperoftheweek.com/2007/01/08/an-evaluation-of-statistical-approaches-to-text-categorization-yang-researchindex/"> Yang 97</a> and had made it through the first 4 sections so far, which are covered <a href="http://www.paperoftheweek.com/2007/01/10/discussion-of-sections-1-4-of-yang-97/">here</a>.  This leaves us with the meat of the paper, I guess, which is the actual experiments.</p>
<p>I found sections 5 through 7 to be pretty straightforward.  The author tries to run a variety of classifiers on a number of different corpora.  kNN seems to hold up the best and seems to be the only approach that really scales.  Of course, keep in mind this was 1997, this is probably no longer the case.  Perhaps we can see if there have been any updates to this paper in the future (or perhaps someone can leave a link doing just that.)  Table 3 of the paper contains the guts of what you need to know about the experiments.</p>
<p>All in all, I think this paper is a nice soft introduction to the field of text categorization. The lessons I take from the paper are the following:</p>
<ol>
<li>Pay special attention to what is in the corpus that is being tested.  The inclusion/exclusion of unlabeled documents can make a big difference.  Try to use a standard corpus, if there is such a thing.</li>
<li>Know what measures are being used when reading the paper and why people use those measures.</li>
<li>Sometimes, simpler is better.  I know the fields need to advance, but sometimes a good solid algorithm like kNN is all you need.</li>
</ol>
<p>I think for next week, I&#8217;m going to continue on with the theme of text categorization and try to dig into a specific algorithm to see how it works.  After that, I have been wanting to look into some Graph Theory algorithms in use with IR and NLP.  As always, leave comments with suggestions on anything you think I just have to read lest I miss the boat completely.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/01/12/discussion-of-sections-5-7-of-yang-97/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Discussion of Sections 1-4 of Yang 97</title>
		<link>http://www.paperoftheweek.com/2007/01/10/discussion-of-sections-1-4-of-yang-97/</link>
		<comments>http://www.paperoftheweek.com/2007/01/10/discussion-of-sections-1-4-of-yang-97/#comments</comments>
		<pubDate>Thu, 11 Jan 2007 02:31:24 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[Statistical Approach]]></category>
		<category><![CDATA[Text Categorization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/01/10/discussion-of-sections-1-4-of-yang-97/</guid>
		<description><![CDATA[So, hopefully everyone has read the paper (http://www.paperoftheweek.com/2007/01/08/an-evaluation-of-statistical-approaches-to-text-categorization-yang-researchindex/) at least once. The first 4 sections are quite easy to get, in my opinion, as they define the problem of text categorization and lay the framework for the experiments. Digging into the details of the various implementations will be left as an exercise for the reader [...]]]></description>
			<content:encoded><![CDATA[<p>So, hopefully everyone has read the paper (<a href="http://www.paperoftheweek.com/2007/01/08/an-evaluation-of-statistical-approaches-to-text-categorization-yang-researchindex/">http://www.paperoftheweek.com/2007/01/08/an-evaluation-of-statistical-approaches-to-text-categorization-yang-researchindex/</a>) at least once.  The first 4 sections are quite easy to get, in my opinion, as they define the problem of text categorization and lay the framework for the experiments.  Digging into the details of the various implementations will be left as an exercise for the reader at this point (damn, I&#8217;ve always wanted to say that ever since wading through math proofs back in college).  In a nutshell, the author is setting out to evaluate 14 different text categorization methods using a few different &#8220;standard&#8221; collections.  The real effort here is to try to compare apples to apples, since some of the prior research concerning these systems has used a variety of approaches to evaluation, preventing direct comparison.  In section 3.2 it is important to note the discussion of how the corpora are used.  I am sure, as we go forward on many of these topics that we will come across these corpora again and again, especially the Reuters collection (heck, we even use it in <a href="http://lucene.apache.org/java/docs">Lucene</a> for benchmarking.)  Section 4, on performance measures, discusses another piece of information that occurs in much of the literature, namely the concept of evaluation.  Recall and precision are very common measures and the paper does a good job of how we derive recall, precision, break even point and the F measure from the truth table generated by a binary classifier. The recall and precision methods are worth repeating here, I think, as we will see many variations of this when discussing IR, etc.:</p>
<p>recall = # categories found correct / total # of categories correct = the number we got right divided by the total number of right answers in the system<br />
precision = # categories found correct / total categories found = the number we got right out of the number that we retrieved</p>
<p>For example, you could have perfect recall by assigning every possible category to every document, but our precision would not be very good.<br />
Finally, section 4 ends with a discussion on what average (micro or macro)  to use.  The micro average was chosen because, according to the paper, it favors the more common categories.   This, I think, makes sense given that one is probably more interested in how a system does on the common categories, unless of course you are really interested in the rare categories . <img src='http://www.paperoftheweek.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />   Perhaps someone with a better understanding can contribute some thoughts on this.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/01/10/discussion-of-sections-1-4-of-yang-97/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

