Latent Semantic Analysis

http://lsi.argreenhouse.com/lsi/papers/JASIS90.pdf

You are all no doubt familiar with Google and the way in which web pages are found using keywords. Unfortunately this method can break down. If for example you want to find references to the game of “bridge” you are (in English) swamped with references to civil engineering. I even suggested to the President of the Birminham Bridge Club that she translate the webpage into Spanish or French (puente, pont do not refer to a card game) so that this did not occur! The main search engines are all aware of his. The above paper gives the Microsoft solution embedded in live.com.
What is done is to take a number of words, find the correlation matrix and diagonalize it. This is all done using standard functions which are available in any reasonable statistics package. The approach in the paper is to take a web page and mark a word “1″ if it occurs in the page “0″ otherwise. The words used are those used in search terms. Prepositions and conjunctions are excluded.
The first 50-100 eigenvectors are taken. Eigenvectors are defined as being orthogonal (Σvijvjk=δik). Also VAV-1 is diagonal and the diagonal terms are the eigenvalues. These vectors are then used to index web pages. It has been claimed that all that this is doing is combining keywords. Not so! The vectors represent latent values. To show you what a latent value is we can use a second language. I have decided on Spanish for various reasons.
Suppose I want to go on a boating holiday and I wnt to know about the locks on the Grand Union Canal. Lock has two meanings. If I go to Google translate I get “El barco attravesta una cerradura” Now the difference between cerradura and éclusia is a latent value We can define éclusia as being part of a vector. If we talk about canals and boats we will get this vector and Microsoft’s “link.com” will fish it out for us. Articles about the game of Bridge are fished out in a similar way. The paper discusses competing techniques like clustering and hierarchical analysis. The surprising, or perhaps not so surprising, fact is that Latent Semantic Analysis beats all these other techniques. In case you think I am blowing Microsoft’s trumpet let me say that:-
1) The paper is written by Microsoft employees, so I am giving the Microsoft picture.
2) Google, Yahoo and all the major players are doing work in AI. Google, among other things, is working on a user profile so that every time I, Ian Parker accesses Google my profile is added to whatever I type in. This can be used for things other than aiding searches. Suppose I were to search out militant Islamic sites and bomb making sites. Should Google alert the authorities? Ethical dilemma.
Microsoft uses a Boolean view of websites. Other people using LSA are taking the number of times a word occurs and the paper acknowledges this.
This deals with finding websites. What about finding facts? Suppose I want to know the speed of light, the rate of inflation in the UK, the main address of a company, turnover, share capital of same company, price and availability of flights etc. I want something which will search out the websites for particular facts. Web 3.0 proposes to organize this information by using wrappers. Wrappers will take me to other websites as well as “wrapping” particular facts. Now Web 3.0 proposes that the user constructs wrappers himself/herself. This to me is a recipe for chaos.
To wrap automatically, or find any particular fact, we need to have local latent values. Finding these values is very akin to translation.
Translation
http://www.cs.princeton.edu/~tledbett/IndependentWork/StatisticalMachineTranslationReview.pdf is a article on the statistical methods of language translation. The basic method is to take a text which has been translated by a human and then construct statistics. The main technique is to match words in a sentence in language A and language B. This is done by finding maximum probability matching. The probability matching is done by looking at maximum probability. Word matching is not done by parsing. That is rather surprising. If I say “yu tomaré el testamento” there are two “wills” in English, only “testamento” matches, the other “will” does not match anything. Parsing, of course, sorts out auxiliary verbs. Basically the best consist of taking all the words in a sentence and weighting them.
as I have stated there is no parsing and no long range LSA. Canals and boats anywhere within a website or paragraph will enhance the probability of “éclusia”. statistical language translation methods involve a human translation. We can move onto methods that might enable languages to be translated using a dictionary and a corpus in the target language.
Compression as AI
Before diving into Matthew Mahoney’s paper I think that some comment should be made on the relationship between compression and AI. If you ask people what compression has to do with AI they will say “nothing”.
The paper takes a bit stream and seeks to predict the next bit given the previous bits. Now if we have N bits in total and we predict n correctly the number of combinations we can have is :-
N!  
n!(N-n)! and the number of bits we need in the compressed form is of course Log2(Expression) our maximal compressed form is (in effect) a single integer.
This expression is derived by taking combinations. Basically what we are doing to derive this expression, and the more general expression quoted later on is to take the number of possible arrangements of N and divide it by the number of ways of arranging n and the number of ways of arranging N-n
It is clear that the more we get right the smaller the number of combinations and hence the compressed size becomes smaller as our techniques of prediction improve. If you were to ask people whether or not intelligence was to do with the ability to make predictions, they would probably say that it was. It is all in the way the question is asked! Compression = Prediction in quite a rigorous way.
Translation from one Natural Language to another : -This represents an important AI task. It too is dependent on compression. Let us take three Spanish words :- Primavera, Resorte, Mamanthal They all translate into English as “Spring” but different springs. Getting a good translation is a matter of finding the right word. To translate we have to make a guess, in other words compress.
There are also intelligence tests. “[resorte, primavera, mamanthal] son flores” is a simply example of a typical verbal reasoning test. Psychologists are in fact (possibly unwittingly) using compression as their metric, just like Matthew Mahoney.
When is compression interesting?
If you take words (punctuation marks counting as words) you find by the basic statistical formulae that the number of arrangements is
N!
Пni! where ni is the number of times the ith word occurs in the text. N is the total number or words. Log2 of that expression is the minimum compressed length in bits. This is before we have started guessing words. The point (I have found it to be 24% on the Wikipedpia corpus) is an interesting one. Beyond that point we are talking about intelligent anticipation. “Son flores en primavera”.
http://www.cs.fit.edu/~mmahoney/compression/cs200516.pdf is the address of the paper.
One of the things which I particularly like about the concept of Context mixing is that you have a number of models. The weight given to these different models is constantly updated. If you have a bee in your bonnet about a particular method being the best, you can include it and see how it performs.
There are few really objective criteria of how well we are performing AI. The Turing Test is highly subjective. Slightly more objective is the Bleu criterion for language translation. This measures how a computer translation differs from a translation provided by a human. The difficulties are twofold. First of all there needs to be a human based translation. Second bleu will punish minor errors of style as severely as the gross errors quoted above. This leaves loss free compression as being the only effective objective criterion. The imperfect nature of this has been highlighted in AI user groups, but there is really nothing better. This should be borne in mind when we consider figures of merit for particular techniques. At the moment the context mixing algorithms which are used are very much “local” – no more than about two words.
How good is the program as it stands? 16.3MB or 16.3% on the Wikipedia corpus, 18.18% on the Calgary corpus. This is way under the figure for raw words and represents a considerable word guessing capability. There is about 5.92 characters per word, so this represents something of the order of 1 byte per word (just over for Calgary, just under for Wiki). This is about 4 bits better than raw word compression.
The reader may find this a little hard to visualize. It means that it is capable of picking one word out of a list of 16. If we were translating and the translation (in our target language) was being processed a 16 fold ability to make a choice would ensure, perhaps not a translation of human quality but something quite acceptable.
Another feature which is interesting about the paper is the generality of the approach. Speech and images are discussed. There is a problem. Text is compressed loss free. Speech and images need preliminary noise reduction, this is departing from strict loss free compression. MPEG and JPEG compress but they are not loss free. The degree of compression of the PEGs is not interesting from the AI stand point. In text there is a well defined threshold of interest. It might be slightly begging the question to take M/J PEG and try to compress it loss free. The initial compression is lossy, but it will be loss free from now on A similar situation pertains to speech. The best approach to date is http://www.bbn.com/Solutions_and_Technologies/Speech_Recognition/index.html you can see when you look at it that it is a kind of iterative translation (monolingual text of course). To distinguish between weather and whether is identical to what has been discussed above. El [si, tiempo] esta lluva y viento. The BBN approach assigns probabilities. In fact BBN seems to one of the few companies or organizations directly associating language and speech. Matthew is correct in stating that error free speech recognition is dependent on compression and language string, but I wonder whether the algorithms he implicitly proposes are the right ones.
There is one thing which I am not clear about
WRT replaces English words with a 1 to 3 byte code from a dictionary. There are additional transforms to model punctuation, capitalization, and end of lines.
Are the 1-3 byte codes the codes for the root or is each tense of a verb, singular or plural, separate words? The paper discusses bigrams for words. In the absence of parsing there will be a significant difference between English and Spanish (Spanish performing better). I will take -> yu tomaré In a non parsed bigram we have will/take which does not really help in terms of prediction. In the case of long range association (indexing web sites) parsing is unnecessary
Where do we go from here?
The paper is suggesting that research should focus on speeding up the algorithms as they are some 1000 times slower than the LZ algorithms used in such things as WinZip. I wonder. The purpose of high compression is not archiving, it is AI, and for this reason alongside the search for speed it might be profitable to look for higher compression. Foremost in this must be parsing and the association of different parts of speech. If I want to make an association between watch and spring (reloj, resorte) intelligence tests also ask for associations like reloj, agua flores with primavera, resorte, mamanthal We seem certain to get it right if we construct bigrams with parts of speech.
Conclusion
I have tried to convey to you the research that is being done in the area of text and language processing. I have tried to convey the fact that all these areas are tied together. In many respects it seems a pity that researchers cannot get together more. One fact that emerges is that a second language does manifest some of the hidden variables in your first language. Indeed the process of generating hidden variables, so necessary for translation, is important in automatically generating wrappers and finding facts like the speed of light.