http://www.cs.nyu.edu/~mohri/postscript/hbka.pdf

This paper is dealing with speech recognition. We will see that Speech recognition is a very deep process in Natural Language. It ios linked closely to translation.

Let me begin by asking some fairly simple questions.

1) Why is language important for AI?

Suppose I am confronted with “Alice” and I want to discuss say my boating holiday. I go through a series of locks at Merry Hill. It takes me an hour and I moor my boat by the pub and have lunch there.

For Alice to comment she needs to have an accurate internal representation of what I am saying. Alice is going to use this internal representation to find the most appropriate response. How do I know whether the internal representation is correct or not. Well at one level you can say that if Alice gives an appropriate, intelligent response she has passed the Turing Test. Alternatively I can ask for a translation into a second language, Spanish say. Good Spanish will indicated an accurate internal representation, and conversely. In fact Google Translate provides me with “El barco attravesta una cerradura

2) What is the connection between speech and language understanding?

The recognition of speech comes in two parts. There is the recognition of individual phonemes, the building up of phonemes into words. An extremely important part of speech recognition is the placing of words in context. If I were to say “El si esta lluva y viento” you see immediately that “si” is out of context. A correct word recognizer should be able to tell the difference between whether and weather even though the sound of the words is the same. It does this by means of context – exactly the same as if we are translating.

This paper looks at Markov chains. In this model we have a sequence of words, each sequence of words is assigned a probability. If we have a choice of words we fit in the most probable. The probability is determined using a training set. Of course Google being a Web engine has no difficulty in taking a training set which is as large as required. A Markov chain essentially runs forwards. You have a “state” and this “state” is being constantly modified by subsequent words. This contrasts with lattice techniques where a number of surrounding words are taken.

The paper talks about bigrams and n grams. These in essence represent different meanings of words. The paper discusses in detail the lexical structure of bigrams and trigrams, but it does not tell us how to construct them. A bigram in fact consists of a different meaning. Clearly cerradura is going to behave differently from éclusia. In a Markov chain we move along branching when required. As soon as a branch reaches a termination and/or low probability we stop.

Perplexity is defined as being the number of choices. It is a geometrical mean as the perplexity metric is associated with entropy (defines to be Log (Number of states). With the vocabularies normally used we get a perplexity of about 150. Exact values are in the paper.

What are the results. On large vocabularies there is a 12% error rate. This although disappointing is none the less better than most speech recognition systems and as good as any. How could it be improved? Well similar hidden Markov models are available as source code.

Hidden Markov Models are now available as source code.
http://htk.eng.cam.ac.uk/
http://www.colloquial.com/carp/Publications/collinsACL04.pdf

There is quite an important point here and this concerns grammatical parsing. Markov chain methods do not explicitly parse. The lattice methods are basically about building a hypothesis which is then refined. I commented “!El barco esta caliente!” to “El barco attravesta una cerradura“. The methodology does indeed have an analogy to the annealing process. To start a lattice process you need an initial approximation A 12% error rate is good enough to be an initial approximation.

Markov chain and bigrams are well established techique. Sphinx uses Markov chain and bigram techniques, but with one important difference. It has grammar. It is developed by Carnegie Mellon university and is called Sphinx. As explained above the effect of grammar is not so much to produce a better result but to provide a framework by means of which other languages can be added. Sphinx allows you to speak in one language and get text in another. It is no better in terms of basic accuracy than other Markov chain methods.

ConclusionThis I think demonstrates the latest developments in Speech Recognition Research. The irony is that speech, like translation, depends on the recognition of words in context. Markov chains and lattices may be used too translate text from one language to another. Speech is in fact a far harder problem than translation but much more effort is being put in. Google Translate does not (as yet) differentiate between different contexts although the speech research does. It is clear why. People are on the move with their mobile phones, texting with a small keypad is an extremely time consuming process and most people can speak faster than they can type. However it would still be ionic if speech recognition came in before good translation.

In fact if one could get a computer to recognize speech and take down speeches, to have a “Hansard” (multilingual) for the European Parliant would be a trivial extension.