Better late than never is what they say… I’ve been absolutely slammed this week trying to get out a proposal and finish up my slides for ApacheCon Europe. So, here goes on the POTW for this week. If you recall, we are reading “An Analysis of the AskMSR Question-Answering System” by Brill, Dumais and Banko.

Microsoft’s AskMSR system, as detailed in this paper, takes a different tact for trying to solve the QA problem. Unlike most systems that use sophisticated NLP techniques, AskMSR makes the bet that somewhere on the web the exact answer is written in just the right way to answer the question. Their approach is touted to be much simpler and much more efficient. Also, they suggest this approach is complementary to the other approaches, even though in this paper they are trying to see just how far they can get.

AskMSR relies on the Internet as a massively redundant data source. Section 2 of the paper lays out the System Architecture, which can be br0ken down into 4 sections:

  1. Query Reformulation
    1. Takes the input query and rewrites into a number of different possible answers. As a last resort, they also produce some less precise versions made up of non-stopwords. No POS tagger or parser is used.
  2. n-gram mining
    1. Given the rewrites, they submit them to a search engine, get the page summaries and use n-grams to calculate candidate answers and score them.
  3. Filtering
    1. Given the n-grams, AskMSR then filters and reweights the answers based on answer-type, as specified by a few in-house, handwritten filters. This is most likely one area where more advanced NLP techniques could be used to determine.
  4. n-gram tiling
    1. The tiling algorithm merges similar answers and creates longer answers from overlapping answers.

Section 3 then details the experiments that were run, how they performed and how the different components contributed. Ironically, they use Google for their search engine, but my guess is there current system does not. Section 3.2 lays out how important each of the individual components are to the process. While query rewrites weighting are pretty important (11.2% drop in performance when all are weighted equally), n-gram filtering is the single most important part, contributing nearly 18% over the baseline. Tiling rounds things off by contributing approximately 14% to the process.

Section 4 has a discussion of how the various components contribute to errors. Interestingly, many of the errors are due to their projection of answers onto the TREC-9 corpus and wouldn’t really be an issue in a live system. Another interesting error source is the inability to query the search engine for numbers without including the number in the query:

“For example, a good rewrite for the query How many islands does Fiji have wold be <<Fiji has <NUM> islands >> but we are unable to give this type of query to the search engine”

It seems to me, though, that this could be addressed by some type of bounded approach.  For instance, why not just iterate through trying some numbers, within reason.  I wonder, too, how good their potential answers would be if they just put in any number.  I tried the Google query: “Fiji number islands” and the top three results have the answer and the third one, I think would be handled by the n-gram approach.

Section 5 is a good discussion on knowing when the AskMSR system does not know.  They use a decision tree that uses a number of factors, like query length, number of stopwords, etc. The thought process behind this is that why bother trying to answer questions that you know have a very low likelihood of being answered in the first place.  For example, they know that they don’t do very well on how questions, so they could simply say they don’t know the answer or just return keyword search results for the question (in a live system).  Of course, it all depends on how tolerant your users are of incorrect answers.

Well, that’s it for this week.  Haven’t decided on next week yet.  May do another QA paper, may look into some clustering algorithms.  Anyone have anything in particular they want to look at?

Popularity: 5% [?]