NLP-Supported Full-Text Retrieval

This is my master’s thesis. It evaluates the usefulness of morphologic analysis in information retrieval systems, in particular for the retrieval of German-language documents. To this end I developed an experimental retrieval system called IRF/1, which I also describe there. If you want to know more, read the abstract below; if you want to know everything, read the complete thesis (PDF).

Abstract

The amount of information available in electronic form is growing exponentially, making it increasingly difficult to find the desired information. This is especially true of the World Wide Web, which has no central administration and thus no ordering scheme to help users find the information they need. Furthermore, most of the information is narrative, i.e., in the form of unstructured documents written in natural languages, as opposed to structured information stored in databases.

Information retrieval is primarily concerned with the storage and retrieval of unstructured information. Thus, along with the growth of the World Wide Web, information retrieval systems gain importance since they are often the only way to find the few documents actually relevant to a specific question in the vast quantities of text available. Internet search engines like AltaVista or Lycos are very popular and commercially successful.

Although information retrieval systems mainly deal with natural language, linguistic methods are rarely used. Most systems only use stemming, i.e., the mechanical cutting off of inflectional and derivational suffixes to better match index terms to query terms. Since most research on information retrieval is done for English, which has a relatively weak morphology, this is seldom regarded as problematic. Some researchers even consider stemming as completely unnecessary. There is, however, considerable evidence that stemming and more linguistically motivated methods do have a positive impact on retrieval performance for languages such as Dutch, German, Italian, or Slovene, which are morphologically richer than English. Morphologic phenomena like compounds and changes of the stem are still not handled by conventional stemmers. As German, for example, makes extensive use of these morphologic processes (consider compounds like Bundesverfassungsgericht, and changes of the stem like in Häuser, the plural of Haus), the application of full morphologic analysis to the information retrieval task intuitively seems to be promising.

This thesis sets out to determine the usefulness of morphologic analysis in information retrieval systems, particularly for the retrieval of German-language documents. An experimental retrieval system called IRF/1 was developed as a test bed. It is described in this thesis. IRF/1 is used to compare the retrieval effectiveness of different text processing methods for a test collection of about 300 magazine articles. The evaluated methods are:

  1. stemming (as a baseline),
  2. base form reduction using morphologic analysis
  3. same as (2) but compounds are split into the base forms of their constituents, and
  4. same as (3) but the base forms of compounds are kept along with their parts.

Using the standard information retrieval measures of recall and precision, the comparison finds morphologic analysis to be generally more effective than stemming. While morphologic base form reduction only provides relatively little improvement over stemming, decomposition of compounds results in a decisive increase in retrieval effectiveness for German.

It can be concluded that morphologic analysis with decomposition of compounds is a very promising approach to improving information retrieval for German and should be further investigated.