Entity Recognition and Classification

This is the suggested topic for the course Language Technology Project 2006.


A key resource for question answering systems is an entity recognition module. Usually such a module identifies named entities in text and classifies them as belonging to broad entity classes like person, location and organization. However, nearly always a more fine-grained segmentation is required as the following example questions from the TREC-QA 2005 competition show:

  1. The submarine Kursk was part of which Russian fleet?
  2. What country did the winner of the winner of the Miss Universe 2000 competition represent?
  3. What was the nickname for the 1998 French soccer team?

A naive question answering system would search in appropriate texts for the three key phrases Kursk, Miss Universe 2000 and 1998 French soccer team, try to find fleets, countries or nicknames in the neighborhood of these phrases, and consider these as candidate answers. The task of the entity recognition module is to identify these three entity classes (and many others) as such. However this identification process is a non-trivial task.


Develop and build a system that automatically enriches plain text documents with entity class information that is useful for a question answering system. As a source for such entity class information, use the Wikipedia, (eXtended) WordNet, arbitray lists, handcoded rules or learn them from training data.

The system should be able to run in several modes, using either one, many or all of the available resources. Annotations should be provided in stand-off XML. The proposed target language for the system is Dutch. However, if the consensus of the students is that English or another language should be chosen then such a switch is possible.


The system consists of several modules which are typically built by two persons each. In case the course attracts fewer than 12 students, one or more of these modules will be left out of the system. The preferences of the students will determine the included parts.

For this course, relevant background knowledge consists of knowledge of natural language processing plus general programming experience. Some experience with machine learners is useful for the students that want to work on the learning module. Having written XML processing software is a plus for the interface module.

The system will be evaluated with documents which will be supplied by the teacher. Important questions will be: "what is the precision of the assigned entity classes?", "how many different classes can be detected?" and "how many interesting phrases are classified?".

Literature, tools and examples

Last update: January 09, 2006, erikt@science.uva.nl