20060102 erikt@science.uva.nl

tnt directory: data files for TnT tagger applied to Dutch NE tagging

sources: TnT:  http://www.coli.uni-saarland.de/~thorsten/tnt/
         data: http://www.cnts.ua.ac.be/conll2002/ner/


HOW TO EXPAND THE LEXICON

Add lines to the file train.lex.extra in the format: "word 1 tag 1"
Example:

   echo "New 1 I-LOC 1" >> train.lex.extra
   echo "York 1 E-LOC 1" >> train.lex.extra

Then combine the original lexicon with the extra entities:

   cat train.lex.org train.lex.extra > train.lex

This strategy will not work very well for ambiguous words. Those 
should be added in the training text (as well) in order to make 
their context available.


HOW TO EXPAND THE TRAINING TEXT

Add tagged text to the file train.extra in the format: one word and
tag per line, separated by a single space with an empty line between
sentences. TIP: allow the sentences to start with a capital character
only if the first word is a name (software: ../bin/recap) and separate
punctuation signs from words.

Then combine the files train.org and train.extra to train:

   cat train.org train.extra | ../bin/addpp > train

After this, retrain the tagger:

   ../bin/tnt-para train

This will create a new lexicon which has to ba saved:

   cp train.lex train.lex.org

Then the extra words need to be added to the lexicon:

   cat train.lex.org train.lex.extra > train.lex


HOW TO EVALUATE

   ../bin/evaluate ../etc/000TEST

The start score is overall FB1=69.95
20060213: improved to 70.38
20060221: improved to 70.66

TAGS

E-LOC  final word of location entity
E-MISC final word of miscellaneous entity
E-ORG  final word of organization entity
E-PER  final word of person entity
I-LOC  non-final word of location entity
I-MISC non-final word of miscellaneous entity
I-ORG  non-final word of organization entity
I-PER  non-final word of person entity
O      non-entity word

