20161025 data and software with LT4DH 2016 paper by Erik Tjong Kim Sang
erik.tjong.kim.sang@meertens.knaw.nl

Erik Tjong Kim Sang, "Finding Rising and Falling Words". Proceedings
of the COLING 2016 workshop "Language Technology Resources and Tools 
for Digital Humanities", ACL, Osaka, Japan, 2016.


FILES AND DIRECTORIES

bin: software
eval: manual evaluation results
gnuplot: graphs
magazine.txt: raw frequency counts of magazine words with frequency >= 100 per year
tweets.txt: raw frequency counts of tweet words >= 10000 per month


FILE FORMATS

directory eval 
Files contain one word per line. Words in files with name "positive" 
were classified as interesting cadidates by the human annotator, 
words in files with name "negative" were classified as uninteresting.

file magazine.txt
The file contains one word per line followed by absolute frequencies
per year for 1837-2009 (excluding years 1945, 2003, 2004 and 2008),
separated by TABs. The first line with label CORPUSSIZE contains the
total number of words per year. Only words with a total frequency of 
100 or more have been included.

file tweets.txt
The file contains one word per line followed by absolute frequencies
per month for January 2011-August 2016 (68 months), separated by 
TABs. The first line with label CORPUSSIZE contains the total number 
of months per year. Only words with a total frequency of 10000 or 
more have been included.


RANKING WORD LISTS

The program bin/r can be used for ranking the word lists, in 
combination with the standard program sort from Linux and Mac OS X. 
For example:

magazine text, smoothing factor 1 (-s1), delta scores (-k1)
bin/r -s1 < magazine.txt | sort -nr -k1

tweets, smoothing factor 5 (-s5), correlation coefficients (-k3)
bin/r -s5 < tweets.txt | sort -nr -k3

magazine text, smoothing factor 11 (-s5), window size 10 (-m10), correlation coefficients (top: -k3; bottom -k7)
bin/r -s11 -m10 < magazine.txt | sort -nr -k3
bin/r -s11 -m10 < magazine.txt | sort -nr -k7

Note: windowing results (with option -m) may take several minutes 
to compute.


EVALUATING WORD LISTS

The program bin/eval can be used for evaluating the top and bottom
100 (or any other number) of the ranked lists generated in the 
previous section. For example:

magazine text, smoothing factor 1 (-s1), delta scores (-k1)
bin/r -s1 < magazine.txt | sort -nr -k1 | bin/eval eval/magazine 100

tweets, smoothing factor 5 (-s5), correlation coefficients (-k3)
bin/r -s5 < tweets.txt | sort -nr -k3 | bin/eval eval/tweets 100

magazine text, smoothing factor 11 (-s5), window size 10 (-m10), correlation coefficients (top: -k3; bottom -k7)
bin/r -s11 -m10 < magazine.txt | sort -nr -k3 | bin/eval eval/magazine 100
bin/r -s11 -m10 < magazine.txt | sort -nr -k7 | bin/eval eval/magazine 100

Note: windowing results (with option -m) may take several minutes 
to compute.

If bin/eval report unknown words, their graphs need to be inspected
and the words need to be added to one of the files 
magazine/positive.txt, magazine/negative.txt, tweets/positive.txt or
tweets/negative.txt based on the source and the assessment.


CREATING GRAPHS

Graphs can be created with the bin/plotword programs which 
use the program gnuplot. Gnuplot is not a standard program. 
You may need to install it on your computer.

plot the graph for the word "terecht" based on magazine text: slow version
bin/plotword.magazine terecht < magazine.txt

plot the graph for the word "terecht" based on magazine text: fast version
W=terecht; grep -e CORPUSSIZE -e $W magazine.txt | bin/plotword.magazine $W

The graphs are also available as WORD.png (for example terecht.png)
in the directory gnuplot/magazine or gnuplot/tweets.

