User:DCausse/Term Stats With Cirrus Dump

From Wikitech

Dump data

It is now possible to dump the content of a cirrus index.

Data can be dumped with the dumpIndex maintenance script. For example on deployment-bastion.deployment-prep.eqiad.wmflabs you can use the following commands to dump the simplewiki content and general index ː

mwscript extensions/CirrusSearch/maintenance/dumpIndex.php --wiki simplewiki --indexType general | gzip -c > dump-simplewiki-general.gz
mwscript extensions/CirrusSearch/maintenance/dumpIndex.php --wiki simplewiki --indexType content | gzip -c > dump-simplewiki-content.gz

In order to rebuild the index locally you will need to dump the mapping and the settingsː

curl http://deployment-elastic06:9200/simplewiki_content_first/_mapping/ > simplewiki_content_mapping.json
curl http://deployment-elastic06:9200/simplewiki_general_first/_mapping/ > simplewiki_general_mapping.json

curl http://deployment-elastic06:9200/simplewiki_general_first/_settings/ > simplewiki_general_settings.json
curl http://deployment-elastic06:9200/simplewiki_content_first/_settings/ > simplewiki_content_settings.json

Import data

On another host or your local machine you can recreate the same index :

You need to install the proper elasticsearch plugins :

  • analysis-icu
  • experimental-highlighter-elasticsearch-plugin
  • extra

Create the index with the same settings as the original (you need to install jq & curl) :

jq -c '.simplewiki_content_first' < simplewiki_content_settings.json | curl -XPUT 'http://localhost:9200/simplewiki_content_test' --data @-

And the mappings :

jq -c '{"page": .simplewiki_content_first.mappings.page}' < simplewiki_content_mapping.json | curl -XPUT 'http://localhost:9200/simplewiki_content_test/_mapping/page' --data @-
jq -c '{"namespace": .simplewiki_content_first.mappings.namespace}' < simplewiki_content_mapping.json | curl -XPUT 'http://localhost:9200/simplewiki_content_test/_mapping/namespace' --data @-

Import the data (you need to install gnu parallel):

zcat dump-simplewiki-content.gz | parallel --pipe -L 2 -N 2000 -j3 'curl -s http://localhost:9200/simplewiki_content_test/_bulk --data-binary @- > /dev/null'

You can follow the same steps to import simplewiki_general by changing all references to simplewiki_content to simplewiki_general.

Dump term stats

To dump term stats I use a plugin named elasticsearch-index-termlist. This plugin does not support ES-1.6 and have some problems with some fields, use my fork here : https://github.com/nomoa/elasticsearch-index-termlist . If you just need the jar for ES-1.6 grab it here : https://drive.google.com/file/d/0Bzo2vOqfrXhJZU1DTDdyanRkQUU/view?usp=sharing

Once the plugin is installed you can extract term stats with :

curl -XGET 'http://localhost:9200/simplewiki_content_test/_termlist?field=title.prefix'  > terms_title_prefix.json

TODO: add a list of the available fields.

Convert json to CSV

jq -r '.terms[] | [.term,.totalfreq,.docfreq] | @csv' < terms_title_prefix.json  > terms_title_prefix.csv

Now you can use R to inspect data :

# install.packages("stringi")
# install.packages("data.table")

library(stringi)
library(data.table)

# Read input, V1:the term, V2:totalDocFreq (always 0 for this field), V3:docFreq
dat <- read.csv('/plat/cirrus-dump/terms_title_prefix.csv', header=FALSE, sep=",");

# calculate the prefix length
dat$tlength <- stri_length(dat$V1);

# Reorder the dataframe on prefix length and term freq
dat <- dat[order(dat$tlength, -dat$V3),]

# Create a data.table with prefixes of length from 1 to 10
all = as.data.table(dat[dat$tlength<=10,]);
# Keep only data that have less than 10 chars and have doc freq > 10
highfreqs <- as.data.table(dat[dat$tlength<10 & dat$V3>10,])
# Keep only data that have less than 10 chars and have doc freq <= 10
lowfreqs <- as.data.table(dat[dat$tlength<10 & dat$V3<=10,])

# Rank in each term on freq grouped by length
all[,order:=rank(-V3,ties.method="first"),by=tlength]
highfreqs[,order:=rank(-V3,ties.method="first"),by=tlength]
lowfreqs[,order:=rank(-V3,ties.method="first"),by=tlength]

# Get the number of terms by length
counts <- table(all$tlength);

highfreqsCounts <- table(highfreqs$tlength);
lowfreqsCounts <- table(lowfreqs$tlength);

# number of terms by length that have more than 10 results
plot(highfreqsCounts)

# number of terms by length that have less than 10 results
plot(lowfreqsCounts)

# kind of freq distribution by term length
# each dot is a term, terms with length one are distributed between x [0,1], terms with length 2 in x [1,2] ...
plot(mydt$tlength + (mydt$order/ counts[mydt$tlength]), mydt$V3, log="y", cex=0.002)