User:DCausse/Term Stats With Cirrus Dump
Dump data
It is now possible to dump the content of a cirrus index.
Data can be dumped with the dumpIndex maintenance script. For example on deployment-bastion.deployment-prep.eqiad.wmflabs you can use the following commands to dump the simplewiki content and general index ː
mwscript extensions/CirrusSearch/maintenance/dumpIndex.php --wiki simplewiki --indexType general | gzip -c > dump-simplewiki-general.gz mwscript extensions/CirrusSearch/maintenance/dumpIndex.php --wiki simplewiki --indexType content | gzip -c > dump-simplewiki-content.gz
In order to rebuild the index locally you will need to dump the mapping and the settingsː
curl http://deployment-elastic06:9200/simplewiki_content_first/_mapping/ > simplewiki_content_mapping.json curl http://deployment-elastic06:9200/simplewiki_general_first/_mapping/ > simplewiki_general_mapping.json curl http://deployment-elastic06:9200/simplewiki_general_first/_settings/ > simplewiki_general_settings.json curl http://deployment-elastic06:9200/simplewiki_content_first/_settings/ > simplewiki_content_settings.json
Import data
On another host or your local machine you can recreate the same index :
You need to install the proper elasticsearch plugins :
- analysis-icu
- experimental-highlighter-elasticsearch-plugin
- extra
Create the index with the same settings as the original (you need to install jq & curl) :
jq -c '.simplewiki_content_first' < simplewiki_content_settings.json | curl -XPUT 'http://localhost:9200/simplewiki_content_test' --data @-
And the mappings :
jq -c '{"page": .simplewiki_content_first.mappings.page}' < simplewiki_content_mapping.json | curl -XPUT 'http://localhost:9200/simplewiki_content_test/_mapping/page' --data @- jq -c '{"namespace": .simplewiki_content_first.mappings.namespace}' < simplewiki_content_mapping.json | curl -XPUT 'http://localhost:9200/simplewiki_content_test/_mapping/namespace' --data @-
Import the data (you need to install gnu parallel):
zcat dump-simplewiki-content.gz | parallel --pipe -L 2 -N 2000 -j3 'curl -s http://localhost:9200/simplewiki_content_test/_bulk --data-binary @- > /dev/null'
You can follow the same steps to import simplewiki_general by changing all references to simplewiki_content to simplewiki_general.
Dump term stats
To dump term stats I use a plugin named elasticsearch-index-termlist. This plugin does not support ES-1.6 and have some problems with some fields, use my fork here : https://github.com/nomoa/elasticsearch-index-termlist . If you just need the jar for ES-1.6 grab it here : https://drive.google.com/file/d/0Bzo2vOqfrXhJZU1DTDdyanRkQUU/view?usp=sharing
Once the plugin is installed you can extract term stats with :
curl -XGET 'http://localhost:9200/simplewiki_content_test/_termlist?field=title.prefix' > terms_title_prefix.json
TODO: add a list of the available fields.
Convert json to CSV
jq -r '.terms[] | [.term,.totalfreq,.docfreq] | @csv' < terms_title_prefix.json > terms_title_prefix.csv
Now you can use R to inspect data :
# install.packages("stringi") # install.packages("data.table") library(stringi) library(data.table) # Read input, V1:the term, V2:totalDocFreq (always 0 for this field), V3:docFreq dat <- read.csv('/plat/cirrus-dump/terms_title_prefix.csv', header=FALSE, sep=","); # calculate the prefix length dat$tlength <- stri_length(dat$V1); # Reorder the dataframe on prefix length and term freq dat <- dat[order(dat$tlength, -dat$V3),] # Create a data.table with prefixes of length from 1 to 10 all = as.data.table(dat[dat$tlength<=10,]); # Keep only data that have less than 10 chars and have doc freq > 10 highfreqs <- as.data.table(dat[dat$tlength<10 & dat$V3>10,]) # Keep only data that have less than 10 chars and have doc freq <= 10 lowfreqs <- as.data.table(dat[dat$tlength<10 & dat$V3<=10,]) # Rank in each term on freq grouped by length all[,order:=rank(-V3,ties.method="first"),by=tlength] highfreqs[,order:=rank(-V3,ties.method="first"),by=tlength] lowfreqs[,order:=rank(-V3,ties.method="first"),by=tlength] # Get the number of terms by length counts <- table(all$tlength); highfreqsCounts <- table(highfreqs$tlength); lowfreqsCounts <- table(lowfreqs$tlength); # number of terms by length that have more than 10 results plot(highfreqsCounts) # number of terms by length that have less than 10 results plot(lowfreqsCounts) # kind of freq distribution by term length # each dot is a term, terms with length one are distributed between x [0,1], terms with length 2 in x [1,2] ... plot(mydt$tlength + (mydt$order/ counts[mydt$tlength]), mydt$V3, log="y", cex=0.002)