User:Santhosh/OpusMT Setup
Opus MT setup
Note 1: OpusMT can be used with it docker setup. Refer https://github.com/Helsinki-NLP/OPUS-MT Here we are listing the steps to manually install everything in a fresh machine without docker.
Note 2: In stat machine, downloading anything from internet require proxy. Add this to ~/.profile
export http_proxy=http://webproxy.eqiad.wmnet:8080 export https_proxy=http://webproxy.eqiad.wmnet:8080
Marian NMT
Steps to compile Marian NMT to prepare for Opus MT. Refer: https://marian-nmt.github.io/
Prerequisites:
- Install cmake. Download a binary release from https://cmake.org/download/ like cmake-3.16.5-Linux-x86_64.sh. Run it and copy bin and share folder to ~/.local
cp -rf bin/cmake ~/.local/bin/ cp -rf share ~/.local/
Set the ~/.local/bin to path
export PATH=~/.local/bin:$PATH
- Before downloading and compiling Marian, follow instructions at https://marian-nmt.github.io/docs/ to install Intel MKL, or alternatively OpenBLAS
Download marian (git pull a release from github) compile it:
cmake . -DCOMPILE_SERVER=on -DCOMPILE_CPU=on -DCOMPILE_CUDA=off -DUSE_STATIC_LIBS=on -DUSE_SENTENCEPIECE=off
Use all 32 cpu cores while compiling. Otherwise it will take lot of time
make -j32
Copy all binaries to ~/.local/bin/
cp marian-* ~/.local/bin/
Opus MT
Get Opus MT - git pull master branch - https://github.com/Helsinki-NLP/Opus-MT Change to Opus-MT directory. Create a models directory
Example:
santhosh@stat1008:~$ ls ./Opus-MT/models/ en-es en-fi en-ml en-mr
In each language pair directories, download the model from https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models Unzip the model zip file. The content should look like:
santhosh@stat1008:~$ ls -l ./Opus-MT/models/en-ml/ total 581756 -rw-rw---- 1 santhosh wikidev 274 Mar 2 11:20 decoder.yml -rw-r----- 1 santhosh wikidev 18652 Mar 2 11:20 LICENSE -rw-rw-r-- 1 santhosh wikidev 285443946 Mar 2 11:26 opus+bt-2020-03-02.zip -rw-rw-r-- 1 santhosh wikidev 306340921 Mar 2 10:47 opus+bt.spm32k-spm32k.transformer-align.model1.npz.best-perplexity.npz -rw-rw-r-- 1 santhosh wikidev 83144 Mar 2 11:06 opus+bt.spm32k-spm32k.transformer-align.train1.log -rw-rw-r-- 1 santhosh wikidev 6759 Mar 2 10:57 opus+bt.spm32k-spm32k.transformer-align.valid1.log -rw-rw-r-- 1 santhosh wikidev 1616586 Mar 1 22:41 opus+bt.spm32k-spm32k.vocab.yml -rwxrwx--- 1 santhosh wikidev 80 Mar 2 11:20 postprocess.sh -rwxrwx--- 1 santhosh wikidev 844 Mar 2 11:20 preprocess.sh -rw-rw---- 1 santhosh wikidev 625 Mar 2 11:20 README.md -rw-rw---- 1 santhosh wikidev 818441 Mar 2 11:20 source.spm -rw-rw---- 1 santhosh wikidev 0 Mar 2 11:20 source.tcmodel -rw-rw---- 1 santhosh wikidev 1358287 Mar 2 11:20 target.spm
Repeat this for all language models. You can see that they are referred from services.json file.
In Opus MT folder, create a virtual environment
python -m venv ENV_DIR source ENV_DIR/bin/activate pip install -r requirements.txt python server.py
This will start the OpusMT server.
You may run it under a screen session or just use another session to communicate with the webserver First, make sure that translation is working
curl --noproxy "*" --request POST --header "Content-Type: application/json" --data '{"from":"en","to":"fi","source":"water"}' http://localhost:8888/api/translate
This should print the translation of word "water" in Finnish.
Now, let use try load testing
Performance testing
Make sure apache benchmark utility - ab - is available.
Prepare a post data file English.json with the content:
{"from":"en","to":"fi","source":"Wail al-Shehri (1973–2001) was one of five hijackers of American Airlines Flight 11, which was flown into the North Tower of the World Trade Center as part of the September 11 attacks (memorial pictured). He and his younger brother Waleed joined an Al-Qaeda training camp in Afghanistan in March 2000. They were chosen, along with other Saudis, to participate in the attacks. Shehri returned to Saudi Arabia in October 2000 to obtain a clean passport and went back to Afghanistan before arriving in the United States in early June 2001. He stayed in motels in the Boynton Beach area of south Florida. On September 5, 2001, Shehri traveled to Boston and checked into a motel with his brother. Six days later, he arrived early in the morning at Boston's Logan International Airport and boarded American Airlines Flight 11. Shehri, his brother and three other hijackers deliberately crashed the airliner into the North Tower at 8:46 a.m."}
Make sure the system is able to translate this
curl --request POST --header "Content-Type: application/json" --data '{"from":"en","to":"fi","source":"Wail al-Shehri (1973–2001) was one of five hijackers of American Airlines Flight 11, which was flown into the North Tower of the World Trade Center as part of the September 11 attacks (memorial pictured). He and his younger brother Waleed joined an Al-Qaeda training camp in Afghanistan in March 2000. They were chosen, along with other Saudis, to participate in the attacks. Shehri returned to Saudi Arabia in October 2000 to obtain a clean passport and went back to Afghanistan before arriving in the United States in early June 2001. He stayed in motels in the Boynton Beach area of south Florida. On September 5, 2001, Shehri traveled to Boston and checked into a motel with his brother. Six days later, he arrived early in the morning at Boston's Logan International Airport and boarded American Airlines Flight 11. Shehri, his brother and three other hijackers deliberately crashed the airliner into the North Tower at 8:46 a.m."}' http://localhost:8888/api/translate
Now run translation 10 times in concurrency level 1
ab -n 10 -c 1 -p English.json -T "application/json" http://localhost:8888/api/translate
Run with 100 requests with 10 concurrency
ab -n 100 -c 10 -p English.json -T "application/json" http://localhost:8888/api/translate
Compare the performance with the earlier results posted at https://phabricator.wikimedia.org/T247245