User:Santhosh/OpusMT Setup

From Wikitech

Opus MT setup

Note 1: OpusMT can be used with it docker setup. Refer https://github.com/Helsinki-NLP/OPUS-MT Here we are listing the steps to manually install everything in a fresh machine without docker.

Note 2: In stat machine, downloading anything from internet require proxy. Add this to ~/.profile

export http_proxy=http://webproxy.eqiad.wmnet:8080
export https_proxy=http://webproxy.eqiad.wmnet:8080

Marian NMT

Steps to compile Marian NMT to prepare for Opus MT. Refer: https://marian-nmt.github.io/

Prerequisites:

  • Install cmake. Download a binary release from https://cmake.org/download/ like cmake-3.16.5-Linux-x86_64.sh. Run it and copy bin and share folder to ~/.local
cp -rf bin/cmake ~/.local/bin/
cp -rf share ~/.local/

Set the ~/.local/bin to path

export PATH=~/.local/bin:$PATH

Download marian (git pull a release from github) compile it:

cmake . -DCOMPILE_SERVER=on -DCOMPILE_CPU=on -DCOMPILE_CUDA=off -DUSE_STATIC_LIBS=on -DUSE_SENTENCEPIECE=off

Use all 32 cpu cores while compiling. Otherwise it will take lot of time

make -j32

Copy all binaries to ~/.local/bin/

cp marian-* ~/.local/bin/

Opus MT

Get Opus MT - git pull master branch - https://github.com/Helsinki-NLP/Opus-MT Change to Opus-MT directory. Create a models directory

Example:

santhosh@stat1008:~$ ls ./Opus-MT/models/
en-es  en-fi  en-ml  en-mr

In each language pair directories, download the model from https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models Unzip the model zip file. The content should look like:

santhosh@stat1008:~$ ls  -l ./Opus-MT/models/en-ml/
total 581756
-rw-rw---- 1 santhosh wikidev       274 Mar  2 11:20 decoder.yml
-rw-r----- 1 santhosh wikidev     18652 Mar  2 11:20 LICENSE
-rw-rw-r-- 1 santhosh wikidev 285443946 Mar  2 11:26 opus+bt-2020-03-02.zip
-rw-rw-r-- 1 santhosh wikidev 306340921 Mar  2 10:47 opus+bt.spm32k-spm32k.transformer-align.model1.npz.best-perplexity.npz
-rw-rw-r-- 1 santhosh wikidev     83144 Mar  2 11:06 opus+bt.spm32k-spm32k.transformer-align.train1.log
-rw-rw-r-- 1 santhosh wikidev      6759 Mar  2 10:57 opus+bt.spm32k-spm32k.transformer-align.valid1.log
-rw-rw-r-- 1 santhosh wikidev   1616586 Mar  1 22:41 opus+bt.spm32k-spm32k.vocab.yml
-rwxrwx--- 1 santhosh wikidev        80 Mar  2 11:20 postprocess.sh
-rwxrwx--- 1 santhosh wikidev       844 Mar  2 11:20 preprocess.sh
-rw-rw---- 1 santhosh wikidev       625 Mar  2 11:20 README.md
-rw-rw---- 1 santhosh wikidev    818441 Mar  2 11:20 source.spm
-rw-rw---- 1 santhosh wikidev         0 Mar  2 11:20 source.tcmodel
-rw-rw---- 1 santhosh wikidev   1358287 Mar  2 11:20 target.spm

Repeat this for all language models. You can see that they are referred from services.json file.

In Opus MT folder, create a virtual environment

python -m venv ENV_DIR
source ENV_DIR/bin/activate
pip install -r requirements.txt 
python server.py 

This will start the OpusMT server.

You may run it under a screen session or just use another session to communicate with the webserver First, make sure that translation is working

curl --noproxy "*"  --request POST --header "Content-Type: application/json" --data '{"from":"en","to":"fi","source":"water"}'  http://localhost:8888/api/translate

This should print the translation of word "water" in Finnish.

Now, let use try load testing

Performance testing

Make sure apache benchmark utility - ab - is available.

Prepare a post data file English.json with the content:

{"from":"en","to":"fi","source":"Wail al-Shehri (1973–2001) was one of five hijackers of American Airlines Flight 11, which was flown into the North Tower of the World Trade Center as part of the September 11 attacks (memorial pictured). He and his younger brother Waleed joined an Al-Qaeda training camp in Afghanistan in March 2000. They were chosen, along with other Saudis, to participate in the attacks. Shehri returned to Saudi Arabia in October 2000 to obtain a clean passport and went back to Afghanistan before arriving in the United States in early June 2001. He stayed in motels in the Boynton Beach area of south Florida. On September 5, 2001, Shehri traveled to Boston and checked into a motel with his brother. Six days later, he arrived early in the morning at Boston's Logan International Airport and boarded American Airlines Flight 11. Shehri, his brother and three other hijackers deliberately crashed the airliner into the North Tower at 8:46 a.m."}

Make sure the system is able to translate this

curl --request POST --header "Content-Type: application/json" --data '{"from":"en","to":"fi","source":"Wail al-Shehri (1973–2001) was one of five hijackers of American Airlines Flight 11, which was flown into the North Tower of the World Trade Center as part of the September 11 attacks (memorial pictured). He and his younger brother Waleed joined an Al-Qaeda training camp in Afghanistan in March 2000. They were chosen, along with other Saudis, to participate in the attacks. Shehri returned to Saudi Arabia in October 2000 to obtain a clean passport and went back to Afghanistan before arriving in the United States in early June 2001. He stayed in motels in the Boynton Beach area of south Florida. On September 5, 2001, Shehri traveled to Boston and checked into a motel with his brother. Six days later, he arrived early in the morning at Boston's Logan International Airport and boarded American Airlines Flight 11. Shehri, his brother and three other hijackers deliberately crashed the airliner into the North Tower at 8:46 a.m."}'  http://localhost:8888/api/translate

Now run translation 10 times in concurrency level 1

ab -n 10 -c 1 -p English.json -T  "application/json"  http://localhost:8888/api/translate

Run with 100 requests with 10 concurrency

ab -n 100 -c 10 -p English.json -T  "application/json"  http://localhost:8888/api/translate

Compare the performance with the earlier results posted at https://phabricator.wikimedia.org/T247245