Nova Resource:Wikisource/Wikimedia OCR

From Wikitech

This page documents how to set up the Wikimedia OCR tool that is used by the Wikisource extension.

Web server

Install and configure Apache and PHP.

sudo apt -y install php php-bcmath php-common php-cli php-fpm php-gd php-json php-xml php-intl php-curl apache2 libapache2-mod-php

Create the web server configuration file at /etc/apache2/sites-available/wikimediaocr.conf with the following:

<VirtualHost *:80>
        DocumentRoot /var/www/tool/public
        ServerName ocr.wmcloud.org
        
        php_value memory_limit 512M

        # Requests with these user agents are denied.
        SetEnvIfNoCase User-Agent "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com\/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com\/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ13bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|MSIE 7\.0; AOL 9\.5|Acoo Browser|AcooBrowser|MSIE 6\.0; Windows NT 5\.1; SV1; QQDownload|\.NET CLR 2\.0\.50727|MSIE 7\.0; Windows NT 5\.1; Trident\/4\.0; SV1; QQDownload|Frontera|tigerbot|Slackbot|Discordbot|LinkedInBot|BLEXBot|filterdb\.iss\.net|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp|Archive\-It|lua\-resty\-http|crawler4j|libcurl|dygg\-robot|GarlikCrawler|Gluten Free Crawler|WordPress|Paracrawl|7Siters|Microsoft Office Excel|MJ12bot|AhrefsBot|dotbot|amp-cloud|naver\.me\/spd|Adsbot|linkfluence|coccocbot|sqlmap)" bad_bot=yes

        CustomLog ${APACHE_LOG_DIR}/access.log combined expr=!(reqenv('bad_bot')=='yes'||reqenv('dontlog')=='yes')
        CustomLog ${APACHE_LOG_DIR}/denied.log combined expr=(reqenv('bad_bot')=='yes')
        ErrorLog ${APACHE_LOG_DIR}/error.log

        <Directory /var/www/tool/public/>
             Options Indexes FollowSymLinks
             AllowOverride All
             Require all granted
             DirectoryIndex index.php
             RewriteEngine On
             RewriteRule ^index\.php$ - [L]
             RewriteCond %{REQUEST_FILENAME} !-f
             RewriteCond %{REQUEST_FILENAME} !-d
             RewriteRule . /index.php [L]
        </Directory>

        <Directory /var/www/tool/>
                Options Indexes FollowSymLinks
                AllowOverride None
                Require all granted
                Deny from env=bad_bot
        </Directory>

        ErrorDocument 403 "Access denied"
        RewriteCond "%{HTTP_REFERER}" "^http://127\.0\.0\.1:(5500|8002)/index\.html" [NC]
        RewriteRule .* - [R=403,L]
        RewriteCond "%{HTTP_USER_AGENT}" "^[Ww]get"
        RewriteRule .* - [R=403,L]
        
        RewriteEngine On
        RewriteCond %{HTTP:X-Forwarded-Proto} !https
        RewriteRule ^/?(.*) https://%{SERVER_NAME}/$1 [R=301,L]
</VirtualHost>

Set PHP configuration in /etc/php/7.3/mods-available/wikimediaocr.ini:

max_execution_time = 60;

And enable it with sudo phpenmod wikimediaocr

Enable various Apache modules, and the web server configuration (and disable the default site, which isn't used):

sudo a2enmod php7.3 rewrite
sudo a2ensite wikimediaocr
sudo a2dissite 000-default
sudo apache2ctl graceful

Tool

Install dependencies:

sudo apt install git composer npm

Manually update Composer to the latest version (follow the instruction and copy the new version to /usr/bin/composer). This is because the packaged version is too old, but we install it anyway to get all the dependencies.

Clone the repository, first removing the html/ directory created by Apache.

cd /var/www && sudo rm -rf html
sudo git clone https://github.com/wikimedia/wikimedia-ocr.git tool
cd /var/www/tool

Create .env.local with relevant values (see below).

# "composer update" is required for newer versions.
sudo composer update
sudo composer install --no-dev -o
sudo npm install
# "npm audit fix" is required to get newer code with fixes for security issues.
sudo npm audit fix
sudo npm run build

Restore ownership of all application files to www-data:

sudo chown -R www-data:www-data .

Add the cron job to update the app when there's a new tagged release with sudo crontab -e -u www-data then add:

MAILTO=tools.ocr@tools.wmflabs.org
*/10 * * * * /var/www/tool/vendor/wikimedia/toolforge-bundle/bin/deploy.sh prod /var/www/tool

Tesseract

The only configuration for Tesseract is to install it with all available OCR models (languages and scripts):

sudo apt install tesseract-ocr-all

At the time of writing, Tesseract 5 is the stable version which will be installed with that command.

Latest Tesseract

If for some reason the very latest Tesseract is required, it can also be installed from source.

  1. Install the required packages:
    sudo apt-get install automake ca-certificates g++ git libtool libleptonica-dev make pkg-config
    
  2. Optionally install the man pages:
    sudo apt-get install --no-install-recommends asciidoc docbook-xsl xsltproc
    
  3. Clone the Tesseract repo (home directory is fine):
    git clone https://github.com/tesseract-ocr/tesseract.git
    
  4. cd tesseract and checkout the latest tag for Tesseract 5, which at the time of writing is 5.3.2:
    git checkout 5.3.2
    
  5. Build from source:
    ./autogen.sh
    ./configure
    make
    
  6. Now remove the old Tesseract package, if present, with sudo apt purge tesseract-ocr. This will also remove the trained data files, which we'll re-add later. The make process above takes the longest, so it's important to not remove the old Tesseract until afterwards so as to minimize downtime of the tool. Note that parallelization (make -j8) doesn't seem to make any difference.
  7. Install the new version:
    sudo make install
    sudo ldconfig
    
  8. Clone the trained data files:
    cd ~
    git clone https://github.com/tesseract-ocr/tessdata_fast.git
    
  9. Copy them to /usr/local/share/tessdata:
    sudo cp tessdata_fast/*.traineddata /usr/local/share/tessdata
    
  10. Make sure all is well by running the check_tesseract script:
    cd /var/www/tool/
    ./check_tesseract.sh
    

Upgrading Tesseract 5 to git master or another branch

Assuming Tesseract 5 is already installed and you only need to upgrade it to a newer version, follow these simplified steps:

  1. cd to the tesseract directory in your home dir (see above for cloning)
  2. Checkout the version you want to upgrade to, e.g. git checkout master && git pull for git master
  3. Run sudo make clean to clear any previously compiled stuff (this is probably not required, but it executes very fast and should provide additional guarantees)
  4. Then follow the normal installation steps:
    ./autogen.sh
    ./configure
    make -j8
    sudo make install
    sudo ldconfig
    
  5. Check the output of tesseract --version: if you upgraded to git master, it should contain a commit number and that should match the latest commit on the master branch.
  6. Run the check_tesseract script for the final checks:
    cd /var/www/tool/
    ./check_tesseract.sh
    

Google OCR

Add the php-bcmath package:

sudo apt install php-bcmath

Download the Google Cloud Vision API keyfile to your local system (see CONTRIBUTING.md for info on obtaining a keyfile), then use scp to copy it to the VPS instance:

scp keyfile.json username@ocr-prod01.wikisource.eqiad1.wikimedia.cloud:/home/username
sudo mv keyfile.json /var/www/

Make sure .env.local file points to the right place:

APP_GOOGLE_KEYFILE=/var/www/keyfile.json

You also may need to restart Apache:

sudo service apache2 restart

Transkribus

Wikimedia OCR also offers Transkribus as an OCR engine. The documentation for that is still missing.