Nova Resource:Wikisource/Wikimedia OCR
This page documents how to set up the Wikimedia OCR tool that is used by the Wikisource extension.
Web server
Install and configure Apache and PHP.
sudo apt -y install php php-bcmath php-common php-cli php-fpm php-gd php-json php-xml php-intl php-curl apache2 libapache2-mod-php
Create the web server configuration file at /etc/apache2/sites-available/wikimediaocr.conf
with the following:
<VirtualHost *:80>
DocumentRoot /var/www/tool/public
ServerName ocr.wmcloud.org
php_value memory_limit 512M
# Requests with these user agents are denied.
SetEnvIfNoCase User-Agent "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com\/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com\/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ13bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|MSIE 7\.0; AOL 9\.5|Acoo Browser|AcooBrowser|MSIE 6\.0; Windows NT 5\.1; SV1; QQDownload|\.NET CLR 2\.0\.50727|MSIE 7\.0; Windows NT 5\.1; Trident\/4\.0; SV1; QQDownload|Frontera|tigerbot|Slackbot|Discordbot|LinkedInBot|BLEXBot|filterdb\.iss\.net|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp|Archive\-It|lua\-resty\-http|crawler4j|libcurl|dygg\-robot|GarlikCrawler|Gluten Free Crawler|WordPress|Paracrawl|7Siters|Microsoft Office Excel|MJ12bot|AhrefsBot|dotbot|amp-cloud|naver\.me\/spd|Adsbot|linkfluence|coccocbot|sqlmap)" bad_bot=yes
CustomLog ${APACHE_LOG_DIR}/access.log combined expr=!(reqenv('bad_bot')=='yes'||reqenv('dontlog')=='yes')
CustomLog ${APACHE_LOG_DIR}/denied.log combined expr=(reqenv('bad_bot')=='yes')
ErrorLog ${APACHE_LOG_DIR}/error.log
<Directory /var/www/tool/public/>
Options Indexes FollowSymLinks
AllowOverride All
Require all granted
DirectoryIndex index.php
RewriteEngine On
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</Directory>
<Directory /var/www/tool/>
Options Indexes FollowSymLinks
AllowOverride None
Require all granted
Deny from env=bad_bot
</Directory>
ErrorDocument 403 "Access denied"
RewriteCond "%{HTTP_REFERER}" "^http://127\.0\.0\.1:(5500|8002)/index\.html" [NC]
RewriteRule .* - [R=403,L]
RewriteCond "%{HTTP_USER_AGENT}" "^[Ww]get"
RewriteRule .* - [R=403,L]
RewriteEngine On
RewriteCond %{HTTP:X-Forwarded-Proto} !https
RewriteRule ^/?(.*) https://%{SERVER_NAME}/$1 [R=301,L]
</VirtualHost>
Set PHP configuration in /etc/php/8.2/mods-available/wikimediaocr.ini
:
max_execution_time = 60;
And enable it with sudo phpenmod wikimediaocr
Enable various Apache modules, and the web server configuration (and disable the default site, which isn't used):
sudo a2enmod php8.2 rewrite
sudo a2ensite wikimediaocr
sudo a2dissite 000-default
sudo apache2ctl graceful
Tool
Install dependencies:
sudo apt install git composer npm
Clone the repository, first removing the html/
directory created by Apache.
cd /var/www && sudo rm -rf html
sudo git clone https://github.com/wikimedia/wikimedia-ocr.git tool
cd /var/www/tool
Create .env.local
with relevant values (see below).
sudo composer install --no-dev --optimize-autoloader
sudo npm install
# "npm audit fix" is required to get newer code with fixes for security issues.
sudo npm audit fix
sudo npm run build
Change ownership of all application files to www-data:
sudo chown -R www-data:www-data .
Add the cron job to update the app when there's a new tagged release with sudo crontab -e -u www-data
then add:
MAILTO=tools.ocr@tools.wmflabs.org
*/10 * * * * /var/www/tool/vendor/wikimedia/toolforge-bundle/bin/deploy.sh prod /var/www/tool
Tesseract
The only configuration for Tesseract is to install it with all available OCR models (languages and scripts):
sudo apt install tesseract-ocr-all
At the time of writing, Tesseract 5 is the stable version which will be installed with that command.
Latest Tesseract
If for some reason the very latest Tesseract is required, it can also be installed from source.
- Install the required packages:
sudo apt-get install automake ca-certificates g++ git libtool libleptonica-dev make pkg-config
- Optionally install the man pages:
sudo apt-get install --no-install-recommends asciidoc docbook-xsl xsltproc
- Clone the Tesseract repo (home directory is fine):
git clone https://github.com/tesseract-ocr/tesseract.git
cd tesseract
and checkout the latest tag for Tesseract 5, which at the time of writing is 5.3.2:git checkout 5.3.2
- Build from source:
./autogen.sh ./configure make
- Now remove the old Tesseract package, if present, with
sudo apt purge tesseract-ocr
. This will also remove the trained data files, which we'll re-add later. Themake
process above takes the longest, so it's important to not remove the old Tesseract until afterwards so as to minimize downtime of the tool. Note that parallelization (make -j8
) doesn't seem to make any difference. - Install the new version:
sudo make install sudo ldconfig
- Clone the trained data files:
cd ~ git clone https://github.com/tesseract-ocr/tessdata_fast.git
- Copy them to
/usr/local/share/tessdata
:sudo cp tessdata_fast/*.traineddata /usr/local/share/tessdata
- Make sure all is well by running the check_tesseract script:
cd /var/www/tool/ ./check_tesseract.sh
Upgrading Tesseract 5 to git master or another branch
Assuming Tesseract 5 is already installed and you only need to upgrade it to a newer version, follow these simplified steps:
cd
to the tesseract directory in your home dir (see above for cloning)- Checkout the version you want to upgrade to, e.g.
git checkout master && git pull
for git master - Run
sudo make clean
to clear any previously compiled stuff (this is probably not required, but it executes very fast and should provide additional guarantees) - Then follow the normal installation steps:
./autogen.sh ./configure make -j8 sudo make install sudo ldconfig
- Check the output of
tesseract --version
: if you upgraded to git master, it should contain a commit number and that should match the latest commit on the master branch. - Run the check_tesseract script for the final checks:
cd /var/www/tool/ ./check_tesseract.sh
Google OCR
Add the php-bcmath package:
sudo apt install php-bcmath
Download the Google Cloud Vision API keyfile to your local system (see CONTRIBUTING.md for info on obtaining a keyfile), then use scp to copy it to the VPS instance:
scp keyfile.json username@ocr-prod01.wikisource.eqiad1.wikimedia.cloud:/home/username
sudo mv keyfile.json /var/www/
Make sure .env.local
file points to the right place:
APP_GOOGLE_KEYFILE=/var/www/keyfile.json
You also may need to restart Apache:
sudo service apache2 restart
Transkribus
Wikimedia OCR also offers Transkribus as an OCR engine.
To configure it, set the following two environment variables in .env.local
:
APP_TRANSKRIBUS_USERNAME=
APP_TRANSKRIBUS_PASSWORD=