Nova Resource:Wikisource/Wikisource Export

From Wikitech

We have two VPS instances and two Toolforge tools (one each for prod and test). The latter exist because WS Export used to be hosted there, and the VPSs still use those tools' databases and email addresses.

Creating a new instance

Create a new g2.cores4.ram8.disk80 instance running on the latest Debian (or a g3.cores1.ram1.disk20 for the test instance). Once the instance has been spawned, SSH in and follow these steps:

  1. Install PHP and Apache, along with some dependencies:
    sudo apt update
    sudo apt -y upgrade
    sudo apt -y install php php-mysql php-sqlite3 php-intl php-zip apache2 php-fpm mariadb-client calibre php-curl php-xml php-dom php-mbstring cron
    
  2. We use the packaged version of Calibre, even though they recommend not to because it can be out of date; it's been working fine for us. Note that Calibre can fail to clean up its temp files in some situations, so we also add the following in /etc/cron.daily/calibre-cleanup:
    #!/bin/bash
    
    find /tmp /ws-export/var/calibre-temp -path '*calibre*' -user www-data -mtime +1 -exec rm -r {} \;
    
  3. Install some fonts. Mostly these are available in the Debian repositories, but the Mukta family must be installed manually to maintain backwards compatibility (these used to be packaged with the tool's code), and Amiri is not available elsewhere.
    sudo apt -y install fontconfig fonts-freefont-ttf fonts-linuxlibertine fonts-dejavu-core fonts-gubbi fonts-opendyslexic fonts-noto fonts-noto-cjk
    wget https://fonts.google.com/download?family=Mukta -O Mukta.zip
    wget https://fonts.google.com/download?family=Mukta%20Mahee -O MuktaMahee.zip
    wget https://fonts.google.com/download?family=Mukta%20Malar -O MuktaMalar.zip
    wget https://fonts.google.com/download?family=Mukta%20Vaani -O MuktaVaani.zip
    sudo unzip Mukta.zip -d /usr/local/share/fonts/Mukta
    sudo unzip MuktaMahee.zip -d /usr/local/share/fonts/MuktaMahee
    sudo unzip MuktaMalar.zip -d /usr/local/share/fonts/MuktaMalar
    sudo unzip MuktaVaani.zip -d /usr/local/share/fonts/MuktaVaani
    wget https://github.com/aliftype/amiri/archive/refs/tags/1.000.zip -O Amiri.zip
    sudo unzip -j Amiri.zip "amiri-1.000/fonts/*" -d /usr/local/share/fonts/Amiri
    wget https://github.com/TiroTypeworks/Indigo/archive/refs/heads/main.zip
    unzip main.zip
    sudo cp -r Indigo-main/fonts /usr/local/share/fonts/Indigo
    
    sudo fc-cache -v
    
  4. Install composer by following these instructions (we don't include them here because you must validate the download), but make sure to install to the /usr/local/bin directory and with the filename composer, e.g.:
    sudo php composer-setup.php --install-dir=/usr/local/bin --filename=composer
    
  5. Clone the repository, first removing the html directory created by Apache.
    cd /var/www && sudo rm -rf html
    sudo git clone https://github.com/wikimedia/ws-export.git tool
    cd /var/www/tool
    
  6. Become the root user with sudo su root
  7. Add a block storage filesystem at /ws-export/ with a directory in it symlimked from the tool's var/ directory:
    mkdir /ws-export/var
    chown -R www-data:www-data /ws-export/var
    ln -s /ws-export/var /var/www/tool/var
    
  8. Run sudo composer install --no-dev -o
  9. Copy .env to .env.local and edit the environment variables in it.
  10. Make sure that all the files in the repo are owned by www-data.
    sudo chown -R www-data:www-data .
    
  11. Create the web server configuration file at /etc/apache2/sites-available/wsexport.conf with the following:
    <VirtualHost *:80>
            ServerName wsexport.wmflabs.org
            Redirect / https://ws-export.wmcloud.org/
    </VirtualHost>
    <VirtualHost *:80>
            DocumentRoot /var/www/tool/public
            ServerName ws-export.wmcloud.org
            
            <Proxy "fcgi://localhost">
                ProxySet retry=0 disablereuse=on
            </Proxy>
    
            # Requests with these user agents are denied and logged at ${APACHE_LOG_DIR}/denied.log
            SetEnvIfNoCase User-Agent "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com\/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com\/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ13bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|MSIE 7\.0; AOL 9\.5|Acoo Browser|AcooBrowser|MSIE 6\.0; Windows NT 5\.1; SV1; QQDownload|\.NET CLR 2\.0\.50727|MSIE 7\.0; Windows NT 5\.1; Trident\/4\.0; SV1; QQDownload|Frontera|tigerbot|Slackbot|Discordbot|LinkedInBot|BLEXBot|filterdb\.iss\.net|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp|Archive\-It|lua\-resty\-http|crawler4j|libcurl|dygg\-robot|GarlikCrawler|Gluten Free Crawler|WordPress|Paracrawl|7Siters|Microsoft Office Excel|MJ12bot|AhrefsBot|dotbot|amp-cloud|naver\.me\/spd|Adsbot|linkfluence|coccocbot|sqlmap|Applebot|MauiBot|PetalBot|FacebookBot|UMichBot|LinuxGetUrl|; MSIE (6|7|8)\.0;|deepnoc)" bad_bot=yes
    
            # Google Cloud, Amazon AWS among other webhost blocks. Requests are logged at ${APACHE_LOG_DIR}/denied.log
            SetEnvIfExpr "%{HTTP:X-Forwarded-For} -ipmatch '136.243.220.214' || %{HTTP:X-Forwarded-For} -ipmatch '89.58.41.156' || %{HTTP:X-Forwarded-For} -ipmatch '107.178.192.0/18' || %{HTTP:X-Forwarded-For} -ipmatch '15.236.0.0/14' || %{HTTP:X-Forwarded-For} -ipmatch '162.250.188.0/22' || %{HTTP:X-Forwarded-For} -ipmatch '18.32.0.0/11' || %{HTTP:X-Forwarded-For} -ipmatch '18.64.0.0/10' || %{HTTP:X-Forwarded-For} -ipmatch '216.218.128.0/17' || %{HTTP:X-Forwarded-For} -ipmatch '3.0.0.0/9' || %{HTTP:X-Forwarded-For} -ipmatch '34.192.0.0/10' || %{HTTP:X-Forwarded-For} -ipmatch '34.64.0.0/10' || %{HTTP:X-Forwarded-For} -ipmatch '35.152.0.0/13' || %{HTTP:X-Forwarded-For} -ipmatch '35.160.0.0/12' || %{HTTP:X-Forwarded-For} -ipmatch '35.184.0.0/13' || %{HTTP:X-Forwarded-For} -ipmatch '35.176.0.0/13' || %{HTTP:X-Forwarded-For} -ipmatch '35.192.0.0/12' || %{HTTP:X-Forwarded-For} -ipmatch '35.208.0.0/12' || %{HTTP:X-Forwarded-For} -ipmatch '35.224.0.0/12' || %{HTTP:X-Forwarded-For} -ipmatch '35.240.0.0/13' || %{HTTP:X-Forwarded-For} -ipmatch '52.0.0.0/10' || %{HTTP:X-Forwarded-For} -ipmatch '52.64.0.0/12' || %{HTTP:X-Forwarded-For} -ipmatch '54.144.0.0/12' || %{HTTP:X-Forwarded-For} -ipmatch '54.160.0.0/11' || %{HTTP:X-Forwarded-For} -ipmatch '54.192.0.0/12' || %{HTTP:X-Forwarded-For} -ipmatch '54.208.0.0/13' || %{HTTP:X-Forwarded-For} -ipmatch '54.216.0.0/14' || %{HTTP:X-Forwarded-For} -ipmatch '54.220.0.0/15' || %{HTTP:X-Forwarded-For} -ipmatch '54.224.0.0/11' || %{HTTP:X-Forwarded-For} -ipmatch '54.64.0.0/11' || %{HTTP:X-Forwarded-For} -ipmatch '45.145.128.0/24' || %{HTTP:X-Forwarded-For} -ipmatch '45.145.130.0/23' || %{HTTP:X-Forwarded-For} -ipmatch '5.133.192.0/19' || %{HTTP:X-Forwarded-For} -ipmatch '194.99.24.0/22' || %{HTTP:X-Forwarded-For} -ipmatch '93.177.116.0/23' || %{HTTP:X-Forwarded-For} -ipmatch '92.38.128.0/20' || %{HTTP:X-Forwarded-For} -ipmatch '185.88.101.0/24' || %{HTTP:X-Forwarded-For} -ipmatch '193.56.72.0/22' || %{HTTP:X-Forwarded-For} -ipmatch '185.88.36.0/22' || %{HTTP:X-Forwarded-For} -ipmatch '193.233.136.0/22' || %{HTTP:X-Forwarded-For} -ipmatch '193.56.64.0/22' || %{HTTP:X-Forwarded-For} -ipmatch '88.218.66.0/23'" bad_bot=yes
    
            # Calibre env vars: https://manual.calibre-ebook.com/customize.html#id1
            SetEnv CALIBRE_CONFIG_DIRECTORY /tmp/calibre-config
            SetEnv CALIBRE_TEMP_DIR /var/www/tool/var/calibre-temp
    
            LogFormat "%{X-Forwarded-For}i %t \"%r\" %>s \"%{Referer}i\" \"%{User-Agent}i\"" wsexport
    
            CustomLog ${APACHE_LOG_DIR}/access.log wsexport expr=!(reqenv('bad_bot')=='yes'||reqenv('dontlog')=='yes')
            CustomLog ${APACHE_LOG_DIR}/denied.log wsexport expr=(reqenv('bad_bot')=='yes')
            ErrorLog ${APACHE_LOG_DIR}/error.log
    
            ScriptAlias /tool "/var/www/tool/public"
            Redirect /wikisource-fr-good.atom /opds/fr/Bon_pour_export.xml
            Redirect /opds/fr.xml /opds/fr/Bon_pour_export.xml
    
            <Location /fpm-status>
                    SetHandler "proxy:unix:/run/php/php-fpm.sock|fcgi://localhost"
            </Location>
    
            <Directory /var/www/tool/public/>
                 Options Indexes FollowSymLinks
                 AllowOverride All
                 Require all granted
                 DirectoryIndex index.php book.php
                 # Rewrite URLs for Symfony:
                 RewriteEngine On
                 RewriteRule ^index\.php$ - [L]
                 RewriteCond %{REQUEST_FILENAME} !-f
                 RewriteCond %{REQUEST_FILENAME} !-d
                 RewriteCond %{REQUEST_URI} !^/fpm-status
                 RewriteRule .* /index.php [L]
    
                 <FilesMatch ".+\.php$">
                    SetHandler "proxy:unix:/run/php/php-fpm.sock|fcgi://localhost"
                 </FilesMatch>
            </Directory>
    
            <Directory /var/www/tool/>
                    Options Indexes FollowSymLinks
                    AllowOverride None
                    Require all granted
                    Deny from env=bad_bot
                    <Files "robots.txt">
                            # Allow bots to find out that they're not allowed
                            Allow from all
                    </Files>
            </Directory>
    
            ErrorDocument 403 "Access denied. If you are human and were wrongfully affected by this block, please contact tools.wsexport@tools.wmflabs.org"
            RewriteCond "%{HTTP_REFERER}" "^http://127\.0\.0\.1:(5500|8002)/index\.html" [NC]
            RewriteRule .* - [R=403,L]
            RewriteCond "%{HTTP_USER_AGENT}" "^[Ww]get"
            RewriteRule .* - [R=403,L]
            
            RewriteEngine On
            RewriteCond %{HTTP:X-Forwarded-Proto} !https
            RewriteRule ^/?(.*) https://%{SERVER_NAME}/$1 [R=301,L]
    </VirtualHost>
    
  12. Enable/disable the needed Apache modules, and enable the web server configuration.
    sudo a2dismod mpm_event
    sudo a2enmod proxy_fcgi
    sudo a2dissite 000-default
    sudo a2ensite wsexport
    sudo service apache2 reload
    
  13. (Re)start Apache:
    sudo service apache2 restart
    
    Moving forward, you should use sudo service apache2 graceful to restart the server.
  14. Set PHP configuration in /etc/php/8.2/mods-available/wsexport.ini:
    max_execution_time = 60
    memory_limit=512M
    error_log=/ws-export/var/log/php-error.log
    
    And enable it with sudo phpenmod wsexport
  15. Replace /etc/php/8.2/fpm/pool.d/www.conf with:
    [www]
    user = www-data
    group = www-data
    listen = /run/php/php8.2-fpm.sock
    listen.owner = www-data
    listen.group = www-data
    pm = dynamic
    pm.max_children = 10
    pm.start_servers = 2
    pm.min_spare_servers = 1
    pm.max_spare_servers = 3
    request_terminate_timeout = 120
    
  16. Set a global PHP memory limit by creating /etc/systemd/system/php8.2-fpm.service.d/limit.conf with:
    [Service]
    MemoryMax=85%
    OOMPolicy=continue
    Restart=on-failure
    
  17. Load the limit file and restart PHP-FPM:
    sudo systemctl daemon-reload
    sudo systemctl restart php8.2-fpm
    
  18. Add a cronjob to prune the cache twice a day:
    00 1,13 * * * /usr/local/bin/wsexport-prune-cache.sh
    
    Where the script is the following:
    #!/bin/bash
    df /ws-export/
    /usr/bin/php /var/www/tool/bin/console cache:pool:prune
    df /ws-export/
    
  19. Set up annual log dump files by running the following weekly (it's located at /etc/cron.weekly/wsexport-dump-logs, and note that you have to put the tool's DB credentials into /etc/mysql/conf.d/wsexport.cnf):
    #!/bin/bash
    YEAR="$1"
    if [ -z "$YEAR" ]; then
      YEAR=$( date +%Y )
    fi
    LOGDIR=/var/www/tool/public/logs
    echo "Dumping logs of $YEAR to $LOGDIR"
    mysqldump --defaults-file=/etc/mysql/conf.d/wsexport.cnf \
            --host=tools.db.svc.wikimedia.cloud \
            s52561__wsexport_p books_generated \
            --where="YEAR(time) = $YEAR" \
            | gzip -c > $LOGDIR/$YEAR.sql.gz
    chown -R www-data:www-data $LOGDIR
    ls -l $LOGDIR
    
    You should also create a symlink to make these logs public at ws-export.wmcloud.org/logs:
    ln -s /ws-export/wsexport_logs /var/www/tool/public/logs
    
  20. Add log rotation to Symfony's logs by creating the file /etc/logrotate.d/symfony with:
    /var/www/tool/var/log/*.log {
            su www-data www-data
            daily
            missingok
            rotate 14
            compress
            delaycompress
            notifempty
            create 640 root adm 
            sharedscripts
            postrotate
                    if /etc/init.d/apache2 status > /dev/null ; then \
                        /etc/init.d/apache2 reload > /dev/null; \
                    fi;
            endscript
            prerotate
                    if [ -d /etc/logrotate.d/httpd-prerotate ]; then \
                            run-parts /etc/logrotate.d/httpd-prerotate; \
                    fi; \
            endscript
    }
    
    You can check that it works by running it directly:
    $ sudo logrotate -f /etc/logrotate.d/symfony
    

crontab summary

Crontab for www-data:

# OPDS exports.
@daily php /var/www/tool/bin/console app:opds -q -l en --category=Ready_for_export
@daily php /var/www/tool/bin/console app:opds -q -l fr --category=Bon_pour_export

# Prune cache.
00 1,7,13,19 * * * /usr/local/bin/wsexport-prune-cache.sh > /dev/null