Nova Resource:Wikisource/Wikisource Export
Appearance
URL: https://ws-export.wmcloud.org
Staging URL: https://ws-export-test.wmcloud.org
Source: https://github.com/wikimedia/ws-export
License: GPL-2.0
Staging URL: https://ws-export-test.wmcloud.org
Source: https://github.com/wikimedia/ws-export
License: GPL-2.0
We have two VPS instances and two Toolforge tools (one each for prod and test). The latter exist because WS Export used to be hosted there, and the VPSs still use those tools' databases and email addresses.
Creating a new instance
Create a new g2.cores4.ram8.disk80 instance running on the latest Debian (or a g3.cores1.ram1.disk20 for the test instance). Once the instance has been spawned, SSH in and follow these steps:
- Install PHP and Apache, along with some dependencies:
sudo apt update sudo apt -y upgrade sudo apt -y install php php-mysql php-sqlite3 php-intl php-zip apache2 php-fpm mariadb-client calibre php-curl php-xml php-dom php-mbstring cron
- We use the packaged version of Calibre, even though they recommend not to because it can be out of date; it's been working fine for us. Note that Calibre can fail to clean up its temp files in some situations, so we also add the following in
/etc/cron.daily/calibre-cleanup
:#!/bin/bash find /tmp /ws-export/var/calibre-temp -path '*calibre*' -user www-data -mtime +1 -exec rm -r {} \;
- Install some fonts. Mostly these are available in the Debian repositories, but the Mukta family must be installed manually to maintain backwards compatibility (these used to be packaged with the tool's code), and Amiri is not available elsewhere.
sudo apt -y install fontconfig fonts-freefont-ttf fonts-linuxlibertine fonts-dejavu-core fonts-gubbi fonts-opendyslexic fonts-noto fonts-noto-cjk fonts-smc-manjari fonts-smc-gayathri fonts-smc-raghumalayalamsans wget https://fonts.google.com/download?family=Mukta -O Mukta.zip wget https://fonts.google.com/download?family=Mukta%20Mahee -O MuktaMahee.zip wget https://fonts.google.com/download?family=Mukta%20Malar -O MuktaMalar.zip wget https://fonts.google.com/download?family=Mukta%20Vaani -O MuktaVaani.zip sudo unzip Mukta.zip -d /usr/local/share/fonts/Mukta sudo unzip MuktaMahee.zip -d /usr/local/share/fonts/MuktaMahee sudo unzip MuktaMalar.zip -d /usr/local/share/fonts/MuktaMalar sudo unzip MuktaVaani.zip -d /usr/local/share/fonts/MuktaVaani wget https://github.com/aliftype/amiri/archive/refs/tags/1.000.zip -O Amiri.zip sudo unzip -j Amiri.zip "amiri-1.000/fonts/*" -d /usr/local/share/fonts/Amiri wget https://github.com/TiroTypeworks/Indigo/archive/refs/heads/main.zip unzip main.zip sudo cp -r Indigo-main/fonts /usr/local/share/fonts/Indigo sudo fc-cache -v
- Install composer by following these instructions (we don't include them here because you must validate the download), but make sure to install to the
/usr/local/bin
directory and with the filenamecomposer
, e.g.:sudo php composer-setup.php --install-dir=/usr/local/bin --filename=composer
- Clone the repository, first removing the html directory created by Apache.
cd /var/www && sudo rm -rf html sudo git clone https://github.com/wikimedia/ws-export.git tool cd /var/www/tool
- Become the root user with
sudo su root
- Add a block storage filesystem at
/ws-export/
with a directory in it symlimked from the tool'svar/
directory:mkdir /ws-export/var chown -R www-data:www-data /ws-export/var ln -s /ws-export/var /var/www/tool/var
- Run
sudo composer install --no-dev -o
- Copy
.env
to.env.local
and edit the environment variables in it. - Make sure that all the files in the repo are owned by www-data.
sudo chown -R www-data:www-data .
- Create the web server configuration file at
/etc/apache2/sites-available/wsexport.conf
with the following:<VirtualHost *:80> ServerName wsexport.wmflabs.org Redirect / https://ws-export.wmcloud.org/ </VirtualHost> <VirtualHost *:80> DocumentRoot /var/www/tool/public ServerName ws-export.wmcloud.org <Proxy "fcgi://localhost"> ProxySet retry=0 disablereuse=on </Proxy> # Requests with these user agents are denied and logged at ${APACHE_LOG_DIR}/denied.log SetEnvIfNoCase User-Agent "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com\/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com\/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ13bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|MSIE 7\.0; AOL 9\.5|Acoo Browser|AcooBrowser|MSIE 6\.0; Windows NT 5\.1; SV1; QQDownload|\.NET CLR 2\.0\.50727|MSIE 7\.0; Windows NT 5\.1; Trident\/4\.0; SV1; QQDownload|Frontera|tigerbot|Slackbot|Discordbot|LinkedInBot|BLEXBot|filterdb\.iss\.net|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp|Archive\-It|lua\-resty\-http|crawler4j|libcurl|dygg\-robot|GarlikCrawler|Gluten Free Crawler|WordPress|Paracrawl|7Siters|Microsoft Office Excel|MJ12bot|AhrefsBot|dotbot|amp-cloud|naver\.me\/spd|Adsbot|linkfluence|coccocbot|sqlmap|Applebot|MauiBot|PetalBot|FacebookBot|UMichBot|LinuxGetUrl|; MSIE (6|7|8)\.0;|deepnoc)" bad_bot=yes # Google Cloud, Amazon AWS among other webhost blocks. Requests are logged at ${APACHE_LOG_DIR}/denied.log SetEnvIfExpr "%{HTTP:X-Forwarded-For} -ipmatch '136.243.220.214' || %{HTTP:X-Forwarded-For} -ipmatch '89.58.41.156' || %{HTTP:X-Forwarded-For} -ipmatch '107.178.192.0/18' || %{HTTP:X-Forwarded-For} -ipmatch '15.236.0.0/14' || %{HTTP:X-Forwarded-For} -ipmatch '162.250.188.0/22' || %{HTTP:X-Forwarded-For} -ipmatch '18.32.0.0/11' || %{HTTP:X-Forwarded-For} -ipmatch '18.64.0.0/10' || %{HTTP:X-Forwarded-For} -ipmatch '216.218.128.0/17' || %{HTTP:X-Forwarded-For} -ipmatch '3.0.0.0/9' || %{HTTP:X-Forwarded-For} -ipmatch '34.192.0.0/10' || %{HTTP:X-Forwarded-For} -ipmatch '34.64.0.0/10' || %{HTTP:X-Forwarded-For} -ipmatch '35.152.0.0/13' || %{HTTP:X-Forwarded-For} -ipmatch '35.160.0.0/12' || %{HTTP:X-Forwarded-For} -ipmatch '35.184.0.0/13' || %{HTTP:X-Forwarded-For} -ipmatch '35.176.0.0/13' || %{HTTP:X-Forwarded-For} -ipmatch '35.192.0.0/12' || %{HTTP:X-Forwarded-For} -ipmatch '35.208.0.0/12' || %{HTTP:X-Forwarded-For} -ipmatch '35.224.0.0/12' || %{HTTP:X-Forwarded-For} -ipmatch '35.240.0.0/13' || %{HTTP:X-Forwarded-For} -ipmatch '52.0.0.0/10' || %{HTTP:X-Forwarded-For} -ipmatch '52.64.0.0/12' || %{HTTP:X-Forwarded-For} -ipmatch '54.144.0.0/12' || %{HTTP:X-Forwarded-For} -ipmatch '54.160.0.0/11' || %{HTTP:X-Forwarded-For} -ipmatch '54.192.0.0/12' || %{HTTP:X-Forwarded-For} -ipmatch '54.208.0.0/13' || %{HTTP:X-Forwarded-For} -ipmatch '54.216.0.0/14' || %{HTTP:X-Forwarded-For} -ipmatch '54.220.0.0/15' || %{HTTP:X-Forwarded-For} -ipmatch '54.224.0.0/11' || %{HTTP:X-Forwarded-For} -ipmatch '54.64.0.0/11' || %{HTTP:X-Forwarded-For} -ipmatch '45.145.128.0/24' || %{HTTP:X-Forwarded-For} -ipmatch '45.145.130.0/23' || %{HTTP:X-Forwarded-For} -ipmatch '5.133.192.0/19' || %{HTTP:X-Forwarded-For} -ipmatch '194.99.24.0/22' || %{HTTP:X-Forwarded-For} -ipmatch '93.177.116.0/23' || %{HTTP:X-Forwarded-For} -ipmatch '92.38.128.0/20' || %{HTTP:X-Forwarded-For} -ipmatch '185.88.101.0/24' || %{HTTP:X-Forwarded-For} -ipmatch '193.56.72.0/22' || %{HTTP:X-Forwarded-For} -ipmatch '185.88.36.0/22' || %{HTTP:X-Forwarded-For} -ipmatch '193.233.136.0/22' || %{HTTP:X-Forwarded-For} -ipmatch '193.56.64.0/22' || %{HTTP:X-Forwarded-For} -ipmatch '88.218.66.0/23'" bad_bot=yes # Calibre env vars: https://manual.calibre-ebook.com/customize.html#id1 SetEnv CALIBRE_CONFIG_DIRECTORY /tmp/calibre-config SetEnv CALIBRE_TEMP_DIR /var/www/tool/var/calibre-temp LogFormat "%{X-Forwarded-For}i %t \"%r\" %>s \"%{Referer}i\" \"%{User-Agent}i\"" wsexport CustomLog ${APACHE_LOG_DIR}/access.log wsexport expr=!(reqenv('bad_bot')=='yes'||reqenv('dontlog')=='yes') CustomLog ${APACHE_LOG_DIR}/denied.log wsexport expr=(reqenv('bad_bot')=='yes') ErrorLog ${APACHE_LOG_DIR}/error.log ScriptAlias /tool "/var/www/tool/public" Redirect /wikisource-fr-good.atom /opds/fr/Bon_pour_export.xml Redirect /opds/fr.xml /opds/fr/Bon_pour_export.xml <Location /fpm-status> SetHandler "proxy:unix:/run/php/php-fpm.sock|fcgi://localhost" </Location> <Directory /var/www/tool/public/> Options Indexes FollowSymLinks AllowOverride All Require all granted DirectoryIndex index.php book.php # Rewrite URLs for Symfony: RewriteEngine On RewriteRule ^index\.php$ - [L] RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteCond %{REQUEST_URI} !^/fpm-status RewriteRule .* /index.php [L] <FilesMatch ".+\.php$"> SetHandler "proxy:unix:/run/php/php-fpm.sock|fcgi://localhost" </FilesMatch> </Directory> <Directory /var/www/tool/> Options Indexes FollowSymLinks AllowOverride None Require all granted Deny from env=bad_bot <Files "robots.txt"> # Allow bots to find out that they're not allowed Allow from all </Files> </Directory> ErrorDocument 403 "Access denied. If you are human and were wrongfully affected by this block, please contact tools.wsexport@tools.wmflabs.org" RewriteCond "%{HTTP_REFERER}" "^http://127\.0\.0\.1:(5500|8002)/index\.html" [NC] RewriteRule .* - [R=403,L] RewriteCond "%{HTTP_USER_AGENT}" "^[Ww]get" RewriteRule .* - [R=403,L] RewriteEngine On RewriteCond %{HTTP:X-Forwarded-Proto} !https RewriteRule ^/?(.*) https://%{SERVER_NAME}/$1 [R=301,L] </VirtualHost>
- Enable/disable the needed Apache modules, and enable the web server configuration.
sudo a2dismod mpm_event sudo a2enmod proxy_fcgi sudo a2dissite 000-default sudo a2ensite wsexport sudo service apache2 reload
- (Re)start Apache:
sudo service apache2 restart
- Moving forward, you should use
sudo service apache2 graceful
to restart the server.
- Moving forward, you should use
- Set PHP configuration in
/etc/php/8.2/mods-available/wsexport.ini
:And enable it withmax_execution_time = 60 memory_limit=512M error_log=/ws-export/var/log/php-error.log
sudo phpenmod wsexport
- Replace
/etc/php/8.2/fpm/pool.d/www.conf
with:[www] user = www-data group = www-data listen = /run/php/php8.2-fpm.sock listen.owner = www-data listen.group = www-data pm = dynamic pm.max_children = 10 pm.start_servers = 2 pm.min_spare_servers = 1 pm.max_spare_servers = 3 request_terminate_timeout = 120
- Set a global PHP memory limit by creating
/etc/systemd/system/php8.2-fpm.service.d/limit.conf
with:[Service] MemoryMax=85% OOMPolicy=continue Restart=on-failure
- Load the limit file and restart PHP-FPM:
sudo systemctl daemon-reload sudo systemctl restart php8.2-fpm
- Add a cronjob to prune the cache twice a day:Where the script is the following:
00 1,13 * * * /usr/local/bin/wsexport-prune-cache.sh
#!/bin/bash df /ws-export/ /usr/bin/php /var/www/tool/bin/console cache:pool:prune df /ws-export/
- Set up annual log dump files by running the following weekly (it's located at
/etc/cron.weekly/wsexport-dump-logs
, and note that you have to put the tool's DB credentials into/etc/mysql/conf.d/wsexport.cnf
):You should also create a symlink to make these logs public at ws-export.wmcloud.org/logs:#!/bin/bash YEAR="$1" if [ -z "$YEAR" ]; then YEAR=$( date +%Y ) fi LOGDIR=/var/www/tool/public/logs echo "Dumping logs of $YEAR to $LOGDIR" mysqldump --defaults-file=/etc/mysql/conf.d/wsexport.cnf \ --host=tools.db.svc.wikimedia.cloud \ s52561__wsexport_p books_generated \ --where="YEAR(time) = $YEAR" \ | gzip -c > $LOGDIR/$YEAR.sql.gz chown -R www-data:www-data $LOGDIR ls -l $LOGDIR
ln -s /ws-export/wsexport_logs /var/www/tool/public/logs
- Add log rotation to Symfony's logs by creating the file /etc/logrotate.d/symfony with:You can check that it works by running it directly:
/var/www/tool/var/log/*.log { su www-data www-data daily missingok rotate 14 compress delaycompress notifempty create 640 root adm sharedscripts postrotate if /etc/init.d/apache2 status > /dev/null ; then \ /etc/init.d/apache2 reload > /dev/null; \ fi; endscript prerotate if [ -d /etc/logrotate.d/httpd-prerotate ]; then \ run-parts /etc/logrotate.d/httpd-prerotate; \ fi; \ endscript }
$ sudo logrotate -f /etc/logrotate.d/symfony
crontab summary
Crontab for www-data
:
MAILTO=tools.wsexport@tools.wmflabs.org
# OPDS exports.
@daily php /var/www/tool/bin/console app:opds -q -l en --category=Ready_for_export
@daily php /var/www/tool/bin/console app:opds -q -l fr --category=Bon_pour_export
# Prune cache.
00 1,7,13,19 * * * /usr/local/bin/wsexport-prune-cache.sh > /dev/null