Parsoid: Difference between revisions

From Wikitech
Content deleted Content added
GWicke (talk | contribs)
GWicke (talk | contribs)
Line 37: Line 37:


Parsoid and its configuration are deployed (separately) using git-deploy. Doing deployments with git-deploy is very easy. You run <code>git deploy start</code>, make whichever changes you need to make to the git clone (such as pulling, changing branches, committing live hacks, etc.), then run <code>git deploy sync</code>. The sync command pushes the new state to all backends and restarts them.
Parsoid and its configuration are deployed (separately) using git-deploy. Doing deployments with git-deploy is very easy. You run <code>git deploy start</code>, make whichever changes you need to make to the git clone (such as pulling, changing branches, committing live hacks, etc.), then run <code>git deploy sync</code>. The sync command pushes the new state to all backends and restarts them.

=== Pre-deploy checks ===
* Perform manual VisualEditor editing tests with non-ASCII content too to catch encoding issues


=== Deploying the latest version of Parsoid ===
=== Deploying the latest version of Parsoid ===
Line 57: Line 60:
catrope@tin$ git deploy sync
catrope@tin$ git deploy sync
</pre>
</pre>

=== Post-deploy checks ===
* Test VE editing on enwiki and non-latin wikis


=== Misc stuff ===
=== Misc stuff ===

Revision as of 22:15, 4 November 2013

Parsoid is a service that parses converts between wikitext and HTML. The HTML contains additional metadata that allows it to be converted back ("round-tripped") to wikitext. VisualEditor fetches the HTML for a given page from Parsoid, edits it, then delivers the modified HTML to Parsoid, which converts it back to wikitext. Parsoid is a stateless HTTP server running on port 8000.

Monitoring

  • Parsoid eqiad cluster in Ganglia, only lists the worker machines. The Varnish hosts are cp1045 and cp1058.
  • Nagios has service checks for HTTP on port 8000 on both the individual backends and on the LVS service IP, and on port 80 on cp1045 and cp1058 and their service IP.
  • pybal does health checks on all backends every second, and depools boxes that are down as long as the % of depooled boxes does not exceed 50%. To see these health checks and depools/repools happen in real time, run ssh parsoid.svc.eqiad.wmnet (this will drop you into either lvs1003 or lvs1006, depending on which is active), then tail -f /var/log/pybal.log | grep parsoid
    • pybal also manages the Varnish hosts in the same way; they're at parsoidcache.svc.eqiad.wmnet
  • There is very rudimentary logging in /var/lib/parsoid/nohup.out on each Parsoid node. This log is truncated on each restart.

When something goes wrong

Roan and Gabriel know most about the Parsoid infrastructure. Send them a mail or (if urgent) call if there are issues you can't solve.

Reverting a Parsoid deployment

Code

ssh tin
cd /srv/deployment/parsoid/Parsoid
git deploy revert # pick the last good deployed version

Config and modules

ssh tin
cd /srv/deployment/parsoid/config
git deploy revert # pick the last good deployed version

If git deploy revert fails:

git deploy start
git reset --hard <desired changeset>
git deploy --force sync

Deploying changes

Parsoid and its configuration are deployed (separately) using git-deploy. Doing deployments with git-deploy is very easy. You run git deploy start, make whichever changes you need to make to the git clone (such as pulling, changing branches, committing live hacks, etc.), then run git deploy sync. The sync command pushes the new state to all backends and restarts them.

Pre-deploy checks

  • Perform manual VisualEditor editing tests with non-ASCII content too to catch encoding issues

Deploying the latest version of Parsoid

catrope@fenari$ ssh tin
catrope@tin$ cd /srv/deployment/parsoid/Parsoid
catrope@tin$ git deploy start
catrope@tin$ git pull
catrope@tin$ git deploy sync

Changing the Parsoid configuration

catrope@fenari$ ssh tin
catrope@tin$ cd /srv/deployment/parsoid/config
catrope@tin$ git deploy start
catrope@tin$ vim localsettings.js
[make your changes]
catrope@tin$ git commit -a
catrope@tin$ git deploy sync

Post-deploy checks

  • Test VE editing on enwiki and non-latin wikis

Misc stuff

  • Restart parsoid hosts via salt, in batches of 5: salt -b 5 -G 'deployment_target:parsoid' parsoid.restart_parsoid parsoid
  • To abort a deployment after running git deploy start but before git deploy sync , run git deploy abort .
  • There is a lock file preventing multiple deployments on the same code base from being active at the same time. If git deploy start complains about this lock, you can run git deploy abort to make it go away (if you know this isn't a legitimate warning due to someone else actively deploying).
  • If the sync step complains you didn't change anything, you can run git deploy --force sync (note order of arguments!) to make it sync anyway.
  • To change which hosts are pooled or change their weights, edit /home/wikipedia/common/docroot/noc/pybal/eqiad/parsoid as root on fenari

Data flow

Parsoid runs entirely on an internal subnet, so requests to it are proxied through the ve-parsoid API module. This module is implemented in extensions/VisualEditor/ApiVisualEditor.php and is invoked with a POST request to /w/api.php?action=ve-parsoid. The API module then sends a request to Parsoid, either GET /$prefix/$pagename to get the HTML for a page, or POST /$prefix/$pagename to submit HTML and get wikitext back. Parsoid itself also issues requests to /w/api.php to get the wikitext of the requested page and to do template expansion.

Once the ve-parsoid API module receives a response from Parsoid, it either relays it back to the client (when requesting HTML), or saves the returned wikitext to the page (when submitting HTML).

                (POST /w/api.php?action=ve-parsoid)          (GET /en/Barack_Obama?oldid=1234)           (requests for page content and template expansions)
Client browser ------------------------------------------> API ---------------------------->  Parsoid -----------------------------------------------------> API
    ^                                                      | ^                                 |   ^                                                          |
    |                  (response)                          | |      (HTML)                     |   |                   (responses)                            |
    +------------------------------------------------------+ +---------------------------------+   +----------------------------------------------------------+


                (POST /w/api.php?action=ve-parsoid)          (POST /en/Barack_Obama; oldid=1234)
Client browser ------------------------------------------> API ---------------------------->  Parsoid
                                                           | ^                                 |
                                               (save page) | |      (wikitext)                 |
                                                           | +---------------------------------+
                                                           |
                                                        Database

Caching and load balancing

Parsoid is load balanced using LVS. The assigned service IPs are:

The parsoidcache LVS balances two front-end Varnishes running on cp1045 / cp1058 (see parsoid-frontend.inc.vcl.erb). Those only hash requests for backends (see parsoid-backend.inc.vcl.erb). Cache misses are then forwarded to LVS in front of the Parsoid backends.

       10.2.2.29:80  {cp1045,cp1058}:80      10.2.2.28:8000          wtp10NN:8000
MW API  -> LVS -----> Varnish ---------------> LVS  ---------------------> Parsoid

All request URLs include the oldid as a query parameter. The Parsoid PHP extension in sends update requests to the front-end LVS IP on edits, template updates and visibility changes. The Parsoid backends perform additional requests with 'Cache-Control: only-if-cached' to the caches and reuse cached HTML to speed up serialization and re-rendering of pages. As an example, expansions of templates, extensions and images are reused after an edit without performing API requests for these. See this document for more detail.