From Wikitech
Jump to navigation Jump to search

Zotero is a Node.js service which is run alongside the Citoid service.


See: mw:Citoid/Updating Zotero


An image show zotero request flow
An image show zotero request flow

These are directions for deploying zotero, a Node.js service. More detailed but more general directions for nodejs services are at Migrating_from_scap-helm#Code_deployment/configuration_changes.

Locate build candidate

From gerrit, locate the candidate build. PipelineBot will post a message with the build name, i.e.

Mar 15 9:22 PM

Patch Set 4:

Wikimedia Pipeline
Image Build SUCCESS


 2019-03-15-211530-candidate, 950e3b4468f2f84d3bb2b0343c

'2019-03-15-211530-candidate' is the name of the build.

Add change via gerrit

  1. Clone deployment-charts repo.
  2. vi values.yaml
  image: wikimedia/mediawiki-services-zotero
    cpu: 10
    memory: 4Gi
      port: 1969
  port: 1969
    cpu: 200m
    memory: 200Mi
  version: 2019-01-17-114541-candidate-change-me
  1. Make a CR to change the version value for all three servers (staging, codfw, and eqiad), and after a successful review, merge it.
  2. After merge, log into a deployment server, there is a cronjob (1 minute) that will update the /srv/deployment-charts directory with the contents from git.

Log into deployment server

Ssh into the deploy machine.

ssh deployment.eqiad.wmnet

Navigate to /srv/deployment-charts/helmfile.d/services/

Zotero runs in both of the core data centers: Eqiad and Codfw. There is also a staging server. You can test out changes on the staging server first.

Staging server

cd /srv/deployment-charts/helmfile.d/services/zotero
helmfile -e staging -i apply

Helfile apply (may take awhile)

This checks status again so you can see if all instances have been restarted yet.

Verify Zotero is running on staging with a curl request:

curl -k -d 'http://www.nytimes.com/2018/06/11/technology/net-neutrality-repeal.html' -H 'Content-Type: text/plain' https://staging.svc.eqiad.wmnet:4969/web

curl -d '9791029801297' -H 'Content-Type: text/plain' https://staging.svc.eqiad.wmnet:4969/search

Production server

helmfile -e codfw -i apply

repeat for eqiad

helmfile -e eqiad -i apply


Verify Zotero is running with a curl request:

curl -k -d 'http://www.nytimes.com/2018/06/11/technology/net-neutrality-repeal.html' -H 'Content-Type: text/plain' https://zotero.svc.codfw.wmnet:4969/web

curl -k -d 'http://www.nytimes.com/2018/06/11/technology/net-neutrality-repeal.html' -H 'Content-Type: text/plain' https://zotero.svc.eqiad.wmnet:4969/web

curl -d 'http://www.nytimes.com/2018/06/11/technology/net-neutrality-repeal.html' -H 'Content-Type: text/plain' https://zotero.discovery.wmnet:4969/web


Logs have been disabled in 827891af due to them being actively harmful (aside from useless) to our environment. If you need to chase down a request that caused an issue, Citoid logs might be helpful as, aside from monitoring, Citoid is the only service talking to zotero. Citoid logs are in Logstash


Zotero is single threaded and a nodejs app. That means it has an event queue but whenever it is doing some CPU intensive (like parsing a large PDF) it is unable to serve the queue. It also does not offer any endpoint that can be used for a readinessProbe. That effectively means that when a given replica is serving a large request it can't serve anything else AND it is not depooled from the rotation, which means that requests will still head its way. In the majority of cases, it's ok. Citoid is usually ok with zotero requests timing out and it falls back to its internal parser. Plus, by the time zotero ends up with the large request it will serve node's event queue which USUALLY (judging from CPU graphs) will be way cheaper CPU wise.

So what happens when we get paged is one of the following scenarios:

All majority replicas get some repeated request (a user submits the url they want cited multiple times) and end up being CPU pegged and unable to service requests, including monitoring which times out 3 times and raises the alert. A replica gets a large request and becomes unable to serve more requests for an amount of time, monitoring requests flow its' way in a real unlucky situation, timeout and raise the alert. Something in the spectrum between those 2 extremes.

An HTTP GET endpoint in zotero like /healthz would effectively mitigate the above and render most of this note moot. However we have never invested in that.

Rolling back changes

If you need to roll back a change because something went wrong:

  1. Revert the git commit to the deployment-charts repo
  2. Merge the revert (with review if needed)
  3. Wait one minute for the cron job to pull the change to the deployment server
  4. execute ENV=<staging,eqiad,codfw> kube_env zotero $ENV; helmfile -e $ENV diff to see what you'll be changing
  5. execute helmfile -e $ENV apply

Rolling restart

A rolling restart of Zotero is sometimes necessary to resolve production issues. To run a rolling restart, run this on the deployment host, where $ENV is one of staging, eqiad, or codfw:

helmfile -e ${ENV?} -f /srv/deployment-charts/helmfile.d/services/zotero/helmfile.yaml --state-values-set roll_restart=1 sync

Rolling back in an emergency

This is discouraged but is noted here for completeness. Only use it in really truly big emergencies. If you are wondering whether a situation qualifies as truly big emergency, it almost certainly is not. Reach out to more senior SREs and the service owner first

If you can't wait the one minute, or the cron job to update from git fails etc. then it is possible to manually roll back using helm.

  1. Find the revision to roll back to
    1. kube_env zotero <staging,eqiad,codfw>; helm history <production> --tiller-namespace YOUR_SERVICE_NAMESPACE
    2. Find the revision to roll back to
    3. e.g. perhaps the penultimate one
      REVISION        UPDATED                         STATUS          CHART           DESCRIPTION     
      1               Tue Jun 18 08:39:20 2019        SUPERSEDED      termbox-0.0.2   Install complete
      2               Wed Jun 19 08:20:42 2019        SUPERSEDED      termbox-0.0.3   Upgrade complete
      3               Wed Jun 19 10:33:34 2019        SUPERSEDED      termbox-0.0.3   Upgrade complete
      4               Tue Jul  9 14:21:39 2019        SUPERSEDED      termbox-0.0.3   Upgrade complete
  2. Rollback with: kube_env zotero <staging,eqiad,codfw>; helm rollback <production> 3 --tiller-namespace YOUR_SERVICE_NAMESPACE