SLO/Citoid
Status: approved
Organizational
Service
- Citoid is a stateless Node.js service running in Kubernetes. It is publicly accessible behind restbase. Images
- Zotero is a stateless Node.js service running in Kubernetes. It is only available behind citoid and is a soft dependency, as citoid will continue to operate without it, though with slightly degraded results. It is not publicly available. Images
- Requests to citoid consist of a format and a search term. The search term may be a url, doi, isbn, or some other identifier. It returns a response containing metadata about the website, article or book. The service typically returns JSON (unless the requested format is bibtex). Linked wikipages contain sample requests to ensure it is working correctly.
Teams
- Editing
- Service OPs SRE
Architectural
Environmental dependencies
- wikikube cluster - hard dependency
Service dependencies
- url-downloader - hard dependency. Url-downloader failures historically cause both citoid and zotero to timeout. If only the url-downloader instance serving citoid is malfunctioning, citoid may continue to work but with considerable lag as it is waiting for zotero response to return; most of these requests may in fact ultimately result in timeouts.
Client-facing
Clients
- Mediawiki Citoid Extension
- WMF owned wikis
- external wikis
- note we cannot currently distinguish between these on the service level
- On wiki bots, and third party users of the toolforge backends:
- reFill - ToolForge url | Documentation
- CitationBot - ToolForge url | Documentation
- Various unidentified third parties
Request Classes
- WMF users versus third party users?
- Extension versus toolforge served bots?
Service Level Indicators (SLIs)
- Latency SLI, percentile: The 95th* percentile request latency, as measured at the server side.
- Latency SLI, acceptable fraction: The percentage of all requests that complete within 20 seconds, measured at the server side.
- Service availability SLI: Non 5xx only over all requests. Citoid only returns 5xxs if there is downtime or if there is a bug (internal server error) so this measures availability. Citoid should never return 5xxs during normal operation.
- Success ratio SLI: All 200 over all requests. This is important for users as a decreasing fraction of 200s seriously impacts usefulness of the tool. However, it is often out of our control as third party websites may be i.e. blocking us or be otherwise inaccessible (in which case we return 404s), so the ratio is generous.
Operational
Monitoring
Citoid is monitored by an open api swagger probe. There is only a probe for the citoid service, and it checks that zotero is working as well. It does not check if citoid can access external IPs. If zotero fails, the citoid probe fails.* *Question - we did add open api swagger docs to zotero a few years back so it could be probed directly- is that probed elsewhere?
Troubleshooting
We don't have a lot of SREs and developers familiar with the service, so this might affect how fast the issue can be responded to. There is not much documentation on the internals of Citoid (or Zotero) and no automatically generated docs (i.e. from doc strings) like there are with other projects. On the plus side, because these services are manually deployed and not automatically deployed, most issues with deploys are discovered in staging before they make it to production. *TODO review for accuracy
Deployment
Citoid and zotero are deployed on kubernetes with helmfile. Changes to deployment charts are needed to change the image and/or helmfile. Reverting or merging a change and deploying typically takes less than 10 minutes, but necessarily longer than 1 minute as the cronjob to update helmfile runs once a minute.* *Note: review for accuracy.
Service Level Objectives
Realistic targets
The Success Ratio SLO is the number of requests that correctly return citations. Historically, this has been very low; in some cases 80% or lower, and with time has gotten worse. This is in large part because of a large number of external sites have infrastructure to protect themselves from scrapers and bots which, in effect, we sometimes are. (Though our primary use case is to scrape citation metadata upon a user request, our service is sometimes used by bots, see consumers.) Sometimes this is also due to outright blocking of our User-Agent and IP. Getting unblocked requires talking to the site in question, and isn't always successful. Also, some metadata on some websites is JavaScript loaded; as we are not a headless browser, we can't interpret JS. Currently a success rate of 85% seems achievable, but 90% does not.
The service availability SLO has historically been fairly high. We experience less than one outage per quarter, and these are not typically long (i.e. less than 2 hours). However, because of time differences and the number of SREs to deal with the problem, in a worse case scenario it could take 12 hours to fix.
Ideal targets
An ideal target for the Success Ratio SLO would be 95%. Sometimes users enter links which are broken due to only pasting part of it. Some links might be dead to begin with; we recently added archive.org support to looking for them there, but this only works part of the time. Sometimes bots are misconfigured; one time one submitted a bunch of DOIs with incorrect URL encoding. Additionally, some users request metadata from formats we can't get metadata from, i.e. pdfs. However, the Success Ratio SLO has never been this high.
An ideal target for Service Availability SLO is 99.9%. This suggests most problems will be fixed in under 2 hours.
Reconciliation
Success Ratio SLO: 85%
Service Availability SLO: 99.5% (draft)
Latency SLI, acceptable fraction: 90% (draft)