Tool:Deputy

From Wikitech
Toolforge tools
Deputy Dispatch
Website https://deputy.toolforge.org
Description Bulk data processor for Deputy users
Keywords copyright, data processing, api, javascript, nodejs, typescript
Author(s) Chlod Alejandrotalk
Maintainer(s) Chlod (View all)
Source code https://github.com/ChlodAlejandro/deputy-dispatch
License Apache License 2.0
Issues https://github.com/ChlodAlejandro/deputy-dispatch/issues

Dispatch (or Deputy Dispatch) is a Node.js + Express webserver that exposes API endpoints that processes large masses of data from Wikimedia wikis for easier consumption by Deputy. It is meant to centralize and optimize the gathering and processing of bulk data such that numerous users of Deputy do not individual make taxing requests on Wikimedia servers.

This user makes requests under the user, but does not make any edits. It purely reads data from the Wikimedia servers, and the logged-in status allows it to query more than an anonymous user would be able to.

Usage

Dispatch is primarily used through Deputy. Deputy has been built to work cross-wiki and integrate with Dispatch to support every single Wikimedia wiki, with an out-of-box configuration which can handle simple copyright management tasks on the wiki.

The Dispatch API can also be used directly. Documentation for the API is automatically generated, and can be found here.

Asynchronous jobs

Some tasks done by Dispatch may require longer periods of time to run. Though these usually last under 3 minutes, timeouts or network issues may not be able to sustain such a connection for a prolonged period of time. For this reason, tasks which take a while to execute must be ran through asynchronous job requests. An initial request is sent to Dispatch (using POST) which returns a job ID. The progress of the job can then be polled using a GET to the /{id}/progress sub-path of that endpoint. Lastly, the result of that job when it completes can be accessed with a GET to the /{id} sub-path of that endpoint.

Note that attempting to access the result early will end up in a 409 Conflict HTTP error. The data is usually cached for an hour before being discarded. Refer to the documentation for the task information schema.

Deployment

The deputy tool uses a standard Node.js web service to operate. As of February 19, 2024, this tool is being deployed using the Toolforge Build Service.

Deployments are not automatic. As the Wikimedia GitLab instance develops, this may change in the future. For now, the following steps are used to deploy new versions of the tool.

  1. [me@tools-sgebastion-XX] become deputy
    • That's pretty obvious already.
  2. [tools.deputy@tools-sgebastion-XX] toolforge build start https://github.com/ChlodAlejandro/deputy-dispatch
    • Trigger a build on the Toolforge Build Service. This downloads the latest version of the repository (on main) and performs all necessary build steps.
  3. [tools.deputy@tools-sgebastion-XX] toolforge webservice restart
    • Restart the webservice. In the event that the service.manifest got deleted or the service must be restarted from scratch, use the following command:
      [tools.deputy@tools-sgebastion-XX] toolforge webservice --backend=kubernetes buildservice start
  4. [tools.deputy@tools-sgebastion-XX] toolforge webservice logs -f
    • Verify that the tool is up and running.

For deployment issues, you can email wikiatchlod.net or use Special:EmailUser/Chlod Alejandro. If you both break and fix Dispatch (and you're not User:Chlod Alejandro), you get a complimentary chocolate chip cookie.

Required environment variables

  • TOOLFORGE set to 1. This informs Dispatch that it's running on Toolforge.
  • DISPATCH_SELF_OAUTH_ACCESS_TOKEN set to an owner-only Meta-Wiki OAuth application token.
  • TOOL_TOOLSDB_USER and TOOL_TOOLSDB_PASSWORD (provided by Toolforge)
  • TOOL_REPLICA_USER and TOOL_REPLICA_PASSWORD (provided by Toolforge)

Debug logs

webservice logs provides human-readable logs, but only for log levels INFO and higher, and doesn't provide extra data in a machine-readable way. Debug logs are available as a file on the tool's working directory. Dispatch can run with or without a Toolforge NFS mount, and it will place the log files depending on how this is done. When Dispatch is NFS-mounted, logs are stored in $TOOL_DATA_DIR/.logs.

When Dispatch is not NFS-mounted, logs are created on the container and destroyed when the pod is terminated. toolforge webservice shell will create a new pod, which is not what you want. Instead you want to access the webservice pod:

  1. [tools.deputy@tools-sgebastion-XX] kubectl get pods
    Determine the name of the pod running the webservice. It'll have the pattern deputy-*.
  2. [tools.deputy@tools-sgebastion-XX] kubectl exec -ti <POD> -- tail .logs/dispatch.log -f
    Print out the tail of the log with -f (follow).

You can adapt this based on your needs, such as dumping the entire log into a file if you need to take a closer look.

Every log file is in Bunyan JSONL format. You can use any bunyan-compatible log reader to get detailed information — significantly better than staring at JSON until your eyes burn out.

Development

Instructions on how to get a development set-up of Dispatch can be found at https://github.com/ChlodAlejandro/deputy-dispatch#contributing. Note that you will need a Toolforge account, because you'll need access to the Wiki Replicas. Attempting to run Dispatch without properly setting the database connection information up will cause any request or job requiring the databases to fail.

External links