Jump to content

Wikifunctions/Runbook

From Wikitech

This is a list of runbooks for the Abstract Wikipedia Team, particularly the Wikifunctions services, covering step-by-step lists of what to do when things need doing, especially when things go wrong.

How to disable execution in an emergency

Prevent logged-out users from running functions

  1. In mediawiki-config, change core-Permissions.php to set 'wikilambda-execute' => false in the '*' (logged-out users) block of groupOverrides => '+wikifunctions' , push to gerrit, and get deployed like any MW config change.

Prevent all users from running functions

  1. In mediawiki-config, change core-Permissions.php to set 'wikilambda-execute' => false in the '*' (logged-out users) and also the 'user' (logged-in users) blocks of groupOverrides => '+wikifunctions' , push to gerrit, and get deployed like any MW config change.

Prevent service from accessing Wikidata

  1. In deployment-charts, change values-main-orchestrator.yaml to set "useWikidata": false, , push to gerrit, and get deployed like any helm chart change.

Back-end services

What services?

  • function-orchestrator, a service to co-ordinate function requests.
  • function-evaluator, a service to execute user-written code.

Deploy a config update to the orchestrator

  1. Make a change to the Wikifunctions services helm values over-ride in the deployment-charts repo in Gerrit, make a commit, land it with a colleague or by yourself
  2. Shell into production deployment server (ssh deployment.eqiad.wmnet) and go to our service directory (cd /srv/deployment-charts/helmfile.d/services/wikifunctions)
  3. Check that the new change to deployment-charts git repo has made it automatically to the server (git log)
    1. Be sure you can see the correct latest commit(s) via git status
    2. Sometimes you may need to communicate with external team/members to check their status on updates. Most to all communication surrounding deploy takes place in IRC, but talks of moving this to a different platform is in the works.
  4. [Cautionary step] In general you might want to first deploy function-orchestrator changes before function-evaluator. Doing both in parallel adds more risk and should generally be avoided. Repeat from this step once you've successfully deployed the first change, if you have others.
  5. Run this commend to validate that the helm chart applies and the diff looks correct. If it shows no diff, wait a little for the chart museum to catch up, then try again:
    helmfile -e staging -i apply --context 5
  6. Make a simple request via curl to check that the orchestrator performs as expected, e.g.:
    curl https://wikifunctions.k8s-staging.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{"Z1K1":"Z7","Z7K1":"Z801","Z801K1":"foo"},"doValidate":false}' --header "Content-type: application/json" -w "\n"
    … should output a JSON blob starting with {"Z1K1":"Z22","Z22K1":"foo",… (call just to the orchestrator)
    curl https://wikifunctions.k8s-staging.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{"Z1K1":"Z7","Z7K1":{"Z1K1":"Z8","Z8K1":["Z17",{"Z1K1":"Z17","Z17K1":"Z6","Z17K2":{"Z1K1":"Z6","Z6K1":"Z400K1"},"Z17K3":{"Z1K1":"Z12","Z12K1":["Z11"]}},{"Z1K1":"Z17","Z17K1":"Z6","Z17K2":{"Z1K1":"Z6","Z6K1":"Z400K2"},"Z17K3":{"Z1K1":"Z12","Z12K1":["Z11"]}}],"Z8K2":"Z1","Z8K3":["Z20"],"Z8K4":["Z14",{"Z1K1":"Z14","Z14K1":"Z400","Z14K3":{"Z1K1":"Z16","Z16K1":{"Z1K1":"Z61","Z61K1":"python"},"Z16K2": "def Z400(Z400K1,Z400K2):\n\treturn str(int(Z400K1) + int(Z400K2))"}}],"Z8K5":"Z400"},"Z400K1":"5","Z400K2":"8"},"doValidate":false}' --header "Content-type: application/json" -w "\n"
    … should output a JSON blob starting with {"Z1K1":"Z22","Z22K1":"13",… (call to the Python evaluator via the orchestrator)
    curl https://wikifunctions.k8s-staging.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{"Z1K1":"Z7","Z7K1":{"Z1K1":"Z8","Z8K1":["Z17",{"Z1K1":"Z17","Z17K1":"Z6","Z17K2":{"Z1K1":"Z6","Z6K1":"Z400K1"},"Z17K3":{"Z1K1":"Z12","Z12K1":["Z11"]}},{"Z1K1":"Z17","Z17K1":"Z6","Z17K2":{"Z1K1":"Z6","Z6K1":"Z400K2"},"Z17K3":{"Z1K1":"Z12","Z12K1":["Z11"]}}],"Z8K2":"Z1","Z8K3":["Z20"],"Z8K4":["Z14",{"Z1K1":"Z14","Z14K1":"Z400","Z14K3":{"Z1K1":"Z16","Z16K1":"Z600","Z16K2":"function Z400( Z400K1,Z400K2 ) { return (parseInt(Z400K1) + parseInt(Z400K2)).toString(); }"}}],"Z8K5":"Z400"},"Z400K1":"15","Z400K2":"18"},"doValidate":false}' --header "Content-type: application/json" -w "\n"
    … should output a JSON blob starting with {"Z1K1":"Z22","Z22K1":"33",… (call to the JavaScript evaluator via the orchestrator)
    curl https://wikifunctions.k8s-staging.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{"Z1K1":"Z7","Z7K1":"Z10000","Z10000K1":"foo","Z10000K2":"bar"},"doValidate":false}' --header "Content-type: application/json" -w "\n"
    … should output a JSON blob starting with {"Z1K1":"Z22","Z22K1":"foobar",… (call to the evaluator via the orchestrator, making a call to the wiki)
    curl https://wikifunctions.k8s-staging.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{"Z1K1":"Z7","Z7K1":"Z6825","Z6825K1":{"Z1K1":"Z6095","Z6095K1":"L1"}},"doValidate":false}' --header "Content-type: application/json" -w "\n"
    … should output a JSON blob starting with {"Z1K1":"Z22","Z22K1":{"Z1K1":"Z6005",… (call to dereference a Wikidata Lexeme)
  7. Check that you can see the logs triggered from the above requests in LogStash.
    1. You can tell where the log is coming from by observing the ‘host’ label (rather than ‘kubernetes.host’, because Staging uses the same host, ‘eqiad’, as Production).
  8. Run this to deploy the update to the Texas datacentre (the change is now live for some users):
    helmfile -e codfw -i apply --context 5
  9. Run this to deploy the update to the Virginia datacentre (the change is now live for all users)
    helmfile -e eqiad -i apply --context 5
    (There are two data centers used: ‘codfw’ and ‘eqiad’, and the Foundation rotates between them every six months.)
  10. Monitor production for a bit, and revert if needed
    • Wikifunctions services grafana dashboard
    • For the above curl commands, you can replace wikifunctions.k8s-staging.discovery.wmnet with wikifunctions.discovery.wmnet:
      curl https://wikifunctions.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{"Z1K1":"Z7","Z7K1":"Z801","Z801K1":"foo"},"doValidate":false}' --header "Content-type: application/json" -w "\n"
      curl https://wikifunctions.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{"Z1K1":"Z7","Z7K1":{"Z1K1":"Z8","Z8K1":["Z17",{"Z1K1":"Z17","Z17K1":"Z6","Z17K2":{"Z1K1":"Z6","Z6K1":"Z400K1"},"Z17K3":{"Z1K1":"Z12","Z12K1":["Z11"]}},{"Z1K1":"Z17","Z17K1":"Z6","Z17K2":{"Z1K1":"Z6","Z6K1":"Z400K2"},"Z17K3":{"Z1K1":"Z12","Z12K1":["Z11"]}}],"Z8K2":"Z1","Z8K3":["Z20"],"Z8K4":["Z14",{"Z1K1":"Z14","Z14K1":"Z400","Z14K3":{"Z1K1":"Z16","Z16K1":{"Z1K1":"Z61","Z61K1":"python"},"Z16K2": "def Z400(Z400K1,Z400K2):\n\treturn str(int(Z400K1) + int(Z400K2))"}}],"Z8K5":"Z400"},"Z400K1":"5","Z400K2":"8"},"doValidate":false}' --header "Content-type: application/json" -w "\n"
      curl https://wikifunctions.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{"Z1K1":"Z7","Z7K1":{"Z1K1":"Z8","Z8K1":["Z17",{"Z1K1":"Z17","Z17K1":"Z6","Z17K2":{"Z1K1":"Z6","Z6K1":"Z400K1"},"Z17K3":{"Z1K1":"Z12","Z12K1":["Z11"]}},{"Z1K1":"Z17","Z17K1":"Z6","Z17K2":{"Z1K1":"Z6","Z6K1":"Z400K2"},"Z17K3":{"Z1K1":"Z12","Z12K1":["Z11"]}}],"Z8K2":"Z1","Z8K3":["Z20"],"Z8K4":["Z14",{"Z1K1":"Z14","Z14K1":"Z400","Z14K3":{"Z1K1":"Z16","Z16K1":"Z600","Z16K2":"function Z400( Z400K1,Z400K2 ) { return (parseInt(Z400K1) + parseInt(Z400K2)).toString(); }"}}],"Z8K5":"Z400"},"Z400K1":"15","Z400K2":"18"},"doValidate":false}' --header "Content-type: application/json" -w "\n"
      curl https://wikifunctions.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{"Z1K1":"Z7","Z7K1":"Z10000","Z10000K1":"foo","Z10000K2":"bar"},"doValidate":false}' --header "Content-type: application/json" -w "\n"
    curl https://wikifunctions.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{"Z1K1":"Z7","Z7K1":"Z6825","Z6825K1":{"Z1K1":"Z6095","Z6095K1":"L1"}},"doValidate":false}' --header "Content-type: application/json" -w "\n"
  11. Check again that you can see the logs triggered from the above Production requests in LogStash.

Deploy a new version of the orchestrator

  1. Make a change to the orchestrator repo in GitLab, make a Merge Request, wait for it to landed by a colleague (example)
  2. In your local clone the deployment-charts repo in Gerrit, start a commit based on master, and find the latest commits for our code.
    git checkout -B wikifunctions-deploy origin/master && git log helmfile.d/services/wikifunctions/
  3. Take the hash of the latest update to the function-orchestrator (in this case, 4f41ae5) and in your local clone of the function-orchestrator get the list of commits:
    git fetch && git log --topo-order --no-merges --reverse --oneline 4f41ae5..origin/main
  4. Also use this to get a list of tasks that will be affected by this deploy:
    git fetch && git log --topo-order --no-merges --reverse 4f41ae5..origin/main | grep Bug: | sort | uniq
  5. On your Merge Request's page on GitLab, find out what the published container was tagged as: find the post-merge pipeline run, and within that the "publish-images" stage; click through to the job output, and then scroll down to the bottom where there should be a line like pushing manifest for docker-registry.discovery.wmnet/repos/abstract-wiki/wikifunctions/function-orchestrator:2024-11-13-145636@sha256:…2024-11-13-145636 is the image tag.
  6. Edit helmfile.d/services/wikifunctions/values-main-orchestrator.yaml to change the version: value to the new image tag.
  7. Add your change to your git stack:
    git add -p helmfile.d/services/wikifunctions/values-main-orchestrator.yaml
  8. Commit your git stack with git commit, using the above details, e.g.:
    wikifunctions: Upgrade orchestrator from 2024-10-15-192817 to 2024-11-13-145636
    
    2f9ab91 db: Show the full URL, not just the initial value, in logs
    5047011 Include nested senses inside fetched lexemes
    0c5c623 pass header details for tracing from evaluator
    add6191 Add referencePreCache, ensuring that each ZID will be resolved exactly once per orchestrator invocation.
    4719da9 Update function-schemata sub-module to HEAD (f2c043c)
    60b4c4d Increase orchestrator rate limit to 300
    
    Bug: T356144
    Bug: T367120
    Bug: T375922
    Bug: T375944
    Bug: T376060
    Bug: T376826
    Bug: T377380
    Bug: T377547
    Bug: T377797
    Bug: T377851
    Bug: T378678
    Bug: T379098
    
  9. Push your commit for review
    git review

Disable an evaluator from being called

  1. Make a config change to the orchestrator's helm values as above, changing in values-main-orchestrator.yaml the ORCHESTRATOR_CONFIG value to remove the evaluator from the map of known evaluators.
    • If the evaluator you are removing is the only one assigned to that language, you are disabling evaluation in that language.

Add an evaluator to be called

  1. To add a new evaluator instance, edit helmfile.yaml to add a new entry in the releases section, and add a new values-foo-evaluator.yaml file like the others but pointed at the appropriate image and version. Deploy this, and ensure the new release deploys successfully.
  2. Make a config change to the orchestrator's helm chart as above, changing in values-main-orchestrator.yaml the ORCHESTRATOR_CONFIG value to add the evaluator to the map of known evaluators for the appropriate languages.
    • If the evaluator you are adding is the only one assigned to that language, remember that you are enabling evaluation in that language.

Deploy a config update update to an evaluator

Note: There is intentionally very little configurability of the evaluators themselves.

  1. Identify which evaluator release you're updating (JavaScript, Python, etc.) or if it's for all evaluators.
  2. Find the values-*-evaluator.yaml file for the release you wish to adjust, and alter them to add/update/remove the relevant config values, and deploy as above (example)

Deploy a new version of an evaluator

  1. Make a change to the evaluator repo in GitLab, make a Merge Request, wait for it to landed by a colleague (example)
  2. Make a config change to the Wikifunctions service helm values as above, changing in one or all of the values-*-evaluator.yaml files the version value of the docker image to the newly-created docker-registry tag from step 1. You may wish to explain in the commit what changes are being deployed, for ease of tracking later. (example)

Wikifunctions.org wiki

Add or edit pre-defined Objects in production

When to do this?

  • If any new pre-defined Objects have been added in the latest function-schemata updates.
  • If existing pre-defined Objects have been edited in the latest function-schemata updates.

Where to find the objects to update?

  • A list of the Objects created and edited in function-schemata since the last update should be kept in the AW Chores page.
  • You can also gather this list by analyzing each schemata change in the latest sub-module update applied in production (see the list of latest merged schemata updates).

How to run the script?

  1. Shell into a deployment server
  2. To add new Objects:
    1. If you only need to add one Object, run the script with the --zid <ZID> flag, e.g.:
    mwscript-k8s -f -- extensions/WikiLambda/maintenance/loadPreDefinedObject.php --wiki=wikifunctionswiki --zid Z1234
    1. If you need to add a few Objects within a range, run the script with the --from <ZID> --to <ZID> flags, e.g.:
    mwscript-k8s -f -- extensions/WikiLambda/maintenance/loadPreDefinedObject.php --wiki=wikifunctionswiki --from Z1234 --to Z1237
  3. To edit existing Objects:
    1. Use the --zid or --to and --from flags as explained above.
    2. Add the --merge flag to merge the function-schemata latest version with the currently stored Object in production.
      E.g.: to apply latest changes to the built-ins from Z6000 to Z6006, run: mwscript-k8s -f -- extensions/WikiLambda/maintenance/loadPreDefinedObject.php --wiki=wikifunctionswiki --merge --from Z6000 --to Z6006
      E.g.: to apply latest changes to the built-in Z1234, run: mwscript-k8s -f -- extensions/WikiLambda/maintenance/loadPreDefinedObject.php --wiki=wikifunctionswiki --merge --zid Z1234
    3. The --merge flag might find conflicts; the script will show information of the conflict and request action:
      1. If the conflict flags an intended change in function-schemata, enter y (yes).
      2. If the conflict is unrelated or you have any doubt, enter n (no) to keep the current version and discuss the conflict with the team.

For more detailed documentation on loadPreDefinedObject.php, see the WikiLambda README.md file.

Re-run the secondary data updates for a kind of Object in production

To refresh the secondary data tables (e.g. labels) for a Type, such as when we fix a bug that means they might have been corrupted or partially-missing:

  1. Shell into a deployment server
  2. Do a dry-run to check: mwscript-k8s -f -- extensions/WikiLambda/maintenance/updateSecondaryTables.php --wiki=wikifunctionswiki --zType <TYPE ZID> --report --verbose --dryRun, e.g.:
    mwscript-k8s -f -- extensions/WikiLambda/maintenance/updateSecondaryTables.php --wiki=wikifunctionswiki --zType 60 --report --verbose --dryRun
  3. … then do the real: mwscript-k8s -f -- extensions/WikiLambda/maintenance/updateSecondaryTables.php --wiki=wikifunctionswiki --zType <TYPE ZID> --report --verbose, e.g.:
    mwscript-k8s -f -- extensions/WikiLambda/maintenance/updateSecondaryTables.php --wiki=wikifunctionswiki --zType Z60 --report --verbose

How to monitor usage

This section is currently a draft.
Material may not yet be complete, information may presently be omitted, and certain parts of the content may be subject to radical, rapid alteration. More information pertaining to this may be available on the talk page.

How to debug an issue

This section is currently a draft.
Material may not yet be complete, information may presently be omitted, and certain parts of the content may be subject to radical, rapid alteration. More information pertaining to this may be available on the talk page.

How to inspect Wikifunctions with Kubernetes tools

  1. Log in to a deployment server and run kube_env wikifunctions $CLUSTER where $CLUSTER is probably staging unless you are very brave.
  2. Now you can run kubectl commands like kubectl get pods.
  3. For example, to get logs for the function-orchestrator, you can run kubectl logs `kubectl get pods | grep orchestrator | awk '{print $1}'` function-orchestrator-main-orchestrator.
  4. For another example, to read the Prometheus metrics for a pod, you can get the IP via kubectl get pods -o wide and then curl <IP>:9100, to see that the expected metrics are being set.

How to poke the orchestrator/evaluator

  1. Log in to a deployment server.
  2. The endpoint for the orchestrator can be found at https://wikifunctions.k8s-staging.discovery.wmnet:30443/1/v1/evaluate/.
  3. Example commands:
    1. to provoke the orchestrator directly: curl https://wikifunctions.k8s-staging.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{"Z1K1":"Z7","Z7K1":"Z801","Z801K1":"foo"},"doValidate":false}' --header "Content-type: application/json"
    2. to call the evaluator via the orchestrator: curl https://wikifunctions.k8s-staging.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{ "Z1K1": "Z7", "Z7K1": { "Z1K1": "Z8", "Z8K1": [ "Z17", { "Z1K1": "Z17", "Z17K1": "Z6", "Z17K2": { "Z1K1": "Z6", "Z6K1": "Z400K1" }, "Z17K3": { "Z1K1": "Z12", "Z12K1": [ "Z11" ] } }, { "Z1K1": "Z17", "Z17K1": "Z6", "Z17K2": { "Z1K1": "Z6", "Z6K1": "Z400K2" }, "Z17K3": { "Z1K1": "Z12", "Z12K1": [ "Z11" ] } } ], "Z8K2": "Z1", "Z8K3": [ "Z20" ], "Z8K4": [ "Z14", { "Z1K1": "Z14", "Z14K1": "Z400", "Z14K3": { "Z1K1": "Z16", "Z16K1": { "Z1K1": "Z61", "Z61K1": "javascript" }, "Z16K2": "function Z400( Z400K1, Z400K2 ) { return (parseInt(Z400K1) + parseInt(Z400K2)).toString(); }" } } ], "Z8K5": "Z400" }, "Z400K1": "5", "Z400K2": "8" } ,"doValidate":false}' --header "Content-type: application/json"

Thing that might go wrong

  1. environment variables
    • Environment variables set in the microservice images are ignored by Kubernetes. If you add/delete/modify an environment variable in a container image, you must also update the corresponding configuration when deploying that version of the image.

Beta Cluster

  • Our Beta Cluster production imitation runs on deployment-docker-wikifunctions01 using the role::beta::docker_services hack to run them directly in docker (no kubernetes), so it's not entirely prod-like
  • If you're an admin member of the deployment-prep project, you should be able to do almost everything need in Horizon
  • To debug, shell in via ssh deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud
    • To trigger an immediate puppet update rather than waiting for the cron, run sudo -i puppet agent -tv
    • To see what services are running, run sudo docker ps
    • To inspect logs from one of the services, run e.g. sudo docker logs function-evaluator-py.service
    • To run a test from the CLI, you can use the above sample commands but changing out the URL for https://wikifunctions-orchestrator-beta.wmflabs.org/1/v1/evaluate, e.g.:
      curl https://wikifunctions-orchestrator-beta.wmflabs.org/1/v1/evaluate -X POST --data '{"zobject":{ "Z1K1": "Z7", "Z7K1": { "Z1K1": "Z8", "Z8K1": [ "Z17", { "Z1K1": "Z17", "Z17K1": "Z6", "Z17K2": { "Z1K1": "Z6", "Z6K1": "Z400K1" }, "Z17K3": { "Z1K1": "Z12", "Z12K1": [ "Z11" ] } }, { "Z1K1": "Z17", "Z17K1": "Z6", "Z17K2": { "Z1K1": "Z6", "Z6K1": "Z400K2" }, "Z17K3": { "Z1K1": "Z12", "Z12K1": [ "Z11" ] } } ], "Z8K2": "Z1", "Z8K3": [ "Z20" ], "Z8K4": [ "Z14", { "Z1K1": "Z14", "Z14K1": "Z400", "Z14K3": { "Z1K1": "Z16", "Z16K1": "Z600", "Z16K2": "function Z400( Z400K1, Z400K2 ) { return (parseInt(Z400K1) + parseInt(Z400K2)).toString(); }" } } ], "Z8K5": "Z400" }, "Z400K1": "15", "Z400K2": "18" },"doValidate":false}' --header "Content-type: application/json" -w "\n"