Data Platform/Systems/AQS
The Analytics Query Service (AQS) is a set of services that provide public-facing APIs that serve analytics data from both a Cassandra and a Druid backend. It is how Wikistats (stats.wikimedia.org) gets data to visualize.
- Maintained by mw:Data Platform Engineering/Data Products
- Issue tracker: Phabricator (Report an issue)
Services
- Device Analytics
- Repository: generated-data-platform/aqs/device-analytics
- API specification: swagger.json
- Data source: Cassandra
- Geo Analytics
- Repository: generated-data-platform/aqs/geo-analytics
- API specification: swagger.json
- Data source: Cassandra
- Media Analytics
- Repository: generated-data-platform/aqs/media-analytics
- API specification: swagger.json
- Data source: Cassandra
- Page Analytics
- Repository: generated-data-platform/aqs/page-analytics
- API specification: swagger.json
- Data source: Cassandra
- Edit Analytics
- Repository: generated-data-platform/aqs/edit-analytics
- API specification: swagger.json
- Data source: Druid
- Editor Analytics
- Repository: generated-data-platform/aqs/editor-analytics
- API specification: swagger.json
- Data source: Druid
- Commons Impact Metrics
- Repository: generated-data-platform/aqs/commons-impact-analytics (GitLab)
- API specification: swagger.json
- Data source: Cassandra
- Common functionality
- AQS Assist: generated-data-platform/aqs/aqsassist
- service-lib-golang: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/servicelib-golang
- Test environments
- Cassandra test env: generated-data-platform/aqs/aqs-docker-cassandra-test-env
- Druid test env: generated-data-platform/aqs/aqs-docker-druid-test-env
- QA
- QA test suite: generated-data-platform/aqs/aqs_tests
- User documentation site
- Repository: generated-data-platform/aqs/analytics-api
- Live link: doc.wikimedia.org/analytics-api
Monitoring
Grafana dashboards:
- Device Analytics dashboard
- Media Analytics dashboard
- Geo Analytics dashboard
- Page Analytics dashboard
- Editor Analytics dashboard
- Edit Analytics dashboard
- Commons Impact Analytics dashboard
- Cassandra: https://grafana.wikimedia.org/d/000000418/cassandra?orgId=1
- Druid: https://grafana.wikimedia.org/d/000000538/druid
Project overview
Epic task in Phabricator | API Platform Team workboard | AQS2.0 workboard
- Done Implement the new, stand-alone AQS service(s)
- Done Deploy to k8s
- Done Switch RESTBase to proxying requests from the old AQS service, to the new k8s-based one
- Phase 1: Unique devices endpoint (Device Analytics service)
- Phase 2: Pageviews and legacy endpoints (Page Analytics service), editors/by-country endpoint (Geo Analytics service), and media requests endpoints (Media Analytics service)
- Phase 3: Edited pages, edits, and bytes difference endpoints (Edit Analytics service) and editors and registered users endpoints (Editor Analytics service)
- In progress Deprecate the http://{project}/api/rest_v1/metrics resources
- Eventually phase out the RESTBase /metrics hierarchy
Communicating user-facing changes
When making a user-facing change to an AQS API:
- Consult the stability policy for versioning guidance.
- Update the service's API spec and the documentation site.
- Update the combined changelog (source).
- Send an announcement to the analytics mailing list.
Running a service
AQS 2.0 consists of several repositories. Some correspond to individual services that expose APIs. Others correspond to cross-service common functionality or test environments.
Setup
You will need:
- Go (the language), which can be downloaded from https://go.dev/doc/install
- Make (the utility), which can be installed in various ways depending on your platform. For example, one way to install it on Mac is https://formulae.brew.sh/formula/make
Go (aka "golang") is an opinionated language in various ways. Among these is that you're probably much better off keeping your Go code under your "GOPATH" rather than wherever you may be used to keeping code. (There are, of course, always ways for savvy developers to cheat the system. If you choose to do that, any consequences are on you.) On my Mac, I cloned all the AQS 2.0 repositories under ~/go/src/.
Start a service
The various service README files contain details about running that particular service. But the summary is that you'll need to open several command line (aka "terminal") windows/tabs and run commands in each. The following describes how to execute the "pageviews" services. Other services operate similarly.
- In one terminal, navigate to <GOPATH>/aqs-docker-test-env
- Run "make startup", wait for it to say "Startup complete", then leave it running
- In another terminal, also in <GOPATH>/aqs-docker-test-env, run "make bootstrap" and wait for it to complete
- Navigate (either in that terminal or a different one) to <GOPATH>/pageviews
- Run "make"
- Run "./pageviews" (and leave it running)
- In another terminal, navigate to <GOPATH>/pageviews and run "make test"
- In your browser, visit http://localhost:8080/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Banana/daily/20190101/20190102
We haven't started the Druid-based endpoint(s) yet, but the process will likely be similar, with perhaps some differences in how to launch the test environment.
Tips and troubleshooting
Because Go is an opinionated language, it may refuse to run over seemingly small things, such as whitespace. If you see something like this:
goimports: format errors detected
You can execute this to see what Go is unhappy about:
goimports -d *.go
And this to automatically fix it:
goimports -w *.go
Our services depend on several packages, including our own “aqsassist”, which is in active development. This means you may sometimes need to update dependencies for your local service to run. You can update all dependencies via:
go get .
or update specific dependencies via something like:
go get gitlab.wikimedia.org/repos/generated-data-platform/aqs/aqsassist
Local Testing
This section aims to explain how to prepare and run both local test environments (Cassandra and Druid) to test all the AQS services using the AQS QA test suite. In this section there are instructions to do this with both test environments: Cassandra and Druid. That way we could run two docker compose projects with all the AQS services running with the two existing test environments (Cassandra and Druid). There are enough instructions below to do that with all cassandra-based and druid-based services in both test environments.
All the steps explained here are ready for almost all the services and both test environments.
At this moment we could build and run two test environments ready to be used by QA engineers to run the AQS test suite:
- aqs-docker-cassandra-test-env-qa: It's a cassandra test environment with the following services/containers: (available here)
- cassandra-qa: A docker container with a Cassandra database already populated with some sample data
- device-analytics-qa: A docker container to run the service listening on the port 8090
- geo-analytics-qa: A docker container to run the service listening on the port 8091
- media-analytics-qa: A docker container to run the service listening on the port 8092
- page-analytics-qa: A docker container to run the service listening on the port 8093
- aqs-docker-druid-test-env-qa: It's a Druid test environment with the following services/containers: (available here)
- druid-qa: A docker container with a Druid database already populated with some sample data
- editor-analytics-qa: A docker container to run the service listening on the port 8094
- edit-analytics-qa: A docker container to run the service listening on the port 8095
Quick start (for QA engineers)
This quick start guide shows how to build, start and populate the docker-compose project for both testing environments: Cassandra and Druid:
- Clone all service repositories that belong to the test env you want to run
- Inside every service project, run
make docker_qa
to create the service docker image (it will be called like the service: geo-analytics, editor-analytics and so son) - For Cassandra test env:
- Clone the aqs-docker-cassandra-test-env repository
- Inside the test-env project, run
make startup-qa
to create the new docker-compose project - Before creating the project, you could change the port where you want the service to listen to (in the snippet below we are mapping the default port of the service container, 8080, to the 8091 port in our host. It will be really useful if we want to run all the services at the same time to test them at the same time:
- Inside the test-env project, run
make bootstrap-qa
to populate cassandra in the new docker-compose project (it takes around 15 minutes to fully populated the database) - Take a look at your Docker Dashboard to be sure that a docker compose project called aqs-docker-test-env-qa has been created with all its services running
- For Druid test env:
- Clone the aqs-docker-druid-test-env repository
- Enter to the aqs-docker-druid-test-env-build folder and run
docker build -t aqs-docker-druid-test-env .
(this will create a druid image already populated with the sample data) - Enter to the aqs-docker-druid-test-env-run folder and run
make startup-qa
- Take a look at your Docker Dashboard to be sure that a docker compose project called aqs-docker-druid-test-env-qa has been created with all its services running
- Try making a request (for instance to http://localhost:8091/metrics/editors/by-country/en.wikipedia/100..-edits/2018/11) to check everything is working fine (change the port and the request according to the service you want to try)
- The following are the ports where each service will be listening:
- aqs-docker-test-env-qa:
- device-analytics: 8090
- geo-analytics: 8091
- media-analytics: 8092
- page-analytics: 8093
- aqs-docker-druid-test-env-qa:
- editor-analytics: 8094
- edit-analytics: 8095
- aqs-docker-test-env-qa:
Full guide (in case you want to customize these environments)
This full guide describes all the necessary steps to create a docker compose project composed of geo-analytics service and cassandra test env as a sample. All the steps related to the service could be done for any other AQS services to run it using docker. Test environments (Cassandra and Druid) could be tuned following the same pattern we use here to add more services to the final docker-composed project.
Service config changes
These steps modify the config.yaml file in geo-analytics
Needed changes:
- config.yaml file: We need to change the cassandra hostname and the service listen_address to be able to run properly the new AQS test env via docker.
You must change the cassandra host to "cassandra". It’s the name of the service we will set in the docker compose config file with which we are going to run the QA test environment (cassandra + geo). It’s needed so that the container service (geo-analytics in this case) can connect to the cassandra one. Take the opportunity to change the listen_address property to 0.0.0.0 to accept remote connections from outside the service container (to be able to run the QA test suite from your host)
. . .
listen_address: 0.0.0.0
. . .
cassandra:
hosts: cassandra
. . .
Create the service image
These steps describe how it was done for the geo-analytics service. Same steps can be done for any other AQS service
Needed changes:
- blubber file: https://gerrit.wikimedia.org/r/plugins/gitiles/generated-data-platform/aqs/geo-analytics/+/refs/heads/main/.pipeline/blubber.yaml
- deployment config file: https://gerrit.wikimedia.org/r/plugins/gitiles/generated-data-platform/aqs/geo-analytics/+/refs/heads/main/.pipeline/config.yaml
- New target in the Makefile:
docker_qa: ## create a docker container to run QA test via Docker
curl -s "https://blubberoid.wikimedia.org/v1/production" -H 'content-type: application/yaml' --data-binary @".pipeline/blubber.yaml" > Dockerfile
docker build -t geo-analytics .
Needed steps
- If not available, put the blubber file into the .pipeline folder (change the entry_point according to the service where you are putting this file)
- If not available, put the config file into the .pipeline folder (this file is the same for all the services)
- Add the new target to the
Makefile
- Now you could build the service image with the following command:
make docker_qa
Build the QA test environment (for cassandra-based services)
These steps must be run in the aqs-docker-test-env project folder.
Needed changes:
All changes described here are available in the aqs-docker-cassandra-test-env
- New file docker-compose-qa.yml: https://gitlab.wikimedia.org/repos/generated-data-platform/aqs/aqs-docker-cassandra-test-env/-/blob/main/docker-compose-qa.yml (we could add all the services to this file to create a AQS full-services test env).
This file defines a new docker-compose project compound of a cassandra engine and a sample service (geo-analytics in this case). If you take a look at this file, you will see that the service specific part could be customized to add any additional cassandra-based service you want to include to this dockerized env. In this moment some of these services are already included:
A service has to be added to define the cassandra container. For this specific service we also add a _healthcheck_ property to define how to know when cassandra is available. That way service containers will start when the database is available (to avoid failures about trying to connect to it when not yet available):
cassandra:
image: cassandra:3.11
container_name: cassandra-qa
ports:
- "9042:9042"
volumes:
- .:/env
networks:
- network1
healthcheck:
test: ["CMD-SHELL", "[ $$(nodetool statusgossip) = running ]"]
interval: 20s
timeout: 10s
retries: 5
An another one for each service you want to add to this test environment (in this case we are adding geo-analytics):
geo-analytics:
image: geo-analytics
container_name: geo-analytics-qa
ports:
- "8091:8080"
networks:
- network1
depends_on:
cassandra:
condition: service_healthy
- New targets in the Makefile:
bootstrap-qa: schema-qa load-qa
schema-qa:
docker exec -it cassandra-qa cqlsh -f /env/schema.cql
load-qa:
docker exec -it cassandra-qa cqlsh -f /env/test_data.cql --cqlshrc=/env/cqlshrc
startup-qa:
docker-compose -f docker-compose-qa.yml up -d
Needed steps:
Once you have customized both files according to the specified changes (docker-compose-qa-yml and Makefile) you could build and run the new dockerized test environment:
- Create the docker-compose project:
make startup-qa
- Populate the cassandra container
make bootstrap-qa
QA test env and the included services are already running as docker containers within a docker compose project. The compose project will be named as aqs-docker-test-env-qa
Each service will be listening on a different port according to the service configuration. For example, in this case geo-analytics will be listening on the port 8091.
Build and run the QA test environment (for druid-based services)
These steps must be run in the aqs-docker-druid-test-env project folder.
Needed changes:
All changes described here are available in the aqs-docker-druid-test-env
- New file docker-compose-qa.yml (to be added to the aqs-docker-druid-test-env-run folder): https://gitlab.wikimedia.org/repos/generated-data-platform/aqs/aqs-docker-druid-test-env/-/blob/main/aqs-docker-druid-test-env-run/docker-compose-qa.yml (we could add all the services to this file to create a AQS full-services test env for Druid).
This file defines a new docker-compose project composed of a Druid engine and a sample service (editor-analytics in this case). If you take a look at this file, you will see that the service specific part could be customized to add any additional Druid-based service you want to include to this dockerized env. In this moment some of these services are already included:
A service has to be added to define the Druid container. For this specific service we also add a _healthcheck_ property to define how to know when Druid is available. That way service containers will start when the database is available (to avoid failures about trying to connect to it when not yet available):
druid:
image: bpirkle/aqs-docker-druid-test-env:latest
container_name: druid-qa
ports:
- "8888:8888"
- "8082:8082"
networks:
- network1
healthcheck:
interval: 10s
retries: 9
timeout: 90s
test:
- CMD-SHELL
- nc -z 127.0.0.1 8888
An another one for each service you want to add to this test environment (in this case we are adding editor-analytics):
editor-analytics:
image: editor-analytics
container_name: editor-analytics-qa
ports:
- "8094:8080"
networks:
- network1
depends_on:
druid:
condition: service_healthy
- New targets in the Makefile (inside the aqs-docker-druid-test-env-build folder):
startup-qa:
docker-compose -f docker-compose-qa.yml up -d
shutdown-qa:
docker-compose -f docker-compose-qa.yml down
Needed steps:
- Once you have customized both files according to the specified changes (docker-compose-qa-yml and Makefile) you could build and run the new test environment docker image (from the aqs-docker-druid-test-env-build folder):
docker build -t aqs-docker-druid-test-env .
- After creating the image, you can start the docker-compose project (from the aqs-docker-druid-test-env-run folder):
make startup-qa
QA test env and the included services are already running as docker containers within a docker compose project. The compose project will be named as aqs-docker-druid-test-env-qa
Each service will be listening on a different port according to the service configuration. For example, in this case editor-analytics will be listening on the port 8094.
Demos
Notes for developers
- Developers will need to keep a new additional config-devel.yaml just with a different cassandra host (“localhost”) to be able to connect test env while developing and listen_address = “localhost” (we’ll have to do something similar with Druid-based services). This is the way to run the service using an alternative config file:
make clean build && ./geo-analytics --config config-devel.yaml
config.yaml | config-devel.yaml |
---|---|
listen_address: 0.0.0.0 | listen_address: localhost |
cassandra.hosts: - cassandra |
cassandra.hosts: - localhost |
Dockerfile
(created by blubber) and thisconfig-devel.yaml
should be added to.gitignore
- A
.dockerignore
file should be added to the repo to avoid using an already build service binary when creating the service docker image (the already built binary mustn't be included into the dockerized environment because it has to be build inside the right docker container)
To keep in mind
- We need to change the host of the database (to the name of the service, cassandra) to be able to connect from the service container to the cassandra one
- It doesn’t matter because when deploying to production, the config file will be replaced automatically. We can use cassandra for the default one and create then config-devel.yaml to use “localhost” when developing
- We need to allow remote connections (Listen 0.0.0.0:8080 instead of localhost:8080) to allow our host to connect to the service (Postman, curl, . . .)
- 0.0.0.0 should be by default so we can use this value in the default config file for all the services. That value will be automatically replaced by the right one when deploying to production
- In the end we need to keep two cassandra containers: the one for developing and the one for QA (a docker compose with two services)
- It’s not really a problem because developers usually doesn’t start the QA one and QA engineers don't use the development one
- Anyway, we can keep both test-env at the same time (aqs-docker-test-env and aqs-docker-test-env-qa). The only thing we have to keep in mind is that we cannot run both at the same time because they are listening in the same port.
Adding a Wiki
Data generated by a wiki goes through two main data pipelines before it ends up in AQS. Readership data flows from our varnish caches, through Kafka, into private and public datasets. The public pageviews data ends up in the Pageview API, served by AQS from Cassandra. Editing data flows slower through monthly whole-history loads from mediawiki database replicas. This ends up in Druid as mediawiki_history_reduced and is also served by AQS. (TODO: linkify all this)
To add a new wiki, you need to edit the include lists for these two pipelines:
Developing a New Endpoint
This is roughly how writing a new AQS endpoint goes. Note that some endpoints may use Druid rather than Cassandra, so that process may be different:
- Development
- Write the Oozie job to move data from Hadoop to Cassandra; Verify the output is correct by outputting to plain JSON/test Hive table; The Oozie job will be unable to load into Wikimedia Cloud Cassandra instances (You just have to hope that the loading works)
- Write the AQS endpoint, which includes the table schema spec and unit tests
- Testing
- Tunnel to any of the aqs-test Cassandra instances (i.e., aqs-test1001.analytics.eqiad1.wikimedia.cloud, ..., aqs-test1010.analytics.eqiad1.wikimedia.cloud)
- Create a keyspace in this cloud Cassandra instance of the same name as is detailed in the Oozie job properties file; Insert some data into the table
- From any of the aqs-test machines, run your version of AQS with the new endpoint, pointing to one of the Cassandra instances (e.g., 172.16.4.205); Test by manually running queries against your local instance (i.e., localhost:7231)
- Productionize
- Deploy the Oozie job, but don't run it
- Deploy AQS to point to the new data; Once it is running, it will automatically create the relevant keyspace in the production Cassandra instance
- Manually run a few queries against the new AQS endpoint (i.e., aqs1010; localhost:7232), and ensure that they all respond with 404 (because no data is loaded into Cassandra yet)
- Run the Oozie job to load the data into Cassandra
- Manually run a few queries against the new AQS endpoint, and ensure that they all respond with the proper responses, as the data should now be loaded into Cassandra
- Submit a pull request for the restbase repository on GitHub with the schema of the new endpoint added; Now the endpoint should be publicly accessible!
- Update all relevant documentation
Deployment
This section assumes that the microservice has been deployed and provides instructions for subsequent releases.
Prerequisites
- Ensure access: Confirm you are part of the deployment group. If not, see the process for filing a production access request.
- Read documentation: Review the deployment pipeline documentation.
Quick guide
If you've previously deployed an AQS service, you can use this quick reference guide to run all the necessary commands after merging the deployment patch (sample patch). Otherwise, see the step-by-step guide.
# Do the following once you have prepared the change and the change has been reviewed and merged
ssh deployment.[equiad|codfw].wmnet
cd /srv/deployment-charts
git log -n 1
# Check that the change has been pulled
cd helmfile.d/services/your-service-name
# Check changes and deploy to staging
helmfile -e staging diff
helmfile -e staging apply
# Check changes and deploy to production (eqiad)
helmfile -e eqiad diff
helmfile -e eqiad apply
# Check changes and deploy to production (codfw)
helmfile -e codfw diff
helmfile -e codfw apply
Step-by-step instructions
1. Prepare a deployment patch
Clone Deployment Charts Repository, and prepare a new patch changing the version you want to deploy in the right values file for the service you want to deploy.
Every service has a couple of values files in the helmfile.d/services/your-service-name
folder:
values.yaml
: The file we will use to deploy to production. Always deploy to staging first to check that everything is working correctly.values-staging.yaml
: The file we will use to deploy only to staging to test something you are working on but you don't want to deploy to production yet. After testing, you will have to revert the change to this values-staging.yaml file.
The image version is the the only thing we have to change to deploy a new version of any service. The version will be different depending on the services you want to deploy:
- AQS Services: These are the services that reside in Wikimedia Gerrit. They use the format YYYY-MM-DD-HHMMSS-production (e.g.,
2024-06-05-094107-production
) to identify the Docker image. This image name can be found in the pipeline output in Gerrit after merging the deployment patch. Keep in mind we always use theproduction
variant. For example: - CIM: The Commons Impact Metrics service uses a version tag like
v1.0.1
which should coincide with the tag associated with the version you are deploying. You can also find the right version in the pipeline output after creating the tag in thebuild-and-publish-production-image
step. For example (source):. . . #23 pushing layers 3.6s done #23 pushing manifest for docker-registry.discovery.wmnet/repos/generated-data-platform/aqs/commons-impact-analytics:v1.0.5@sha256:b1d92fb52b56a6ba7252c6c1b0628bc707c39c6bcd325632c41da10e1261b2c6 #23 pushing manifest for docker-registry.discovery.wmnet/repos/generated-data-platform/aqs/commons-impact-analytics:v1.0.5@sha256:b1d92fb52b56a6ba7252c6c1b0628bc707c39c6bcd325632c41da10e1261b2c6 0.5s done #23 pushing layers 0.2s done . . .
You can always check the Wikimedia Docker Registry to find any image and version you need. However, keep in mind that this registry updates every 4 hours, so any newly created image won’t appear there until that time has passed.
Open a patch for the change, and get a +2 approval to merge the patch. Here's a sample change for geo-analytics
. Once your patch has been merged, you are ready to deploy the service
2. Access the deployment server
The first step to deploy your service is to access the deployment server where the deployment-charts
repository is automatically pulled at /srv/deployment-charts/
.
- SSH into the deployment server. For example:
ssh deployment.eqiad.wmnet
orssh deployment.codfw.wmnet
- Once you have entered the deployment server, go to /srv/deployment-charts where the repository is automatically pulled.
cd /srv/deployment-charts/
- Once there, check the last change to be sure that your merged patch has been pulled already (it may take a few seconds until your change is pulled):
git log -n 1
If you can see the patch you have recently pushed, your change is ready to be deployed.
3. Deploy to staging
Before deploying, verify the changes that are going to be deployed:
- Go to the service's folder:
cd helmfile.d/services/your-service-name/
- Check which changes are ready to be deployed (they appear in green):
helmfile -e staging diff
If you are ok with the changes, you can deploy to the staging environment:
helmfile -e staging apply
4. Verify the service on staging
Before deploying to production, make a service request to check if the service works properly in staging. Choose any request you want to test and keep in mind that the base URL can be different depending on the service you want to test:
- device-analytics
https://staging.svc.eqiad.wmnet:4972
curl https://staging.svc.eqiad.wmnet:4972/metrics/unique-devices/all-wikipedia-projects/all-sites/monthly/20231001/2023110100
- media-analytics
https://SERVICE-NAME.k8s-staging.discovery.wmnet:30443
curl https://media-analytics.k8s-staging.discovery.wmnet:30443/metrics/mediarequests/aggregate/all-referers/all-media-types/all-agents/daily/20210101/20230220
- page-analytics
https://SERVICE-NAME.k8s-staging.discovery.wmnet:30443
curl https://page-analytics.k8s-staging.discovery.wmnet:30443/metrics/pageviews/aggregate/all-projects/all-access/all-agents/hourly/2021010100/2021010216
- geo-analytics
https://SERVICE-NAME.k8s-staging.discovery.wmnet:30443
curl https://geo-analytics.k8s-staging.discovery.wmnet:30443/metrics/editors/by-country/ru.wikipedia/5..99-edits/2018/01
- edit-analytics
https://SERVICE-NAME.k8s-staging.discovery.wmnet:30443
curl https://edit-analytics.k8s-staging.discovery.wmnet:30443/metrics/edits/aggregate/en.wikipedia/user/content/daily/20220301/20220302
- editor-analytics
https://SERVICE-NAME.k8s-staging.discovery.wmnet:30443
curl https://editor-analytics.k8s-staging.discovery.wmnet:30443/metrics/editors/aggregate/ab.wikipedia/name-bot/all-page-types/all-activity-levels/monthly/20210302/20220901
- commons-analytics
https://SERVICE-NAME.k8s-staging.discovery.wmnet:30443
curl https://commons-impact-analytics.k8s-staging.discovery.wmnet:30443/metrics/commons-analytics/category-metrics-snapshot/UNESCO/20231101/20231120
Keep in mind that you cannot directly access these URLs. Instead, you must run the curl
command from a stat machine (such as stat1008.eqiad.wmnet
). If something isn't working as expected, see #Troubleshooting deployment.
5. Deploy to production
Once you have deployed to staging and verified that the service is running correctly, you are ready to deploy to production (both eqiad
and codfw
servers). We assume that you are in the same folder from which you have deployed to staging environment to start running the following commands:
# Check changes and deploy to production (eqiad)
helmfile -e eqiad diff
helmfile -e eqiad apply
# Check changes and deploy to production (codfw)
helmfile -e codfw diff
helmfile -e codfw apply
6. Verify the service in production
To verify the service in production, try one of the following production API requests. The base URL is the same for all services: https://wikimedia.org/api/rest_v1/metrics
.
- device-analytics
https://wikimedia.org/api/rest_v1/metrics/unique-devices/all-wikipedia-projects/all-sites/monthly/20231001/2023110100
- media-analytics
https://wikimedia.org/api/rest_v1/metrics/mediarequests/aggregate/all-referers/all-media-types/all-agents/daily/20210101/20230220
- page-analytics
https://wikimedia.org/api/rest_v1/metrics/pageviews/top-by-country/en.wikipedia/all-access/2020/12
- geo-analytics
https://wikimedia.org/api/rest_v1/metrics/editors/by-country/en.wikipedia/5..99-edits/2023/02
- edit-analytics
https://wikimedia.org/api/rest_v1/metrics/edits/aggregate/en.wikipedia/user/content/daily/20220301/20220302
- editor-analytics
https://wikimedia.org/api/rest_v1/metrics/editors/aggregate/ab.wikipedia/name-bot/all-page-types/all-activity-levels/monthly/20210302/20220901
- commons-analytics
https://wikimedia.org/api/rest_v1/metrics/commons-analytics/category-metrics-snapshot/Gallica/20240101/20240501
Troubleshooting deployment
If something isn't working as expected, verify whether the service has been deployed properly. Here are a few steps you can take to troubleshoot:
Check pods
Enter the Kubernetes environment (staging, eqiad or codfw):
kube-env your-service-name eqiad
Check pods to be sure that a new pod has been created and started recently (output is added below the command):
kubectl get pods
NAME READY STATUS RESTARTS AGE
aqs-http-gateway-main-7cfb86cb9c-dj5jl 2/2 Running 0 3s
aqs-http-gateway-main-7cfb86cb9c-gtth5 2/2 Running 0 3s
aqs-http-gateway-main-7cfb86cb9c-hgv6l 2/2 Running 0 3s
aqs-http-gateway-main-7cfb86cb9c-pwr7b 2/2 Running 0 3s
Check logs
If something is not working properly, check the pod's logs:
kubectl logs aqs-http-gateway-main-7cfb86cb9c-dj5jl -c aqs-http-gateway-main
{"@timestamp":"2024-06-20T21:07:55Z","message":"CASSANDRA_USERNAME and CASSANDRA_PASSWORD env vars unset, using values from configuration file","log":{"level":"WARNING"},"service":{"name":"geo-analytics"}}
{"@timestamp":"2024-06-20T21:07:55Z","message":"initializing service geo-analytics (Go version: 7dd406b, Build host: buildkitsandbox, Item: 2024-06-05T11:04:57:UTC","log":{"level":"INFO"},"service":{"name":"geo-analytics"}}
. . .
. . .
Data filter before Cassandra load
Some data used by AQS comes from Cassandra. We are using Airflow+Spark+HQL to feed the tables on Cassandra.
On HDFS, We have implemented a disallowed table `wmf.disallowed_cassandra_articles` to filter out sensitive pages we don't want to appear in the top-viewed articles of a wiki.
In fact, some attacks aim at manipulating the number of views per article. For example, the goal could be pushing traffic to a 3rd party site or adding an offensive word to the top list, which millions of users view.
The table is used in some Cassandra tables:
* pageview_per_article_daily
* pageview_top_articles_daily
* pageview_top_percountry_daily
* pageview_top_articles_monthly
To update this list of disallowed articles:
- update the TSV in analytics/refinery `static_data/cassandra/disallowed_cassandra_articles.tsv `
- prepare a patch and deploy it
For emergency procedures, you could run the following (Note that you still need a Gerrit patch as the next deployment of analytics/refinery will override your change):
# Fetch the disallowed list
ssh an-launcher1002.eqiad.wmnet
export TSV_FILENAME=disallowed_cassandra_articles.tsv
export TSV_HDFS_PATH="/wmf/refinery/current/static_data/cassandra/${TSV_FILENAME}"
hdfs dfs -cat $TSV_HDFS_PATH > $TSV_FILENAME
# Add or remove some entries (beware, tabs are expected between columns, not spaces)
vim $TSV_FILENAME
# Push the file back to HDFS
sudo -u hdfs kerberos-run-command hdfs hdfs dfs -put -f $TSV_FILENAME $TSV_HDFS_PATH
sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod +r $TSV_HDFS_PATH
API documentation
To create API reference docs that are reliable and easy to update, AQS 2.0 services use:
- swag to generate an OpenAPI specification based on a mix of code annotations and the code itself. For more information about swag, see the API documentation page on mediawiki.org.
- Vitepress to generate a static website for the documentation at doc.wikimedia.org/analytics-api. See the repository at gitlab:repos/generated-data-platform/aqs/analytics-api.
- RapiDoc to integrate with Vitepress and display the API specs in HTML with an interactive sandbox.
Documentation monitoring
The AQS documentation site uses Matomo to collect analytics data. To view the data, follow the instruction on wiki to log in to Matomo, and select "AQS documentation" from the "websites" dropdown.
Updating the docs
When making a change to an AQS service, you must run these commands locally to update the API spec before submitting the change. As of 2023, there is no integration to update the spec automatically, so each patch must include the corresponding changes to the API spec when needed. Remember that the docs rely on code annotations, so make sure to keep the annotations up to date with any code changes.
1. Install swag:
go install github.com/swaggo/swag/cmd/swag@latest
2. Generate the spec:
make docs
Swag outputs the spec in YAML and JSON formats to a /docs directory.
Reading the specification
You can view the spec using the API spec reader.
Setting up docs for a new service
To set up API docs for a new AQS service:
- Annotate main.go (example, style guide): Anywhere in main.go, add annotations to document general information about the API.
- Annotate handler (example, style guide): Add annotations to any code file to document an endpoint. Endpoint annotations should be stored as close as possible to the code they describe. The block of endpoint annotations must end on a line immediately preceding a function.
- Annotate entity (example, style guide): Swag automatically gets information about the response format from the struct. To complete the schema in the docs, add these elements to the struct definition:
- an example value within the JSON encoding definition using the syntax
example:"example value"
. Note that these examples will be used in the sandbox and should return valid data. - a description of the attribute as an inline comment
- an example value within the JSON encoding definition using the syntax
- Add a
make docs
command: Add these lines to the service's Makefile:docs: ## creates openapi spec (requires swag) swag init || (echo "Hint: If you haven't installed swag, run 'go install github.com/swaggo/swag/cmd/swag@latest', then re-run 'make docs'."; exit 1)
- Generate the spec: Run
make docs
to generate the spec. - Commit the spec files: After generating the spec, commit the /docs directory to the source code repository. Since spec generation does not currently run in CI, the API spec must be stored in the repository in order to be served by the API spec endpoint.
- Add an endpoint to serve the API spec (example: main.go, handler, test): To make the docs publicly available, add an endpoint that serves the
docs/swagger.json
file viaservice-name/api-spec.json
, for exampledevice-analytics/api-spec.json
- Route the spec endpoint: As part of setting up routing for the new service in the REST Gateway, ensure that the spec endpoint is served at
service-name/api-spec.json
, for exampledevice-analytics/api-spec.json
. - Add reference documentation to the AQS docs site: Submit a merge request to analytics-api that:
- Adds a new Markdown file under /reference using the #API reference template.
- Adds the new page to the navigation in the config file under API reference.
- Adds an entry to the changelog announcing that the new endpoints are available.
- Follow the instructions in the README to preview the site locally. Note that the API reference docs won't appear until the API spec endpoint has been routed and deployed.
API reference template
To use this template, replace everything in square brackets, and remove the square brackets. For an example, see the page analytics reference page.
---
title: [Service name]
description: Reference docs and sandbox for [service name]
---
# [Service name]
[Description of what data the service provides and any limitations users should be aware of, such as the date the data is available from. Use h2 headings to separate limitation sections if necessary.]
## [Endpoint name]
<SpecView specUrl="[spec endpoint URL]" path="get [endpoint path, not including https://wikimedia.org/api/rest_v1/metrics/]"/>
[Repeat the h2 section above for each endpoint. If necessary, group the endpoints into groups, using an h2 heading for the group name and h3 headings for the endpoints.]
Spec validation
API specs are validated against the schemas in aqs tests.
Decision records
- phab:T361887: User documentation approach
- phab:T361889: OpenAPI spec viewer
Historical
AQS was originally developed as a single service with an API proxied through RESTBase. As a part of the goal to deprecate RESTBase, the /metrics endpoints served by AQS 1.0 were migrated to a set of services that do not depend on RESTBase. This project was referred to as AQS 2.0. See phab:T263489 for the original proposal.