Data Platform/Systems/AQS

The Analytics Query Service (AQS) is a set of services that provide public-facing APIs that serve analytics data from both a Cassandra and a Druid backend. It is how Wikistats (stats.wikimedia.org) gets data to visualize.

Maintained by mw:Data Platform Engineering/Data Products

Issue tracker: Phabricator (Report an issue)

Services

Device Analytics: Repository: generated-data-platform/aqs/device-analytics; API specification: swagger.json; Data source: Cassandra
Geo Analytics: Repository: generated-data-platform/aqs/geo-analytics; API specification: swagger.json; Data source: Cassandra
Media Analytics: Repository: generated-data-platform/aqs/media-analytics; API specification: swagger.json; Data source: Cassandra
Page Analytics: Repository: generated-data-platform/aqs/page-analytics; API specification: swagger.json; Data source: Cassandra
Edit Analytics: Repository: generated-data-platform/aqs/edit-analytics; API specification: swagger.json; Data source: Druid
Editor Analytics: Repository: generated-data-platform/aqs/editor-analytics; API specification: swagger.json; Data source: Druid
Commons Impact Metrics: Repository: generated-data-platform/aqs/commons-impact-analytics (GitLab); API specification: swagger.json; Data source: Cassandra
Common functionality: AQS Assist: generated-data-platform/aqs/aqsassist; service-lib-golang: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/servicelib-golang
Test environments: Cassandra test env: generated-data-platform/aqs/aqs-docker-cassandra-test-env; Druid test env: generated-data-platform/aqs/aqs-docker-druid-test-env
QA: QA test suite: generated-data-platform/aqs/aqs_tests
User documentation site: Repository: generated-data-platform/aqs/analytics-api; Live link: doc.wikimedia.org/analytics-api

Monitoring

Grafana dashboards:

Project overview

Epic task in Phabricator | API Platform Team workboard | AQS2.0 workboard

Done Implement the new, stand-alone AQS service(s)
Done Deploy to k8s
Done Switch RESTBase to proxying requests from the old AQS service, to the new k8s-based one
1. Phase 1: Unique devices endpoint (Device Analytics service)
2. Phase 2: Pageviews and legacy endpoints (Page Analytics service), editors/by-country endpoint (Geo Analytics service), and media requests endpoints (Media Analytics service)
3. Phase 3: Edited pages, edits, and bytes difference endpoints (Edit Analytics service) and editors and registered users endpoints (Editor Analytics service)
In progress Deprecate the http://{project}/api/rest_v1/metrics resources
Eventually phase out the RESTBase /metrics hierarchy

Communicating user-facing changes

When making a user-facing change to an AQS API:

Consult the stability policy for versioning guidance.
Update the service's API spec and the documentation site.
Update the combined changelog (source).
Send an announcement to the analytics mailing list.

Running a service

AQS 2.0 consists of several repositories. Some correspond to individual services that expose APIs. Others correspond to cross-service common functionality or test environments.

Setup

You will need:

Go (the language), which can be downloaded from https://go.dev/doc/install
Make (the utility), which can be installed in various ways depending on your platform. For example, one way to install it on Mac is https://formulae.brew.sh/formula/make

Go (aka "golang") is an opinionated language in various ways. Among these is that you're probably much better off keeping your Go code under your "GOPATH" rather than wherever you may be used to keeping code. (There are, of course, always ways for savvy developers to cheat the system. If you choose to do that, any consequences are on you.) On my Mac, I cloned all the AQS 2.0 repositories under ~/go/src/.

Start a service

The various service README files contain details about running that particular service. But the summary is that you'll need to open several command line (aka "terminal") windows/tabs and run commands in each. The following describes how to execute the "pageviews" services. Other services operate similarly.

In one terminal, navigate to <GOPATH>/aqs-docker-test-env
Run "make startup", wait for it to say "Startup complete", then leave it running
In another terminal, also in <GOPATH>/aqs-docker-test-env, run "make bootstrap" and wait for it to complete
Navigate (either in that terminal or a different one) to <GOPATH>/pageviews
Run "make"
Run "./pageviews" (and leave it running)
In another terminal, navigate to <GOPATH>/pageviews and run "make test"
In your browser, visit http://localhost:8080/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Banana/daily/20190101/20190102

We haven't started the Druid-based endpoint(s) yet, but the process will likely be similar, with perhaps some differences in how to launch the test environment.

Tips and troubleshooting

Because Go is an opinionated language, it may refuse to run over seemingly small things, such as whitespace. If you see something like this:

   goimports: format errors detected

You can execute this to see what Go is unhappy about:

   goimports -d *.go

And this to automatically fix it:

   goimports -w *.go

Our services depend on several packages, including our own “aqsassist”, which is in active development. This means you may sometimes need to update dependencies for your local service to run. You can update all dependencies via:

   go get .

or update specific dependencies via something like:

   go get gitlab.wikimedia.org/repos/generated-data-platform/aqs/aqsassist

Local Testing

This section aims to explain how to prepare and run both local test environments (Cassandra and Druid) to test all the AQS services using the AQS QA test suite. In this section there are instructions to do this with both test environments: Cassandra and Druid. That way we could run two docker compose projects with all the AQS services running with the two existing test environments (Cassandra and Druid). There are enough instructions below to do that with all cassandra-based and druid-based services in both test environments.

All the steps explained here are ready for almost all the services and both test environments.

At this moment we could build and run two test environments ready to be used by QA engineers to run the AQS test suite:

aqs-docker-cassandra-test-env-qa: It's a cassandra test environment with the following services/containers: (available here)
- cassandra-qa: A docker container with a Cassandra database already populated with some sample data
- device-analytics-qa: A docker container to run the service listening on the port 8090
- geo-analytics-qa: A docker container to run the service listening on the port 8091
- media-analytics-qa: A docker container to run the service listening on the port 8092
- page-analytics-qa: A docker container to run the service listening on the port 8093
aqs-docker-druid-test-env-qa: It's a Druid test environment with the following services/containers: (available here)
- druid-qa: A docker container with a Druid database already populated with some sample data
- editor-analytics-qa: A docker container to run the service listening on the port 8094
- edit-analytics-qa: A docker container to run the service listening on the port 8095

Quick start (for QA engineers)

This quick start guide shows how to build, start and populate the docker-compose project for both testing environments: Cassandra and Druid:

Clone all service repositories that belong to the test env you want to run
Inside every service project, run make docker_qa to create the service docker image (it will be called like the service: geo-analytics, editor-analytics and so son)
For Cassandra test env:
- Clone the aqs-docker-cassandra-test-env repository
- Inside the test-env project, run make startup-qa to create the new docker-compose project
- Before creating the project, you could change the port where you want the service to listen to (in the snippet below we are mapping the default port of the service container, 8080, to the 8091 port in our host. It will be really useful if we want to run all the services at the same time to test them at the same time:
- Inside the test-env project, run make bootstrap-qa to populate cassandra in the new docker-compose project (it takes around 15 minutes to fully populated the database)
- Take a look at your Docker Dashboard to be sure that a docker compose project called aqs-docker-test-env-qa has been created with all its services running
For Druid test env:
- Clone the aqs-docker-druid-test-env repository
- Enter to the aqs-docker-druid-test-env-build folder and run docker build -t aqs-docker-druid-test-env .(this will create a druid image already populated with the sample data)
- Enter to the aqs-docker-druid-test-env-run folder and run make startup-qa
- Take a look at your Docker Dashboard to be sure that a docker compose project called aqs-docker-druid-test-env-qa has been created with all its services running

Try making a request (for instance to http://localhost:8091/metrics/editors/by-country/en.wikipedia/100..-edits/2018/11) to check everything is working fine (change the port and the request according to the service you want to try)
The following are the ports where each service will be listening:
- aqs-docker-test-env-qa:
  - device-analytics: 8090
  - geo-analytics: 8091
  - media-analytics: 8092
  - page-analytics: 8093
- aqs-docker-druid-test-env-qa:
  - editor-analytics: 8094
  - edit-analytics: 8095

Full guide (in case you want to customize these environments)

This full guide describes all the necessary steps to create a docker compose project composed of geo-analytics service and cassandra test env as a sample. All the steps related to the service could be done for any other AQS services to run it using docker. Test environments (Cassandra and Druid) could be tuned following the same pattern we use here to add more services to the final docker-composed project.

Service config changes

These steps modify the config.yaml file in geo-analytics

Needed changes:

config.yaml file: We need to change the cassandra hostname and the service listen_address to be able to run properly the new AQS test env via docker.

You must change the cassandra host to "cassandra". It’s the name of the service we will set in the docker compose config file with which we are going to run the QA test environment (cassandra + geo). It’s needed so that the container service (geo-analytics in this case) can connect to the cassandra one. Take the opportunity to change the listen_address property to 0.0.0.0 to accept remote connections from outside the service container (to be able to run the QA test suite from your host)

 . . .
 listen_address: 0.0.0.0
 . . .
 cassandra:
   hosts: cassandra
 . . .

Create the service image

These steps describe how it was done for the geo-analytics service. Same steps can be done for any other AQS service

Needed changes:

blubber file: https://gerrit.wikimedia.org/r/plugins/gitiles/generated-data-platform/aqs/geo-analytics/+/refs/heads/main/.pipeline/blubber.yaml
deployment config file: https://gerrit.wikimedia.org/r/plugins/gitiles/generated-data-platform/aqs/geo-analytics/+/refs/heads/main/.pipeline/config.yaml
New target in the Makefile:

 docker_qa: ## create a docker container to run QA test via Docker
  curl -s "https://blubberoid.wikimedia.org/v1/production" -H 'content-type: application/yaml' --data-binary @".pipeline/blubber.yaml" > Dockerfile
  docker build -t geo-analytics .

Needed steps

If not available, put the blubber file into the .pipeline folder (change the entry_point according to the service where you are putting this file)
If not available, put the config file into the .pipeline folder (this file is the same for all the services)
Add the new target to the Makefile

Now you could build the service image with the following command:

make docker_qa

Build the QA test environment (for cassandra-based services)

These steps must be run in the aqs-docker-test-env project folder.

Needed changes:

All changes described here are available in the aqs-docker-cassandra-test-env

New file docker-compose-qa.yml: https://gitlab.wikimedia.org/repos/generated-data-platform/aqs/aqs-docker-cassandra-test-env/-/blob/main/docker-compose-qa.yml (we could add all the services to this file to create a AQS full-services test env).

This file defines a new docker-compose project compound of a cassandra engine and a sample service (geo-analytics in this case). If you take a look at this file, you will see that the service specific part could be customized to add any additional cassandra-based service you want to include to this dockerized env. In this moment some of these services are already included:

A service has to be added to define the cassandra container. For this specific service we also add a _healthcheck_ property to define how to know when cassandra is available. That way service containers will start when the database is available (to avoid failures about trying to connect to it when not yet available):

 cassandra:
    image: cassandra:3.11
    container_name: cassandra-qa
    ports:
      - "9042:9042"
    volumes:
      - .:/env
    networks:
      - network1
    healthcheck:
      test: ["CMD-SHELL", "[ $$(nodetool statusgossip) = running ]"]
      interval: 20s
      timeout: 10s
      retries: 5

An another one for each service you want to add to this test environment (in this case we are adding geo-analytics):

 geo-analytics:
  image: geo-analytics
  container_name: geo-analytics-qa
  ports:
   - "8091:8080"
  networks:
   - network1
  depends_on:
      cassandra:
        condition: service_healthy

New targets in the Makefile:

 bootstrap-qa: schema-qa load-qa
 schema-qa:
    docker exec -it cassandra-qa cqlsh -f /env/schema.cql
 load-qa:
    docker exec -it cassandra-qa cqlsh -f /env/test_data.cql --cqlshrc=/env/cqlshrc
 startup-qa:
    docker-compose -f docker-compose-qa.yml up -d

Needed steps:

Once you have customized both files according to the specified changes (docker-compose-qa-yml and Makefile) you could build and run the new dockerized test environment:

Create the docker-compose project:

 make startup-qa

Populate the cassandra container

make bootstrap-qa

QA test env and the included services are already running as docker containers within a docker compose project. The compose project will be named as aqs-docker-test-env-qa

Each service will be listening on a different port according to the service configuration. For example, in this case geo-analytics will be listening on the port 8091.

Build and run the QA test environment (for druid-based services)

These steps must be run in the aqs-docker-druid-test-env project folder.

Needed changes:

All changes described here are available in the aqs-docker-druid-test-env

New file docker-compose-qa.yml (to be added to the aqs-docker-druid-test-env-run folder): https://gitlab.wikimedia.org/repos/generated-data-platform/aqs/aqs-docker-druid-test-env/-/blob/main/aqs-docker-druid-test-env-run/docker-compose-qa.yml (we could add all the services to this file to create a AQS full-services test env for Druid).

This file defines a new docker-compose project composed of a Druid engine and a sample service (editor-analytics in this case). If you take a look at this file, you will see that the service specific part could be customized to add any additional Druid-based service you want to include to this dockerized env. In this moment some of these services are already included:

A service has to be added to define the Druid container. For this specific service we also add a _healthcheck_ property to define how to know when Druid is available. That way service containers will start when the database is available (to avoid failures about trying to connect to it when not yet available):

druid:
 image: bpirkle/aqs-docker-druid-test-env:latest
 container_name: druid-qa
 ports:
  - "8888:8888"
  - "8082:8082"
 networks:
  - network1
 healthcheck:
  interval: 10s
  retries: 9
  timeout: 90s
  test:
   - CMD-SHELL
   - nc -z 127.0.0.1 8888

An another one for each service you want to add to this test environment (in this case we are adding editor-analytics):

 editor-analytics:
  image: editor-analytics
  container_name: editor-analytics-qa
  ports:
   - "8094:8080"
  networks:
   - network1
  depends_on:
      druid:
        condition: service_healthy

New targets in the Makefile (inside the aqs-docker-druid-test-env-build folder):

startup-qa:
	docker-compose -f docker-compose-qa.yml up -d
shutdown-qa:
	docker-compose -f docker-compose-qa.yml down

Needed steps:

Once you have customized both files according to the specified changes (docker-compose-qa-yml and Makefile) you could build and run the new test environment docker image (from the aqs-docker-druid-test-env-build folder):

docker build -t aqs-docker-druid-test-env .

After creating the image, you can start the docker-compose project (from the aqs-docker-druid-test-env-run folder):

 make startup-qa

QA test env and the included services are already running as docker containers within a docker compose project. The compose project will be named as aqs-docker-druid-test-env-qa

Each service will be listening on a different port according to the service configuration. For example, in this case editor-analytics will be listening on the port 8094.

Demos

Demo 1: How to run AQS QA test suite using docker (cassandra based-service)

Notes for developers

Developers will need to keep a new additional config-devel.yaml just with a different cassandra host (“localhost”) to be able to connect test env while developing and listen_address = “localhost” (we’ll have to do something similar with Druid-based services). This is the way to run the service using an alternative config file:

 make clean build && ./geo-analytics --config config-devel.yaml

config.yaml	config-devel.yaml
listen_address: 0.0.0.0	listen_address: localhost
cassandra.hosts: - cassandra	cassandra.hosts: - localhost

Dockerfile (created by blubber) and this config-devel.yaml should be added to .gitignore

A .dockerignore file should be added to the repo to avoid using an already build service binary when creating the service docker image (the already built binary mustn't be included into the dockerized environment because it has to be build inside the right docker container)

To keep in mind

We need to change the host of the database (to the name of the service, cassandra) to be able to connect from the service container to the cassandra one
- It doesn’t matter because when deploying to production, the config file will be replaced automatically. We can use cassandra for the default one and create then config-devel.yaml to use “localhost” when developing
We need to allow remote connections (Listen 0.0.0.0:8080 instead of localhost:8080) to allow our host to connect to the service (Postman, curl, . . .)
- 0.0.0.0 should be by default so we can use this value in the default config file for all the services. That value will be automatically replaced by the right one when deploying to production
In the end we need to keep two cassandra containers: the one for developing and the one for QA (a docker compose with two services)
- It’s not really a problem because developers usually doesn’t start the QA one and QA engineers don't use the development one
- Anyway, we can keep both test-env at the same time (aqs-docker-test-env and aqs-docker-test-env-qa). The only thing we have to keep in mind is that we cannot run both at the same time because they are listening in the same port.

Adding a Wiki

Data generated by a wiki goes through two main data pipelines before it ends up in AQS. Readership data flows from our varnish caches, through Kafka, into private and public datasets. The public pageviews data ends up in the Pageview API, served by AQS from Cassandra. Editing data flows slower through monthly whole-history loads from mediawiki database replicas. This ends up in Druid as mediawiki_history_reduced and is also served by AQS. (TODO: linkify all this)

To add a new wiki, you need to edit the include lists for these two pipelines:

Developing a New Endpoint

This information is outdated. See phab:T356748.

This is roughly how writing a new AQS endpoint goes. Note that some endpoints may use Druid rather than Cassandra, so that process may be different:

Development
- Write the Oozie job to move data from Hadoop to Cassandra; Verify the output is correct by outputting to plain JSON/test Hive table; The Oozie job will be unable to load into Wikimedia Cloud Cassandra instances (You just have to hope that the loading works)
- Write the AQS endpoint, which includes the table schema spec and unit tests
Testing
- Tunnel to any of the aqs-test Cassandra instances (i.e., aqs-test1001.analytics.eqiad1.wikimedia.cloud, ..., aqs-test1010.analytics.eqiad1.wikimedia.cloud)
- Create a keyspace in this cloud Cassandra instance of the same name as is detailed in the Oozie job properties file; Insert some data into the table
- From any of the aqs-test machines, run your version of AQS with the new endpoint, pointing to one of the Cassandra instances (e.g., 172.16.4.205); Test by manually running queries against your local instance (i.e., localhost:7231)
Productionize
- Deploy the Oozie job, but don't run it
- Deploy AQS to point to the new data; Once it is running, it will automatically create the relevant keyspace in the production Cassandra instance
- Manually run a few queries against the new AQS endpoint (i.e., aqs1010; localhost:7232), and ensure that they all respond with 404 (because no data is loaded into Cassandra yet)
- Run the Oozie job to load the data into Cassandra
- Manually run a few queries against the new AQS endpoint, and ensure that they all respond with the proper responses, as the data should now be loaded into Cassandra
- Submit a pull request for the restbase repository on GitHub with the schema of the new endpoint added; Now the endpoint should be publicly accessible!
- Update all relevant documentation

Deployment

This section assumes that the microservice has been deployed and provides instructions for subsequent releases.

Prerequisites

Ensure access: Confirm you are part of the deployment group. If not, see the process for filing a production access request.
Read documentation: Review the deployment pipeline documentation.

Quick guide

If you've previously deployed an AQS service, you can use this quick reference guide to run all the necessary commands after merging the deployment patch (sample patch). Otherwise, see the step-by-step guide.

# Do the following once you have prepared the change and the change has been reviewed and merged
ssh deployment.[equiad|codfw].wmnet
cd /srv/deployment-charts
git log -n 1
# Check that the change has been pulled
cd helmfile.d/services/your-service-name
# Check changes and deploy to staging
helmfile -e staging diff
helmfile -e staging apply
# Check changes and deploy to production (eqiad)
helmfile -e eqiad diff
helmfile -e eqiad apply
# Check changes and deploy to production (codfw)
helmfile -e codfw diff
helmfile -e codfw apply

Step-by-step instructions

1. Prepare a deployment patch

Clone Deployment Charts Repository, and prepare a new patch changing the version you want to deploy in the right values file for the service you want to deploy.

Every service has a couple of values files in the helmfile.d/services/your-service-name folder:

values.yaml: The file we will use to deploy to production. Always deploy to staging first to check that everything is working correctly.
values-staging.yaml: The file we will use to deploy only to staging to test something you are working on but you don't want to deploy to production yet. After testing, you will have to revert the change to this values-staging.yaml file.

The image version is the the only thing we have to change to deploy a new version of any service. The version will be different depending on the services you want to deploy:

AQS Services: These are the services that reside in Wikimedia Gerrit. They use the format YYYY-MM-DD-HHMMSS-production (e.g., 2024-06-05-094107-production) to identify the Docker image. This image name can be found in the pipeline output in Gerrit after merging the deployment patch. Keep in mind we always use the production variant. For example:

CIM: The Commons Impact Metrics service uses a version tag like v1.0.1 which should coincide with the tag associated with the version you are deploying. You can also find the right version in the pipeline output after creating the tag in the build-and-publish-production-image step. For example (source):

. . .
#23 pushing layers 3.6s done
#23 pushing manifest for docker-registry.discovery.wmnet/repos/generated-data-platform/aqs/commons-impact-analytics:v1.0.5@sha256:b1d92fb52b56a6ba7252c6c1b0628bc707c39c6bcd325632c41da10e1261b2c6
#23 pushing manifest for docker-registry.discovery.wmnet/repos/generated-data-platform/aqs/commons-impact-analytics:v1.0.5@sha256:b1d92fb52b56a6ba7252c6c1b0628bc707c39c6bcd325632c41da10e1261b2c6 0.5s done
#23 pushing layers 0.2s done
. . .

You can always check the Wikimedia Docker Registry to find any image and version you need. However, keep in mind that this registry updates every 4 hours, so any newly created image won’t appear there until that time has passed.

Open a patch for the change, and get a +2 approval to merge the patch. Here's a sample change for geo-analytics. Once your patch has been merged, you are ready to deploy the service

2. Access the deployment server

The first step to deploy your service is to access the deployment server where the deployment-charts repository is automatically pulled at /srv/deployment-charts/.

SSH into the deployment server. For example: ssh deployment.eqiad.wmnet or ssh deployment.codfw.wmnet
Once you have entered the deployment server, go to /srv/deployment-charts where the repository is automatically pulled.
```
cd /srv/deployment-charts/
```
Once there, check the last change to be sure that your merged patch has been pulled already (it may take a few seconds until your change is pulled):
```
git log -n 1
```

If you can see the patch you have recently pushed, your change is ready to be deployed.

3. Deploy to staging

Before deploying, verify the changes that are going to be deployed:

Go to the service's folder:

cd helmfile.d/services/your-service-name/

Check which changes are ready to be deployed (they appear in green):
```
helmfile -e staging diff
```

If you are ok with the changes, you can deploy to the staging environment:

helmfile -e staging apply

4. Verify the service on staging

Before deploying to production, make a service request to check if the service works properly in staging. Choose any request you want to test and keep in mind that the base URL can be different depending on the service you want to test:

device-analytics

https://staging.svc.eqiad.wmnet:4972

curl https://staging.svc.eqiad.wmnet:4972/metrics/unique-devices/all-wikipedia-projects/all-sites/monthly/20231001/2023110100

media-analytics