Jump to content

Data Platform/Systems/AQS

From Wikitech

The Analytics Query Service (AQS) is a set of services that provide public-facing APIs that serve analytics data from both a Cassandra and a Druid backend. It is how Wikistats (stats.wikimedia.org) gets data to visualize.

Services

Device Analytics
Repository: generated-data-platform/aqs/device-analytics
API specification: swagger.json
Data source: Cassandra
Geo Analytics
Repository: generated-data-platform/aqs/geo-analytics
API specification: swagger.json
Data source: Cassandra
Media Analytics
Repository: generated-data-platform/aqs/media-analytics
API specification: swagger.json
Data source: Cassandra
Page Analytics
Repository: generated-data-platform/aqs/page-analytics
API specification: swagger.json
Data source: Cassandra
Edit Analytics
Repository: generated-data-platform/aqs/edit-analytics
API specification: swagger.json
Data source: Druid
Editor Analytics
Repository: generated-data-platform/aqs/editor-analytics
API specification: swagger.json
Data source: Druid
Commons Impact Metrics
Repository: generated-data-platform/aqs/commons-impact-analytics (GitLab)
API specification: swagger.json
Data source: Cassandra
Common functionality
AQS Assist: generated-data-platform/aqs/aqsassist
service-lib-golang: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/servicelib-golang
Test environments
Cassandra test env: generated-data-platform/aqs/aqs-docker-cassandra-test-env
Druid test env: generated-data-platform/aqs/aqs-docker-druid-test-env
QA
QA test suite: generated-data-platform/aqs/aqs_tests
User documentation site
Repository: generated-data-platform/aqs/analytics-api
Live link: doc.wikimedia.org/analytics-api

Monitoring

Grafana dashboards:

Project overview

Epic task in Phabricator | API Platform Team workboard | AQS2.0 workboard

  1. Yes Done Implement the new, stand-alone AQS service(s)
  2. Yes Done Deploy to k8s
  3. Yes Done Switch RESTBase to proxying requests from the old AQS service, to the new k8s-based one
    1. Phase 1: Unique devices endpoint (Device Analytics service)
    2. Phase 2: Pageviews and legacy endpoints (Page Analytics service), editors/by-country endpoint (Geo Analytics service), and media requests endpoints (Media Analytics service)
    3. Phase 3: Edited pages, edits, and bytes difference endpoints (Edit Analytics service) and editors and registered users endpoints (Editor Analytics service)
  4. Underway In progress Deprecate the http://{project}/api/rest_v1/metrics resources
  5. Eventually phase out the RESTBase /metrics hierarchy

Communicating user-facing changes

When making a user-facing change to an AQS API:

  1. Consult the stability policy for versioning guidance.
  2. Update the service's API spec and the documentation site.
  3. Update the combined changelog (source).
  4. Send an announcement to the analytics mailing list.

Running a service

AQS 2.0 consists of several repositories. Some correspond to individual services that expose APIs. Others correspond to cross-service common functionality or test environments.

Setup

You will need:

Go (aka "golang") is an opinionated language in various ways. Among these is that you're probably much better off keeping your Go code under your "GOPATH" rather than wherever you may be used to keeping code. (There are, of course, always ways for savvy developers to cheat the system. If you choose to do that, any consequences are on you.) On my Mac, I cloned all the AQS 2.0 repositories under ~/go/src/.

Start a service

The various service README files contain details about running that particular service. But the summary is that you'll need to open several command line (aka "terminal") windows/tabs and run commands in each. The following describes how to execute the "pageviews" services. Other services operate similarly.

  • In one terminal, navigate to <GOPATH>/aqs-docker-test-env
  • Run "make startup", wait for it to say "Startup complete", then leave it running
  • In another terminal, also in <GOPATH>/aqs-docker-test-env, run "make bootstrap" and wait for it to complete
  • Navigate (either in that terminal or a different one) to <GOPATH>/pageviews
  • Run "make"
  • Run "./pageviews" (and leave it running)
  • In another terminal, navigate to <GOPATH>/pageviews and run "make test"
  • In your browser, visit http://localhost:8080/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Banana/daily/20190101/20190102

We haven't started the Druid-based endpoint(s) yet, but the process will likely be similar, with perhaps some differences in how to launch the test environment.

Tips and troubleshooting

Because Go is an opinionated language, it may refuse to run over seemingly small things, such as whitespace. If you see something like this:

   goimports: format errors detected

You can execute this to see what Go is unhappy about:

   goimports -d *.go

And this to automatically fix it:

   goimports -w *.go

Our services depend on several packages, including our own “aqsassist”, which is in active development. This means you may sometimes need to update dependencies for your local service to run. You can update all dependencies via:

   go get .

or update specific dependencies via something like:

   go get gitlab.wikimedia.org/repos/generated-data-platform/aqs/aqsassist

Local Testing

This section aims to explain how to prepare and run both local test environments (Cassandra and Druid) to test all the AQS services using the AQS QA test suite. In this section there are instructions to do this with both test environments: Cassandra and Druid. That way we could run two docker compose projects with all the AQS services running with the two existing test environments (Cassandra and Druid). There are enough instructions below to do that with all cassandra-based and druid-based services in both test environments.

All the steps explained here are ready for almost all the services and both test environments.

At this moment we could build and run two test environments ready to be used by QA engineers to run the AQS test suite:

  • aqs-docker-cassandra-test-env-qa: It's a cassandra test environment with the following services/containers: (available here)
    • cassandra-qa: A docker container with a Cassandra database already populated with some sample data
    • device-analytics-qa: A docker container to run the service listening on the port 8090
    • geo-analytics-qa: A docker container to run the service listening on the port 8091
    • media-analytics-qa: A docker container to run the service listening on the port 8092
    • page-analytics-qa: A docker container to run the service listening on the port 8093
  • aqs-docker-druid-test-env-qa: It's a Druid test environment with the following services/containers: (available here)
    • druid-qa: A docker container with a Druid database already populated with some sample data
    • editor-analytics-qa: A docker container to run the service listening on the port 8094
    • edit-analytics-qa: A docker container to run the service listening on the port 8095

Quick start (for QA engineers)

This quick start guide shows how to build, start and populate the docker-compose project for both testing environments: Cassandra and Druid:

  • Clone all service repositories that belong to the test env you want to run
  • Inside every service project, run make docker_qa to create the service docker image (it will be called like the service: geo-analytics, editor-analytics and so son)
  • For Cassandra test env:
    • Clone the aqs-docker-cassandra-test-env repository
    • Inside the test-env project, run make startup-qa to create the new docker-compose project
    • Before creating the project, you could change the port where you want the service to listen to (in the snippet below we are mapping the default port of the service container, 8080, to the 8091 port in our host. It will be really useful if we want to run all the services at the same time to test them at the same time:
    • Inside the test-env project, run make bootstrap-qa to populate cassandra in the new docker-compose project (it takes around 15 minutes to fully populated the database)
    • Take a look at your Docker Dashboard to be sure that a docker compose project called aqs-docker-test-env-qa has been created with all its services running
  • For Druid test env:
    • Clone the aqs-docker-druid-test-env repository
    • Enter to the aqs-docker-druid-test-env-build folder and run docker build -t aqs-docker-druid-test-env .(this will create a druid image already populated with the sample data)
    • Enter to the aqs-docker-druid-test-env-run folder and run make startup-qa
    • Take a look at your Docker Dashboard to be sure that a docker compose project called aqs-docker-druid-test-env-qa has been created with all its services running
  • Try making a request (for instance to http://localhost:8091/metrics/editors/by-country/en.wikipedia/100..-edits/2018/11) to check everything is working fine (change the port and the request according to the service you want to try)
  • The following are the ports where each service will be listening:
    • aqs-docker-test-env-qa:
      • device-analytics: 8090
      • geo-analytics: 8091
      • media-analytics: 8092
      • page-analytics: 8093
    • aqs-docker-druid-test-env-qa:
      • editor-analytics: 8094
      • edit-analytics: 8095

Full guide (in case you want to customize these environments)

This full guide describes all the necessary steps to create a docker compose project composed of geo-analytics service and cassandra test env as a sample. All the steps related to the service could be done for any other AQS services to run it using docker. Test environments (Cassandra and Druid) could be tuned following the same pattern we use here to add more services to the final docker-composed project.

Service config changes

These steps modify the config.yaml file in geo-analytics

Needed changes:

  • config.yaml file: We need to change the cassandra hostname and the service listen_address to be able to run properly the new AQS test env via docker.

You must change the cassandra host to "cassandra". It’s the name of the service we will set in the docker compose config file with which we are going to run the QA test environment (cassandra + geo). It’s needed so that the container service (geo-analytics in this case) can connect to the cassandra one. Take the opportunity to change the listen_address property to 0.0.0.0 to accept remote connections from outside the service container (to be able to run the QA test suite from your host)

 . . .
 listen_address: 0.0.0.0
 . . .
 cassandra:
   hosts: cassandra
 . . .

Create the service image

These steps describe how it was done for the geo-analytics service. Same steps can be done for any other AQS service

Needed changes:

 docker_qa: ## create a docker container to run QA test via Docker
  curl -s "https://blubberoid.wikimedia.org/v1/production" -H 'content-type: application/yaml' --data-binary @".pipeline/blubber.yaml" > Dockerfile
  docker build -t geo-analytics .

Needed steps

  • If not available, put the blubber file into the .pipeline folder (change the entry_point according to the service where you are putting this file)
  • If not available, put the config file into the .pipeline folder (this file is the same for all the services)
  • Add the new target to the Makefile
  • Now you could build the service image with the following command:
make docker_qa

Build the QA test environment (for cassandra-based services)

These steps must be run in the aqs-docker-test-env project folder.

Needed changes:

All changes described here are available in the aqs-docker-cassandra-test-env

This file defines a new docker-compose project compound of a cassandra engine and a sample service (geo-analytics in this case). If you take a look at this file, you will see that the service specific part could be customized to add any additional cassandra-based service you want to include to this dockerized env. In this moment some of these services are already included:

A service has to be added to define the cassandra container. For this specific service we also add a _healthcheck_ property to define how to know when cassandra is available. That way service containers will start when the database is available (to avoid failures about trying to connect to it when not yet available):

 cassandra:
    image: cassandra:3.11
    container_name: cassandra-qa
    ports:
      - "9042:9042"
    volumes:
      - .:/env
    networks:
      - network1
    healthcheck:
      test: ["CMD-SHELL", "[ $$(nodetool statusgossip) = running ]"]
      interval: 20s
      timeout: 10s
      retries: 5

An another one for each service you want to add to this test environment (in this case we are adding geo-analytics):

 geo-analytics:
  image: geo-analytics
  container_name: geo-analytics-qa
  ports:
   - "8091:8080"
  networks:
   - network1
  depends_on:
      cassandra:
        condition: service_healthy
  • New targets in the Makefile:
 bootstrap-qa: schema-qa load-qa
 schema-qa:
    docker exec -it cassandra-qa cqlsh -f /env/schema.cql
 load-qa:
    docker exec -it cassandra-qa cqlsh -f /env/test_data.cql --cqlshrc=/env/cqlshrc
 startup-qa:
    docker-compose -f docker-compose-qa.yml up -d

Needed steps:

Once you have customized both files according to the specified changes (docker-compose-qa-yml and Makefile) you could build and run the new dockerized test environment:

  • Create the docker-compose project:
 make startup-qa
  • Populate the cassandra container
make bootstrap-qa

QA test env and the included services are already running as docker containers within a docker compose project. The compose project will be named as aqs-docker-test-env-qa

Each service will be listening on a different port according to the service configuration. For example, in this case geo-analytics will be listening on the port 8091.

Build and run the QA test environment (for druid-based services)

These steps must be run in the aqs-docker-druid-test-env project folder.

Needed changes:

All changes described here are available in the aqs-docker-druid-test-env

This file defines a new docker-compose project composed of a Druid engine and a sample service (editor-analytics in this case). If you take a look at this file, you will see that the service specific part could be customized to add any additional Druid-based service you want to include to this dockerized env. In this moment some of these services are already included:

A service has to be added to define the Druid container. For this specific service we also add a _healthcheck_ property to define how to know when Druid is available. That way service containers will start when the database is available (to avoid failures about trying to connect to it when not yet available):

druid:
 image: bpirkle/aqs-docker-druid-test-env:latest
 container_name: druid-qa
 ports:
  - "8888:8888"
  - "8082:8082"
 networks:
  - network1
 healthcheck:
  interval: 10s
  retries: 9
  timeout: 90s
  test:
   - CMD-SHELL
   - nc -z 127.0.0.1 8888

An another one for each service you want to add to this test environment (in this case we are adding editor-analytics):

 editor-analytics:
  image: editor-analytics
  container_name: editor-analytics-qa
  ports:
   - "8094:8080"
  networks:
   - network1
  depends_on:
      druid:
        condition: service_healthy
  • New targets in the Makefile (inside the aqs-docker-druid-test-env-build folder):
startup-qa:
	docker-compose -f docker-compose-qa.yml up -d
shutdown-qa:
	docker-compose -f docker-compose-qa.yml down

Needed steps:

  • Once you have customized both files according to the specified changes (docker-compose-qa-yml and Makefile) you could build and run the new test environment docker image (from the aqs-docker-druid-test-env-build folder):
docker build -t aqs-docker-druid-test-env .
  • After creating the image, you can start the docker-compose project (from the aqs-docker-druid-test-env-run folder):
 make startup-qa

QA test env and the included services are already running as docker containers within a docker compose project. The compose project will be named as aqs-docker-druid-test-env-qa

Each service will be listening on a different port according to the service configuration. For example, in this case editor-analytics will be listening on the port 8094.

Demos

Notes for developers

  • Developers will need to keep a new additional config-devel.yaml just with a different cassandra host (“localhost”) to be able to connect test env while developing and listen_address = “localhost” (we’ll have to do something similar with Druid-based services). This is the way to run the service using an alternative config file:
 make clean build && ./geo-analytics --config config-devel.yaml
config.yaml config-devel.yaml
listen_address: 0.0.0.0 listen_address: localhost
cassandra.hosts:
- cassandra
cassandra.hosts:
- localhost
  • Dockerfile (created by blubber) and this config-devel.yaml should be added to .gitignore
  • A .dockerignore file should be added to the repo to avoid using an already build service binary when creating the service docker image (the already built binary mustn't be included into the dockerized environment because it has to be build inside the right docker container)

To keep in mind

  • We need to change the host of the database (to the name of the service, cassandra) to be able to connect from the service container to the cassandra one
    • It doesn’t matter because when deploying to production, the config file will be replaced automatically. We can use cassandra for the default one and create then config-devel.yaml to use “localhost” when developing
  • We need to allow remote connections (Listen 0.0.0.0:8080 instead of localhost:8080) to allow our host to connect to the service (Postman, curl, . . .)
    • 0.0.0.0 should be by default so we can use this value in the default config file for all the services. That value will be automatically replaced by the right one when deploying to production
  • In the end we need to keep two cassandra containers: the one for developing and the one for QA (a docker compose with two services)
    • It’s not really a problem because developers usually doesn’t start the QA one and QA engineers don't use the development one
    • Anyway, we can keep both test-env at the same time (aqs-docker-test-env and aqs-docker-test-env-qa). The only thing we have to keep in mind is that we cannot run both at the same time because they are listening in the same port.

Adding a Wiki

Data generated by a wiki goes through two main data pipelines before it ends up in AQS. Readership data flows from our varnish caches, through Kafka, into private and public datasets. The public pageviews data ends up in the Pageview API, served by AQS from Cassandra. Editing data flows slower through monthly whole-history loads from mediawiki database replicas. This ends up in Druid as mediawiki_history_reduced and is also served by AQS. (TODO: linkify all this)

To add a new wiki, you need to edit the include lists for these two pipelines:

Developing a New Endpoint

This information is outdated. See phab:T356748.

This is roughly how writing a new AQS endpoint goes. Note that some endpoints may use Druid rather than Cassandra, so that process may be different:

  • Development
    • Write the Oozie job to move data from Hadoop to Cassandra; Verify the output is correct by outputting to plain JSON/test Hive table; The Oozie job will be unable to load into Wikimedia Cloud Cassandra instances (You just have to hope that the loading works)
    • Write the AQS endpoint, which includes the table schema spec and unit tests
  • Testing
    • Tunnel to any of the aqs-test Cassandra instances (i.e., aqs-test1001.analytics.eqiad1.wikimedia.cloud, ..., aqs-test1010.analytics.eqiad1.wikimedia.cloud)
    • Create a keyspace in this cloud Cassandra instance of the same name as is detailed in the Oozie job properties file; Insert some data into the table
    • From any of the aqs-test machines, run your version of AQS with the new endpoint, pointing to one of the Cassandra instances (e.g., 172.16.4.205); Test by manually running queries against your local instance (i.e., localhost:7231)
  • Productionize
    • Deploy the Oozie job, but don't run it
    • Deploy AQS to point to the new data; Once it is running, it will automatically create the relevant keyspace in the production Cassandra instance
    • Manually run a few queries against the new AQS endpoint (i.e., aqs1010; localhost:7232), and ensure that they all respond with 404 (because no data is loaded into Cassandra yet)
    • Run the Oozie job to load the data into Cassandra
    • Manually run a few queries against the new AQS endpoint, and ensure that they all respond with the proper responses, as the data should now be loaded into Cassandra
    • Submit a pull request for the restbase repository on GitHub with the schema of the new endpoint added; Now the endpoint should be publicly accessible!
    • Update all relevant documentation

Deployment

This section assumes that the microservice has been deployed and provides instructions for subsequent releases.

Prerequisites

  1. Ensure access: Confirm you are part of the deployment group. If not, see the process for filing a production access request.
  2. Read documentation: Review the deployment pipeline documentation.

Quick guide

If you've previously deployed an AQS service, you can use this quick reference guide to run all the necessary commands after merging the deployment patch (sample patch). Otherwise, see the step-by-step guide.

# Do the following once you have prepared the change and the change has been reviewed and merged
ssh deployment.[equiad|codfw].wmnet
cd /srv/deployment-charts
git log -n 1
# Check that the change has been pulled
cd helmfile.d/services/your-service-name
# Check changes and deploy to staging
helmfile -e staging diff
helmfile -e staging apply
# Check changes and deploy to production (eqiad)
helmfile -e eqiad diff
helmfile -e eqiad apply
# Check changes and deploy to production (codfw)
helmfile -e codfw diff
helmfile -e codfw apply

Step-by-step instructions

1. Prepare a deployment patch

Clone Deployment Charts Repository, and prepare a new patch changing the version you want to deploy in the right values file for the service you want to deploy.

Every service has a couple of values files in the helmfile.d/services/your-service-name folder:

  • values.yaml: The file we will use to deploy to production. Always deploy to staging first to check that everything is working correctly.
  • values-staging.yaml: The file we will use to deploy only to staging to test something you are working on but you don't want to deploy to production yet. After testing, you will have to revert the change to this values-staging.yaml file.

The image version is the the only thing we have to change to deploy a new version of any service. The version will be different depending on the services you want to deploy:

  • AQS Services: These are the services that reside in Wikimedia Gerrit. They use the format YYYY-MM-DD-HHMMSS-production (e.g., 2024-06-05-094107-production) to identify the Docker image. This image name can be found in the pipeline output in Gerrit after merging the deployment patch. Keep in mind we always use the production variant. For example:
  • CIM: The Commons Impact Metrics service uses a version tag like v1.0.1 which should coincide with the tag associated with the version you are deploying. You can also find the right version in the pipeline output after creating the tag in the build-and-publish-production-image step. For example (source):
    . . .
    #23 pushing layers 3.6s done
    #23 pushing manifest for docker-registry.discovery.wmnet/repos/generated-data-platform/aqs/commons-impact-analytics:v1.0.5@sha256:b1d92fb52b56a6ba7252c6c1b0628bc707c39c6bcd325632c41da10e1261b2c6
    #23 pushing manifest for docker-registry.discovery.wmnet/repos/generated-data-platform/aqs/commons-impact-analytics:v1.0.5@sha256:b1d92fb52b56a6ba7252c6c1b0628bc707c39c6bcd325632c41da10e1261b2c6 0.5s done
    #23 pushing layers 0.2s done
    . . .
    

You can always check the Wikimedia Docker Registry to find any image and version you need. However, keep in mind that this registry updates every 4 hours, so any newly created image won’t appear there until that time has passed.

Open a patch for the change, and get a +2 approval to merge the patch. Here's a sample change for geo-analytics. Once your patch has been merged, you are ready to deploy the service

2. Access the deployment server

The first step to deploy your service is to access the deployment server where the deployment-charts repository is automatically pulled at /srv/deployment-charts/.

  1. SSH into the deployment server. For example: ssh deployment.eqiad.wmnet or ssh deployment.codfw.wmnet
  2. Once you have entered the deployment server, go to /srv/deployment-charts where the repository is automatically pulled.
    cd /srv/deployment-charts/
    
  3. Once there, check the last change to be sure that your merged patch has been pulled already (it may take a few seconds until your change is pulled):
    git log -n 1
    

If you can see the patch you have recently pushed, your change is ready to be deployed.

3. Deploy to staging

Before deploying, verify the changes that are going to be deployed:

  1. Go to the service's folder:
    cd helmfile.d/services/your-service-name/
    
  2. Check which changes are ready to be deployed (they appear in green):
    helmfile -e staging diff
    

If you are ok with the changes, you can deploy to the staging environment:

helmfile -e staging apply

4. Verify the service on staging

Before deploying to production, make a service request to check if the service works properly in staging. Choose any request you want to test and keep in mind that the base URL can be different depending on the service you want to test:

device-analytics
https://staging.svc.eqiad.wmnet:4972
curl https://staging.svc.eqiad.wmnet:4972/metrics/unique-devices/all-wikipedia-projects/all-sites/monthly/20231001/2023110100
media-analytics
https://SERVICE-NAME.k8s-staging.discovery.wmnet:30443
curl https://media-analytics.k8s-staging.discovery.wmnet:30443/metrics/mediarequests/aggregate/all-referers/all-media-types/all-agents/daily/20210101/20230220
page-analytics
https://SERVICE-NAME.k8s-staging.discovery.wmnet:30443
curl https://page-analytics.k8s-staging.discovery.wmnet:30443/metrics/pageviews/aggregate/all-projects/all-access/all-agents/hourly/2021010100/2021010216
geo-analytics
https://SERVICE-NAME.k8s-staging.discovery.wmnet:30443
curl https://geo-analytics.k8s-staging.discovery.wmnet:30443/metrics/editors/by-country/ru.wikipedia/5..99-edits/2018/01
edit-analytics
https://SERVICE-NAME.k8s-staging.discovery.wmnet:30443
curl https://edit-analytics.k8s-staging.discovery.wmnet:30443/metrics/edits/aggregate/en.wikipedia/user/content/daily/20220301/20220302
editor-analytics
https://SERVICE-NAME.k8s-staging.discovery.wmnet:30443
curl https://editor-analytics.k8s-staging.discovery.wmnet:30443/metrics/editors/aggregate/ab.wikipedia/name-bot/all-page-types/all-activity-levels/monthly/20210302/20220901
commons-analytics
https://SERVICE-NAME.k8s-staging.discovery.wmnet:30443
curl https://commons-impact-analytics.k8s-staging.discovery.wmnet:30443/metrics/commons-analytics/category-metrics-snapshot/UNESCO/20231101/20231120

Keep in mind that you cannot directly access these URLs. Instead, you must run the curl command from a stat machine (such as stat1008.eqiad.wmnet). If something isn't working as expected, see #Troubleshooting deployment.

5. Deploy to production

Once you have deployed to staging and verified that the service is running correctly, you are ready to deploy to production (both eqiad and codfw servers). We assume that you are in the same folder from which you have deployed to staging environment to start running the following commands:

# Check changes and deploy to production (eqiad)
helmfile -e eqiad diff
helmfile -e eqiad apply
# Check changes and deploy to production (codfw)
helmfile -e codfw diff
helmfile -e codfw apply

6. Verify the service in production

To verify the service in production, try one of the following production API requests. The base URL is the same for all services: https://wikimedia.org/api/rest_v1/metrics.

device-analytics
https://wikimedia.org/api/rest_v1/metrics/unique-devices/all-wikipedia-projects/all-sites/monthly/20231001/2023110100
media-analytics
https://wikimedia.org/api/rest_v1/metrics/mediarequests/aggregate/all-referers/all-media-types/all-agents/daily/20210101/20230220
page-analytics
https://wikimedia.org/api/rest_v1/metrics/pageviews/top-by-country/en.wikipedia/all-access/2020/12
geo-analytics
https://wikimedia.org/api/rest_v1/metrics/editors/by-country/en.wikipedia/5..99-edits/2023/02
edit-analytics
https://wikimedia.org/api/rest_v1/metrics/edits/aggregate/en.wikipedia/user/content/daily/20220301/20220302
editor-analytics
https://wikimedia.org/api/rest_v1/metrics/editors/aggregate/ab.wikipedia/name-bot/all-page-types/all-activity-levels/monthly/20210302/20220901
commons-analytics
https://wikimedia.org/api/rest_v1/metrics/commons-analytics/category-metrics-snapshot/Gallica/20240101/20240501

Troubleshooting deployment

If something isn't working as expected, verify whether the service has been deployed properly. Here are a few steps you can take to troubleshoot:

Check pods

Enter the Kubernetes environment (staging, eqiad or codfw):

kube-env your-service-name eqiad

Check pods to be sure that a new pod has been created and started recently (output is added below the command):

kubectl get pods

NAME                                     READY   STATUS    RESTARTS   AGE
aqs-http-gateway-main-7cfb86cb9c-dj5jl   2/2     Running   0           3s
aqs-http-gateway-main-7cfb86cb9c-gtth5   2/2     Running   0           3s
aqs-http-gateway-main-7cfb86cb9c-hgv6l   2/2     Running   0           3s
aqs-http-gateway-main-7cfb86cb9c-pwr7b   2/2     Running   0           3s

Check logs

If something is not working properly, check the pod's logs:

kubectl logs aqs-http-gateway-main-7cfb86cb9c-dj5jl -c aqs-http-gateway-main

{"@timestamp":"2024-06-20T21:07:55Z","message":"CASSANDRA_USERNAME and CASSANDRA_PASSWORD env vars unset, using values from configuration file","log":{"level":"WARNING"},"service":{"name":"geo-analytics"}}
{"@timestamp":"2024-06-20T21:07:55Z","message":"initializing service geo-analytics (Go version: 7dd406b, Build host: buildkitsandbox, Item: 2024-06-05T11:04:57:UTC","log":{"level":"INFO"},"service":{"name":"geo-analytics"}}
. . .
. . .

Data filter before Cassandra load

Some data used by AQS comes from Cassandra. We are using Airflow+Spark+HQL to feed the tables on Cassandra.

On HDFS, We have implemented a disallowed table `wmf.disallowed_cassandra_articles` to filter out sensitive pages we don't want to appear in the top-viewed articles of a wiki.

In fact, some attacks aim at manipulating the number of views per article. For example, the goal could be pushing traffic to a 3rd party site or adding an offensive word to the top list, which millions of users view.

The table is used in some Cassandra tables:

* pageview_per_article_daily

* pageview_top_articles_daily

* pageview_top_percountry_daily

* pageview_top_articles_monthly

To update this list of disallowed articles:

  • update the TSV in analytics/refinery `static_data/cassandra/disallowed_cassandra_articles.tsv `
  • prepare a patch and deploy it

For emergency procedures, you could run the following (Note that you still need a Gerrit patch as the next deployment of analytics/refinery will override your change):

# Fetch the disallowed list
ssh an-launcher1002.eqiad.wmnet
export TSV_FILENAME=disallowed_cassandra_articles.tsv
export TSV_HDFS_PATH="/wmf/refinery/current/static_data/cassandra/${TSV_FILENAME}"
hdfs dfs -cat $TSV_HDFS_PATH > $TSV_FILENAME
# Add or remove some entries (beware, tabs are expected between columns, not spaces)
vim $TSV_FILENAME
# Push the file back to HDFS
sudo -u hdfs kerberos-run-command hdfs hdfs dfs -put -f $TSV_FILENAME $TSV_HDFS_PATH
sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod +r $TSV_HDFS_PATH

API documentation

To create API reference docs that are reliable and easy to update, AQS 2.0 services use:

Documentation monitoring

The AQS documentation site uses Matomo to collect analytics data. To view the data, follow the instruction on wiki to log in to Matomo, and select "AQS documentation" from the "websites" dropdown.

Updating the docs

When making a change to an AQS service, you must run these commands locally to update the API spec before submitting the change. As of 2023, there is no integration to update the spec automatically, so each patch must include the corresponding changes to the API spec when needed. Remember that the docs rely on code annotations, so make sure to keep the annotations up to date with any code changes.

1. Install swag:

go install github.com/swaggo/swag/cmd/swag@latest

2. Generate the spec:

make docs

Swag outputs the spec in YAML and JSON formats to a /docs directory.

Reading the specification

You can view the spec using the API spec reader.

Setting up docs for a new service

To set up API docs for a new AQS service:

  1. Annotate main.go (example, style guide): Anywhere in main.go, add annotations to document general information about the API.
  2. Annotate handler (example, style guide): Add annotations to any code file to document an endpoint. Endpoint annotations should be stored as close as possible to the code they describe. The block of endpoint annotations must end on a line immediately preceding a function.
  3. Annotate entity (example, style guide): Swag automatically gets information about the response format from the struct. To complete the schema in the docs, add these elements to the struct definition:
    1. an example value within the JSON encoding definition using the syntax example:"example value". Note that these examples will be used in the sandbox and should return valid data.
    2. a description of the attribute as an inline comment
  4. Add a make docs command: Add these lines to the service's Makefile:
    docs:  ## creates openapi spec (requires swag)
    	swag init || (echo "Hint: If you haven't installed swag, run 'go install github.com/swaggo/swag/cmd/swag@latest', then re-run 'make docs'."; exit 1)
    
  5. Generate the spec: Run make docs to generate the spec.
  6. Commit the spec files: After generating the spec, commit the /docs directory to the source code repository. Since spec generation does not currently run in CI, the API spec must be stored in the repository in order to be served by the API spec endpoint.
  7. Add an endpoint to serve the API spec (example: main.go, handler, test): To make the docs publicly available, add an endpoint that serves the docs/swagger.json file via service-name/api-spec.json, for example device-analytics/api-spec.json
  8. Route the spec endpoint: As part of setting up routing for the new service in the REST Gateway, ensure that the spec endpoint is served at service-name/api-spec.json, for example device-analytics/api-spec.json.
  9. Add reference documentation to the AQS docs site: Submit a merge request to analytics-api that:
    1. Adds a new Markdown file under /reference using the #API reference template.
    2. Adds the new page to the navigation in the config file under API reference.
    3. Adds an entry to the changelog announcing that the new endpoints are available.
    4. Follow the instructions in the README to preview the site locally. Note that the API reference docs won't appear until the API spec endpoint has been routed and deployed.

API reference template

To use this template, replace everything in square brackets, and remove the square brackets. For an example, see the page analytics reference page.

---
title: [Service name]
description: Reference docs and sandbox for [service name]
---

# [Service name]

[Description of what data the service provides and any limitations users should be aware of, such as the date the data is available from. Use h2 headings to separate limitation sections if necessary.]

## [Endpoint name]

<SpecView specUrl="[spec endpoint URL]" path="get [endpoint path, not including https://wikimedia.org/api/rest_v1/metrics/]"/>

[Repeat the h2 section above for each endpoint. If necessary, group the endpoints into groups, using an h2 heading for the group name and h3 headings for the endpoints.]

Spec validation

API specs are validated against the schemas in aqs tests.

Decision records

Historical

AQS was originally developed as a single service with an API proxied through RESTBase. As a part of the goal to deprecate RESTBase, the /metrics endpoints served by AQS 1.0 were migrated to a set of services that do not depend on RESTBase. This project was referred to as AQS 2.0. See phab:T263489 for the original proposal.

Scaling for AQS 1.0