Data Engineering/Systems/AQS
The Analytics Query Service (AQS) is a public facing API that serves analytics data from both a Cassandra and a Druid backend. It is how Wikistats (stats.wikimedia.org) gets data to visualize.
- Maintained by mw:Data_Platform_Engineering/Data_Products
- Issue tracker: Phabricator (Report an issue)
Adding a Wiki
Data generated by a wiki goes through two main data pipelines before it ends up in AQS. Readership data flows from our varnish caches, through Kafka, into private and public datasets. The public pageviews data ends up in the Pageview API, served by AQS from Cassandra. Editing data flows slower through monthly whole-history loads from mediawiki database replicas. This ends up in Druid as mediawiki_history_reduced and is also served by AQS. (TODO: linkify all this)
To add a new wiki, you need to edit the include lists for these two pipelines:
Hosted API
- Analytics/AQS/Pageviews
- Analytics/AQS/Devices Analytics
- Analytics/AQS/Wikistats 2
- Analytics/AQS/Mediarequests
- Analytics/AQS/Geoeditors
More and up to date info in: https://wikimedia.org/api/rest_v1/?doc#/
Scaling: Settings, Failover and Capacity Projections
Monitoring
Grafana dashboards:
- Cassandra: https://grafana.wikimedia.org/d/000000418/cassandra?orgId=1
- Druid: https://grafana.wikimedia.org/d/000000538/druid
- PageViews: https://grafana.wikimedia.org/dashboard/db/pageviews [Broken Link T211982]
- ElasticSearch: https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase => Query: analytics.wikimedia.org
Throttling
2016-05-26
Sum up:Throttling is enforced at the restbase/AQS layer thus requests that are served by varnish are not throttled. This is an important point. It means that the throughput of the API in the top endpoints is real high as the same data is requested over and over as on those endpoints we mostly serve"daily top" data. Throttling is done per (IP/endpoint/second) and if a client breaks throttling limits it will receive a 429 response code to its http request.
At the time of this writing throttling is set to trigger at reqs per (IP/endpoint/second) and thus far we are only logging when limits are breached, we are not enforcing throttling quite yet. Why? Cause if we get more that 30 concurrent requests in cassandra at any one time cassandra lookups time out. This, likely, will not be true after we finish our work in scaling the storage layer of the API.
Ticket in which we discussed throttling: [1]
Throttling limits breached are logged in to: https://logstash.wikimedia.org/#/dashboard/temp/AVTsUtpi_LTxu7wlBfI-
Config for throttling is at: https://github.com/wikimedia/restbase/blob/master/v1/metrics.yaml
2016-09-21
Bumping up throttling limits after scaling work to 400 requests per second, after our load tests. See: Analytics/AQS/Scaling#Load_testing
Developing a New Endpoint
This is roughly how writing a new AQS endpoint goes. Note that some endpoints may use Druid rather than Cassandra, so that process may be different:
- Development
- Write the Oozie job to move data from Hadoop to Cassandra; Verify the output is correct by outputting to plain JSON/test Hive table; The Oozie job will be unable to load into Wikimedia Cloud Cassandra instances (You just have to hope that the loading works)
- Write the AQS endpoint, which includes the table schema spec and unit tests
- Testing
- Tunnel to any of the aqs-test Cassandra instances (i.e., aqs-test1001.analytics.eqiad1.wikimedia.cloud, ..., aqs-test1010.analytics.eqiad1.wikimedia.cloud)
- Create a keyspace in this cloud Cassandra instance of the same name as is detailed in the Oozie job properties file; Insert some data into the table
- From any of the aqs-test machines, run your version of AQS with the new endpoint, pointing to one of the Cassandra instances (e.g., 172.16.4.205); Test by manually running queries against your local instance (i.e., localhost:7231)
- Productionize
- Deploy the Oozie job, but don't run it
- Deploy AQS to point to the new data; Once it is running, it will automatically create the relevant keyspace in the production Cassandra instance
- Manually run a few queries against the new AQS endpoint (i.e., aqs1010; localhost:7232), and ensure that they all respond with 404 (because no data is loaded into Cassandra yet)
- Run the Oozie job to load the data into Cassandra
- Manually run a few queries against the new AQS endpoint, and ensure that they all respond with the proper responses, as the data should now be loaded into Cassandra
- Submit a pull request for the restbase repository on GitHub with the schema of the new endpoint added; Now the endpoint should be publicly accessible!
- Update all relevant documentation
Deployment
This step-by-step serves for deploying to both staging (beta) and production. Watch out for specific differences between beta and prod in each step of this section.
Step 0: Testing AQS locally
With cassandra
Testing your change in our staging environment (beta) requires either having a stable patch merged in AQS and deployed through scap, or a lot of git black magic and messing around that you shouldn't do. A good solution for quick testing is setting up your own mini AQS in your local machine where you can make changes to the APIs instantly, update dependencies, load data... without switching between machines or sending gerrit patches.
- Install Zookeeper (
brew install zookeeper
on mac). - Install Cassandra (
brew install cassandra@2.2
on mac. Be aware that without the @2.2, brew will install version 3, which we don't have yet in production). - Make sure you're using the right Java version (8). Cassandra will complain a lot about Java 9 and 10, so make sure that your JAVA_HOME environment variable points to your Java 8 installation (
/usr/libexec/java_home -V
will show the versions currently installed). To do that, setexport JAVA_HOME=`/usr/libexec/java_home -v 1.8`
- Once the Cassandra service is running, start AQS by running the server with the default config provided on the repo:
./server.js -c config.example.wikimedia.yaml
- To load data or make changes in Cassandra, run
cqlsh
With druid
The easiest to test AQS druid integration is to use the production druid cluster (AQS can only query druid, no data loss is possible). This is doable by following those steps:
- Start an SSH tunnel between your machine and the druid-public broker :
ssh -N druid1010.eqiad.wmnet -L 8082:druid-public-broker.svc.eqiad.wmnet:8082
- Start AQS locally with the appropriate configuration, as suggested in this gerrit patch (WARNING: Update the
datasources: mediawiki_history
to the correct value for testing). - You should be able to query your local AQS for druid-oriented queries (for instance: http://localhost:7231/analytics.wikimedia.org/v1/edits/aggregate/all-projects/all-editor-types/all-page-types/monthly/20180101/20190101)
Step 1: Update the AQS deploy repository
Note: Be aware that this process requires having Docker installed as an instantiation of docker is done when building.
Note: Even if you're deploying to staging (beta), the code you want to deploy should be merged to master. Otherwise, the whole deployment process won't work.
- If it's the first time you deploy:
- Get the deploy repository: git clone ssh://$USER@gerrit.wikimedia.org:29418/analytics/aqs/deploy .
- Make sure AQS source git repo has the deploy.dir config variable set (see Services/FirstDeployment#Local Git).
- Run
npm install
in the source repository and make sure that no error is returned. Do also the same thing withnpm test
- Are you deploying a new endpoint? You need to add a bit of code to the fake data script that matches the x-amples definition in AQS's v1 yaml. Otherwise endpoint checks will fail on deployment. An alternative is to set x-monitor to false, in which case your new endpoints won't get checked (tip: while this fixes the deploy, not testing the endpoint is not advised).
- Then (regardless if first time or not):
- Make sure both aqs-deploy and aqs repositories are on master, have latest, are clean, including submodules updated
- Follow Services/Deployment#Preparing_the_Deploy_Repository (basically, run
./server.js build --deploy-repo --force --review -c config.test.yaml
in the source folder). - Check that src's sha1 in the review corresponds to the code you want to deploy).
- Merge the newly created change to aqs deploy repo to master.
Issues with "src" path
Remove src path from deploy repo. (We're not sure why this was added to the docs, we should discuss and explain or remove.)
Issues with git review
It uses git review only if you pass it the --review param, omit it and it will not try to submit patch, it will commit it but it will not be pushed. Sometimes the build hangs. In this case, check the sync-repo branch of the deploy repository. It should have the commit in there and that can be pushed to gerrit. It's ok to kill the build if it's been hanging for a while.
NPM vulnerabilities
Whenever possible, it is convenient to run npm audit
and make sure that no dependencies pose a threat to the service. Most vulnerabilities will be solved by upgrading packages, but in some cases they will correspond to a second or third-level dependency that can only be upgraded by forcing versions in package-lock.json
. Forcing versions can be avoided if you are certain that the code carrying the vulnerability will not be run by AQS (task T207945 is an example of this). If this is not the case, you can enforce the new version by editing package-lock.json
and making sure that the version change doesn't break tests.
See note about hoek npm vulnerability here: https://phabricator.wikimedia.org/T206474
NPM has more information about dealing with vulnerabilities.
Step 2: Deploy using scap
- Tell the
#wikimedia-analytics
and#wikimedia-operations
IRC channels that you are deploying (use!log
for instance) - Ssh into the deployment machine that suits your needs:
- For staging (beta) use:
deployment-deploy01.deployment-prep.eqiad1.wikimedia.cloud
. - For production use:
deployment.eqiad.wmnet
.
- For staging (beta) use:
- Execute scap:
cd /srv/deployment/analytics/aqs/deploy
git pull
git submodule update --init
scap deploy -e aqs "YOUR DEPLOYMENT MESSAGE"
- [optional] To see more detailed error logs during deployment, run
scap deploy-log
from/srv/deployment/analytics/aqs/deploy
while you deploy.
Note: after T156049 scap will deploy only to aqs1010 (or deployment-aqs01 in case of beta) as first step (canary) and it will ask for confirmation before proceeding to the rest of the cluster. After that, it will deploy to one host at the time serially. You can force scap to ask for confirmation after each host or not, but telling him to proceed to all the other hosts (after the canary) will not cause a deployment to all of them at the same time, since the previously mentioned constraint will hold. Each host will be de-pooled from the load-balancer before the aqs restart, and re-pooled after that.
Step 3: Test
Staging (beta)
Beta thus far just has a modest dataset with pageviews to Barack Obama page in 2016 from es.wikipedia, en.wikipedia and de.wikipedia
You can run some queries like the following to see that aqs is running well:
wget http://localhost:7232/analytics.wikimedia.org/v1/pageviews/
curl http://localhost:7232/analytics.wikimedia.org/v1/pageviews/per-article/de.wikipedia/all-access/all-agents/Barack_Obama/daily/2016010100/2016020200
Should return daily records
curl http://localhost:7232/analytics.wikimedia.org/v1/pageviews/per-article/de.wikipedia/all-access/all-agents/Barack_Obama/monthly/2016010100/2016020200
Should return monthly records
curl http://localhost:7232/analytics.wikimedia.org/v1/pageviews/aggregate/en.wikipedia/all-access/all-agents/daily/2015100100/2016103100
Should return aggreggate data for en.wikipedia, if any
curl http://localhost:7232/analytics.wikimedia.org/v1/pageviews/aggregate/es.wikipedia/all-access/all-agents/monthly/2015100100/2016103100
Should return monthly aggreggate data for en.wikipedia
Production
From (one of) the deployed machine, run /srv/deployment/analytics/aqs/deploy/test/test_local_aqs_urls.sh
.
Troubleshooting Deployment
Issues with deployment to labs deploy
had to:
SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l deploy-service deployment-aqs01.deployment-prep.eqiad1.wikimedia.cloud
Issues with scap
- Depool machine
- Delete deployment directory
- Run puppet
- Try to deploy again.
Check deploy logs:
scap deploy-log -v
Check AQS logs:
sudo journalctl -u aqs
Journalctl might not have a lot of information since by default Restbase is configured to push logs to logstash. So in order to disable this behavior, remove the following from the AQS configuration file under /etc:
logging:
name: aqs
level: warn
streams:
- # XXX: Use gelf-stream -> logstash
- - type: gelf
- host: localhost
- port: 12201
Manual AQS restart:
sudo systemctl restart aqs
Administration
Cassandra CLI
Cqlsh is a python-based CLI for executing Cassandra Query Language commands. To start cqlsh in beta (password is public, this is labs):
cqlsh -u cassandra -p cassandra 172.16.4.205
Load data into cassandra test keyspace
Creating a keyspace in production requires talking with Data Persistence. However to test, we can log in with the aqs_testing
user and create tables in the aqs_testing
keyspace. You can find the credentials for this in this file on HDFS: /user/analytics/aqs_testing_password.txt
. From the launcher machine: sudo -u analytics kerberos-run-command analytics hdfs dfs -cat /user/analytics/aqs_testing_password.txt
.
(NOTE: removed section on loading data in beta, as that's no longer relevant)
Restbase status
On the host to check live requests:
elukey@aqs1003:~$ sudo httpry -i eth0 tcp
Check Restbase status:
elukey@aqs1003:~$ systemctl status aqs
● aqs.service - "aqs service"
Loaded: loaded (/lib/systemd/system/aqs.service; enabled)
Active: active (running) since Tue 2016-05-17 15:45:58 UTC; 1 day 21h ago
Main PID: 25226 (firejail)
CGroup: /system.slice/aqs.service
├─25226 /usr/bin/firejail --blacklist=root --blacklist=/home/* --tmpfs=/tmp --caps --seccomp /usr/bin/nodejs src/server.js -c /etc/aqs/config.yaml
├─25227 /usr/bin/nodejs src/server.js -c /etc/aqs/config.yaml
├─25254 /usr/bin/nodejs /srv/deployment/analytics/aqs/deploy-cache/revs/a38e4d78718b072a70514477c3b268baaf8e1d29/src/server.js -c /etc/aqs/config.yaml
[...]
├─25493 /usr/bin/nodejs /srv/deployment/analytics/aqs/deploy-cache/revs/a38e4d78718b072a70514477c3b268baaf8e1d29/src/server.js -c /etc/aqs/config.yaml
└─25504 /usr/bin/nodejs /srv/deployment/analytics/aqs/deploy-cache/revs/a38e4d78718b072a70514477c3b268baaf8e1d29/src/server.js -c /etc/aqs/config.yaml
Cassandra status
Check Cassandra cluster status (UN == Up Normal):
# Please note the -a suffix, there is also another instance that can be inspected using -b
elukey@aqs1004:~$ nodetool-a status
Datacenter: eqiad
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.64.48.148 1.6 TB 256 24.5% ec437eff-af17-4863-b6ff-42f87ea86557 rack3
UN 10.64.0.213 1.8 TB 256 23.1% c1fcaa1e-fc38-4597-8794-a37d9831df74 rack1
UN 10.64.48.149 1.58 TB 256 24.4% 4d24db1d-fc2a-4ec9-9d43-3952d480ff7e rack3
UN 10.64.16.74 1.67 TB 256 24.9% 7d8443d7-3b81-401a-a46e-d15316d69a56 rack2
UN 10.64.48.122 1.81 TB 256 26.1% c1e9333f-2e7a-48cc-a0cc-c4db53930c22 rack3
UN 10.64.48.123 1.68 TB 256 25.0% a4cade35-16cd-427c-ac77-a51d3d12c3a3 rack3
UN 10.64.32.189 1.59 TB 256 24.1% de1f9797-9ee0-472f-9713-e9bc3c8a1949 rack2
UN 10.64.0.237 1.83 TB 256 26.8% 7f25e4fb-e1b5-4ae3-916d-446f94f4cca9 rack1
UN 10.64.32.190 1.64 TB 256 24.5% 38b46448-a547-4a4f-9e96-35a0e28ee796 rack2
UN 10.64.16.78 1.82 TB 256 26.6% ab0da954-4db8-4e68-8a84-7bca0cf7e8c4 rack2
UN 10.64.0.126 1.77 TB 256 25.7% a6c7480a-7f94-4488-a925-0cff98c5841a rack1
UN 10.64.0.127 1.61 TB 256 24.3% ed33d9e1-a654-4ca6-a232-bf97f32206ba rack1
elukey@aqs1004:~$ nodetool-a info
ID : a6c7480a-7f94-4488-a925-0cff98c5841a
Gossip active : true
Thrift active : false
Native Transport active: true
Load : 1.77 TB
Generation No : 1606380190
Uptime (seconds) : 684961
Heap Memory (MB) : 8855.14 / 16384.00
Off Heap Memory (MB) : 3331.05
Data Center : eqiad
Rack : rack1
Exceptions : 0
Key Cache : entries 977802, size 400 MB, capacity 400 MB, 29643458 hits, 58426030 requests, 0.507 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 200 MB, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 0, size 0 bytes, capacity 50 MB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Token : (invoke with -T/--tokens to see all 256 tokens)
Cassandra logs
Most useful one is /var/log/cassandra/system.log, that becomes -a and -b on aqs100[456] since we have two cassandra instances running:
elukey@aqs1004:/var/log/cassandra$ ls
gc-a.log.0.current gc-b.log.0.current system-a.log system-b.log system.log
Network Configuration
The AQS IPs are deployed in the Production network, meanwhile the Hadoop IPs are running in the Analytics network. The traffic flow is guarded by ACLs on switches/routers that needs to be updated if you need to connect new AQS IPs to the Analytics network. For example, this is the error that we were getting from analytics1* hosts while trying to upload data to the aqs1010-a.eqiad.wmnet Cassandra instance:
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: aqs1004-a.eqiad.wmnet/10.64.0.126:9042 (com.datastax.driver.core.TransportException: [aqs1004-a.eqiad.wmnet/10.64.0.126:9042] Cannot connect))
To solve the issue ops extended the existing ACL for aqs100[123].eqiad.wmnet to allow all the Cassandra Instances IPs too.
Deploy new History snapshot for Wikistats Backend
As of Q4 2018 every snapshot of mediawiki history we load into druid is a new datasource named after the snapshot. For example: "mediawiki-2021-10" AQS will not serve this data until told to do so (this is so we can actually rollback to a prior snapshot easily). First, check for the existence of a new snapshot on hdfs with:
hdfs dfs -ls /wmf/data/wmf/mediawiki/history_reduced/
Then, to enable a new snapshot, you need to change the hiera config for AQS that points to the active snapshot. See patch for example:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/875364 . Once merged and applied, you'll need to restart the aqs servers to have it take effect, and the safest way to do it is to ask an SRE to run the following from one of the cluster management hosts (cumin1001.eqiad.wmnet, cumin2002.codfw.wmnet
):
# Ensure that the new config is deployed on all hosts
elukey@cumin1001:~$ sudo cumin 'A:aqs' 'run-puppet-agent'
# Roll restart the aqs nodejs daemon
elukey@cumin1001:~$ sudo cookbook sre.aqs.roll-restart-reboot --query aqs --reason "Restart after deployment" restart_daemons
A quick note on caching. After deploying a new snapshot, you can check the data by hitting AQS directly with curl:
Real people out in the world won't see the new data until they also clear cache or it expires (14400 seconds / 4 hours).
Useful comands
Password
See: /etc/aqs/config.yaml
See table schema:
cassandra@cqlsh> describe table "local_group_default_T_pageviews_per_article_flat".data
Add fake data to Cassandra after wiping the cluster
cqlsh -u cassandra -p cassandra aqs1004-a -f /srv/deployment/analytics/aqs/deploy/scripts/insert_monitoring_fake_data.cql
This commands will ensure that no AQS related alarm will fire.
Data filter before Cassandra load
Some data used by AQS comes from Cassandra. We are using Airflow+Spark+HQL to feed the tables on Cassandra.
On HDFS, We have implemented a disallowed table `wmf.disallowed_cassandra_articles` to filter out sensitive pages we don't want to appear in the top-viewed articles of a wiki.
In fact, some attacks aim at manipulating the number of views per article. For example, the goal could be pushing traffic to a 3rd party site or adding an offensive word to the top list, which millions of users view.
The table is used in some Cassandra tables:
* pageview_per_article_daily
* pageview_top_articles_daily
* pageview_top_percountry_daily
* pageview_top_articles_monthly
To update this list of disallowed articles:
- update the TSV in analytics/refinery `static_data/cassandra/disallowed_cassandra_articles.tsv `
- prepare a patch and deploy it
For emergency procedures, you could run the following (Note that you still need a Gerrit patch as the next deployment of analytics/refinery will override your change):
# Fetch the disallowed list
ssh an-launcher1002.eqiad.wmnet
export TSV_FILENAME=disallowed_cassandra_articles.tsv
export TSV_HDFS_PATH="/wmf/refinery/current/static_data/cassandra/${TSV_FILENAME}"
hdfs dfs -cat $TSV_HDFS_PATH > $TSV_FILENAME
# Add or remove some entries (beware, tabs are expected between columns, not spaces)
vim $TSV_FILENAME
# Push the file back to HDFS
sudo -u hdfs kerberos-run-command hdfs hdfs dfs -put -f $TSV_FILENAME $TSV_HDFS_PATH
sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod +r $TSV_HDFS_PATH