Representative test query collection

WDQS is now in maintenance mode. We intend to check its health at regular intervals. To do this, we collect a small subset of SAPRQL queries that represent all the SPARQL queries that come into WDQS. We run the small subset of queries at regular intervals of time and check how they perform, whether they are taking too long to response (even though they are typically fast queries), or are failing etc. After we have chosen a set of queries, if we keep running them, the results may get cached and thus give us misguided information about the actual health of WDQS. To prevent caching we can perform the same queries with slight variation, like changing the items in the queries.

Jupyter notebook containing code and docs to select a small set of representative queries that can run at definite time intervals as a health check for WDQS: Notebook

Feature selection

What features should we consider that should be represented in the small set of test queries?

(considered) response time
(considered) query structure (found from operator list)
(not required) query length
(not required) should we make sure various services are used even they aren't the *most* frequent queries?
- It turns out we don't need to ensure all services are incorporated in the test queries.
(not required) Should we make sure to have some complex paths? (The larger the number of paths, possibly the more complex it is since a single complex path breaks down into all its individual path components)
- Also no. We want to check whether WDQS is alive, not check whether it can resolve complex paths. So this is not required.
Should we make sure the test queries have various kinds of expressions? (This will be somewhat incorporated in opList)
What happens when most queries are from a single UA or a bot? After query selection we need to make sure the UA/bots are varied for the queries.
The idea is to select a few query *types* using the opList. For each query type we generate multiple queries by changing certain items.
- Turns out we don't need to work hard to generate queries. There are thousands of queries per query type. We can easily select 50 or so of those queries per query type.
We need to make sure changing the items within the queries doesn't drastically change the time-distribution of the selected set of queries. Also distribution of the other features.

Automated query selection

Manually selecting query types (not feasible, abandoned)

Queries are grouped by query_time_class and the opList (operator list), the number of distinct queries per group are counted.
For each group, we look at simple yet most occurring queries. Simple queries are those that don't have too many operators in the opList. But we look for diverse types of queries, such as, some have project, filters, table, order by, paths etc.
We select time groups that take upto 1 second. The next group takes 1s to 10s, these queries may be too long to be test queries. So we choose the first three time groups only.
Next we look at and select individual queries from each group of query types

Automatic process to select queries

The demerit of picking up query types individually is that it is time consuming, and selecting query types is not trivial. It doesn't guarantee that we will choose a good set of query types. To do this relatively automatically:

We can select the first 10-15 query types for each query_time_class.
We can then remove any query type that seems too long (e.g those with too many joins). We may even decide to keep those queries, but select less of them.
For each query type group, we can save a set of 50 queries. To run test queries we can select 1 or 2 from the pool of 50 queries from each query type group. This will also alleviate the need to 'generate' queries by changing items in the queries, because we already have a large of queries to choose from in each query type group.
Something to keep in mind: we are running the automation for 1 months data. We could run it for longer data to get better representation. We can also run the script at some intervals of time, such as every 6 months and save a new set of queries that we can test with.