User:AKhatun/WDQS Triples Analysis

From Wikitech

The following analysis is done on the SPARQL queries (public cluster) from Wikidata Query Service (WDQS). The data is of 10 May, 2021. The analysis is part of a broader analysis on WDQS to help scale the service. This page shows the analysis of triples extracted from SPARQL queries. Some analysis was done on SPARQL queries earlier without explicitly extracting triples from them (See User:Joal/WDQS_Queries_Analysis and User:Joal/WDQS_Traffic_Analysis).
Phabricator Task: T282129
Jupyter Notebook of Analysis

Structure of a SPARQL query

A triple consists of a Subject, Predicate and an Object. Each of them can be called a Node. Each node can be several things like Variable, URI, Literal, Path and Blank Node.

The SPARQL queries taken from the WDQS are first processed to extract the triples among other things. Not all queries are successfully processed due to presence of uncommon prefixes (mostly indicating that the query uses things other than WDQS in the backend, like mwapi). Around 92% of queries are parsed and processed correctly.

Example of a sample SPARQL query is shown below along with what the extracted triples look like.

The Query:

SELECT * WHERE
    {
      ?s wdt:P31/wdt:P279 <_:bn>;
         skos:altLabel "alias"@en.
    }

The extracted triples:

[
TripleInfo(
    NodeInfo(NODE_VAR,s),              //subject
    NodeInfo(PATH,wdt:P31/wdt:P279),   //predicate
    NodeInfo(NODE_BLANK,bn)            //object
), 

 TripleInfo(
     NodeInfo(NODE_VAR,s),             //subject
     NodeInfo(NODE_URI,skos:altLabel), //predicate
     NodeInfo(NODE_LITERAL,alias@en)   //object
)
]

Node Analysis

Combining all the nodes from subject, predicate, and object the count of the type of node is shown below.

Combined Node Type Distribution
NodeType Count Count %
NODE_URI 48256552 53.37
NODE_VAR 29871236 33.04
NODE_LITERAL 11110653 12.29
PATH 1172584 1.29
NODE_BLANK 2 0.00

But the node type distribution for different nodes are often different. For example subjects are usually URIs and objects are usually variables or literals. So the node type of each of subject, predicate and object is also determined separately. The percentages are shown column wise (By subject, predicate or object type).

Node Type Distribution
NodeType Subject Predicate Object
count percent count percent count percent
NODE_URI 16104919 53.43 27011280 89.63 5140353 17.06
NODE_VAR 14032060 46.56 1953145 6.48 13886031 46.07
NODE_LITERAL 30 0.00 0 0.00 11110623 36.87
PATH 0 0.00 1172584 3.89 0 0.00
NODE_BLANK 0 0.00 0 0.00 2 0.00

Node Values Distribution

The values of node indicate the variables name for Variable node, or the URI for URI node, the Path for Path node, the literal value for the Literal node etc. Node value analysis can be done in a variety of ways:

  • Top Values for Sub/Pred/Obj nodes
  • Top Values for each type of node (URI, literal, Var)
  • Top values of each type of node, for each of Sub/Pred/Obj. Subjects tend to have more URIs, Objects tend to have more variables etc.

These analysis are shown in brief in the Jupyter Notebook.
These are based only on the data of 10th May, 2021.

Triple Analysis

Triples distribution based on node types

A triple is made of of Subject, Predicate, and Object. Each node is, as mentioned, classified into types such as Variable, URI etc. Replacing each node in a triple with its type we can obtain a triple in the form - for example (NODE_URI, PATH, NODE_LITERAL). Distribution of triples in this format can give us insight into what kind of triples are more used. Using more URIs or Literals means the person writing the SPARQL knows exactly what to ask for, whereas more Variables mean searching in a greater portion of the graph. Ofcourse this also depends on the size of the subgraphs in question. Therefore more analysis needs to be done to get deeper infromation from this.

Triple Count Count %
NODE_URI NODE_URI NODE_LITERAL 8268924 27.44
NODE_VAR NODE_URI NODE_VAR 7443036 24.70
NODE_URI NODE_URI NODE_VAR 4997654 16.58
NODE_URI NODE_URI NODE_URI 2508824 8.32
NODE_VAR NODE_URI NODE_LITERAL 2113018 7.01
NODE_VAR NODE_URI NODE_URI 1679795 5.57
NODE_VAR NODE_VAR NODE_VAR 943751 3.13
NODE_VAR PATH NODE_URI 862532 2.86
NODE_VAR NODE_VAR NODE_LITERAL 721584 2.39
NODE_VAR PATH NODE_VAR 216456 0.72
NODE_URI NODE_VAR NODE_VAR 204853 0.68
NODE_URI PATH NODE_VAR 80251 0.27
NODE_VAR NODE_VAR NODE_URI 44789 0.15
NODE_URI NODE_VAR NODE_URI 38165 0.13
NODE_VAR PATH NODE_LITERAL 7097 0.02
NODE_URI PATH NODE_URI 6248 0.02
NODE_LITERAL NODE_URI NODE_VAR 29 0.00
NODE_VAR NODE_VAR NODE_BLANK 2 0.00
NODE_LITERAL NODE_VAR NODE_VAR 1 0.00

Triples distribution based on node values

While node type distribution gives us some information, we still don't know where in the wikidata graph people are searching most. Or what kind of services they are using most. URIs, Paths and Literals give us this information. It is better that the variables and blank nodes remain obfuscated since one can use any variable name to search the same information or when writing the same sparql query. Therefore the distribution of triples with variable and blank nodes obfuscated is shown below.

Top 50 triples based on values
Triple count
bd:serviceParam wikibase:language en 3717731
NODE_VAR rdfs:label NODE_VAR 1390180
NODE_VAR wdt:P279 NODE_VAR 1245462
gas:program gas:out1 NODE_VAR 1242919
gas:program gas:out NODE_VAR 1242919
gas:program gas:traversalDirection Forward 1242594
gas:program gas:gasClass com.bigdata.rdf.graph.analytics.SSSP 1242387
gas:program gas:linkType wdt:P279 1242332
gas:program gas:maxIterations 3^^http://www.w3.org/2001/XMLSchema#integer 1242307
NODE_VAR NODE_VAR NODE_VAR 943751
NODE_VAR <http://www.wikidata.org/prop/direct/P31>/(<http://www.wikidata.org/prop/direct/P279>)* wd:Q16521 677313
NODE_VAR schema:about NODE_VAR 584918
bd:serviceParam wikibase:language [AUTO_LANGUAGE],en 555352
NODE_VAR schema:isPartOf https://en.wikipedia.org/ 312901
NODE_VAR wdt:P569 NODE_VAR 289123
NODE_VAR wdt:P570 NODE_VAR 283221
NODE_VAR wdt:P1630 NODE_VAR 251418
NODE_VAR wikibase:propertyType NODE_VAR 248225
NODE_VAR schema:name NODE_VAR 210927
NODE_VAR wdt:P31 NODE_VAR 207363
NODE_VAR wdt:P18 NODE_VAR 150968
NODE_VAR pq:P6552 NODE_VAR 136415
NODE_VAR p:P2002 NODE_VAR 136376
NODE_VAR rdf:type wikibase:Property 120507
NODE_VAR wikibase:claim NODE_VAR 82803
NODE_VAR wdt:P856 NODE_VAR 79010
NODE_VAR wikibase:statementProperty NODE_VAR 78692
hint:Query hint:optimizer None 68602
NODE_VAR schema:inLanguage en 65903
NODE_VAR skos:altLabel NODE_VAR 64923
NODE_VAR wdt:P577 NODE_VAR 61687
NODE_VAR pq:P1545 NODE_VAR 55542
NODE_VAR schema:isPartOf https://sv.wikipedia.org/ 55440
NODE_VAR wdt:P282 wd:Q8229 50172
http://www.wikidata.org schema:dateModified NODE_VAR 49241
NODE_VAR wdt:P21 NODE_VAR 48106
NODE_VAR wdt:P50 NODE_VAR 46583
NODE_VAR schema:description NODE_VAR 46269
NODE_VAR wikibase:propertyType wikibase:ExternalId 44854
NODE_VAR wdt:P31 wd:Q5 43986
NODE_VAR p:P179 NODE_VAR 42521
NODE_VAR wdt:P300 NODE_VAR 42304
bd:serviceParam wikibase:language fr,en,it,sp,de 41919
NODE_VAR wdt:P227 NODE_VAR 39279
NODE_VAR wdt:P136 NODE_VAR 39038
NODE_VAR wdt:P27 NODE_VAR 38119
NODE_VAR ps:P179 NODE_VAR 36276
NODE_VAR wdt:P19 NODE_VAR 33985
NODE_VAR wdt:P1843 NODE_VAR 33459
NODE_VAR wdt:P106 NODE_VAR 32173