Tool:Wikibase Unicorn

From Wikitech
Toolforge tools
Website https://wikibase-unicorn.toolforge.org/
Description Unicorn graph query language for wikibase
Keywords graph, search, wikibase, wikidata
Maintainer(s) Ebernhardson (View all)
Source code https://github.com/ebernhardson/wikibase-unicorn
License MIT License

Wikibase Unicorn is a minimal implementation of the unicorn graph query language (pdf). It provides graph search over wikibase enabled wikis by performing recursive queries against the Cloud Elastic replicas.

This is nowhere near as expressive as SPARQL. But it might be easier to read and write. And this scales horizontally, more servers = bigger graph. The relevant question is then, does this provide enough functionality to fill some space?

Experimental Status

This project is experimental as subject to significant change. Current implementation level is Proof of Concept. It can demonstrate the functionality, but error handling is minimal and correctness isn't strongly guaranteed.

Examples

Hospital owners. More concretely, owned-by edge of instance of hospital or instance of sub-class of hospital:

(extract P127=
         (or P31=Q16917
             (apply P31= P279=Q16917)))

SPARQL (approximate) equivalent:

SELECT ?owner ?ownerLabel
WHERE 
{
  {
    SELECT ?owner (count(distinct ?sitelink) as ?sitelinks)
    WHERE
    {
      ?hospital wdt:P31/wdt:P279* wd:Q16917 .
      ?hospital wdt:P127 ?owner .
      OPTIONAL { ?sitelink schema:about ?owner }
    }
    GROUP BY ?owner
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY DESC(?sitelinks)

Implemented Operators

Operator Example S-expression Description
P31=Q16917 Instance of hospital
term (term P31=Q16917) Instance of hospital
and (and P31=Q16917 P31=Q1774898) Instance of hospital and clinic
or (or P31=Q16917 P31=Q1774898) Instance of hospital or clinic
difference (difference P31=Q16917 P31=Q1774898) Instance of hospital and not clinic
apply (apply P31= P279=Q16917) Instance of sub-class of hospital
extract (extract P127= P31=Q16917) Owner of instance of hospital
Instance-of is P31, subclass-of is P279, owned-by is P127. Hospital is Q16917, clinic is Q1774898.

Output Formats

The /search?q={query} endpoint responds based on HTTP accept headers. Supported content-type's are application/json and text/html.

How does it work?

For wikibase enabled wikis CirrusSearch maintains a field called statement_keywords which contains a filtered set of the graph edges each in the form P1=Q1. The provided unicorn query is transformed into equivalent elasticsearch queries, and edges in the graph are followed by performing sequential elasticsearch queries. Because there are execution boundaries between stages, and elasticsearch can only accept 1024 conditions in a single search request, results from Wikibase Unicorn can only provide a completeness guarantee when the truncated metric reported after all search results is zero. Results are truncated based on the number of sitelinks, the pages with the lowest number of sitelinks are removed. Per the linked paper truncation is typical not an issue for user-facing (top N) queries as long as inner-query sorting is doing a good job. Inner query sorting in this implementation is likely sub-par (also by sitelink_count).