In-memory graph database for Wikipedia category structure.
CatGraph is a custom graph database. It provides developers of tools and other software fast access to the Wikipedia category structure, for example:
- Finding all pages in a category with a user-defined search depth inside subcategories.
- Finding "root nodes", that is, categories not contained in any other category; for example, !Hauptkategorie in dewiki.
- Finding cycles in the category structure, i.e. categories containing one of their parent categories (beta).
- Set operations on search results: intersection ("In A and in B"), difference ("In A, but not in B").
The complete category graph of a Wikipedia language version is held in RAM. The provided data consists solely of page_ids that can be used as parameters for SQL queries.
Both queries are run with search depth 6, i.e. the subcategories and their subcategories, ... up to the sixth subcategory are searched. The resulting 'Biology' subgraph contains (as I'm writing this) 381187 pages, the 'People' subgraph contains 1852741 pages. The intersection, that is, the pages contained in both subgraphs, consists of 86042 pages.
$ time curl "http://sylvester.wmflabs.org:8090/enwiki/traverse-successors+692675+6+&&+traverse-successors+691008+6" >/dev/null % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 705k 0 705k 0 0 67236 0 --:--:-- 0:00:10 --:--:-- 170k real 0m10.759s user 0m0.012s sys 0m0.020s
Currently Running Instances
A list of the Wiki graphs currently running on Wikimedia Labs is here (alternative: single json file mapping graphs to hostnames). The name of the graph indicates the wiki instance, e.g. dewiki for the German Wikipedia. The suffix indicates the namespace, e.g. "enwiki_ns14" contains the category namespace of the English Wikipedia. Graphs without a suffix contain all namespaces.
Instances are reimported completely by a cron job in regular intervals for the time being. You can see the timestamp of the last update of a graph here. A script that continuously transfers the running changes is also in the works.
It is planned to make to make all existing language versions available.
Tools using CatGraph
- DeepCat Gadget extending the Wikipedia search to search for and in subcategories or category intersections using CatGraph.
- The Render Article List Generator uses CatGraph for searching the category tree.
- CatCycle finds cycles in the category structure, that is, categories containing one of their supercategories. It can also display root nodes and the shortest path between categories.
- Merlissimo started working on a CatGraph interface for Merlbot.
Resources for Developers
- catgraph-jsonp: A JSONP interface to CatGraph.
- catgraph-client-php & catgraph-client-python: Libraries for talking to CatGraph in in PHP & Python.
- This could be used to execute search engine-style query strings with CatGraph.
- This simple script shows the usage of gpClient with PHP (code listing).
The core components are implemented in C++. Some optimizations have been done to increase performance. A custom mmap-based block allocator keeps allocations fast and memory usage near the theoretical minimum for an adjacency list.
Graphcore implements the graph database. A graphcore process contains a Wikipedia language instance or a subset of its namespaces. The basic node type is an unsigned integer, which maps to the MediaWiki page_id table field. Graphcore communicates using a simple command language over the stdin/stdout streams.
Graphserv handles TCP connections and multiplexes data between clients and Graphcore instances. Basic access protection (read/write/admin levels, password authentication) is provided.
Server admin log
- 19:46 mutante: - gptest1.catgraph manually stopping ganglia (T115330)
- 19:45 mutante: - gptest1.catgraph many puppet errors due to failed mysql-server-5.5 install , broken dpkg/puppet
- 17:01 Ryan_Lane: rebooting sylvester to apply nfs homedirs