Nova Resource:Catgraph

From Wikitech
Jump to: navigation, search

Resource Type project
Project Name catgraph
Monitoring nagf


Edit documentation



In-memory graph database for Wikipedia category structure.

CatGraph is a custom graph database. It provides developers of tools and other software fast access to the Wikipedia category structure, for example:

  • Finding all pages in a category with a user-defined search depth inside subcategories.
  • Finding "root nodes", that is, categories not contained in any other category; for example, !Hauptkategorie in dewiki.
  • Finding cycles in the category structure, i.e. categories containing one of their parent categories (beta).
  • Set operations on search results: intersection ("In A and in B"), difference ("In A, but not in B").

The complete category graph of a Wikipedia language version is held in RAM. The provided data consists solely of page_ids that can be used as parameters for SQL queries.

CatGraph is being developed by Johannes Kroll for Wikimedia Deutschland e.V. The original specification was written by Daniel Kinzler.

Example Query

The following query displays a list of the subset of pages in enwiki that are contained in both the Biology and People categories.

Both queries are run with search depth 6, i.e. the subcategories and their subcategories, ... up to the sixth subcategory are searched. The resulting 'Biology' subgraph contains (as I'm writing this) 381187 pages, the 'People' subgraph contains 1852741 pages. The intersection, that is, the pages contained in both subgraphs, consists of 86042 pages.

$ time curl "" >/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  705k    0  705k    0     0  67236      0 --:--:--  0:00:10 --:--:--  170k

real   0m10.759s
user   0m0.012s
sys    0m0.020s

Currently Running Instances

A list of the Wiki graphs currently running on Wikimedia Labs is here (alternative: single json file mapping graphs to hostnames). The name of the graph indicates the wiki instance, e.g. dewiki for the German Wikipedia. The suffix indicates the namespace, e.g. "enwiki_ns14" contains the category namespace of the English Wikipedia. Graphs without a suffix contain all namespaces.

Instances are reimported completely by a cron job in regular intervals for the time being. You can see the timestamp of the last update of a graph here. A script that continuously transfers the running changes is also in the works.

It is planned to make to make all existing language versions available.

Tools using CatGraph

  • The Render Article List Generator uses CatGraph for searching the category tree.
  • CatCycle finds cycles in the category structure, that is, categories containing one of their supercategories. It can also display root nodes and the shortest path between categories.
  • Merlissimo started working on a CatGraph interface for Merlbot.

Ressources for Developers

  • gpClient: Libraries for talking to CatGraph in in PHP & Python.
  • This could be used to execute search engine-style query strings with CatGraph.
  • This simple script shows the usage of gpClient with PHP (code listing).

Core Components

The core components are implemented in C++. Some optimizations have been done to increase performance. A custom mmap-based block allocator keeps allocations fast and memory usage near the theoretical minimum for an adjacency list.

Graphcore implements the graph database. A graphcore process contains a Wikipedia language instance or a subset of its namespaces. The basic node type is an unsigned integer, which maps to the MediaWiki page_id table field. Graphcore communicates using a simple command language over the stdin/stdout streams.

Graphserv handles TCP connections and multiplexes data between clients and Graphcore instances. Basic access protection (read/write/admin levels, password authentication) is provided.


Server admin log

June 4

  • 17:01 Ryan_Lane: rebooting sylvester to apply nfs homedirs

Instances for this project

  Instance Name Instance Type Project Image Id FQDN Public IP Launch Time Puppet Class Modification dateThis property is a special property in this wiki. Number of CPUs RAM Size Amount of Storage
I-000003ba.eqiad.wmflabs cg-puppetmaster m1.small catgraph ubuntu-12.04-precise i-000003ba.eqiad.wmflabs 27 May 2014 14:27:39 base
24 February 2015 07:59:18 1 2,048 20
I-0000075e.eqiad.wmflabs fishbone m1.medium catgraph ubuntu-14.04-trusty i-0000075e.eqiad.wmflabs 4 December 2014 23:23:08 base
28 January 2015 06:32:35 2 4,096 40
I-00000184.eqiad.wmflabs sylvester pmtpa-4 catgraph ubuntu-12.04-precise (deprecated 12-16-2013) i-00000184.eqiad.wmflabs 17 March 2014 20:49:59 base
22 January 2015 15:29:57 4 8,192 95
I-00000604.eqiad.wmflabs gptest1 m1.small catgraph ubuntu-14.04-trusty i-00000604.eqiad.wmflabs 19 September 2014 12:53:34 base
19 September 2014 13:55:36 1 2,048 20