Nova Resource:Catgraph

From Wikitech
Jump to: navigation, search
Project Name catgraph



In-memory graph database for Wikipedia category structure.

CatGraph is a custom graph database. It provides developers of tools and other software fast access to the Wikipedia category structure, for example:

  • Finding all pages in a category with a user-defined search depth inside subcategories.
  • Finding "root nodes", that is, categories not contained in any other category; for example, !Hauptkategorie in dewiki.
  • Finding cycles in the category structure, i.e. categories containing one of their parent categories (beta).
  • Set operations on search results: intersection ("In A and in B"), difference ("In A, but not in B").

The complete category graph of a Wikipedia language version is held in RAM. The provided data consists solely of page_ids that can be used as parameters for SQL queries.

CatGraph is being developed by Johannes Kroll for Wikimedia Deutschland e.V. The original specification was written by Daniel Kinzler.

Example Query

The following query displays a list of the subset of pages in enwiki that are contained in both the Biology and People categories.

Both queries are run with search depth 6, i.e. the subcategories and their subcategories, ... up to the sixth subcategory are searched. The resulting 'Biology' subgraph contains (as I'm writing this) 381187 pages, the 'People' subgraph contains 1852741 pages. The intersection, that is, the pages contained in both subgraphs, consists of 86042 pages.

$ time curl "" >/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  705k    0  705k    0     0  67236      0 --:--:--  0:00:10 --:--:--  170k

real   0m10.759s
user   0m0.012s
sys    0m0.020s

Currently Running Instances

Instances are reimported by a cron job in regular intervals for the time being. You can find the status of the running instances here. All Wikipedia languages and Commons should be available.

Tools using CatGraph

  • DeepCat Gadget extending the Wikipedia search to search for and in subcategories or category intersections using CatGraph.
  • The Render Article List Generator uses CatGraph for searching the category tree.
  • CatCycle finds cycles in the category structure, that is, categories containing one of their supercategories. It can also display root nodes and the shortest path between categories.
  • Merlissimo started working on a CatGraph interface for Merlbot.

Resources for Developers

Core Components

The core components are implemented in C++. Some optimizations have been done to increase performance. A custom mmap-based block allocator keeps allocations fast and memory usage near the theoretical minimum for an adjacency list.

Graphcore implements the graph database. A graphcore process contains a Wikipedia language instance or a subset of its namespaces. The basic node type is an unsigned integer, which maps to the MediaWiki page_id table field. Graphcore communicates using a simple command language over the stdin/stdout streams.

Graphserv handles TCP connections and multiplexes data between clients and Graphcore instances. Basic access protection (read/write/admin levels, password authentication) is provided.


Edit documentation

Server admin log


  • 19:46 mutante: - gptest1.catgraph manually stopping ganglia (T115330)
  • 19:45 mutante: - gptest1.catgraph many puppet errors due to failed mysql-server-5.5 install , broken dpkg/puppet

June 4

  • 17:01 Ryan_Lane: rebooting sylvester to apply nfs homedirs