Nova Resource:Catgraph

From Wikitech
Jump to: navigation, search


Resource Type project
Project Name catgraph
Monitoring nagf
grafana
Admins
Members

Catgraph

Description

In-memory graph database for Wikipedia category structure.


CatGraph is a custom graph database. It provides developers of tools and other software fast access to the Wikipedia category structure, for example:

  • Finding all pages in a category with a user-defined search depth inside subcategories.
  • Finding "root nodes", that is, categories not contained in any other category; for example, !Hauptkategorie in dewiki.
  • Finding cycles in the category structure, i.e. categories containing one of their parent categories (beta).
  • Set operations on search results: intersection ("In A and in B"), difference ("In A, but not in B").

The complete category graph of a Wikipedia language version is held in RAM. The provided data consists solely of page_ids that can be used as parameters for SQL queries.

CatGraph is being developed by Johannes Kroll for Wikimedia Deutschland e.V. The original specification was written by Daniel Kinzler.

Example Query

The following query displays a list of the subset of pages in enwiki that are contained in both the Biology and People categories.

Both queries are run with search depth 6, i.e. the subcategories and their subcategories, ... up to the sixth subcategory are searched. The resulting 'Biology' subgraph contains (as I'm writing this) 381187 pages, the 'People' subgraph contains 1852741 pages. The intersection, that is, the pages contained in both subgraphs, consists of 86042 pages.

$ time curl "http://sylvester.wmflabs.org:8090/enwiki/traverse-successors+692675+6+&&+traverse-successors+691008+6" >/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  705k    0  705k    0     0  67236      0 --:--:--  0:00:10 --:--:--  170k

real   0m10.759s
user   0m0.012s
sys    0m0.020s

Currently Running Instances

A list of the Wiki graphs currently running on Wikimedia Labs is here (alternative: single json file mapping graphs to hostnames). The name of the graph indicates the wiki instance, e.g. dewiki for the German Wikipedia. The suffix indicates the namespace, e.g. "enwiki_ns14" contains the category namespace of the English Wikipedia. Graphs without a suffix contain all namespaces.

Instances are reimported completely by a cron job in regular intervals for the time being. You can see the timestamp of the last update of a graph here. A script that continuously transfers the running changes is also in the works.

It is planned to make to make all existing language versions available.

Tools using CatGraph

  • DeepCat Gadget extending the Wikipedia search to search for and in subcategories or category intersections using CatGraph.
  • The Render Article List Generator uses CatGraph for searching the category tree.
  • CatCycle finds cycles in the category structure, that is, categories containing one of their supercategories. It can also display root nodes and the shortest path between categories.
  • Merlissimo started working on a CatGraph interface for Merlbot.

Resources for Developers

Core Components

The core components are implemented in C++. Some optimizations have been done to increase performance. A custom mmap-based block allocator keeps allocations fast and memory usage near the theoretical minimum for an adjacency list.

Graphcore implements the graph database. A graphcore process contains a Wikipedia language instance or a subset of its namespaces. The basic node type is an unsigned integer, which maps to the MediaWiki page_id table field. Graphcore communicates using a simple command language over the stdin/stdout streams.

Graphserv handles TCP connections and multiplexes data between clients and Graphcore instances. Basic access protection (read/write/admin levels, password authentication) is provided.

Documentation

Edit documentation

Server admin log

2016-04-27

  • 19:46 mutante: - gptest1.catgraph manually stopping ganglia (T115330)
  • 19:45 mutante: - gptest1.catgraph many puppet errors due to failed mysql-server-5.5 install , broken dpkg/puppet

June 4

  • 17:01 Ryan_Lane: rebooting sylvester to apply nfs homedirs


Instances for this project

FQDN Instance Type Image Id Public IP Number of CPUs RAM Size Amount of Storage Modification dateThis property is a special property in this wiki.
Fridolin.catgraph.eqiad.wmflabs debian-8.2-jessie 2 4,096 40 25 April 2016 15:01:48
Fishbone.catgraph.eqiad.wmflabs 4 8,192 80 23 March 2016 22:02:36
Sylvester.catgraph.eqiad.wmflabs 4 8,192 80 23 March 2016 22:02:33