RCStream

From Wikitech
Jump to: navigation, search

RCStream is a simple server daemon that broadcasts "recent changes" events from MediaWiki wikis using the Socket.IO 0.9 protocol. The main instance used to run at stream.wikimedia.org/rc, broadcasting changes from all public wikis in the Wikimedia production cluster.

stream.wikimedia.org/rc provided a live data stream of edits on Wikimedia wikis that anyone can tap and use to power editor tools and web apps, create beautiful visualisations, inform research, and extend MediaWiki.

RCStream subscribes to the RCFeed from MediaWiki wikis. As a web developer, one can open the stream using JavaScript. As an app developer, one can use a suitable client library for your platform.

API

RCStream provides a simple API for subscribing to RCFeeds of MediaWiki wikis. After connecting you emit a 'subscribe' event, specifying the wikis you wish to subscribe to. This use any of the below formats:

  • a single hostname, such as nl.wikipedia.org.
  • an array of hostnames.
  • hostnames matching a wildcard pattern such as *.wikivoyage.org or nl.*.
  • all wikis, by subscribing to the special topic name *.

You then receive 'change' events whose data is an RCFeed structure containing the type of change, the title of the page, the new revision number, etc.

The Socket.IO server uses the /rc namespace. It also implements an /rcstream_status endpoint that exposes internal state about connected clients and queue size that may help when debugging.

Consumers

  • Cocytus. Tracks citations on Wikipedia.
  • Meteor DDP. Proxies change events combined with page content and diff from the API.
  • Datasift
  • Demo on CodePen.io. Example listener using JavaScript in the browser.
  • Various researchers (per wiki-research-l, December 2014)
  • Pywikibot

Client example

As writing (January 2015), RCStream implements version 0.9 of the Socket.IO protocol, not 1.0 (phab:T68232). See also socket.io 0.9 and socket.io-client 0.9 on GitHub for more information.

JavaScript

// Requires socket.io-client 0.9.x:
// browser code can load a minified Socket.IO JavaScript library;
// standalone code can install via 'npm install socket.io-client@0.9.1'.

var io = require( 'socket.io-client' );
var socket = io.connect( 'https://stream.wikimedia.org/rc' );

socket.on( 'connect', function () {
     socket.emit( 'subscribe', 'commons.wikimedia.org' );
} );

socket.on( 'change', function ( data ) {
    console.log( data.title );
} );

Python

Install dependencies:
pip install socketIO_client==0.5.6
Get stream of events:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import socketIO_client

class WikiNamespace(socketIO_client.BaseNamespace):
    def on_change(self, change):
        print('%(user)s edited %(title)s' % change)

    def on_connect(self):
        self.emit('subscribe', 'commons.wikimedia.org')


socketIO = socketIO_client.SocketIO('https://stream.wikimedia.org')
socketIO.define(WikiNamespace, '/rc')

socketIO.wait()

Wikimedia deployment

The RCFeed of Wikimedia wikis is configured using $wgRCFeeds. The JSON formatter is used with the Redis engine.

The RCStream servers are rcs1001 and rcs1001 (see also puppet node and puppet role). Each backend server runs multiple instances of RCStream, as well as a Redis instance that receives RCFeed messages from the MediaWiki app servers. The servers exposes the RCStream backends through a local Nginx reverse proxy.

An LVS load balancer (stream-lb) is situated in front of the backend servers, exposed as stream.wikimedia.org.

The RCStream server also responds at https://stream.wikimedia.org/rcstream_status with a simple text message; check this if you do not receive any events.

Beta Cluster

The Beta cluster has a simplified setup on a single VM instance running the rcstream Puppet role, exposed as http://stream.wmflabs.org.

See also

Source code: rcstream