Jump to content

XmlRcs

From Wikitech

XmlRcs is a transforming proxy for Event Platform/EventStreams (Wikimedia's recent changes feed) that exposes data as XML instead of JSON, using a lightweight TCP connection with a few simple commands. It runs as a volunteer maintained service in the Wm-bot Cloud VPS project.

Rationale

XmlRcs simplifies access to EventStreams for applications that, for whatever reason, can't use Server-Sent event protocol or JSON.

Wikimedia has had the IRC feed for a long time. While there are numerous problems with it (e.g. the complex data format with IRC color codes, wikitext notation, and embedding of localised interface messages to encode data), the underlying communication protocol (IRC) is relatively easy to implement in any programming language.

This IRC feed, however, has been deprecated and replaced with RCStream, which was again deprecated and replaced with EventStreams, which is supposed to be more stable platform that should make it easy for programmers to retrieve events from Wikimedia sites in real-time. While it may be a better platform in many ways, it does add complexity to the stack. It adds a dependency on third party technologies, such as WebSockets and JSON. While JSON is an easy data format to decode, WebSockets is quite new and lacking good implementations for popular programming languages and frameworks (such as .Net or Qt). In JavaScript or Python, RCStream's WebSocket can be used directly, but it's hard for developers working in lower level languages like C or C++.

XmlRcs intends to solve this problem. It introduces a simple and lightweight TCP protocol, using XML packets to encode the event data.

How it works

Flow of XmlRcs
Flow of XmlRcs

This service is another layer behind the WebSockets server. It's implemented as a python daemon that converts the WebSockets and JSON into raw data and put them in Redis, which are then retrieved using a C++ daemon that acts as a server to which clients can connect and subscribe to for various feeds.

The daemon is listening by default on port 8822 (TCP) and running on server wm-bot.wm-bot.wmcloud.org, example usage:

telnet wm-bot.wm-bot.wmcloud.org 8822
Trying 208.80.155.196...
Connected to wm-bot.wm-bot.wmcloud.org.
Escape character is '^]'.
S en.wikipedia.org
<ok></ok>
<edit wiki="enwiki" server_name="en.wikipedia.org" revid="642587049" oldid="625934858" summary="cat" title="Dunbar Douglas, 4th Earl of Selkirk" namespace="0" user="Brendandh" bot="False" patrolled="False" minor="False" type="edit" length_new="4485" length_old="4446" timestamp="1421317382"></edit>
<edit wiki="enwiki" server_name="en.wikipedia.org" revid="642587048" oldid="638351579" summary="Added source and Explanation of how JMB past papers were used to examine present grade inlfation in the British education system." title="Joint Matriculation Board" namespace="0" user="85.3.139.236" bot="False" patrolled="False" minor="False" type="edit" length_new="4990" length_old="4735" timestamp="1421317382"></edit>
<edit wiki="enwiki" server_name="en.wikipedia.org" revid="642587050" oldid="631962647" summary="Added charts section." title="Pacifica (The Presets album)" namespace="0" user="Ss112" bot="False" patrolled="False" minor="False" type="edit" length_new="7697" length_old="6946" timestamp="1421317382"></edit>
exit
Connection closed by foreign host.

As you can see, you only need to connect to port 8822 using TCP and subscribe using simple commands, the output is XML nodes that contain the information about edits.

Commands

Every command is a plain text terminated with a new line

S

Subscribe to a feed, syntax: S <hostname of wiki>

Example: S en.wikipedia.org

You can use magic word "all" to subscribe to all wikis

Response: "<ok></ok>" on success, "<error>reason</error>" on error

D

Remove a subscription, syntax D <hostname of wiki>

Example: D en.wikipedia.org

Using magic word "all" will remove subscription to "all wikis" but in case you were subscribed to other wikis as well, these subscriptions will stay.

Response: "<ok></ok>" on success, "<error>reason</error>" on error

clear

Removes all subscription

Response: "<ok></ok>" on success, "<error>reason</error>" on error

stat

Display various system information

ping

Check if connection is alive,

Response: "<pong></pong>"

exit

Close the connection

Important: you are supposed to send raw text "pong" in case you receive XML node "ping" if you fail to do that, you may be randomly disconnected

Output

In this moment daemon responds always in XML. Each XML node is only on 1 line - terminated by a newline.

error

Example:

meh
<error>Unknown: meh</error>

Non-critical error message

fatal

Example:

<fatal>Redis server is down</fatal>

Critical error which implies that XmlRcs daemon became defunct, this error should be very rare

warning

Example:

<warning>restarting daemon</warning>

Warning message informing clients about server event

ok

Example

S this.is.a.test
<ok>S this.is.a.test</ok>

ping

Example

<ping></ping>

Daemon sends randomly these messages to verify if client is still online, if you fail to reply with

pong

you may get disconnected within a minute (note: the reply doesn't need to be "pong" the last_response time gets reset on any input)

edit

Information about wiki edit, example

<edit wiki="wikidatawiki" server_name="www.wikidata.org" revid="188428371" oldid="188099357" summary="/* wbcreateclaim-create:1| */ Property:P361: Q18770801" title="Q17467648" namespace="0" user="RobotMichiel1972" bot="True" patrolled="True" minor="False" type="edit" length_new="5168" length_old="4758" timestamp="1421402947"></edit> 
  • wiki: name of a wiki as a shortcut (enwiki)
  • server_name: fqdn of server (en.wikipedia.org)
  • revid: revision id (54635262)
  • oldid: previous revision id (5635323)
  • summary: summary of edit
  • title: name of page
  • user: name of user
  • bot
  • patrolled
  • minor
  • type: type of edit (edit is regular edit, new is a newpage)
  • length_new: size of new edit
  • length_old: size of old edit
  • timestamp

Maintainer info

Whole thing is living on instance xmlrcs2.huggle.eqiad1.wikimedia.cloud. It consists of 3 components, which always need to be started in this order:

  • redis server (started by init.d)
  • xmlrcsd (server daemon for XmlRcs, systems service: xmlrcsd)
  • EventStream to redis daemon (systemd service: es2r)
cold start process
$ ssh wm-bot.wm-bot.wmcloud.org
$ sudo service redis restart
$ sudo service xmlrcs restart
$ sudo service es2r restart

C# Library

There is a C# library: https://github.com/huggle/XMLRCS/tree/master/clients/c%23/XmlRcs

You can download it from "releases" page (precompiled .dll).

Launching an instance

To launch a new instance of xmlrcs, you will need to compile xmlrcsd to get an executable. To do this cd into <xmlrcs source code directory>/src/xmlrcsd and compile using cmake. (Note: do not use GCC, G++ or clang, these compilers are not supported) You will run cmake . then make. From there you will have an xmlrcsd executable. You then just need to have redis running, start this new executable, as well as <xmlrcs source code dir>/src/es2r/es2r.py