Jump to content

Wikidata Query Service/Streaming Updater/Public Update Stream

From Wikitech

The WDQS Streaming Updater produces a stream of RDF updates in order to keep WMF triple stores up to date. Because this stream might have value outside of the WMF infrastructure it is now publicly available via Event_Platform/EventStreams_HTTP_Service.

Available Streams

Events

Events are encoded as JSON documents following the mediawiki/wikibase/entity/rdf_change schema.

The event may happen for different reasons but in most cases it is because of change happening to the wikibase entity. For these cases the operation field is set to:

  • import: when an entity is created or restored
  • diff: when an entity is edited (new revision)
  • delete: when an entity is deleted

In some rarer cases the event may happen for operational purposes when some inconsistencies have been discovered. For these cases the operation field may be set to:

  • reconcile: to tell the consumer that prior data it received concerning this entity might no longer be trusted and that it should consider using the data provided by this event.
  • delete: may also happen for similar purposes when the system discovers that possibly prior events may have wrongly instructed consumers to keep the data related to this entity while in fact it no longer exists.

The purpose of this schema is to be self-sufficient to keep an RDF representation of a Wikibase instance up-to-date (i.e. a triple store) and thus contains all the necessary data to do so without having to call any Wikibase API.

The RDF data itself is encoded in 4 fields:

  • rdf_added_data: the triples that must be added
  • rdf_deleted_data: the triples that must be deleted
  • rdf_linked_shared_data: the triples that might be shared by other entities and could be added. The stream is not aware if this data has already been transferred and thus adding this data again might not be necessary.
  • rdf_unlinked_shared_data: triples that might be used by other entities but are no longer linked from this entity. It is up to the client to determine if these triples are worth keeping or not. Keeping them might lead to "orphaned" triples in the graph. A client may decide that orphans in the graph are an acceptable trade-off or may decide to treat them by deleting them on the fly if they are no longer used. This generally includes:
    • complex values
    • references
    • site links

The RDF data is encoded using the format defined in the mime_type field. As of now only "text/turtle" is expected to be produced. It may be possible that other formats will be used if they're proven more space efficient.

These data fields are populated differently depending on the type operation performed:

  • import: rdf_added_data and rdf_linked_shared_data are populated
  • diff: all these four fields are populated
  • delete: none of them, the client must be able to retrieve all the related triples owned by the given entity.
  • reconcile: only rdf_added_data is used

RDF

The fields containing the RDF content do contain data that is compliant with mw:Wikibase/Indexing/RDF_Dump_Format but with the same differences seen in WDQS:

  • rdf:type statements are omitted for performance reasons
  • data (wdata:) nodes are not emitted
  • only rdfs:label is present (duplicated schema:name & skos:prefLabel are omitted)
  • SomeValue are not encoded as blank node but skolems (mw:Wikidata_Query_Service/Blank_Node_Skolemization)

Event reconstruction

The event may be stored in messaging system that may not be designed to store large payloads (i.e. kafka). Since the data of a wikibase entity may possibly be large comparatively to what is acceptable by such messaging system the event may be split into multiple consecutive chunks. Event that are split are identified using the two fields:

  • sequence: the index of the chunk, starting from 0
  • sequence_length: the number of chunks the event was split into

The client must reconstruct the message before processing it. Chunks related to the same event will have the same data in all its fields except for:

  • the fields containing the rdf content
  • the meta.dt field
  • the meta.id field

To reconstruct the event the client must parse the content of all the individual rdf content fields into a list of rdf statements and append to these lists the statement it parses for all the chunks. When consuming from a stream chunks are expected to appear consecutively, the meta.request_id can be used to verify that the chunks are consistent. When consuming from a stream the client must be able to buffer all the chunks for a given event but is not required to buffer multiple events. In other words a stream consumer can simply buffer event testing the simple condition:

  • sequence + 1 == sequence_length.

When using this data from a batch storage the client is expected to reconstruct the event grouping on meta.request_id for messages having sequence_length > 1.

Event Ordering

Ordering is guaranteed on a per entity basis:

  • two consecutive edits happening on the same entity are guaranteed to appear one after another in the right order
  • two consecutive edits happening on two different entities are not guaranteed to appear in the same order they happened in the wikibase instance

This is generally not a problem but may pose interesting challenges for sitelinks which are currently modeled as a separate well identified subject. In other words this means that two entities might possibly link to the same sitelink while the stream is consumed. This is one of the reason the site links portion of the graph is considered shared linked data. The other reason is historical, see phab:T44325.