WMDE/Wikidata/Dispatching

From Wikitech
< WMDE‎ | Wikidata

Overview

  • Changes on Wikidata are buffered in the wb_changes table.
  • DispatchChanges job is queued on Wikidata after an edit to an entity that has at least one wiki subscribed to it
  • The DispatchChanges job queues an EntityChangeNotification job into the job queue for each wiki subscribed to that entity
    • These EntityChangeNotification jobs get the whole Change(s) as parameter
  • When the EntityChangeNotification job runs on wikis, the ChangeHandler handles the change, which includes cache purges, refreshing links and injecting rc records for example.

Full docs: https://doc.wikimedia.org/Wikibase/master/php/docs_topics_change-propagation.html

The process runs on wikidatawiki, testwikidatawiki and beta.

Occasional stuck changes

Rarely, it might happen that a change gets stuck in `wb_changes` for yet unknown reasons. This can be resolved by running the ResubmitChanges.php maintenance script. See for example task T294008.

Control

Stop dispatching

$repoSettings['localClientDatabases'] controls to which wikis may get EntityChangeNotification jobs queued. Set this to the empty list to stop new EntityChangeNotification jobs being queued except for the client wiki which is also its own repo, i.e. wikidatawiki.

Monitoring

Dispatching state on the repo

  • TODO: add relevant logs on any servers, useful kafkacat commands, etc.

How it actually works?

Data Storage

When someone makes an edit to an Entity in Wikidata, if and only if that Entity has at least one client wiki subscribed to it, then a new row in wb_changes table in wikidatawiki gets added. Here's an example:

MariaDB [wikidatawiki_p]> select * from wb_changes limit 1\G
*************************** 1. row ***************************
         change_id: 1014161077
       change_type: wikibase-item~update
       change_time: 20190924171504
  change_object_id: Q1
change_revision_id: 1019310059
    change_user_id: 142191
       change_info: {"compactDiff":"{\"arrayFormatVersion\":1,\"labelChanges\":[],\"descriptionChanges\":[\"el\",\"eo\",\"en\",\"zh\",\"sr-ec\",\"wuu\",\"vi\",\"sr-el\",\"it\",\"zh-hk\",\"ar\",\"pt-br\",\"tg-cyrl\",\"cs\",\"et\",\"gl\",\"id\",\"es\",\"en-gb\",\"ru\",\"he\",\"nl\",\"pt\",\"zh-tw\",\"nb\",\"tr\",\"zh-cn\",\"tl\",\"th\",\"ro\",\"ca\",\"pl\",\"fr\",\"bg\",\"ast\",\"zh-sg\",\"bn\",\"de\",\"zh-my\",\"ko\",\"da\",\"fi\",\"zh-mo\",\"hu\",\"ja\",\"en-ca\",\"ka\",\"nn\",\"zh-hans\",\"sr\",\"sq\",\"nan\",\"oc\",\"sv\",\"zh-hant\",\"sk\",\"uk\",\"yue\"],\"statementChanges\":[],\"siteLinkChanges\":[],\"otherChanges\":false}","metadata":{"page_id":68145928,"parent_id":1019293753,"comment":"\/* wbeditentity-update:0| *\/ Bot: - Add descriptions:(58 langs).","rev_id":1019310059,"user_text":"Mr.Ibrahembot","central_user_id":15992302,"bot":1}}

The change_info is the compact serialization of the change and it's being used later to dispatch the change. This table is being trimmed by the DispatchChanges jobs handling its entries and usually contains less than 10 rows and almost always less than 100.

Wikidata (and other repos like commons) keep track of client wikis that subscribe to their entities. It's stored in wb_changes_subscription table:

MariaDB [wikidatawiki_p]> select * from wb_changes_subscription where cs_entity_id like 'Q%' limit 5;
+-----------+--------------+------------------+
| cs_row_id | cs_entity_id | cs_subscriber_id |
+-----------+--------------+------------------+
| 100946988 | Q1           | afwiki           |
|  57021716 | Q2           | alswiki          |
| 116682143 | Q2           | amwiki           |
|  57060845 | Q2           | anwiki           |
|  57107362 | Q2           | arcwiki          |
+-----------+--------------+------------------+

Client wikis themselves keep track of exactly which part of Wikidata (=repo) entities they are using and in which pages in a table called wbc_entity_usage (note that this table is in all wikis and not only in Wikidata).

This is an example form Afrikaans Wikipedia:

MariaDB [afwiki_p]> select * from wbc_entity_usage where eu_entity_id = 'Q1';
+-----------+--------------+-----------+------------+
| eu_row_id | eu_entity_id | eu_aspect | eu_page_id |
+-----------+--------------+-----------+------------+
|    398481 | Q1           | C         |      39420 |
|    929039 | Q1           | L.af      |      70835 |
|    132666 | Q1           | O         |      39420 |
|    115881 | Q1           | S         |      39420 |
|    398482 | Q1           | T         |      39420 |
|    929040 | Q1           | T         |      70835 |
+-----------+--------------+-----------+------------+

"C" means "statements" (a.k.a. "claims"), "L.af" means "Label in Afrikaans language" , "O" means "Other" (currently means aliases), "S" means "Sitelinks" (to show them in sidebar), "T" means title. Note that "Q1" is being used in two different pages in different aspects. An item can be used in millions of pages in a client wiki.

The workflow using an example

Note: Change dispatching is only responsible for triggering a refresh (and inject rows to RecentChanges). Fetching the actual data happens somewhere else in the code (see WikiPageUpdater in change-propagation docs).

The actual dispatching happens in the DispatchChangesJob.

1. The DispatchChanges job get's created with only the Entity-id as a parameter. It queries wb_changes table for all changes with that Entity-id. If multiple changes were made in quick succession to that Entity, then it might pick up multiple changes. If it found any changes, it then queries for all client wikis subscribed to that Entity-id.

Let's assume only afwiki is subscribed to that particular Entity. It then queues an EntityChangeNotification job for afwiki. That job gets all the change(s) data directly as a parameter.

After queueing this job at afwiki, the DispatchChanges job deletes the rows that it had received from the wb_changes table.

2. Then at afwiki the EntityChangeNotificationJob runs with the those changes (i.e. telling afwiki "Hey, Chinese label of Q3180666 and Hindi description of Q469681 has changed"). The given job checks that data against the actual aspects it actually needs using wbc_entity_usage table in afwiki:

MariaDB [afwiki_p]> select * from wbc_entity_usage where eu_entity_id in ('Q3180666', 'Q469681') limit 5;
+-----------+--------------+-----------+------------+
| eu_row_id | eu_entity_id | eu_aspect | eu_page_id |
+-----------+--------------+-----------+------------+
|    872799 | Q3180666     | C.P1015   |     224030 |
|    872807 | Q3180666     | C.P1048   |     224030 |
|    872798 | Q3180666     | C.P1053   |     224030 |
|    872815 | Q3180666     | C.P1157   |     224030 |
|    872813 | Q3180666     | C.P1222   |     224030 |
+-----------+--------------+-----------+------------+

This is to avoid triggering a refreshLink or InjectRCrecord job when the used aspects of changed entities hasn't changed actually.

For example, if the page in afwiki only uses the label on English, and the aliases in Persian has changed, no action is needed here.

If there's a match or matches, the dispatcher triggers jobs to refresh the page(s) to use the new data and injects rows into recentchanges table of the client.

Note: There used to be an aspect called "X" meaning "All" and it would basically says "notify for any change on the given item" but it's deprecated now.