News/Actor storage changes on the Wiki Replicas

From Wikitech

The Actor table was introduced to MediaWiki by T167246 and related work on the Actor migration project. This page is about related changes that have been made to the Wiki Replica views to be compatible with the MediaWiki operational database schema. These changes to the Wiki Replicas were first announced on 2019-05-17 via the cloud-announce mailing list and then made live on 2019-06-03.

What changed?

*_user and *_user_text fields have been removed from the Wiki Replica views. This change was made to keep the replicas ahead of changes being made to the production databases in the near future.

An example error message that could be caused by this change is:

ERROR 1054 (42S22): Unknown column 'ar_user' in 'field list'

The fields that were dropped and their replacements (see T223406 for more details):

Table Columns removed Columns added
archive ar_user, ar_user_text ar_actor
filearchive fa_user, fa_user_text fa_actor
image img_user, img_user_text img_actor
ipblocks ipb_by, ipb_by_text ipb_by_actor
logging log_user, log_user_text log_actor
oldimage oi_user, oi_user_text oi_actor
recentchanges rc_user, rc_user_text rc_actor
revision rev_user, rev_user_text rev_actor

The same changes were made to the associated _userindex views such as revision_userindex.

Tables with names ending in _compat will be effectively unchanged, but the hidden joins that preserve the legacy *_user and *_user_text fields will make queries involving those tables slower than direct use of the actor table.

Changes to the revision_userindex view were made to optimize usage of the rev_actor field in T221339. These changes have the side effect of reducing the speed of queries using the revision_user and revision_user_text fields.

What should I do?

The best solution

Every table that had user and user_text type fields has a similar actor field. That field is the id to find in the actor table.

Example: ar_user and ar_user_text from the archive table would be replaced with a join on the actor table using the ar_actor field as actor_id or a separate query, which can be faster in some cases, to obtain the needed actor_name associated with a given archive action.

The list of actor fields follows:

Table Actor field
archive ar_actor
filearchive fa_actor
image img_actor
ipblocks ipb_by_actor
logging log_actor
oldimage oi_actor
recentchanges rc_actor
revision rev_actor

What about a quick solution if the refactor is too time consuming at the moment

Tables affected by the actor change also have a _compat view (for example archive_compat) which have the actor table joins already present in the view's code and will present just like the affected table did before the changes took place in the MediaWiki codebase. The _compat views will suffer a performance penalty because of those same joins. In some cases, where performance isn't the most important factor, this may be the quickest way to fix a tool or application. Use of the _compat views should not be considered a permanent solution for a long lived tool. These views will very likely be removed in the future.

The actor table seems really slow--so does comment

The views for actor and comment are burdened by subqueries against 8 other tables in order to ensure that only the correct rows are available. This is being worked on here T215445, but since that's work toward a long-term solution, WMCS has put up a set of sub-views of these tables to make things faster for individual queries (phab:T224850). Each subview only makes a subquery against a single table. This way, if you are looking for an actor row from the logging table, then you can query against the actor_logging view. This view will be more efficient for that case. The actor_logging view will, naturally, not have any rows available that might be referenced instead in the revision table or the archive table.

Example query using the new views (in this case actor_logging), provided by MusikAnimal (talk contribs):

 SELECT * FROM logging_userindex
 JOIN actor_logging ON actor_id = log_actor
 WHERE actor_name = 'MusikAnimal'

The specialized views that can be used in this case are:

  • actor_filearchive
  • actor_image
  • actor_ipblocks
  • actor_logging
  • actor_oldimage
  • actor_protected_titles
  • actor_recentchanges
  • actor_revision
  • comment_filearchive
  • comment_image
  • comment_ipblocks
  • comment_logging
  • comment_oldimage
  • comment_protected_titles
  • comment_recentchanges
  • comment_revision

Advanced use cases of specialized views

There are other more clever ways to exploit the fact that these views include a single subquery against the external table in the second part of the name to remove joins or where clauses from your own queries, but we suggest experimenting to ensure you are getting the same results if you try this. An example of this is, if you were just trying to see if user "Foo" had ever made edits to a wiki, you could reasonably get that from SELECT 1 FROM actor_revision WHERE actor_name = 'Foo' because any rows with that actor name would be rows that are visible in the revision table. This would likely be faster than a join query for that particular piece of information.

See also