The multi-project indices work has the primary goal of taking multiple wiki projects in the same language (enwiki, enwikisource, enwiktionary, enwikivoyage, etc.) and placing them into a single elasticsearch index.
Why / How
CirrusSearch uses MediaWiki page id's as document id's within elasticsearch. If we want to store multiple projects within the same index we need to change the id to something more distinct. This initial proposal is to prefix all document id's with the wikiId it belongs to. Internal to elasticsearch all document id's are treated as strings, so no changes are required server side. From CirrusSearch we need to track down all usage of page id's and classify what they should be: either document id's or page id's. We then need to verify that as we pass these around we appropriately convert page id's to doc id's (and back) where necessary. Phan can provide some small help here, by always annotating pageId's as integers and docId's as strings phan will let us know when the wrong type is being passed. This is not perfect and wont catch all mistakes, but will help in at least some cases.
To support deployment forceSearchIndex.php will also need a few changes. Some changes may also need to be applied to updateOneSearchIndex.php
- Need to be able to specify $wgCirrusSearchMultiWikiIndices from the command line to allow a reindex to run before deploying the config change to the wiki itself. Note that this script is used in combination with the job queue, so this option needs to propagate into those jobs.
- Need to be able to specify the index base name of the target index, currently it assumes both the target and destination share the same base name. Instead we will be copying from, for example, testwiki_content to test_content.
- Must not make any alias changes or index deletions when $wgCirrusSearchMultiWikiIndices is true. An additional script must be worked up to promote an index to production usage (make the necessary alias changes and, if requested, delete the old index).
- Must be able to reindex into an existing index, for example if testwiki creates test_content and populates it then test2wiki must index into the existing test_content index.
- To help prevent mistakes, a special flag should be required when $wgCirrusSearchMultiWikiIndices is true so the operator knows the process is different. This may be phased out in the future, but seems like a good failsafe initially to prevent partial indexing.
Changing all the id's used is going to make deployment of the change a bit difficult. The first deployment will be to create a single multi-wiki index of testwiki and test2wiki. This process needs to additionally be tested in labs (and beta cluster) prior to deployment. Automating this process will come later. Below is the proposed plan:
- Note down the current time in GMT, to use for backfilling updates after the switchover
- Use forceSearchIndex.php to do an in place index from the current live index into the new shared index for all wiki's that share the index. Ensure to use the above
- Deploy mediawiki-config change reducing $wgCirrusSearchDropDelayedJobsAfter to 0
- Disable indexing by freezing all writes to the cluster and ensure update queues drain. Freezing the indexes needs to be combined with reducing $wgCirrusSearchDropDelayedJobsAfter to ensure the old updates, containing old document ids, are full dropped from the job queue.
- Wait a bit and monitor job queue to ensure we really got rid of all the old updates.
- Deploy mediawiki-config change setting new $wgCirrusSearchIndexBaseName and $wgCirrusSearchMultiWikiIndices values
- Re-enable indexing
- Use forceSearchIndex.php with --from/--to settings on all wiki's sharing the index to backfill updates into the new index
- If we somehow still managed to get old document id's in the new index, a second run of in-place reindexing should convert those as well.
This plan should be made more explicit, with necessary command lines / automation, as the necessary changes to forceSearchIndex.php are agree'd upon and made.
Alternate Deployment, may have less error cases
Deploy in two fazes. First ship out the change handling prefixes, without actually merging the indices. This will still have some complexities but won't intermingle the prefixing with the change to have multiple wiki's in each index. At a later point, reindex multiple wiki's into a single index.
It is likely at some point in the future we will need to reindex, either for analyzer changes or something else. The current in-place reindexing process should work regardless of the number of wiki's in the index but we should double check. The full reindex, which re-parses wikitext pages and inserts to the index, will need to be adjusted as well.
- When doing a full parse reindex indices must not be automatically promoted, as we need to wait until all wiki's that are a part of that index have completed
- A separate script will be needed to promote an index to production status (re-alias and delete the old version)
- Documentation at Search will need to be updated to take this into consideration
- Likely moar