Search Platform/Weekly Updates/2024-08-30
Appearance
Summary
Some post deployment cleanup / minor issues on the Search Update Pipeline, but this work is ramping down. Peripheral to the new Search update Pipeline is a rework of how weighted tags are ingested. We've seen multiple failures in that pipeline, and the new SUP allows us to simplify the workflow and better support features like AddLink and Image Recommendations.
New SPARQL endpoints for WDQS graph split are ready. We will do an additional round of testing and we will communicate more broadly next Tuesday.
Our work on language harmonization is coming approaching completion. After ~ 1 year of work, we have now applied everything we've learned to all of the languages we support!
What we've accomplished
WDQS graph splitting
- https://query-main.wikidata.org and https://query-scholarly.wikidata.org are up! Next will be some odds and ends, testing, and a communication.
Search Update Pipeline / Private Wikis
- Fixed redirect handling (duplicate namespace prefixes) https://phabricator.wikimedia.org/T372446
- Stared working on the CirrusSearch extension, to support writing weighted tags via page_change_weighted_tags stream in addition to job queue. https://phabricator.wikimedia.org/T372904
Improve multilingual zero-results rate
- Additional review of the results for newly configured icu_folding looked good. Speed tests are acceptable: 2% to 10% slower, depending on the complexity of the filter it's being added to. Most were using the default, so the denominator is smaller than analyzers with stemmers or other expensive things. Some new baseline regression test corpora for newly involved languages have been deployed and are running daily.
- Currently breaking up patches. I decided to include a patch with baselines for new test fixtures, as previously suggested. I've refactored a fair number of tests and fixtures, and decided I need some documentation (just a markdown README) so others—including my future self—can more easily figure out what's going on.
- I finished configuring ICU folding for Chinese, Indonesian / Malay, Khmer, Korean, and Polish. I added ICU folding to languages that already have customizations in our code: Mirandese, and the Turkic languages Azerbaijani, Crimean Tatar, Gagauz, Kazakh, and Tatar.
- I started working down the list of languages I have, sorted by volume of unique queries: Vietnamese, Igbo, Swahili, Tagalog, Slovenian, Georgian, Tamil, Uzbek, and Albanian are also all done now.
- Fun Fact: This adds support for customized ICU folding to 20 new languages, and expands ICU folding coverage to the languages of the top 50 Wikipedias and 61 of the top 90 in my list (by unique query volume), and to the languages of 65 Wikipedias overall.
- Next up: Finish documenting details for these languages, do another review of the results, do some speed tests, upload some patches, finish creating new regression test corpora, and update how my tools calculate ZRR (as mentioned last time).
Search Metrics
- The source of big line fluctuations on graphs (https://superset.wikimedia.org/superset/dashboard/search/) seemed to boil down to automata making abnormal numbers of pageviews, skewing the denominator in search metrics. Created new phab task to make further investigation easier: https://phabricator.wikimedia.org/T372932
Misc
- Special:Search intitle search has weird redirect - https://phabricator.wikimedia.org/T372446