Search Platform/Weekly Updates/2024-01-12

From Wikitech

Summary

We've started query analysis on the WDQS Graph Split. The first number look good, with > 95% of queries returning coherent results both on full graph and on the split. This first analysis covers only pywikibot, but as this is the basis for multiple different tools and workflows, the numbers might be representative of general use of WDQS.

We're still working on improving performance and stability of the Search Update Pipeline. Some of the improvements might be replicated to other projects, notably the use of compaction on Kafka.

What we've accomplished

WDQS Graph Splitting

  • Analyst extracted a first set of test queries from logs, covering 5 important tools (pywikibot, wikidata integrator, listeria, mix&match, sparqlwrapper)
  • first rough analysis for 10K queries from pywikibot
    • 99% of which execute successfully on both split an full graph
    • out of which 94% return the same results
    • Identified "AuthorBot" as potentially problematic, we will contact the author
    • More analysis is needed, but this looks like promising numbers!
  • Removed throttling from the test servers to allow to run large amounts of test queries - https://phabricator.wikimedia.org/T354555

Search Update Pipeline

  • Track down envoy performance issues
  • Fix throttled throughput by discarding order of processing rerenders
  • Add gzip support
  • Enable page_rerender event emission for 80% of wikis
  • Process page_rerender and resulting update_pipeline events without writing to elasticsearch (using devnull consumer/indexer) to test fetching from mw-api
  • Configure Kafka topics to be more space efficient (partitions + compaction) and enable parallel processing of multiple partitions
  • Started work on supporting private wikis via blockingly executed jobs