Search Platform/Weekly Updates/2023-11-17

From Wikitech

Summary

Year-end vacation season and deployment freeze are coming, work is expected to somewhat slow down until January. We're still on track to deliver what we expected.

What we've accomplished

Improve multilingual zero-results rate

  • Finished heuristics for merging Type and Script attributes (e.g., <ALPHANUM>/Latin + <NUM> = <ALPHANUM>/Latin; Latin + Cyrillic = Unknown; etc.). Abandoned making Script attributes merging more configurable (e.g., keep first, keep last, count characters), so every mixed token gets "Unknown" (we're limited to ICU script types, otherwise I'd go with "Mixed") - https://phabricator.wikimedia.org/T332337
  • Lots of thinking about configurability of scripts & types to merge (e.g., don't merge <EMOJI> types; only merge <ALPHANUM> types; don't merge CJK scripts, etc.). Still thinking about "numbers only" option (because current behavior is an error wrt UAX #29) - https://phabricator.wikimedia.org/T332337

WDQS graph splitting

Search Update Pipeline

  • Starting backfilling test to validate functional correctness and that load on backend systems is appropriate - https://phabricator.wikimedia.org/T350826
  • There are open questions about failure modes. Currently, some failures related to bad input data require manual intervention to recover. Automated recovery in a robust way isn't trivial. Note that at the moment, SUP has been running with production data for multiple days without issues, so failures due to data are at least somewhat rare.
  • Helm charts created and validated by deployment - https://phabricator.wikimedia.org/T326328
  • Improve the flink-app chart to provide more useful defaults - https://phabricator.wikimedia.org/T346315

Misc