Jump to content

Search Platform/Weekly Updates/2025-01-17

From Wikitech

Ongoing work

Language Stuff: Kuromoji/Sudachi

WDQS graph splitting

MLR Improvements

  • Updated documentation on https://phabricator.wikimedia.org/T383048 on the general approach and what has been done so far
  • Agreed with the team on next steps. I'll train a model with "vanilla" xgboost (instead of the full mjolnir's pipeline). I'll use the "exact match" heuristic as a baseline for "easy queries" identification. No blockers, but today I've been mostly focused on airflow / mjolnir troubleshooting

Misc / Operations

  • Airflow issues after the k8s migration
    • While we ironed out a quite few issues, the airflow-search is not yet fully operational
    • Investigated and resolved name resolution issues that resulted in pods getting killed (http://phabricator.wikimedia.org/T383651)
      • this triggered cascading issues whereby airflow would lose track of skein, resuliting in multiple instances of skein and spark job being spawned per task
    • KubernetesPodExcutor: WIP on both SRE and SWE fronts to resume execution of refinery script (unblocks the drop_data_daily dag). Aim is to test a solution early next week
    • Fixed a skein/spark memory unit missfit bug, that resulted in spark job being erroneusly killed (http://phabricator.wikimedia.org/T383589)
    • Fixed a race condition in mjolnir's use of refinery jars (http://phabricator.wikimedia.org/T383870)
    • MRs in flight to reduce dataset size and tweak popularity_score
  • Helping out the Design System Team with a question on CSS hyphenation for languages with loooooong words. German compounds are just the tip of the iceberg! (That must be "deutschcompoundennounzericenbergertippe" or something, right?)