Search Platform/Weekly Updates/2025-01-17
Appearance
Ongoing work
Language Stuff: Kuromoji/Sudachi
- Sorting through my Sudachi notes and working on the write up. In the spirit of sharing early and often—though it is not that early or that often—my Kuromoji notes ("Japan Eset OK en I Zat Ion") are on Mediawiki. [TJ] T318269 https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Japan_Eset_OK_en_I_Zat_Ion%E2%80%94Kuromoji,_ICU,_and_Sudachi
WDQS graph splitting
- Attempted to deploy config changes to use the internal split graph endpoints [SM][DC]
- succeeded for the campaign extension (https://phabricator.wikimedia.org/T377956)
- failed for WikibaseQualityConstraints (https://phabricator.wikimedia.org/T374021)
MLR Improvements
- Updated documentation on https://phabricator.wikimedia.org/T383048 on the general approach and what has been done so far
- Agreed with the team on next steps. I'll train a model with "vanilla" xgboost (instead of the full mjolnir's pipeline). I'll use the "exact match" heuristic as a baseline for "easy queries" identification. No blockers, but today I've been mostly focused on airflow / mjolnir troubleshooting
Misc / Operations
- Airflow issues after the k8s migration
- While we ironed out a quite few issues, the airflow-search is not yet fully operational
- Investigated and resolved name resolution issues that resulted in pods getting killed (http://phabricator.wikimedia.org/T383651)
- this triggered cascading issues whereby airflow would lose track of skein, resuliting in multiple instances of skein and spark job being spawned per task
- KubernetesPodExcutor: WIP on both SRE and SWE fronts to resume execution of refinery script (unblocks the drop_data_daily dag). Aim is to test a solution early next week
- Fixed a skein/spark memory unit missfit bug, that resulted in spark job being erroneusly killed (http://phabricator.wikimedia.org/T383589)
- Fixed a race condition in mjolnir's use of refinery jars (http://phabricator.wikimedia.org/T383870)
- MRs in flight to reduce dataset size and tweak popularity_score
- Helping out the Design System Team with a question on CSS hyphenation for languages with loooooong words. German compounds are just the tip of the iceberg! (That must be "deutschcompoundennounzericenbergertippe" or something, right?)