Search Platform/Weekly Updates/2025-03-28
Appearance
In Progress
Elasticsearch -> OpenSearch migration
- T386868 Port Sudachi to OpenSearch 1.x
- T389119 Upgrade wmf_opensearch_search_plugins .deb and restart opensearch
- T387028 Decide on a new name for Elastic hosts/Rename hosts during reimage
Explore Search Abandonment
Misc
- T388549 Vector Search PoC
- I documented the PoC and experiments at https://gitlab.wikimedia.org/gmodena/vector_search. This includes PoC code for indexing, querying (embeddings and morelike) and evaluating results with an LLM judge. Sample query and evaluation results are available in the repo. I might still want to do some touch ups between tonight and tomorrow.
Done
MLR Improvements
- T385972 Deploy and test new MLR models - New models have been deployed for a while (mid-February), but documentation and post deployment analysis is completed. In particular, see:
- https://people.wikimedia.org/~gmodena/search/mlr/ab/2025-02/ for AB test results
- https://gitlab.wikimedia.org/repos/search-platform/notebooks/-/blob/main/ab-test/T377128-AB-Test-Metrics.ipynb for the Notebook used for this test
- https://phabricator.wikimedia.org/T385972#10589233 for additional comments on the tests results
- T360536 Increase retention of training data While this has no direct impact on our search, it enables us to train MLR models on wikis with less traffic, once the additional training data has been collected.
Operations / Misc
- T388352 PHP Notice: Trying to access array offset on value of type null (via CirrusSearch on Special:Version)
- T379002 Consider resharding cebwiki_content
- T388372 Increase retention of Wikidata RDF Stream (Kafka and/or Hadoop) - we are now able reimport data on WDQS even if Wikidata dumps fail for up to 3 months (provided some additional work to modify the data reload process, but at least data isn't lost)
- T389895 cirrusSearchElasticaWrite job failures in quibble - in support of Campaigns team