Search Platform/Weekly Updates/2024-02-09

From Wikitech

Summary

Further investigation of failed queries on the WDQS main graph shows that most are coming for a few sources, which gives us some confidence that we can improve the situation significantly by focusing on a small number of use cases.

Other projects are moving along nicely.

What we've accomplished

Improve multilingual zero-results rate

  • ICU token repair corpus is built and daily diffs are running. Reviewing diffs from enabling the ICU tokenizer. Mostly looks good, but there are a few things to track down. (Malayalam has the most unusual results and I'm having a little trouble figuring out what's going on—diffs from my regresion test set aren't reproducing easily in focused testing. I'll get to the bottom of it eventually.)

WDQS graph splitting

Misc

  • Investigated, restarted and back filled failed data pipeline. https://phabricator.wikimedia.org/T356030
  • We participated to a Unicode Consortium meeting about the Foundation's membership. Nothing concrete yet, but a lot of good will and promises to do introductions and work together in the future. This is especially timely with our current work on ICU token repair.