Search Platform/Weekly Updates/2024-11-15
Appearance
Ongoing work
Language Stuff: Kuromoji
T318269 Test and analyze Kuromoji Japanese language analyzer
- 2 of my 3 volunteers have made some progress on reviewing tokenization by Kuromoji and the ICU tokenizer. Thomas suggested the Sudachi tokenizer, which has Elasticsearch and OpenSearch plugins available, so I spent a little time trying to get it to work. It threw errors as soon as I installed it, so I messed around a bit and asked David for some help and he got to the next sticking point (the dictionary has to be installed manually in a not really local place.. though to be fair it was right there in the docs—I should not try to install new software on a Friday afternoon before a 3-day weekend).
- I took the lack of progress in Sudachi installation and waiting on the volunteer reviewers as an opportunity to work on committing my last couple of years' worth of minor upgrades to my analysis analysis tools. The code was easy enough, but there have been so many upgrades to the analyzers themselves that I need to work through the docs again to make sure everything is up to date—because it is not currently up to date at all.
- Up next: submit a patch with the analysis analysis tools and then get back to trying to get Sudachi working while I wait for the reviewers.
What we've accomplished
Search Update Pipeline / Weighted tags
- We can now produce events from Spark. This is also in support of the Dumps 2.0 work. T374341 Add support for Spark producers in Event Platform
Search backend replacement
- We are confident that running a mixed Elasticsearch / OpenSearch cluster during migration will not cause issues: T379938 Evaluate mixed-cluster behavior of elasticsearch + opensearch