Search Platform/Weekly Updates/2024-12-06
Appearance
Ongoing work
Language Stuff: Kuromoji
- David helped me read the Sudachi documentation (which was in English.. it was a Friday, I dunno what to say) and I got it installed and running. It has some serious quirks, the most blatant of which is that is has a lot of multi-word English (and some French) tokens in its dictionary, so it tokenizes "prêt-à-porter" and "Organization for Economic Cooperation and Development" as one token each. It generally can generate some weird tokens on Latin script. Rumor has it that it is also very slow (which I need to check). If it were stupendous at parsing Japanese it might be worth it to hack around the quirks, but I think it's almost time to put it on hold until I finish analyzing the reviews for Kuromoji and the ICU tokenizer.
- 2 of my 3 volunteers have made some progress on reviewing tokenization by Kuromoji and the ICU tokenizer. Thomas suggested the Sudachi tokenizer, which has Elasticsearch and OpenSearch plugins available, so I spent a little time trying to get it to work. It threw errors as soon as I installed it, so I messed around a bit and asked David for some help and he got to the next sticking point (the dictionary has to be installed manually in a not really local place.. though to be fair it was right there in the docs—I should not try to install new software on a Friday afternoon before a 3-day weekend).
- I took the lack of progress in Sudachi installation and waiting on the volunteer reviewers as an opportunity to work on committing my last couple of years' worth of minor upgrades to my analysis analysis tools. The code was easy enough, but there have been so many upgrades to the analyzers themselves that I need to work through the docs again to make sure everything is up to date—because it is not currently up to date at all.
What we've accomplished
WDQS Graph Split
Elasticsearch to OpenSearch migration
Search Update Pipeline / Weighted tags
Migrate Archiva to Gitlab
Misc / Operations
- T380343 Move oozie/util/swift/upload/ out of the refinery oozie folder (moving utility scripts to a central location now that we are removing the last pieces of Oozie)
- T359062 Assess Wikidata dump import hardware - a blog post will follow, tracked in T373338 Blazegraph import of Wikidata - tech blog post
- T379045 mjolnir fails with: Partition not found in table 'labeled_query_page' database 'mjolnir' (issue with the Machine Learning Ranking pipeline)
- T374628 Investigate why rdf-streaming-updater is unable to recover after replacing kafka-main@codfw nodes