Jump to content

Search Platform/Weekly Updates/2024-12-06

From Wikitech

Ongoing work

Language Stuff: Kuromoji

  • David helped me read the Sudachi documentation (which was in English.. it was a Friday, I dunno what to say) and I got it installed and running. It has some serious quirks, the most blatant of which is that is has a lot of multi-word English (and some French) tokens in its dictionary, so it tokenizes "prêt-à-porter" and "Organization for Economic Cooperation and Development" as one token each. It generally can generate some weird tokens on Latin script. Rumor has it that it is also very slow (which I need to check). If it were stupendous at parsing Japanese it might be worth it to hack around the quirks, but I think it's almost time to put it on hold until I finish analyzing the reviews for Kuromoji and the ICU tokenizer.
  • 2 of my 3 volunteers have made some progress on reviewing tokenization by Kuromoji and the ICU tokenizer. Thomas suggested the Sudachi tokenizer, which has Elasticsearch and OpenSearch plugins available, so I spent a little time trying to get it to work. It threw errors as soon as I installed it, so I messed around a bit and asked David for some help and he got to the next sticking point (the dictionary has to be installed manually in a not really local place.. though to be fair it was right there in the docs—I should not try to install new software on a Friday afternoon before a 3-day weekend).
  • I took the lack of progress in Sudachi installation and waiting on the volunteer reviewers as an opportunity to work on committing my last couple of years' worth of minor upgrades to my analysis analysis tools. The code was easy enough, but there have been so many upgrades to the analyzers themselves that I need to work through the docs again to make sure everything is up to date—because it is not currently up to date at all.

What we've accomplished

WDQS Graph Split

Elasticsearch to OpenSearch migration

Search Update Pipeline / Weighted tags

Migrate Archiva to Gitlab

Misc / Operations