Search Platform/Weekly Updates/2024-03-01

From Wikitech

Summary

Search Update Pipeline: we have 100% of writes using the new SUP for Cloudelastic. Operational concerns are solved, we will start deploying to production indices soon. The target of 90% of the update traffic migrated to SUP by the end of the quarter is likely to overflow to Q4 by a few weeks.

WDQS Graph Splitting: We are making progress in understanding better the Scholia use cases and helping them move forward by reviewing some scholia queries as a proof of concept that SPARQL federation is a viable alternative and by providing documentation on how to rewrite queries to use the graph split. The complexity isn't yet under control, both from a technical standpoint and from a change management standpoint. Scholia is also exploring alternate solutions, including running their own query service with a different RDF backend.

We are doing experimentation with different hardware configuration to better understand how we could gain performance in loading the graph by throwing hardware at the problem.

What we've accomplished

WDQS graph splitting

  • Started a list of people that might help/be affected by the split to contact, Adam contacted one that might join the scholia/wikicite group, Luca should take care of the others.
  • Discussion with scholia/wikicite:
  • Working on federated queries examples [DC]
  • Potential hardware performance improvements:
    • AWS Neptune serverless completed latest-all.nt.bz2 in a total of 63 hours . AWS Neptune with a provisioned high power server (1.5 TB RAM, 96 vCPU) showed a speed increase of perhaps 60% over the serverless option, but was stopped to avoid further costs after an initial import on latest-all.nt.bz2 up to about 1.3B records.
    • https://phabricator.wikimedia.org/T358727 has been opened to test import on a server that already has an NVMe
    • AWS-based import speed tests. For example, AWS Neptune serverless with a maximum of 128 NCUs (an NCU is said to be 2 GB RAM and some level of attendant CPU) processed 7_750_230_000 records for window 26-February-2024 2:21:11 PM CT - 27-February-2024 1:46 PM CT; this import is ongoing. This is using the file latest-all.nt.bz2 from 16-February-2024, and as of this writing is utilizing about 70% of allocated CPU. An AWS Neptune import of latest-lexemes.nt.bz2 processed 163_715_491 records in 2142 seconds; the CPU utilization didn't seem to peak out during this import and was achieving about 50% utilization. EC2 imports seem capable of almost approaching the speed of an i7-8700 desktop gaming computer with 64 GB of RAM and attached NVMe, but thus far haven't shown a necessarily faster import possibility, just evidence that NVMe-based disks and faster CPUs both play a role in import speed, which is unsurprising, but worth validation.

Search Update Pipeline

Improve multilingual zero-results rate

  • built my regression test set for the dotted I (İ) fix task (https://phabricator.wikimedia.org/T358495) and did a quick test. It only does good things as long as we keep it away from languages that actually use dotted I and dottless i (İ/i and I/ı—this font is terrible for this!). The next thing is configuring it efficiently everywhere, while looking at removing it from configs that use icu_folding (which makes it redundant), and looking at how best to do İ/i and I/ı lowercasing (Turkish lowercasing or a quick mapping?) for the languages that need it.