Search Platform/Weekly Updates/2024-01-19

From Wikitech

Summary

We've addressed all known performance issues on the Search Update Pipeline and are now generating re-render events for all public wikis. Some more work is required around private wikis, but that's not blocking the roll out.

Query analysis for WDQS Graph Split is showing some results and raising more questions. Some clients seem to have no issues (Pywikibot, SPARQLWrapper), others show queries returning different results (Listeria, MixNMatch, WikidataIntegrator, ...). More investigation are required to understand if this is a limitation of our testing strategy or of the graph split itself.

What we've accomplished

Search Update Pipeline

WDQS graph splitting

  • Some progress on analyzing sparql query results differences - https://phabricator.wikimedia.org/T355040
    • Our query logs do not only contains sparql queries and the sparql client used to collect the data has to be adapted to support these (ASK, CONSTRUCT, DESCRIBE) (https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/991622)
    • Getting failures due to response size, bumped the limit to 16M but still getting problems, I might stop here and simply ignore these queries
    • Getting very bad numbers from Listeria and MixNMatch (34% and 17% identical respectively), avg result size is 1.6k and 8k so might explain partly why getting identical results is difficult, need more investigation to understand the cause...
    • Getting pretty meh numbers for WikidataIntegrator at 88% with very small avg result size at 8, more investigation needed
    • Pywikibot and SPARQLWrapper are good at 99.4% for both
  • Expose 3 new dedicated WDQS endpoints: DNS entries, SSL certificate and microsite configuration are ready, but not yet working. Investigation required, but we're almost there - https://phabricator.wikimedia.org/T351650
  • Spark job to export the split graph from HDFS is completed, with appropriate tests - https://phabricator.wikimedia.org/T350106

Operations