Search Platform/Accountability

From Wikitech

High level

  • Search expertise: We understand how Search works. If you want to change something around Search, from simple UI changes to complex new features, please talk to us first. Search is more complex than you probably think, with ramification into performance, data collection and metrics.
  • Wikidata Query Service (WDQS): We operate Wikidata Query Service. We know how it works, its strenghts and it weaknesses. We are the technical owners of the service. The whole Wikidata product vision is owned by WMDE.
  • Wikidata Commons Query Service (WCQS): Similarly to WDQS, we own the technical operations of WCQS. The product vision is not owned by anyone at the moment.

Technical components

Search

Servers

  • elastic*: production Elasticsearch cluster, backend for Search. Incidentally also hosts indices for APIFeatureUsage and Toolhub.
  • relforge*: non production hosts, used to validate relevance work.
  • cloudelastic*: exposes a copy of the Search indices, for use from Toolforge / WMCS. The main use case is https://global-search.toolforge.org/.
  • searchloader*: Mjolnir daemon that does the data transfer between the production Search clusters and the analytics network.

Other components

  • Analytics jobs: we run a number of jobs to populate the Search indices. We are responsible for the jobs themselves and deploying them, but not for the underlying infrastructure (Airflow, Hadoop, etc...)
  • Search Update Pipeline: All data ingestion into search indices, including Page mutations (creation, edits, deletions), document enrichment (ORES topics, add link, image recommendation). The Search Platform team is also responsible for defining the format in which additional data sources are ingested into the update pipeline.

Query Services

Servers

  • wdqs*: Wikidata Query Service (both internal and public facing clusters + 2 test servers)
  • wcqs*: Wikimedia Commons Query Service

Other components

  • Update Pipeline: We operate an update pipeline based on Flink and running on the Wikikube k8s cluster. We are accountable for the update pipeline itself, but the underlying Flink and k8s infrastructure are owned by other teams.

Misc

For historical reasons, the Search Platform team owns a few components that are not strictly related to its mission:

Servers

  • apifeatureusage*: logstash servers used to route APIFeatureUsage traffic to Elasticsearch indices.

Other components

  • Geodata: Mediawiki extension.

Code repositories