Jump to content

SLO/OpenSearch IPoid

From Wikitech
< SLO

Status: draft

Organizational

Instructions

Service

OpenSearch IPoid provides an API for retrieving reputation data for IP addresses (e.g., identifying VPNs, proxies, or residential IPs). It serves as the storage and retrieval backend for the IPReputation extension, which is also used by mw:Extension:IPInfo and mw:Extension:WikimediaEvents. The service uses OpenSearch to index and query data imported from Spur.

Teams

  • Service Owner: Product Safety and Integrity (PSI) – responsible for the data ingestion logic in Airflow, the IPReputation and IPInfo extensions, and feature maintenance.
  • Infrastructure Owner: Search Platform (Data Platform SRE) – responsible for the underlying OpenSearch cluster on Kubernetes (dse-k8s).

Architectural

Environmental dependencies

The service runs as an OpenSearch cluster within the dse-k8s Kubernetes environment (currently in eqiad and codfw).

  • Namespace: opensearch-ipoid
  • Storage: TBD

Service dependencies

  • Spur Data Feed (Hard Dependency): The service relies on daily updates from Spur. If ingestion fails, data becomes stale, though read availability remains intact. There is a daily cleanup job that removes entries older than 7 days. So if feed ingestion fails, there is a one week period to fix this before all data is unavailable.
  • OpenSearch Operator: Manages the lifecycle of the OpenSearch cluster in Kubernetes.
  • Airflow: Orchestrates the jobs that fetch data from Spur and index it into OpenSearch.

Client-facing

Clients

  • MediaWiki (IPReputation Extension): The primary consumer. MediaWiki makes synchronous requests to OpenSearch IPoid when privileged users view user information or Special:IPInfo.
  • SRE/Admin Tools: Occasional ad-hoc queries for investigation.

Request Classes

  • Query: Read-only lookups for IP reputation data.
  • Ingestion: Write requests during the daily data update window.

Service Level Indicators (SLIs)

  • Availability: The percentage of search queries that return a successful HTTP status code (2xx) excluding 4xx client errors.
  • Latency: The time taken to serve a search query, measured at the load balancer or ingress.
    • Suggested promQL query: avg_over_time(probe_duration_seconds{job="probes/service",module=~"http_opensearch-.*"}[2m])* 1000
  • Freshness: The age of the most recent data record. (Indicator: Time since last successful ingestion job).

Operational

Monitoring

The service is monitored via the standard Prometheus/Grafana stack for dse-k8s and OpenSearch.

  • Dashboards: TBD
  • Alerts: TBD

Troubleshooting

  • Connectivity issues: Check dse-k8s ingress logs and Network Policies.
  • Data issues: Verify Airflow job status and OpenSearch index stats (_cat/indices).
  • Cluster health: Standard OpenSearch diagnostic commands (_cluster/health, _cat/nodes).

Deployment

  • Infrastructure: Deployed via Helm charts and the OpenSearch Operator, managed by the Search Platform team.
  • Configuration: Updates are applied via Kubernetes manifests in the relevant deployment repository.
  • Data Updates: Automated daily via Airflow pipelines.

Service Level Objectives

Realistic targets

  • Availability: 99.5% (The service is used in anti-abuse workflows).
  • Latency: 95% of requests served in < 300ms
  • Freshness: Data is no older than 48 hours

Ideal targets

  • Availability: 99.9%
  • Latency: 99% of requests served in < 200ms.
  • Freshness: Data is no older than 24 hours.

Reconciliation