SLO/OpenSearch IPoid
Appearance
< SLO
Status: draft
Organizational
Service
OpenSearch IPoid provides an API for retrieving reputation data for IP addresses (e.g., identifying VPNs, proxies, or residential IPs). It serves as the storage and retrieval backend for the IPReputation extension, which is also used by mw:Extension:IPInfo and mw:Extension:WikimediaEvents. The service uses OpenSearch to index and query data imported from Spur.
Teams
- Service Owner: Product Safety and Integrity (PSI) – responsible for the data ingestion logic in Airflow, the IPReputation and IPInfo extensions, and feature maintenance.
- Infrastructure Owner: Search Platform (Data Platform SRE) – responsible for the underlying OpenSearch cluster on Kubernetes (
dse-k8s).
Architectural
Environmental dependencies
The service runs as an OpenSearch cluster within the dse-k8s Kubernetes environment (currently in eqiad and codfw).
- Namespace:
opensearch-ipoid - Storage: TBD
Service dependencies
- Spur Data Feed (Hard Dependency): The service relies on daily updates from Spur. If ingestion fails, data becomes stale, though read availability remains intact. There is a daily cleanup job that removes entries older than 7 days. So if feed ingestion fails, there is a one week period to fix this before all data is unavailable.
- OpenSearch Operator: Manages the lifecycle of the OpenSearch cluster in Kubernetes.
- Airflow: Orchestrates the jobs that fetch data from Spur and index it into OpenSearch.
Client-facing
Clients
- MediaWiki (IPReputation Extension): The primary consumer. MediaWiki makes synchronous requests to OpenSearch IPoid when privileged users view user information or Special:IPInfo.
- SRE/Admin Tools: Occasional ad-hoc queries for investigation.
Request Classes
- Query: Read-only lookups for IP reputation data.
- Ingestion: Write requests during the daily data update window.
Service Level Indicators (SLIs)
- Availability: The percentage of search queries that return a successful HTTP status code (2xx) excluding 4xx client errors.
- Latency: The time taken to serve a search query, measured at the load balancer or ingress.
- Suggested promQL query:
avg_over_time(probe_duration_seconds{job="probes/service",module=~"http_opensearch-.*"}[2m])* 1000
- Suggested promQL query:
- Freshness: The age of the most recent data record. (Indicator: Time since last successful ingestion job).
Operational
Monitoring
The service is monitored via the standard Prometheus/Grafana stack for dse-k8s and OpenSearch.
- Dashboards: TBD
- Alerts: TBD
Troubleshooting
- Connectivity issues: Check
dse-k8singress logs and Network Policies. - Data issues: Verify Airflow job status and OpenSearch index stats (
_cat/indices). - Cluster health: Standard OpenSearch diagnostic commands (
_cluster/health,_cat/nodes).
Deployment
- Infrastructure: Deployed via Helm charts and the OpenSearch Operator, managed by the Search Platform team.
- Configuration: Updates are applied via Kubernetes manifests in the relevant deployment repository.
- Data Updates: Automated daily via Airflow pipelines.
Service Level Objectives
Realistic targets
- Availability: 99.5% (The service is used in anti-abuse workflows).
- Latency: 95% of requests served in < 300ms
- Freshness: Data is no older than 48 hours
Ideal targets
- Availability: 99.9%
- Latency: 99% of requests served in < 200ms.
- Freshness: Data is no older than 24 hours.