SLO/Image Suggestions
Appearance
< SLO
Status: Approved
Organizational
Service
- Airflow DAGs made of Spark jobs form the data processing backbone that generate image suggestions from Wiki dumps
- the Data Gateway provides persistence based on the Analytics Query Service (AQS) Cassandra cluster and an internal API that serves suggestions
- CirrusSearch provides persistence based on Elastic that tells whether suggestions are available for a given article of a given Wikipedia
- the GrowthExperiments MediaWiki extension provides an Action API endpoint that allows desktop and mobile clients to access suggestions on Wikipedias

Teams
| Team | Components |
|---|---|
| Former Structured Content engineers | Data pipelines as Airflow DAGs of Spark jobs |
| Data Engineering | Data Lake (upstream data dependencies), Airflow |
| Data Persistence | Data Gateway (AQS Cassandra cluster and internal API) |
| Search Platform | Wiki search indices updates |
| Growth | Client API via GrowthExperiments Extension |
| Mobile Apps | Client APIs via Suggested edits on Android and iOS |
Architectural
Environmental dependencies
platform_engAirflow instance- Data Gateway
- Wiki indices
- GrowthExperiments
- Suggested edits on Android
- Suggested edits on iOS
Service dependencies
Hard Dependencies:
- Without the Wiki Dumps we don't have any Wikidata to work with to kick start the process.
- Without the Airflow Pipelines we don't have any image suggestions data fed into the Hive Tables which cascades to OpenSearch not having any image suggestions.
- Without CirrusSearch the Action API can't provide image suggestions since it relies on querying it's indices for available image suggestions across article topics.
Client-facing
Clients
- Add_Image - More information on how endpoints for Mobile and Web clients work
Service Level Indicators (SLIs)
- Availability SLI for GrowthExtensions ImageSuggestions Action API: The percentage of all image suggestions requests receiving a non-error response, defined as "HTTP status code 200". The following API endpoints are what are covered by the SLO.
ApiQueryImageSuggestionDatawhich allows adding image-suggestions data to responses from the query Action API endpointAddImageFeedbackHandlera REST route to record the user's decision on the recommendations for a given page.
- Uptime SLI for Pipeline Uptime: Availability of the Airflow Jobs that handle the Image Suggestions Generation.
- Freshness SLI for Image Suggestions Pipeline: Time from article edit to updated suggestions
- Availability of Error-free mediawiki_content_current_v1: The percentage of error-free input that doesn't lead to a break in the pipeline.
- Uptime SLI for OpenSearch: Availability of the OpenSearch platform.
- Uptime SLI for Analytics Stack: Availability of the Analytics Stack.
Operational
Monitoring
- The Growth current monitoring in place is to send an alert if we go 5 minutes without the requests for Article Level Image Suggestions (ALIS): https://grafana.wikimedia.org/alerting/grafana/dencr2n4bt9fkc/view?tab=details
- Runtime and failure rate of the following Airflow DAGs:
Troubleshooting
- See runbook.
- See Image Recommendations Guide
Deployment
Done in GitLab and on deployment machines, see docs.
Service Level Objectives
| Metric | Teams | Measurement / SLI | Target SLO |
|---|---|---|---|
| Client API Uptime | Mobile Apps, Growth | Image suggestions requests getting a HTTP 200 response |
≥ 95%(1) |
| Pipelines Freshness | Former Structured Content engineers | Time from article edit to updated suggestions | ≤ 1 week(2) |
| Pipelines Uptime | Former Structured Content engineers | Successful Spark jobs | ≥ 95% |
| mediawiki_content_current_v1 Uptime | Data Engineering | Dataset availability | ≥ 95% |
| mediawiki_content_current_v1 Data completeness | Data Engineering | Consistent dataset that doesn't cause pipelines failure | ≥ 95% |
| CirrusSearch Uptime | Search Platform | Availability of CirrusSearch Platform | ≥ 95% |
| Analytics Stack | Data Engineering | Availability of the Analytics Stack | ≥ 95% |
(1) A 95% availability SLO implies 4.5 days downtime per quarter
(2) Image suggestions are generated weekly, so the integration of user's feedback (which translates into article edit if a given suggestion is accepted) follows the same schedule.