Jump to content

SLO/Image Suggestions

From Wikitech
< SLO

Status: Approved

Organizational

Service

Image Suggestions general architecture

Teams

Team Components
Former Structured Content engineers Data pipelines as Airflow DAGs of Spark jobs
Data Engineering Data Lake (upstream data dependencies), Airflow
Data Persistence Data Gateway (AQS Cassandra cluster and internal API)
Search Platform Wiki search indices updates
Growth Client API via GrowthExperiments Extension
Mobile Apps Client APIs via Suggested edits on Android and iOS

Architectural

Environmental dependencies

Service dependencies

Hard Dependencies:

  • Without the Wiki Dumps we don't have any Wikidata to work with to kick start the process.
  • Without the Airflow Pipelines we don't have any image suggestions data fed into the Hive Tables which cascades to OpenSearch not having any image suggestions.
  • Without CirrusSearch the Action API can't provide image suggestions since it relies on querying it's indices for available image suggestions across article topics.

Client-facing

Clients

  • Add_Image - More information on how endpoints for Mobile and Web clients work

Service Level Indicators (SLIs)

  • Availability SLI for GrowthExtensions ImageSuggestions Action API: The percentage of all image suggestions requests receiving a non-error response, defined as "HTTP status code 200". The following API endpoints are what are covered by the SLO.
    • ApiQueryImageSuggestionData which allows adding image-suggestions data to responses from the query Action API endpoint
    • AddImageFeedbackHandler a REST route to record the user's decision on the recommendations for a given page.
  • Uptime SLI for Pipeline Uptime: Availability of the Airflow Jobs that handle the Image Suggestions Generation.
  • Freshness SLI for Image Suggestions Pipeline: Time from article edit to updated suggestions
  • Availability of Error-free mediawiki_content_current_v1: The percentage of error-free input that doesn't lead to a break in the pipeline.
  • Uptime SLI for OpenSearch: Availability of the OpenSearch platform.
  • Uptime SLI for Analytics Stack: Availability of the Analytics Stack.

Operational

Monitoring

Troubleshooting

Deployment

Done in GitLab and on deployment machines, see docs.

Service Level Objectives

Metric Teams Measurement / SLI Target SLO
Client API Uptime Mobile Apps, Growth Image suggestions requests getting a HTTP 200 response ≥ 95%(1)
Pipelines Freshness Former Structured Content engineers Time from article edit to updated suggestions ≤ 1 week(2)
Pipelines Uptime Former Structured Content engineers Successful Spark jobs ≥ 95%
mediawiki_content_current_v1 Uptime Data Engineering Dataset availability ≥ 95%
mediawiki_content_current_v1 Data completeness Data Engineering Consistent dataset that doesn't cause pipelines failure ≥ 95%
CirrusSearch Uptime Search Platform Availability of CirrusSearch Platform ≥ 95%
Analytics Stack Data Engineering Availability of the Analytics Stack ≥ 95%

(1) A 95% availability SLO implies 4.5 days downtime per quarter

(2) Image suggestions are generated weekly, so the integration of user's feedback (which translates into article edit if a given suggestion is accepted) follows the same schedule.