Jump to content

MariaDB/Zarcillo

From Wikitech
(Redirected from Zarcillo)

Zarcillo

https://zarcillo.wikimedia.org/ is a tool that displays information about Wikimedia MariaDB shards and servers. Unlike Orchestrator, it is publicly accessible albeit many functions are only accessible to SREs.

Zarcillo is being developed as part of T384810 and its related tasks.

The service provides a web UI at https://zarcillo.wikimedia.org/ .

It runs on Kubernetes and provides:

  • A web UI and API to display database status
  • Depooling/repooling MariaDB replicas based on health and role transitions
  • Helps switch between master, candidate-master, and replica roles
  • Lock coordination to avoid conflicting maintenance/failover activities
  • Tracks the source and validity of instance status “ground truth”

Source code: Zarcillo Main tracking ticket: T384810

See Architecture documentation for more details.

Grafana dashboards:


Web UI

The web UI is published at zarcillo.wikimedia.org.

After logging in, it provides the pages listed below.

Most tables are sortable. Some headings can be hotlinked (e.g. https://zarcillo.wikimedia.org/ui/weights#es2).

Instances dashboard

See https://zarcillo.wikimedia.org/ui/instances

Displays "raw" contents from:


Locks

See https://zarcillo.wikimedia.org/ui/locks

Lists locks by instance and user.

Also provides:

  • Form to acquire new locks
  • Button to release locks


Hosts dashboard

See https://zarcillo.wikimedia.org/ui/hosts

Shows database hosts including:

  • DC
  • Location (datacenter)
  • Sections served (with a sleepy icon 😴 when not pooled)
  • Kernel version
  • MariaDB version
  • Active locks
  • Alarms (Icinga / Alertmanager)

It allows highlighting old kernel or MariaDB versions using the Filter button.

It also allows adding hosts from the UI.


Schema change summary

See https://zarcillo.wikimedia.org/ui/schema_change

Shows current and past schema changes tracked by schema_change_helper.py (see PR 42).

Hovering on checkboxes shows who ran the helper and when.

The icons represent:

  • No icon: schema change never started.
  • Hourglass (⏳): schema change pending during the current auto_schema run. The host has not been depooled yet.
  • Spinner: depooling, schema change or pooling are ongoing on the instance.
  • Checkmark (✅): schema change completed.

Sections dashboard

https://zarcillo.wikimedia.org/ui/sections

Primary dashboard showing all sections and their hosts.

For each host it shows:

  • Hostname, instance port, role
  • Replication lag (from heartbeat table via Prometheus metrics mysql_heartbeat_now_timestamp_seconds and mysql_heartbeat_stored_timestamp_seconds)
  • Host uptime (node_boot_time_seconds)
  • Tags (pooled, preferred candidate, alarms, etc.)
  • Candidate score (CS)

Zarcillo computes a candidate score (CS) for replica hosts used in master switchover decisions.

Score is based on:

  • Existing alarms (none is better)
  • Replication lag (lower is better)
  • Uptime (higher is better)
  • Kernel version (newer is better)
  • MariaDB version (newer is better)

This supports safer switchover operations and rolling upgrades.


Weights dashboard

See https://zarcillo.wikimedia.org/ui/weights

Shows instance weights grouped by section, hostname, and groups.

Highlights standardized weights in 2025 in the “std” column.

Flags differences between eqiad and codfw in the “diff” column.

Candidate planner

See https://zarcillo.wikimedia.org/ui/planner

Uses z3 to compute candidate locations and database movements to minimize the risk of rack failures impacting multiple masters or candidates.

Added in https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy/-/merge_requests/20 per T371362 - currently at prototype stage.

Clone dashboard

See https://zarcillo.wikimedia.org/ui/host_clone_events

Shows the status of the MariaDB clone cookbook runs as per T417608

API and documentation

See https://zarcillo.wikimedia.org/apidocs

OpenAPI/Swagger documentation for all API and UI endpoints.

Notes:

  • /api returns JSON
  • /content is for HTMX building blocks
  • /healthz is for Kubernetes health checks
  • /metrics exposes Prometheus metrics

Development

You can use Just and the related justfile.

List available commands with:

$ just -l
Available recipes:
    copy_prod_tables_from_db1215_to_preprod
    deploy_prod_once                # Deploy current container
    deploy_prod_polling             # Poll/deploy on changes
    fetch_logs                      # Fetch production logs
    fetch_logs_and_follow           # Fetch raw production logs
    generate_html_docs              # Requires asciidoctor
    generate_run_local_container
    import_prod_tables_from_db1215_to_localdev
    ingest_puppet_hiera_data        # Import Hiera data
    ingest_puppet_role_data         # Import Role data
    kube_get_pods                   # List K8s pods
    local_test
    local_test_automation
    log_on_local_mariadb
    log_on_preprod_mariadb
    log_on_prod_mariadb
    run_ci_podman_devel
    run_ci_podman_prod              # Run CI container locally
    setup_local_dev_pod
    setup_local_mariadb
    setup_local_testbed
    zap_local_dev_pod

setup_local_testbed sets up dedicated containers and populate MariaDB locally. It requires a local copy/symlink of service-template/generate_local_podman_container.py

For deploying see deploy_prod*. It needs deploy.json deployer.py setup_service.py in the ./local dir.