Jump to content

Portal:Toolforge/Admin/Runbooks/HarborDown

From Wikitech

This happens when the prometheus host is not able to do an https request to the harbor instance.

The procedures in this runbook require admin permissions to complete.

Error / Incident

This usually comes in the form of an alert in alertmanager.

There you will get which project (tools, toolsbeta, ...) is the one it's failing for, and the url of the harbor instance that fails.

Note that this request goes through the proxies and it might be an issue there instead of just harbor.

Debugging

  • You can ssh to the harbor instance directly and check there how/if it's running (use the instance of the project the alert is from):
    me@local$ ssh tools-harbor-1.tools.eqiad1.wikimedia.cloud
    me@tools-harbor-1$ sudo -i
    root@tools-harbor-1:~# cd /srv/ops/harbor/
    root@tools-harbor-1:/srv/ops/harbor# docker-compose ps
          Name                     Command                  State                          Ports                    
    ----------------------------------------------------------------------------------------------------------------
    harbor-core         /harbor/entrypoint.sh            Up (healthy)                                               
    harbor-exporter     /harbor/entrypoint.sh            Up                                                         
    harbor-jobservice   /harbor/entrypoint.sh            Up (healthy)                                               
    harbor-log          /bin/sh -c /usr/local/bin/ ...   Up (healthy)   127.0.0.1:1514->10514/tcp                   
    harbor-portal       nginx -g daemon off;             Up (healthy)                                               
    nginx               nginx -g daemon off;             Up (healthy)   0.0.0.0:80->8080/tcp, 0.0.0.0:9090->9090/tcp
    redis               redis-server /etc/redis.conf     Up (healthy)                                               
    registry            /home/harbor/entrypoint.sh       Up (healthy)                                               
    registryctl         /home/harbor/start.sh            Up (healthy)
    
  • You can try to restart/start it again, with docker-compose restart and docker-compose up -d.
  • You can also check the logs of each component with docker logs harbor-portal, where harbor-portal is the name of the component.
  • Logs may also be found in /var/log/harbor/

Common issues

Add new issues here when you encounter them!

Trove DB out of space

This shows up as having login errors on the UI, that then on the logs show as DB errors like:


root@toolsbeta-harbor-1:/srv/ops/harbor# docker logs --tail 1000 -f harbor-core ... 2023-08-07T07:23:21Z [ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: deal with /service/notifications/tasks/41 request in transaction failed: failed to connect to `host=ttg4ncgzifw.svc.trove.eqiad1.wikimedia.cloud user=harbor database=harbor`: dial error (dial tcp 172.16.5.95:5432: connect: connection refused)"}]}

And you can verify logging into the trove DB and checking the space:

(guest-agent-venv) root@harbordb:/root# df -h                                                                                                                 
...                                                                                                                                         
/dev/sdb        4.9G  4.7G     0 100% /var/lib/postgresql                                      
...

(guest-agent-venv) root@harbordb:/var/lib/postgresql/data/pgdata# du -hs *                                                                                                                                                                                                                                                                                                 
...
4.6G    pg_wal
...

This case was pg_wal using too much space, but that should have been fixed.

Old incidents

  • T354714 - Trove DB filled disk and caused toolforge-build to fail as a result