Portal:Toolforge/Admin/Runbooks/HarborComponentDown

From Wikitech

This happens when one of the Harbor components is down or not reporting being up.

This might make harbor unusable by toolforge and prevent any image from being pulled/pushed, making any non-cached run fail to start or build a new image for a tool.

It might prevent also from deploying some of the toolforge system components.

The procedures in this runbook require admin permissions to complete.

Error / Incident

This usually comes in the form of an alert in alertmanager.

There you will get which project (tools, toolsbeta, ...) is the one it's failing for, the harbor instance, and the component that is failing, or a message saying that none of them is reporting anything.

If none is reporting anything, it might be an issue with prometheus/alertmanager.

Debugging

You can ssh to the harbor instance directly and check there how/if it's running (use the instance of the project the alert is from):

$ ssh tools-harbor-1.tools.eqiad1.wikimedia.cloud
tools-harbor-1$ sudo -i
root@tools-harbor-1:~# cd /srv/ops/harbor/
root@tools-harbor-1:/srv/ops/harbor# docker-compose ps
      Name                     Command                  State                          Ports                    
----------------------------------------------------------------------------------------------------------------
harbor-core         /harbor/entrypoint.sh            Up (healthy)                                               
harbor-exporter     /harbor/entrypoint.sh            Up                                                         
harbor-jobservice   /harbor/entrypoint.sh            Up (healthy)                                               
harbor-log          /bin/sh -c /usr/local/bin/ ...   Up (healthy)   127.0.0.1:1514->10514/tcp                   
harbor-portal       nginx -g daemon off;             Up (healthy)                                               
nginx               nginx -g daemon off;             Up (healthy)   0.0.0.0:80->8080/tcp, 0.0.0.0:9090->9090/tcp
redis               redis-server /etc/redis.conf     Up (healthy)                                               
registry            /home/harbor/entrypoint.sh       Up (healthy)                                               
registryctl         /home/harbor/start.sh            Up (healthy)

You can try to restart/start it again, with docker-compose restart and docker-compose up -d.

You can also check the logs of each component with docker logs harbor-portal, where harbor-portal is the name of the component.

Common issues

Add new issues here when you encounter them!

Issue 1

...

Related information

Old incidents

  • T354714 - Trove DB filled disk and caused toolforge-build to fail as a result