Phabricator
Phabricator is an open-source software development platform. In Wikimedia, Phabricator is used for project management, software bug reporting, and feature requests. See mw:Phabricator for more details on end user usage.
phabricator.wikimedia.org runs on phab1004 in eqiad.
The Phabricator install relies on db1183 (m3 eqiad master), with several replicas. Databases access is routed through dbproxy1003, a.k.a. m3-master.
A disaster recovery plan for phabricator.wikimedia.org is at Phabricator/Disaster Recovery.
Metrics are on https://grafana.wikimedia.org/d/000000587/phabricator.
Since 2023-08-23, we actually use the Phorge fork of Phabricator (T333885), but we have not (yet?) started to update references to the old software name.
Operations Projects Workflows
The operations specific projects on Phabricator[1] include:
Project | Description |
---|---|
SRE | General SRE Team Project |
Labs | Labs Team Project |
DC-Ops | Data center Team Project |
domains | Domain support/changing/issues |
hardware requests | Server Allocation Requests |
procurement | Vendor & Procurement Tasks. Direct ordering of SSL certificates. |
network | Network Requests |
Ops Access Requests | Access requests to any Operations systems |
ops-codfw | Onsite queue for codfw |
ops-eqdfw | Onsite queue for eqdfw |
ops-eqiad | Onsite queue for eqiad |
ops-eqord | Onsite queue for eqord |
ops-esams | Onsite queue for esams |
ops-ulsfo | Onsite queue for ulsfo |
DBA | Database administration requests |
Operations Software Development | Software development projects |
Hardware Request Stage
- User requests hardware via Operations_requests#Hardware_Requests
- Rob reviews the hardware-requests project for all tasks that are assigned to him, or unassigned.
- Tasks assigned to others are not reviewed as often, as they are awaiting input from the assignee. If they are left neglected by the assignee long term, they will likely be rejected, or have the hardware-requests project removed from the task.
- If the system specification meets an on-site spare, system allocation may proceed.
- This allocation step is typically processed by Rob and approved by Mark. (It involves a general overview of the roadmap and system procurement planning.)
- If the system specifications require an order of hardware, the following occurs:
- A RT procurement queue ticket is created for each set of vendor quotes.
- Example: A caching system at this time could be Dell or HP, we create two RT tickets. One for each vendor to provide quotes for the system specification in question.
- Quotes are generated and reviewed by Rob, Mark, and the requestors for the hardware.
- Quotes are approved for purchase by Mark/Damon/Lila (escalation dependent on overall cost) and are typically placed by Rob (for US ordering) or Mark (for EU ordering).
- The hardware-requests task will have the system details noted (hostname/asset tag) and the task will be linked to the system setup task.
- These are kept separate for easy future search history on hardware allocations; thus its nice to leave a task with the hardware-request in said project.
Hardware/Server Setup / Deployment Stage Workflow
- A new phabricator task is created in the operations project.
- This task is the primary tracking task for the setup and deployment of the server.
- Task should include the following (base template):
- System Deployment Steps:
[] - mgmt dns entries created/updated (both asset tag & hostname) [link sub-task for on-site work here, sub-task should include the ops-datacenter project] [] - system bios and mgmt setup and tested [link sub-task for on-site work here, sub-task should include the ops-datacenter project] [] - network switch setup (port description & vlan) [link sub-task for network configuration here, sub-task should include the network project] [] - production dns entries created/updated (just hostname, no asset tag entry) [link sub-task for on-site work here, sub-task should include the ops-datacenter project] [] - install-server module updated (dhcp and netboot/partitioning) [done via this task when on-site subtasks complete] [] - install OS (note jessie or trusty) [done via this task when network sub-task(s) complete] [] - service implementation [done via this task post puppet acceptance]
- The main task is basically for all the software setup, and the sub-tasks are for the specific on-site or networking tasks.
- Many times, the network task isn't created, as the person doing the software work can also do the network configuration.
Misc. Production Virtual Machine Requests Workflow
- User (usually an SRE, unless you really know what you are doing) requests hardware via Operations_requests#Virtual_Machine_Requests_.28Production.29
- SRE reviews the vm-requests project the task for obvious mistakes for all tasks that are tagged.
- Tasks assigned to others are not reviewed as often, as they are awaiting input from the assignee. If they are left neglected by the assignee long term, they will likely be rejected, or have the vm-requests project removed from the task.
- If you are reading this as the SRE that is reviewing the request or are evaluating your own request the docs for what to look for are at: Ganeti#Verify_cluster_resource_availability
- If the system specifications meet all requirements for approval/allocation of a production virtual machine, the machine can be created. The creation should be undertaken by the SRE that filed the request to increase familiarity with the platform.
Administrative Commands
- All Phabricator documentation refers to scripts in the phabricator bin directory. On our setup, that is:
/srv/phab/phabricator/bin/
Dump the entire database
Write the entire contents of phabricator's databases to disk, compressed:
/srv/dumps
is not the right path to use - it is synced to public.
cd /srv/phab/phabricator sudo ./bin/storage dump --output /srv/dumps/phabricator_db_$(date +%Y%m%d%H%M%S).sql.gz --compress
Remove a repo
First you need the repo's callsign. This is an all-uppercase identifier with 'r' prefixed that is used in urls and such in Phabricator for the repo. For example, Puppet's is OPUP. First SSH to phab100N. Then:
cd /srv/phab/phabricator sudo ./bin/remove destroy rFOO
Remove a file
First you need the file's ID prefixed with 'F'. First SSH to phab100N. Then:
cd /srv/phab/phabricator sudo ./bin/remove destroy Fxxxxxxxx
Ban a user
Members of the #acl*userdisable
Phabricator project can ban a user via https://phab-ban.toolforge.org/
Delete a user
This is not recommended if the account has already been active! Deleting a user can be needed when a user entered a wrong email address in the registration form and now cannot verify their address to finish account creation. First SSH to phab100N. Then:
cd /srv/phab/phabricator sudo ./bin/remove destroy @AccountNameOfThatUser
Removing Two Factor Authentication
- Please note that removal of 2FA is a serious request, and all too easily socially engineered. All requests of this nature should be treated with the same degree of security and confirmation as ssh key changes. The user guidelines require one month between the paste of the user committed identity hash on the wiki user page and the reset request, or verification via a video call.
- When copying the text phrase from a Phabricator Paste, make sure to use
View Raw File
and save the file, to avoid issues with line breaks via copy&paste. (Potentially also check with a hex editor that no additional byte such as 0x0A has been added.) Afterwards, runcat file | sha512sum
(or whatever algorithm was used, e.g. could also besha3sum -a 512
or such). - Once confirmed, the actual command is quite simple, run on the phabricator host:
sudo /srv/phab/phabricator/bin/auth strip --all-types --user <username>
- You will be prompted with a yes or no to remove the multi-authentication types on the user.
Revoking a Conduit token
Users can do this themselves with the big red "Terminate Tokens" button in Settings > Conduit API Tokens. If it needs to be forced for some reason, you can do it from a phabricator server:
ssh phab1004.eqiad.wmnet sudo /srv/phab/phabricator/bin/auth revoke --type conduit --from @<username>
Revoking a user's sessions
This invalidates any active sessions and forces the user to log in again.
ssh phab1004 sudo /srv/phab/phabricator/bin/auth revoke --type session --from @<username>
Revoking a user's ssh keys
This invalidates any authorized ssh keys that the user has configured in phabricator.
ssh phab1004 sudo /srv/phab/phabricator/bin/auth revoke --type ssh --from @<username>
Rebuild phabricator search index
Warning: This takes a really long time, probably more than 8 hours. Service will be online during the reindex, however, search quality will be degraded.
ssh phab1004 sudo /srv/phab/phabricator/bin/search init sudo /srv/phab/phabricator/bin/search index --all --force --background
Revert all activity of a given user
Caution: This removes most of the user's activity from Phabricator and it is a destructive operation. This should only be done when cleaning up vandalism from an account which has no legitimate activity. If the account had real contributions prior to being compromised, then another solution is needed to avoid deleting the legitimate contributions along with the spam.
This procedure will attempt to undo all edits made by a given user. If you add the--delete
argument it will also remove all traces of the corresponding transactions from the phabricator activity log. This should be successful in all cases except for 1 limitation: Any field which has been edited by someone after the vandal's edit will be treated as an edit conflict and the field will be left alone to avoid potentially overwriting useful edits by other users.
How it works: The rollback
script simply replays the edit transactions in reverse, from newest to oldest. Each transaction in Phabricator stores the field name, the old value and the new value. To revert a user's activity, the script will do is as follows:
- For each task edited by the vandal user:
- For each transaction made by the vandal user (newest to oldest):
- If the transaction's "new" value matches the field's current value, then the transaction's "old" value is applied to the field.
- After all transactions have been replayed, if any field was changed then the record is saved back to the database.
- Finally, if
--delete
was also specified, then all the replayed transactions are also deleted to clean up the history of activity.
- For each transaction made by the vandal user (newest to oldest):
ssh phab1004 sudo /srv/phab/libext/misc/bin/rollback execute --delete --user <username>
Converting a parent project into a subproject
There is no such script anymore as it led to database corruption; see phab:T342275. Thus this is manual work now.
Run a bulk job silently (suppressing notification spam)
First set up a bulk job in phabricator's GUI, then get the bulk job id and run the make-silent command below, specifying your bulk job id. Finally, start the job in the GUI and it will run without sending notifications.
ssh phab1004 sudo /srv/phab/phabricator/bin/bulk make-silent --id <bulkid>
See also mw:Phabricator/Help#Batch edits for more information and guidance.
read-only mode / restarting mariadb
To put phabricator into read-only mode, which allows it to continue serving requests during a master database restart, do the following on the active phabricator server:
ssh phab1004 sudo /srv/phab/phabricator/bin/config set cluster.read-only true # restart database server sudo /srv/phab/phabricator/bin/config set cluster.read-only false
Disabling a Herald rule
Herald rules can be disabled via
ssh phab1004 sudo /srv/phab/phabricator/bin/herald rule --disable --rule <rulenumber>
Check on a Phabricator user
To check if a Phabricator user is who they say they are there is a script to get their email address and whether it's verified from the SQL database:
chk_phuser <Phabricator username>
ssh phab1004 sudo chk_phuser Dzahn
Unlocking edit permissions on a task
ssh phab1004 sudo /srv/phab/phabricator/bin/policy unlock --edit YourPhabUserName T12345678
Unlocking edit permissions on random objects
First get the internal PHID of the object to unlock, for example via Conduit by passing {"ids":[12345678]}
as constraints
.
ssh phab1004 sudo /srv/phab/phabricator/bin/policy unlock --edit YourPhabUserName PHIDofObject
Mail debugging
See Phabricator/Mail debugging.
Ban an IP address
See Phabricator/Ban IP address
Rate Limiting
Access to Phabricator is restricted by rate limiting rules in requestctl. This rate limiting was enabled in May 2024 due to a high level of scraping and abusive traffic (see T362401). Users affected by the rate limiting will see a "http 429 - too many requests" temporarily.
Normal traffic from legitimate users shouldn't be affected in most cases. To avoid triggering the rate limit, the following can help:
- Keep the number of requests per second low, especially when using the API, scripts, or curl.
- Use a unique, non-shared IP address (avoid cloud networks, VPNs or proxies).
- Set a proper, application-specific User-Agent header.
- Try again later once the limit resets.
SREs can adjust the rate limiting settings in private-puppet/requestctl. The relevant configuration for Phabricator can be found under cache-text/phabricator*
.
Network Architecture
Phabricator is currently hosted on phab1004.eqiad.wmnet / phab2002.codfw.wmnet.
The full path of traffic from the public internet through to the database is as follows:
cache_text esams -> cache_text codfw -> cache_text eqiad -> phab1004 -> dbproxy1003 -> db1043
Fixing Common Problems
PhutilMissingSymbolException
Some Phabricator applications throwing exceptions like Failed to load class or interface "Phabricator*"
- this can sometimes be resolved by running arc liberate
inside of /srv/phab/phabricator
which will update the library map as in this commit.
Phabricator is intermittently down or slow
Phabricator Dashboard on Grafana
Check the logs on /var/log/apache2/phabricator_error.log
(or in Logstash applicaion logs and Logstash apache logs for a more readable format)
Check the host in Icinga for more failed checks (eg. PHD should be running).
Check the status of the phd process (sudo systemctl status phd
).
Do not run aphlict server using websockets and proxy through Apache also running main Phabricator.
See Phabricator/Slowness for more info.
Pybal alerts for git-ssh after rebooting Phabricator servers
If a Phabricator server had to be rebooted for any reason you might get Icinga pybal alerts. Pybal will alert that backends for the git-ssh.wikimedia.org service are down but pooled. Example:
<+icinga-wm> PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled
The reason for this is a race condition where the ssh-phab service is started before the additional IPv6 IP is added to the interface. The service WILL be running so it might not be obvious why it's considered down. This is because it will be listening only on IPv4 while pybal / lvs servers are going to try and use IPv6.
The server has multiple IPs, 2 on loopback for LVS and 4 on en01.
The fix in this case is to manually restart the service with
systemctl restart ssh-phab
You'll have to wait a few minutes and then the pybal Icinga alerts will recover.
Failure Scenarios / Failover
Simple failure of the phabricator server
A simple failure of the phabricator server, e.g. a disk failure or other hardware failure on phab1001.
Take a look at a previous fail-over ticket at T238956.
Code changes needed for the actual fail-over can be seen at the topic branch phab-buster. Decommissioning of the previous server can be seed at the topic branch phab1003-decom.
Additionally the etherpad Phabricator-migration-20191203 was used.
Steps to fail-over an existing Phabricator server to a new server
If there are 2 existing servers, just follow the steps. If the existing prod server died, assume "old_server" means the warm standby in the other data center. If the standby server died see the section below.
- install a new server and add the role::phabricator puppet class on it, run puppet agent
- rsync /srv/repos from old_server to new_server, run it with --delete as well and ensure both sides have the same size. (rsyncd / ferm rules for this are already puppetized on all servers)
- verify code in /srv/phab is up to date and both servers are on the same git tag (if not use scap to deploy to new server / run 'scap pull' on it)
- switch the "phabricator dumps host" to the new server. code change
- (optional) put phab on new_server in maintenance mode (phab admin action)
- set downtimes for both servers in Icinga
- change the "phabricator_server" setting to the new server name. code change
- (changing the "active server" setting is not needed anymore, setup has been simplified)
- switch the discovery record in DNS to the new server. The TTL is 300 seconds by default for all discovery records. It does not need to be changed but be aware there might be a 5 minute window where clients could get the old server. code change
- switch the config for varnish to the new server code change
- switch the mail destination on mx to the new server code change
- using systemctl, restart the "ssh-phab" service on the new server to make it listen on IPv6
- using conftool, depool the "vcs" service on the old server, change conftool data to use the new server code change and pool it
- (if reimage script failed in the past and you have ongoing Icinga alerts about pybal and the vcs server): delete stale confd files on puppetmaster to clear Icinga alerts about confd template compilation failing
- make the "phd" service run on the new server to avoid breakage of repos code change
- verify things work and remove Icinga downtimes
- (a few days later) decom the old server following the usual decom steps and as outlined in the phab1003-decom branch linked above
Steps to re-create a warm standby server
If the non-active server died and you want to re-create it under a new host name:
- install a new server and add the role::phabricator puppet class on it, run puppet agent
- rsync /srv/repos from the prod server to the new_server, run it with --delete as well and ensure both sides have the same size. (rsyncd / ferm rules for this are already puppetized on all servers)
- verify code in /srv/phab is up to date and both servers are on the same git tag (if not use scap to deploy to new server / run 'scap pull' on it)
- Add the new host name to the list of "phabricator_servers" in Hiera in hieradata/role/common/phabricator.yaml.
- using systemctl, restart the "ssh-phab" service on the new server to make it listen on IPv6
- using conftool, depool the "vcs" service on the old server, change conftool data to use the new server code change and pool it
- You do NOT have to worry about the phd service running, it's only needed on the active server.
Complete data center failover
Complete data center failover, e.g. some major event takes down eqiad and we need to fail over to codfw.
How to make codfw master writable
root@cumin1001:~# mysql --skip-ssl -hm3-master.codfw.wmnet
Master database failure
Master database fails, we need to fail over to a slave and swap the slave to become a master
If the master goes down, the proxy would automatically failover to the existing slave (which is read-only) and would need to be set up as read_only=OFF by an admin.
- T190572 - Prepare a disaster recovery plan for failing over Phabricator
- Phabricator/Disaster Recovery