Phabricator/Meeting Notes/2019-01-23

Phabricator Upgrade Planning Meeting - 2019/01/23

attendees: Daniel, Mukunda, thcipriani

TODO: dump these notes on wikitech and ping attendees

Daniel: phab1001 is on jessie (needs to be stretch), want to get off mod_php, phab1002 (former image scaler) cf line 10; moving to phab1002 changes OS, webserver setup, and physical server
Mukunda: and phab1002 has 1/2 the ram as well -- which leaves less room for cache and might hurt performance
Daniel: Hardware first and then software, but that doesn't seem to gain much, plus we need to give back phab1001; the way puppet is setup it seems like all the software is bundled together (i.e., OS and webserver setup changes [if on stretch don't use mod_php, etc])
Mukunda: 15GB free on phab1001, 50GB used, 26GB cache, 32 GB doesn't seem like it's enough
Daniel: we may need to loop in DCOps (robh or chris) to get a 64GB machine, although they probably don't have these servers just lying around
Daniel: if we got a 64GB replacement (phab1003) on stretch: how do we test that everything is working properly on the new server
QUESTION: is the plan to get a 64GB phab1003?
TODO: broach the topic of getting a 64GB replacement -- Daniel to ping Rob

Mukunda: Route personal machines to the new servers
Daniel: /etc/hosts hacking and using ssh -D to route through a prod host with a socks5 proxy in browser
Mukunda: what about a custom header?
TODO ask chase about custom header to route to new phab -- Mukunda to ping Chase

Daniel: routing is much more strict now, but maybe we can open a firewall port and use apache-fast-test on deploy1001 to test against phab100{1,2,3}
Mukunda: would be good enough to test that phabricator works
Daniel: you should be able to open in your browser by modifying /etc/hosts and ssh port forwarding/socks5 proxy
Mukunda: Headers would be preferable
Daniel: we may need to involve traffic team and varnish for that and old methods may not work
RESOLVED: should use apache-fast-test on the deployment host

Mukunda: scap deployment test to run automagic tests using apache-fast-test; although we have not using scap in production, but it needs a bit more finagling
Daniel: we need a fallback, correct? we can't cut over to stretch without a fall back?
Mukunda: correct. We will need a fallback, would be less nervous if we were only upgrading to stretch. We needed php7.1 at somepoint for phab
Daniel: we could be on php7.2
Munkunda: I think that would be find
Daniel: from the sury repo
Mukunda: phab supports 7.1 or newer from 2017 onwards
Daniel: we can do either 7.1 or 7.2
Mukunda: let's do 7.2
RESOLVED: php7.2 for phab

Daniel: this is within eqiad, but we haven't talked at all about multi-datacenter or HA
Mukunda: had a conversation with bblack about this, we could switchover over a short window of time
Daniel: I worry about how to switch over since there is not DB in failover datacenter so it cannot run PHD because there is no database since there is no misc cluster since it's not provided yet so blocked on DBA time
Mukunda: we have a database dump backup, but we would need a slave or something to failover to
Daniel: we should followup due to lack of database. Last week when DB went down and gerrit and phab Faidon brought up ensuring HA for gerrit and phab
Mukunda: agreed, it's important to have HA for those services, but not neccesarily easy to achieve
Tyler: HA for gerrit needs investigation into H2 db as well as looking at the replacement for HA plugin that keeps indexes/caches in-sync
Daniel: IIRC there was some syncing done by the replication service in gerrit to keep a slave up-to-date
TODO: investigate gerrit HA plugin replacement since reviewdb is no longer a thing -- thcipriani

Mukunda: looks like there are database slaves in codfw for m3 I think the proxy does not allow us to connect to these
Daniel: it may be that the proxy is just not setup yet
Mukunda: that is the emergency response, setup a proxy in a panic, we have some assurance we could bring up phab if we had to
Daniel: warm standby has been our stategy for gerrit and phab
Mukunda: since it's untested and not configured it's not quite warm standby
Daniel: should we ping DBA folks about the proxy?
Mukunda: getting an ETA for that would be good
TODO: ping DBA at all-hands in person

"former imagescaler is phab1002 - T195623 request to assign wmf6937 (mw1298, former imagescaler) (now: wmf4727) as phab1002" - https://phabricator.wikimedia.org/T195623
- Declined due to 32GB

"To avoid downtime for Phab users and avoid risk that something goes wrong with the stretch upgrade, let's instead bring up phab1002 with stretch, test, switch over and then upgrade phab1001 and go back." as requested by Moritz: https://phabricator.wikimedia.org/T190568#4230710
- Do it

Apache on phab1001 is gradually leaking worker processes which are stuck in "https://phabricator.wikimedia.org/T182832
- We hope that the upgrade to php7 and stretch will fix this but it's not guaranteed

T137928 Deploy phabricator to phab2001.codfw.wmnet https://phabricator.wikimedia.org/T137928
- We don't have a database proxy in dallas, ask DBAs at all-hands about what it'll take to get m3 cluster working in codfw

only blocked by schedule? https://phabricator.wikimedia.org/T190568#4591523
- achievement unlocked \o/

Create a disaster recovery plan for phabricator: https://phabricator.wikimedia.org/T190572