Phabricator/Meeting Notes/2019-01-23

From Wikitech

Phabricator Upgrade Planning Meeting - 2019/01/23

attendees: Daniel, Mukunda, thcipriani

TODO: dump these notes on wikitech and ping attendees

Raw meeting notes

  • Daniel: phab1001 is on jessie (needs to be stretch), want to get off mod_php, phab1002 (former image scaler) cf line 10; moving to phab1002 changes OS, webserver setup, and physical server
  • Mukunda: and phab1002 has 1/2 the ram as well -- which leaves less room for cache and might hurt performance
  • Daniel: Hardware first and then software, but that doesn't seem to gain much, plus we need to give back phab1001; the way puppet is setup it seems like all the software is bundled together (i.e., OS and webserver setup changes [if on stretch don't use mod_php, etc])
  • Mukunda: 15GB free on phab1001, 50GB used, 26GB cache, 32 GB doesn't seem like it's enough
  • Daniel: we may need to loop in DCOps (robh or chris) to get a 64GB machine, although they probably don't have these servers just lying around
  • Daniel: if we got a 64GB replacement (phab1003) on stretch: how do we test that everything is working properly on the new server
  • QUESTION: is the plan to get a 64GB phab1003?
  • TODO: broach the topic of getting a 64GB replacement -- Daniel to ping Rob

  • Mukunda: Route personal machines to the new servers
  • Daniel: /etc/hosts hacking and using ssh -D to route through a prod host with a socks5 proxy in browser
  • Mukunda: what about a custom header?
  • TODO ask chase about custom header to route to new phab -- Mukunda to ping Chase

  • Daniel: routing is much more strict now, but maybe we can open a firewall port and use apache-fast-test on deploy1001 to test against phab100{1,2,3}
  • Mukunda: would be good enough to test that phabricator works
  • Daniel: you should be able to open in your browser by modifying /etc/hosts and ssh port forwarding/socks5 proxy
  • Mukunda: Headers would be preferable
  • Daniel: we may need to involve traffic team and varnish for that and old methods may not work
  • RESOLVED: should use apache-fast-test on the deployment host

  • Mukunda: scap deployment test to run automagic tests using apache-fast-test; although we have not using scap in production, but it needs a bit more finagling
  • Daniel: we need a fallback, correct? we can't cut over to stretch without a fall back?
  • Mukunda: correct. We will need a fallback, would be less nervous if we were only upgrading to stretch. We needed php7.1 at somepoint for phab
  • Daniel: we could be on php7.2
  • Munkunda: I think that would be find
  • Daniel: from the sury repo
  • Mukunda: phab supports 7.1 or newer from 2017 onwards
  • Daniel: we can do either 7.1 or 7.2
  • Mukunda: let's do 7.2
  • RESOLVED: php7.2 for phab

  • Daniel: this is within eqiad, but we haven't talked at all about multi-datacenter or HA
  • Mukunda: had a conversation with bblack about this, we could switchover over a short window of time
  • Daniel: I worry about how to switch over since there is not DB in failover datacenter so it cannot run PHD because there is no database since there is no misc cluster since it's not provided yet so blocked on DBA time
  • Mukunda: we have a database dump backup, but we would need a slave or something to failover to
  • Daniel: we should followup due to lack of database. Last week when DB went down and gerrit and phab Faidon brought up ensuring HA for gerrit and phab
  • Mukunda: agreed, it's important to have HA for those services, but not neccesarily easy to achieve
  • Tyler: HA for gerrit needs investigation into H2 db as well as looking at the replacement for HA plugin that keeps indexes/caches in-sync
  • Daniel: IIRC there was some syncing done by the replication service in gerrit to keep a slave up-to-date
  • TODO: investigate gerrit HA plugin replacement since reviewdb is no longer a thing -- thcipriani

  • Mukunda: looks like there are database slaves in codfw for m3 I think the proxy does not allow us to connect to these
  • Daniel: it may be that the proxy is just not setup yet
  • Mukunda: that is the emergency response, setup a proxy in a panic, we have some assurance we could bring up phab if we had to
  • Daniel: warm standby has been our stategy for gerrit and phab
  • Mukunda: since it's untested and not configured it's not quite warm standby
  • Daniel: should we ping DBA folks about the proxy?
  • Mukunda: getting an ETA for that would be good
  • TODO: ping DBA at all-hands in person

collection of related tickets (and notes)

  • "To avoid downtime for Phab users and avoid risk that something goes wrong with the stretch upgrade, let's instead bring up phab1002 with stretch, test, switch over and then upgrade phab1001 and go back." as requested by Moritz:  https://phabricator.wikimedia.org/T190568#4230710
    • Do it
  • 'reimage both phab1001 and phab2001 to stretch"  https://phabricator.wikimedia.org/T190568
    • Do it
    • Mukunda: Could do phab2001 as a way to test stretch
    • Daniel: if I upgrade phab2001 to stretch today, would we be in a situation where we have to fail over to an upgraded server?
    • Mukunda: I don't think there's anything inherent to phabricator that I don't think we could fix in that situation
    • RESOLVED: upgrade phab 2001 anytime
  • T137928 Deploy phabricator to phab2001.codfw.wmnet https://phabricator.wikimedia.org/T137928
    • We don't have a database proxy in dallas,  ask DBAs at all-hands about what it'll take to get m3 cluster working in codfw
  • Switch phab production to codfw https://phabricator.wikimeida.org/T164810
    • thcipriani: is that task still valid?
    • Daniel: is this still the plan as part of the next switchover test?
    • Mukunda: yes, that's my goal, although you may want phab for the next big datacenter switch
    • Daniel: that is probably worth it as a test to ensure we can failover
    • Daniel: the jaime comments on this task confirm our thinking about m3 https://phabricator.wikimedia.org/T164810#4078510