Deployments/Features Process/General Feedback

From Wikitech

The Good News

Overall, all teams interviewed were excited to hear we wanted to move to a 1-week deploy cycle. As most teams are already deploying at least weekly, the change won't affect them much in a negative way. It might, in fact, make it easier for them to stay compatible with what is deployed (there will be less drift from when the first wmfXX is deployed to the last).

People were satisfied with the amount of interaction/support Chris/Željko have been providing; we can do more (as always) but it will be welcomed.

There are currently the (unused) "lightning deploy" windows that could be used alongside the weekly deploy schedule where small changes could be deployed. This should reduce the number of one-off deploys that happen frequently now.

The Bad News

(long term) We have a ways to go with respect to continuous integration/testing that will provide enough useful feedback for developers to trust it enough to go directly to production.

We need to get more of the teams better trained on deployments if they will continue to be the ones doing their deployments (likely). Making the various logs that are produced (machine side) during a deploy more human readable (and removing extraneous noisy messages that aren't actually anything important) should be a priority. I'm not sure how much of this will be mitigated by the git-deploy migration.

We need a lot more communication/documentation/help for people to get onto/using betalabs. There is a barrier there that is "easily" surmountable if we had an active advocate (both for betalabs, and to betalabs for what features are needed on it still).

Mobile has multiple testing sites and that might be hard to deal with on betalabs in its current form. But, the closer betalabs gets to acting like production, the more likely it will be useful for Mobile.

Caching is still really opaque to many (including me in some places). This might need to be explained better by someone who understands all of the layers (I mostly still understand the ops-related layers, since they drew a graph for me). In some deploys there are problems that just disappear after 10 or so minutes of waiting due to mysterious caching things.

Teams are having differing experiences with a dedicated deploy branch. Mobile finds it useful and not too burdensome while E2 is having issues with merging back into master.

Next Steps

For specifics to Greg's responsibility, see the TODO page

  • Deployment Practice / Increases Deployment Sanity
    • Move more one-off deployments into the "Lightning Deployments" window
    • git-deploy in use on production for wmf deployments
    • Determine if the current use of betalabs does help developers test their work
      • Is the update interval too fast?
  • Development Practices
    • Determine process for development and deployments happening in a faster cadence
      • specifically, sketch out use of branches or other methods to increase stability of the deployed code at any given point in time
  • QA
    • Provide more automatic ways of doing browser integration testing pre-commit
  • Documentation / Training
    • Create targeted resources for production server overview
    • Clear documentation on distinction between syncd
    • Train devs on how to triage problems during deployments (ssh'ing into various servers as needed)
      • This, ideally, will be removed as NOT NEEDED if the deploy process with git-deploy addresses the common issues