Talk:IRC

Rendered with Parsoid
From Wikitech
Latest comment: 5 years ago by SBassett in topic #mediawiki_security

Summary of Change Process

Current Proposal

As of 15 Apr 2019, the state of proposed changes is that the SRE team has completed a proposal and is now implemeting (where it's SRE-only) or validating (where it affects other people) it.

The SRE team identified problems with IRC channels several years ago. In the January 2019 off-site, many changes were discussed. This was followed up by mailing list discussion, meeting discussion, and wiki talk. This summary is a proposal generally agreed by the SRE team; many details are captured on the rest of the page.

Overview

Channel membership purpose Current Status Intention Logging Bots NDA only notes
#wikimedia-operations public feed of all public bot announcements existing No change. yes yes no Document what it includes? (Gerrit, Phab, SAL, ...?)
#wikimedia-sre public SRE routine communication (of wide interest) and incident response new Create; and move normal SRE discussion here yes no no
#wikimedia-sre-private SRE team only non-public discussion (personnel, travel, expense reports, etc); watercooler conversation new Create (and move sensitive SRE discussion here from #mediawiki_security) no no yes What will we do if we have incidents that cannot be public but where we need to work with people outside SRE? I think that probably will be norm rather than the exception, in those cases. :)
#wikimedia-sre-mgmt SRE line and program managers SRE Management urgent, non-public messaging new Create no no yes
#wikimedia-databases public subteam/topic-specific discussion existing no change yes no no
#wikimedia-dcops public subteam/topic-specific discussion existing no change yes no no
#wikimedia-netops public ? existing delete yes no no
#wikimedia-serviceops public subteam/topic-specific discussion existing no change yes yes no how obvious is the meaning of "serviceops"?
#wikimedia-sre-foundations public subteam/topic-specific discussion new create yes no no This is for infrastructure foundations. "foundations" in title is confusing, however. "infra"?
#wikimedia-traffic public subteam/topic-specific discussion existing no change yes no no
#wikimedia-clinic secret topic-specific discussion regarding sre clinic duty existing ? ? ? ?
#wikimedia-pipeline public discussing dependencies between Release Engineering and SRE existing no change yes no no Pipeline is more an interlock channel between RelEng and SRE. Not necessarily entirely in SRE scope to change.
#mediawiki_security legacy, invite-only, attendees unclear deprecated existing TBD ? ? ?


Proposed changes to process

❏ Write an SOP (Standard Operating Procedure) for how to create new channels (see also m:IRC/Instructions#New_channel_setup)

❏ There should be a Phabricator template for requesting new IRC channels, that includes:

  • direction to check the list of existing channels first
  • fields for:
    • name
    • membership
    • purpose
    • whether or not it is logged
    • whether or not bots are allowed (other than logging)
    • whether all members are subject to WMF NDA or not
    • why none of the existing channels is suitable

Channel by Channel discussion

#wikimedia-operations

Summary

This is a public channel for bot announcements.

  • DECISION 1: Should this be renamed?
    • OPTION: Stay status quo
      • Keep it because, after the channel cleanup, it won't be associated with any one team or subteam
    • OPTION: Rename to #wikimedia-sre-feed
      • Title more clearly matches the intent for a bot feed channel.
  • QUESTION: What is the full list of bots that report to this channel?
    • Would be helpful documentation for making changes or disaster recovery.

Discussion

So I have not attended previous meetings, so I am not aware of already taken decisions, so I apologize in advance if I reopen things that had already been agreed. I am also not aware of what other teams that use "our" channels have said.

The only thing on the proposal that I am not 100% sold into, is the removal or #wikimedia-operations. I think the channel could be kept as is (the name itself is not that important), but instead of "banning" everything except bots, I would merge bots + software deployments + incident response. I understand that the reason for conversations/incident response is because in a huge outage response conversations are difficult, but 1) those are not that common 2) maybe the problem is some kind of alerting actionable rather than changing the log (e.g. sre-feed has everything, -operations is rate-limited 3) you can still go to -sre if necessary. I think there is value of deployers seeing errors that are not just mediawiki, and forces coordination between teams. Also it makes a single point of entry for reporting issues as a second level/knowlegable support channel (being #wikimedia-tech the first level). -operations can be left unrenamed because it would be a generic "operations by SRE and releng (maybe others)". I am ok with shifting general conversations to -sre or anywhere else. Above all, I would like releng and others participate on the discussion, and I don't think -operations is just "ours" anymore.

I apology again if this has already been discussed and decided on, I just wanted to present the "coordination with other teams/bad things ongoing" prospective. We asked for deployments to happen on -operations some time ago, for a reason. If you remove bots from there, I almost guarantee that deployments will not check them/care about them, and start deploying in the middle of a big outage. Maybe it has been planed to be "solved" in other ways, but it is not reflected here, sorry again.

For the rest of the channels I don't have strong opinions, I only have some general (and personal) thoughts - I don't want too many channels, and I want a way for informal discussions, but not a big deal, if it is not part of the final proposal. -- Jcrespo 19:51, 5 March 2019 (UTC)Reply

PS: As Manuel says above, we have bots on -databases. This is very useful for us because it has a very low traffic, but #DBA is usually not added together with #operations, so it won't appear anywhere else.

I don't understand where you get the notion of "removal of #wikimedia-operations" from; the proposal on the page is to keep it, and at maximum (maybe) it would be renamed to #wikimedia-sre-feed. Most general discussions would no longer take place there, but it will still be the place for bot notifications/errors. Both #wikimedia-sre and -operations (or -sre-feed) would be fully public and open to anyone. So what exactly are the issues you're seeing?
Right now I'm thinking, perhaps we should keep the name #wikimedia-operations instead of rename it, precisely because it is no longer associated with a specific team, and that could make it more inviting to others interested in "general site operations". -- mark (talk) 11:30, 6 March 2019 (UTC)Reply

I would like to keep it as it is. So, I vote for no renaming. Marostegui (talk) 15:19, 8 March 2019 (UTC)Reply

Renaming #wikimedia-operations is really not an option in my opinion. It's too widely known as our public point of contact. Also, it's not just a feed, and won't be even in this proposal AIUI Giuseppe Lavagetto (talk) 07:01, 9 March 2019 (UTC)Reply

+1 to keep it with current name and move out general discussions, but we should keep deployments (jouncebot and coordination) and !log actions there. Volans (talk) 12:35, 13 March 2019 (UTC)Reply

+1 to keep it with current name but move out as much discussion as is possible. I find it almost impossible to follow during working hours, and never even attempt to read scrollback from off-hours. CDanis 12:54, 15 March 2019 (UTC)Reply

+1 to keep it Filippo Giunchedi (talk) 10:38, 14 March 2019 (UTC)Reply

+1 to keep it effie mouzeli (talk) 17:17, 14 March 2019 (UTC)Reply

#wikimedia-sre

Summary

  • DECISION 2: Should this channel be created?
    • Yes
    • No
  • DECISION 3: What would belong on this channel?
    • OPTION: all non-private technical discussion within SRE by default
      • Private: security vuln; private data
      • Provide people outside SRE with a single place to find the SRE team as a whole
        • without the need to know which sub-team is in charge of what
        • and ask questions / report issues / coordinate with SRE, including clinic duty related topics
      • coordinate actions during outages, both within and outside of SRE
    • OPTION: non-subteam-specific, non-private technical discussion.
    • OPTION: Use for pudblic conversation during incident response

Summary updated JAufrecht (talk) 21:56, 7 March 2019 (UTC)Reply


Discussion

I'm not entirely sure I see the need for the #wikimedia-sre channel, and I'm definitely sure we don't need a clinic-duty related channel.

We already have a ton of channels, let's try to keep them reasonable.

Giuseppe Lavagetto (talk) 06:57, 27 February 2019 (UTC)Reply

  • I agree with the #wikimedia-sre channel confusion. Right now we tend not to use #wikimedia-operations for discussions which is understandable as sometimes they are mixed with private conversation, but we don't even normally use the public one it during outages (which I would love to use), so I don't know if it we would be using this new channel for discussions, or we'd keep using the private channel for that like we do now. Marostegui (talk) 08:49, 27 February 2019 (UTC)Reply


The #wikimedia-sre channel was proposed in last year's SRE offsite, before some more recent channel creation. The idea was, and I think still is, to have a common centralized place where:
  • have all technical discussion within SRE by default, unless they involve private/sensitive topics.
  • people from other teams/outside the Foundation can find the SRE team as a whole (without the need to know which sub-team is in charge of what) and ask questions / report issues / coordinate with SRE, including clinic duty related topics.
  • coordinate actions during outages, both within and outside of SRE
  • it would probably makes sense to allow to !log from this channel (possibly having the !logs relayed to the #wikimedia-operations channel too)
I'm finding quite confusing to follow all the existing and new sub-team channels lately as our responsibilities are not (yet?) aligned with the sub-teams, so a discussion on topic X should include/notify people from different sub-teams that might miss it if it happens on a sub-team channel, and a central public discussion channel seems to fit the role here. I'm totally aware that not all discussions can happen on the same channel but we need something more flexible than just sub-teams channels. One could argue why not topic-channels, and the thing will diverge pretty soon. Also IRC capabilities are a bit limited in this area as channel creation is "costly".
Volans (talk) 10:32, 27 February 2019 (UTC)Reply
+1 Alexandros Kosiaris (talk) 12:10, 27 February 2019 (UTC)Reply
I agree with you, besides (somewhat) the first point: While I still think #wikimedia-sre should be an important channel for lots of technical discussions of wide interest to many in the SRE team, I don't think it should "have all technical discussion within SRE by default". Experience has shown that we've simply grown too big for that, and it would lead to too many conflicting conversations all the time, leading people to find other venues to talk, such as private discussions. Also keep in mind that we've had some separate sub team channels (more or less) for quite a long time already, so this need isn't entirely new either. -- mark (talk) 15:06, 5 March 2019 (UTC)Reply
I hope I've updated the proposal for this channel to reflect the above. -- mark (talk) 16:46, 7 March 2019 (UTC)Reply

If we really need to add this channel, let's reserve it for incident response discussions and general, all-sre discussions (like some of the ones I've seen flowing through the security channel) Giuseppe Lavagetto (talk) 07:07, 9 March 2019 (UTC)Reply

If this isn't a rename of the -operations channel, I don't think we need it. Yes, things get noisy when there are outages and the bots and alerts are all going off. We've talked about having a separate feed for the bots for years and never reached agreement. The subteam channels were created to move most discussion out of the main channel, with the understanding that even volunteers who don't chat with us every day will find the right place to go to bring up their topic, we ought to do the same. -- ariel (talk) 09:09, 13 March 2019 (UTC)Reply

#wikimedia-sre-private

Summary

Some incidents can be discussed within an regular SRE private space, but should they be? What about incidents that include people outside of SRE, probably the majority use case?

  • DECISION 3.5: Should there be a #wikimedia-sre-private channel for SRE watercooler talk?
    • OPTION: Yes
    • OPTION: No
  • DECISION 4: What is the scope of extra-SRE incident response?
    • OPTION: WMF staff outside of SRE
    • OPTION: previous + anyone invited with NDA
    • OPTION: previous + non-NDA people on invite
  • DECISION 5: Where can we have non-public extra-SRE incident discussion?
    • OPTION: #wikimedia-sre-private
    • OPTION: A dedicated channel such as #wikimedia-sre-incident


Discussion

I don't think that the "purpose" of this channel in the table is correct (non-public incident response details). When non-public incident response happens it involves almost all the time other teams/people too and an SRE-only channel would not fit the requirement. I think that this kind of discussions could stay in the existing #mediawiki_security channel (eventually renamed) or equivalent, but must have a broader participation than just SRE.

My understanding of the scope of the #wikimedia-sre-private is to be a place where:

  • socialize within the SRE team between each other in a private space
  • discuss non-technical private topics (travel, expense reports, etc.)
  • discuss sensitive technical topics that are not incident response (new security releases of software, help debugging something that might refer to private data, etc.)  Preceding unsigned comment added by Volans (talkcontribs)
I have the same understanding. non public stuff, but not incident response related necessarily.Alexandros Kosiaris (talk) 12:11, 27 February 2019 (UTC)Reply
+1 CDanis 20:11, 28 February 2019 (UTC)Reply
I think that's correct, and indeed not accurately depicted in the plan currently. As well as for sensitive SRE-internal discussions that can't happen in public, it can be a watercooler channel, and thereby take that function from #mediawiki_security for a lot of cases. Watercooler conversations are important, and they can't all happen in public/in a logged channel, as people will feel the need to watch what they're saying. -- mark (talk) 15:02, 5 March 2019 (UTC)Reply
The page now explicitly states that this is for watercooler conversations also. -- mark (talk) 16:47, 7 March 2019 (UTC)Reply

#wikimedia-sre-mgmt

  • DECISION 6: Should this channel be created?
    • Yes
    • No

#wikimedia-databases


#wikimedia-dcops

No questions.

#wikimedia-netops

  • What is it used for?
  • DECISION 8: Should it be closed?
    • OPTION: Yes.
      • I think we should close -netops and use -traffic for that. -netops is pretty dead anyway, -traffic is not too busy for this to be a problem.
    • OPTION: no

#wikimedia-serviceops

  • DECISION 9: Is the name clear enough?
    • OPTION: Yes
    • OPTION: No

#wikimedia-sre-foundations

  • DECISION 10: Should this channel be created?
    • Yes
    • No
  • DECISION 11: Would the name be confusing?
    • OPTION: Yes
      • Doesn't match the team name, wmich is "Infrastructure Foundations"
      • The word "Foundations" is overloaded
      • Could use #wikimedia-sre-infra?
    • OPTION: No
      • "wikimedia-sre-foundations" is sufficiently separate from "WikiMedia Foundation"

#wikimedia-traffic

No changes proposed; no questions.

#wikimedia-clinic

Summary

What is it for, how should it change, is it logged, does it have bots?

  • Was created accidentally.
  • Is used for discussion, and requests for help, specific to non-emergency clinic duty.
  • DECISION 12: Should we keep #wikimedia-clinic?
    • OPTION: Keep
      • Improves experience for new clinicians
      • signal separation
        • questions are not lost in a noisy channel
        • doesn't contend with higher-priority problems.
        • easier to track if questions are answered
        • clinic traffic could be spammy on general sre channel.
        • low, on-topic traffic gives more time to reflect
        • Clinic duty is a quasi-subteam so should have a subteam channel for the same reasons as other subteam channels.
    • OPTION: Delete
      • was created by accident; should not be permanent until after discussion of clinic duty itself
      • Could be merged with #wikimedia-sre to simplify

Discussion

#wikimedia-clinic was accidental. It was created when 2 people were on duty (me and matt) so to sync better with Moritz who was helping. After that, when another of the new people was on duty, was joining for get help/advice. I suggest we keep it until clinic duty is better defined and documented. effie mouzeli (talk) 08:07, 27 February 2019 (UTC)Reply

We risk making the "accident" permanent though. We haven't even started any serious discussion about what is to happen to clinic duty. My take would be to remove it and subsume the functionality that it offered to #wikimedia-sre (which I suggest above that we keep it) Alexandros Kosiaris (talk) 12:10, 27 February 2019 (UTC)Reply
For being an accident, the clinic channel has been very helpful. It has made clinic duty less intimidating. Cwhite (talk) 21:24, 27 February 2019 (UTC)Reply
I also found the w-clinic channel very useful while on duty and as this is/was my first duty i have spammed it quite a lot which may add a bit to much noise to a general channel such as w-sre Jbond (talk) 12:55, 5 March 2019 (UTC)Reply
I as well find #wikimedia-clinic quite useful for a couple reasons. 1) Questions are not easily lost into a noisy channel. The dedicated channel helps ensure questions may be asked freely (without contention against operational emergencies, etc.) answers are found, and it also gives other participants time to reflect on the questions and answers written. 2) Clinic duty in many ways is a group within itself, in our case a rotating virtual group. Herron (talk) 18:26, 7 March 2019 (UTC)Reply

What I see in the discussion above is the need for a quiet channel where questions can be asked. That should hopefully be the #-sre channel above. I don't have any skin in the discussion, but this seems to me like an excessive proliferation of channels, again. Giuseppe Lavagetto (talk) 07:10, 9 March 2019 (UTC)Reply

A separate channel for this seems like overkill to me. -- ariel (talk) 09:10, 13 March 2019 (UTC)Reply

If we decide to keep this channel, joining it will be completely optional. Reading the discussion here and from hanging in the channel, it looks like the channel is a big help to some of us. effie mouzeli (talk) 16:51, 14 March 2019 (UTC)Reply

I expect that some of these opinions are controversial but I'm going to proceed anyway:

  • I don't see proliferation of channels as a problem in itself; rather I think the problems are it's too high-overhead to create new channels/temporary channels properly, the lack of one central place to do fulltext search of logged channels, and the lack of a mechanism to log NDA'd channels with logs accessible just for NDA users.
  • I would guess that some of the issue that folks see with channel 'proliferation' has to do with the feeling that it's already too hard for one person to follow everything that's going on within SRE. Is this impression correct? Anyway, I do agree it's already too hard, but I additionally think that's probably just unrealistic as a goal at this point -- certainly unrealistic to be able to follow what's going on across the whole team minute-by-minute/hour-by-hour.
  • Smaller channels like #wikimedia-clinic give new folks a space to ask 'silly' questions without the feeling like all of their teammates/the whole world is watching. This is important IMO.

CDanis 12:49, 15 March 2019 (UTC)Reply

#wikimedia-pipeline

No changes proposed, no questions.

#mediawiki_security

Summary

  • DECISION 13: Where should SRE private & watercooler discussion happen?
    • OPTION stay in #mediawiki_security
      • status quo option
    • move to new #wikimedia-sre-private
      • #mediawiki_security is not an appropriate channel
        • Misleading name to have SRE discussions
        • Not a good watercooler space
          • criteria for membership is/was effectively cryptic
          • not a fully NDAed channel
  • DECISION 14: What belongs in #wikimedia-sre-private
    • OPTION: Practical, private SRE topics
      • E.g. travel, expense reports, budget questions, etc. Non-emergency.
    • OPTION: 2A + Private, non-emergency technical topics
      • Sensitive technical topics that are not incident response
        • e.g. new security releases of software, help debugging something that might refer to private data
    • OPTION: 2B + Watercooler
      • Personal, non-public discussion within SRE team
  • DECISION 15: What should happen to #mediawiki_security?
    • OPTION 3A: pass on the problem
      • Not clear SRE owns it
      • Should be decided by Engineering leadership jointly

Summary updated JAufrecht (talk) 21:49, 7 March 2019 (UTC)Reply

Discussion

That channel has been around forever, it's mostly used by SREs but not only by them. By removing our main watercooler discussions, it would be left to casual tech or non-tech interaction between teams.

I think it will still be useful, but if it won't be, we should revisit its utility once we've moved our non-public conversations away. It's also not our decision to make, IMHO. Giuseppe Lavagetto (talk) 06:51, 27 February 2019 (UTC)Reply

+1 Volans (talk) 10:09, 27 February 2019 (UTC)Reply
+1 Alexandros Kosiaris (talk) 12:06, 27 February 2019 (UTC)Reply
+1 effie mouzeli (talk) 18:56, 27 February 2019 (UTC)Reply
+1. I also recall the importance of watercooler channels, and possibly having one just for SRE, coming up in our talks...? CDanis 17:29, 28 February 2019 (UTC)Reply
The problem statement here is: a) this is a channel with "mediawiki" and "security" in its name, and the conversations there usually have nothing to do with either; b) the criteria for being there are entirely subjective and even its mere existence wasn't acknowledged for a long time, so it felt and still feels like a "cabal" channel c) noone really owns the channel, and SRE are the de facto gatekeepers for it, which in combination with (b) makes for some awkward conversations and unclear decision-making whenever we talk about who to invite, accept in or remove from this channel (and this also explains why SREs are overrpresented in it). As SRE, we should move our watercooler conversations to our own spaces with clear and explicit rules like this proposal attempts to do. Past that, the future of #mediawiki_security is something that engineering leadership can jointly decide. I do see the need for casual tech/non-tech interactions between teams, though, but if we are to address that we should probably think of a different venue, one that is clearly named for its purpose and inclusive across our engineering staff. faidon (talk) 14:20, 6 March 2019 (UTC)Reply
I've updated the proposal page to state that we'll deprecate it in favor of other channels, for now. Ownership of this channel is indeed a bit unclear, founded by a volunteer way back, and not even in normal #wikimedia- namespace. -- mark (talk) 16:44, 7 March 2019 (UTC)Reply
This channel was around before there was an ops team. While it has been used regularly to discuss things that don't belong in a public logged channel, from water-cooler chatter to investigations of DDOSes, it is, or was, not limited to use only by engineers, nor only to people on staff. No existing or proposed channel is a good substitute. I do think that any new replacement channel(s) -- or this one, if kept -- should have a clearly defined charter. -- ariel (talk) 09:01, 13 March 2019 (UTC)Reply

I think this channel has a very important use case, cross-team coordination during sensitive events, that is not covered by any of the other channels in the proposal. By sensitive events I mean security vulnerabilities, potential attacks, debug/incident response that involves private data. Given that the other security channel is for the Security team, that one either seems a good fit for this use case. I don't mind re-organizing it, clearly stating who "owns" it and who should be part of it and potentially renaming it, but I think that a private cross-team incident response channel is needed. Volans (talk) 12:46, 13 March 2019 (UTC)Reply

Totally agree Marostegui (talk) 06:13, 14 March 2019 (UTC)Reply
A point of clarification - the security team recently launched our own private team channel, separate from #wikimedia-security. I personally wouldn't consider #wikimedia-security to be "just for the security team", at least not anymore. Our watercooler and private chatter happens mostly in the team channel now. I'm not sure how relevant #wikimedia-security really is anymore, especially if #mediawiki_security (or a similar replacement channel) is going to exist. SBassett (talk) 13:46, 20 March 2019 (UTC)Reply

Other

  • Is there a single person who has access to implement all of the changes we agree on?

regarding -sre, -sre-private, and _security

I think the concerns raised regarding the -sre-private channel and the current proposal (that being where incident response happens) are correct (iow: incident response can't be in an SRE-only channel). There's _security and #wikimedia-security already. I wonder if renaming _security to something different would suffice (and make NDA-only) for the incident response purpose (and leave #wikimedia-security for their team related things). Or, maybe more correctly(?), should non-security incident response be in -sre and security related incident response be in #wikimedia-security? (since there won't be the logspam in -sre as it'd be in -sre-feed) Greg Grossmeier (talk) 22:23, 19 March 2019 (UTC)Reply


SRE Decisions from March 2019

1: #wikimedia-operations is not renamed.

2: #wikimedia-sre is created.

3: What would belong on #wikimedia-sre? Public SRE team discussion and public incident response—experimental.

3.5: Should there be a #wikimedia-sre-private channel for SRE watercooler talk? Yes, for non-public SRE discussion (e.g. travel, expense reports, non-emergent security news, discussion including private data, watercooler talk)

4: Where can we have non-public extra-SRE incident discussion? #mediawiki_security

5 What is the scope of extra-SRE incident response? Continue to use #mediawiki_security as is, because the issue is too big for SRE to change unilaterally.

6: Should #wikimedia-sre-mgmt channel be created? Yes.

7: Do we want bots in #wikimedia-databases? Yes, no change.

8: Should #wikimedia-netops be closed? Yes. Already done.

9: Is the name #wikimedia-serviceops clear enough? Sure, why not. Not all SRE subteams have SRE in the name, so maybe true consistency would take too many changes, so leave it.

10: Should #wikimedia-sre-foundations be created? YES, already done.

11: Should it be called something else, like #wikimedia-sre-infrastructure? No.

12: Should we keep #wikimedia-clinic? Yes.

13: Where should SRE private & watercooler discussion happen? #wikimedia-sre-private.

14: removed; duplicate with 3.5.

15. What should happen to #mediawiki_security? No change for now other than moving SRE private discussion (e.g. water-cooler) to #wikimedia-sre-private.