SRE project wishlist

From Wikitech
This page may be outdated or contain incorrect details. Please update it if you can.

This page is meant to be used to coordinate projects within the Ops team. The focus of this page should be coordinating non-geographically based sprints. The ideal length of a project should be 1-2 weeks. If the scope of the project grows beyond this, it will probably require more iterations, if not a larger discussion within the ops team.

Please feel free to add your projects. Please include at least some of the following: description/motivation/dependencies, spec, links to bugzilla/RT tickets, duration, start and end dates, interested parties. Also, this is not meant to replace, but supplement, bug trackers.

Volunteers are also welcome! Please feel free to contact other people working on these projects and help out!

Active projects

Netflow collector

  • Team: Joel (contractor)
  • Duration: TBD
  • Master RT ticket: TBD
  • Dates: Nov 2013-

Setup a NetFlow collector or two and point sampled NetFlow version 9 or IPIX from all routers to those (or a multicast group that the collectors will listen to). The goal would be to be able to detect DoS or DDoS more effectively, to get per AS statistics of traffic and help peering & routing decisions.

pmacct is an excellent piece of software for this purpose, although the less complex nfdump could also be used.

Performance monitoring

  • Team: TBD
  • Duration: TBD
  • Master RT ticket: TBD
  • Dates: TBD

To expand statistics collection from the production site and to collect more performance metrics from various pieces of infrastructure (Varnish, Ceph, Swift etc.). Expand the deployment of Graphite and possibly accompany it with a software like statsd. Integrate performance trend lines/forecasts with alerting.

Backburner

Gerrit repo creation through wikitech

This project is to allow gerrit repository through wikitech for MediaWiki extensions and Labs repositories. Maintainers for repos would be either individuals, Labs projects, and Service groups within Labs projects.

Basic monitoring & alerting

  • Team: TBD
  • Duration: TBD
  • Master RT ticket: TBD
  • Dates: TBD

The project is about adding more Nagios checks across the board. We currently average 4.4 checks per host while our goal should be more closer to 50 checks per host. We lack checks about very fundandemental problems (such as disk full). A sprint collecting all those needs and then adding checks in puppet across the infrastructure should happen and possibly iterated again in the future.

The same applies to a lesser extent to Ganglia and the metrics that are being collected there.

Scaling of monitoring infrastructure

  • Team: TBD
  • Duration: TBD
  • Master RT ticket: TBD
  • Dates: TBD

Nagios (and Icinga) currently complain about maximum amounts of checks reached. Additionally, we currently lack the infrastructure to do per-DC monitoring and be able to distinguish signal from noise when e.g. we lose an entire datacenter. The infrastructure will need to scale up, especially if we intend to add more checks (see above); possibilities include expanding the use of passive checks, using multiple Nagios boxes in a hierarchy, using check_mk or using something like mod-gearman.

Logging infrastructure

  • Team: TBD; interested so far: Faidon, Bryan Davis (for logstash)
  • Duration: TBD
  • Master RT ticket: TBD
  • Dates: TBD

The project is about organizing misc (i.e. not HTTP access logs) log handling across the Wikimedia infrastructure for. Most logs currently go to user.log or some other centralized location, sometimes not being logrotated at all or being incosistently logrotated. Puppetized rsyslog definitions based on process names or facilities should be provided that redirect services to well-known locations across the infrastructure; puppetized logrotate definitions per such file should also be provided, as to have a consistent retention of such logs.

A centralized rsyslog (for ops) should be installed to collect those and archive them. A blind log collector box for security purposes might also be a good idea.

Using some fancier log collections tools (such as logstash/kibana, graylog2 or a combination of the two) could be installed to provide easily searchable logs and log trends to facilitate monitoring and troubleshooting.

Network-based security

  • Team: Faidon (for ferm), Leslie (VLANs)
  • Duration: TBD
  • Master RT ticket: TBD
  • Dates: TBD

The project was discussed during the Feb 2013 SF Hackathon. The results of that discussion were:

  • Investigate further the split of VLANs into core (wiki production cluster) and non-core (misc services, both internal & public-facing) and protect each other via router ACLs;
  • Gradually reduce the public IP perimeter by moving more servers (e.g. squids) to private IPs and more of esams, when we get the transatlantic MPLS link;
  • Firewall SSH on most of the IP perimeter
  • Expand the use of host-based firewalls and use ferm & a puppet module;

User account refactoring

  • Team: Faidon, Ryan
  • Duration: TBD
  • Master RT ticket: TBD
  • Dates: TBD

The project was discussed during the Feb 2013 SF Hackathon. The results of the discussion were:

  • Move LDAP out of virt* boxes into separate boxes and make them first-class citizens
  • Use LDAP as the single truth of account information and groups
  • Use separate LDAP groups per realm, and production access (rename wmf group)
  • Use LDAP ACIs to restrict e.g. what labsconsole does
  • Write a script that dumps LDAP into a puppet manifest that gets checked out into git
  • Do not add an LDAP runtime dependency on production systems (shell)
  • Use per-account primary gid and supplementary gids for roles
  • Use groups/role to give access to non-ops to specific boxes
  • Have sudo defintions in puppet, not LDAP; use sudo by group/role membership
  • User accounts should never be renamed

Expand use of AppArmor

  • Team: TBD
  • Duration: TBD
  • Master RT ticket: TBD
  • Dates: TBD

Create AppArmor profiles for all core components of the infrastructure, starting from image scalers, application servers and caching proxies.

Security Updates

  • Team: Leslie and Ryan
  • Duration: TBD
  • Master RT ticket: TBD
  • Dates: TBD

The goal is to have patch management and security updates in a more organized fashion than currently. This was discussed extensively during the Feb 2013 SF Hackathon, and the preliminary ideas that came out of the discussion were:

  • Continue using servermon to tracking updates;
  • Have some automated way of upgrading non-service affecting packages, possibly by forking unattended-upgrades and adding some kind of whitelist/blacklist functionality and perhaps even adding a policy-rc.d hook;
  • Formulate a process in which will be later applied across team to have some kind of rotation & schedule for regularly updating packages, rebooting machines for new kernels, roll-out security updates into custom packages etc.

mediawiki Labs project

  • Team: Ryan + TBD
  • Duration: TBD
  • Master bugzilla ticket: TBD
  • Dates: TBD

/mediawiki Labs project

Finished projects

DNS

  • Team: faidon, mark
  • Duration: 3 weeks
  • Master RT ticket: 4547 and others.
  • Dates: completed, Aug 20th 2013

The project is to create the next generation of the DNS infrastructure. The goals of this are:

  • To upgrade DNS servers to modern software (precise among others)
  • To use MaxMind databases for GeoIP lookups instead of the outdated DNSBL list;
  • To provide support for IPv6 GeoIP and hence be able to add AAAA records to our NS records;
  • To support for the draft edns-client-subnet extension and hence provide better geolocation to Google Public DNS and OpenDNS;
  • To support more granular GeoIP than per-country-based, to e.g. be able to direct US per coast or state;
  • To support more failure scenarios instead of ${site}-down and hence to be able to scale to more datacenters;
  • To move zones from Subversion to Git and to use Gerrit for handling changesets and enable normal ops review processes and contributions from volunteers;
  • To support linting and provide more resiliency to the DNS infrastructure from e.g. typos.

Backup infrastructure

  • Team: Alex
  • Duration: 3 weeks
  • Master RT ticket: 5389
  • Dates: TBD

Design a new generation backup architecture and data retention plan, recommend hardware or service procurement and transition plan

  • Design and propose a backup architecture (bacula-based ? in house)
  • Define Data retention plans
  • Hardware re-use (Netapps?) or new? Tape drives perhaps

Switch "text" to Varnish

  • Team: mark
  • Duration: TBD
  • Master RT ticket: TBD
  • Dates: May-Aug 2013

Text is one of the few services that haven't migrated to Varnish. The project is about collecting the missing bits for supporting that, such as support for (X-)Vary-Options and communicating to Platform the requirements from their side for this to happen. The project largely depends on hardware procurement and hence might be stalled from that side.