Help:Labs labs labs/future

From Wikitech
Jump to navigation Jump to search

Preamble

We operate our cloud infrastructure in a provider-tenant model currently that has outgrown the view of "Labs" as solely a testbed for eventual production use cases. Important community tools and resources staff and volunteers require to do their work (inline CI testing) on a consistent basis are now managed by Openstack. The "Labs" moniker is functionally a few separate user classes we want to identify formally to provide better service for those who need a high level of reliability and stability, and to acknowledge those that do not.

Now

Tool Labs

  • We host tools and bots (Tool Labs) in a specific project using shared VMs, Sun Grid Engine, and heavy reliance on NFS.
  • Tool Labs users have limited isolation within their VMs.
    • Many tools within Tool Labs do not publish their source code outside of the running instance
    • Hard-coded passwords and private information (such as oauth tokens) make this difficult.
  • We now have a small but active backlog of tools that are seeking migration into "production" with the belief that this is the primary method to achieve reliability and support.

VMs

  • We encourage users to maintain a private Puppet master in projects where out-of-production-band code Puppet code needs to be tested
    • This leaves pockets of VMs disconnected from the main Puppet master for orchestration and updates
    • This includes Operations managed meta-projects like Tool Labs.
  • We have a "beta" and a, young and incomplete, "staging" environment that is meant as an official Pre-Production testing ground.
  • We have members of the community running highly visible projects within a Labs Project, such as mwoffliner for generating ZIM files, alongside ephemeral test projects with VMs that are seldom utilized requiring no definite level of service.
  • We do not differentiate between one-off testing projects and persistent, important Beta-esque projects for resourcing or support.

Openstack Infrastructure

  • We have one broadcast domain for all VM's with minimal security segmentation based on a deprecated network service.
  • Historically, Labs has had poor uptime and sporadic service outages that have affected tenants.
    • NFS especially is used as a cure-all for: large storage needs, file/code distribution, backups, object storage (swift replacement), and persistence.

General

The "Labs" name as a descriptor has been severely overloaded and at various times can refer to: the dedicated team, the Openstack infrastructure, services run by Operations in support of VMs, sofware running within a specific project, the Tool Labs infrastructure that runs within Openstack, specific tools and functionality within Tool Labs, the custom Openstack interface running on wikitech, and specific hardware dedicated by Operations to supporting Openstack functionality (which will become more confusing as we adopt hardware dedicated to specific projects).

Future

  • Identify user classes to allocate resources as appropriate during migrations and resource contention
  • Commit to specific SLA's for tenants that require it
  • Provide 'secret' distribution mechanisms for intra and inter project use cases
  • Re-frame the conversation about where serious tools need to run with VPS and Tools as options
  • Describe an opinionated runtime model for SLA'd projects including repeatability (Puppet) and visibility (point of contact)
  • Deploy a Neutron hybrid flat and isolated network model
  • Deploy a managed container environment to replace the current Tool Labs as "Tools"
  • Set operational expectations in Tools regarding code availability and running state
  • Replace our aging custom dashboard with Horizon (upstream dashboard)
  • Reduce or replace NFS as a requirement everywhere possible with the intention of deprecating
  • Define testing models for pre-production Puppet code and service deployment that is not a proliferation of puppet::master::self
  • Provide a development environment for container deployed services that mirrors local development where necessary
  • Re-brand where appropriate for descriptiveness with "Lab VPS" as the best-effort VM class for miscellaneous and short lived testing.

Defining Classes of Service

Env

Need What is it? What to Expect
Development Develop my tool (local) We provide a way to bootstrap development that will integrate with deployment as a container within TOOLS. Drop into #wikimedia-cloud for community and WMF support. We use the same tools and have probably encountered the same problems.

IRC and mailing list support.

TEST VPS Hosted Resources for experimentation and research Hosted projects that provides root on full VMs with best-effort uptime and reliability guarantees.

An environment to prototype a new service or setup a replica to troubleshoot a complicated issue.

We provide quota enforced project workspace for VMs.

LAB VPS projects that aim for availability and sustaintability should integrate into the VPS environment to ensure resourcing. LAB VPS projects and instances will be audited periodically for activity and may be reintegrated into the resource pool in coordination with the project owner or if the project owner is unavailable.

TOOLS WMF runs my tool or bot Managed and hosted container environment with an SLA.

Community-maintained external tools supporting Wikimedia projects and their users. A WMF and volunteer managed environment where containers are run in a highly available state. We provide logging and alerting infrastructure for tool owners. Tools can be developed locally or remotely.

Deployed containers are managed entirely by WMF. Tools can send emails and serve web pages. Tools can also access shared TOOLS/VPS infrastructure for access to database replicas, and other shared resources. The Hosting Team will provide backup facilities where appropriate and support.
VPS Reliable Cloud Hosting Hosted VPS projects that provides root on full VMs with an SLA.

Long lived projects that are part of the expected Wikimedia ecosystem like Staging. The Tools hosting environment itself is a VPS project. These projects have uptime requirements greater than those for experimentation or development. The VPS environment is a platform where the Hosting team strives to ensure repeatability and availability.

VPS projects are allocated resources first when there is contention with LAB VPS. VPS projects are held to a standard of repeatability and data integrity. The Hosting Team will provide backup facilities where appropriate.
Prod WMF Production For access sensitive data, large resource sourcing, or if the determination is made by the WMF Operations team. Access to production is more complicated and nuanced. All access is limited and services that need to run in this context do so with heavy coordination with Operations.

Development

  • A bootstrapped local Tool container development model.
  • Git push deploy

LAB VPS

  • Deployed on flat network for generic bastion usage
  • Will be reaped and cleaned up periodically based on activity
  • Best effort support and stability

TOOLS

  • Tools are expected to be in Git (do we allow mercurial?) (we need to provide a secret distribution mechanism)
  • Tools are expected to follow the process for documenting container requirements (requirements.txt equiv?)
  • Tools must use a reverse proxy and are not assigned public IPs
  • Tools are expected to use below x resources (this is a possible thing to add at some point)

VPS

  • Provide templated BCaaS (Beta Cluster as a Service)
  • VPS configuration needs to be done via Puppet
  • Do you need a external IP?
  • Do I need resources outside of the usual quota such as physical hardware?

Glossary:

TOOLS - New environment for tools and bots to run as containers.

VPS - Virtual Private Server

SLA - Service Level Agreement

Integration with Community Tech

Community Tech = CT

  • Qualitative value judgements on labs resourcing?
    • Proposal to integrate with Grant making and resource allocation for hardware and increased Labs resources for community oversight.
  • Things CT tech wishes we were doing better now in Labs?
    • Reliability in Tool Labs and availability of code in revision control for maint and contribution
  • Does CT rely on things that run in Tool Labs? What is the interaction there?
    • They do and they run several of their own tools.
  • How will a reorg of Labs tenants (or a priority based ordering) affect CT?
    • Seems good but special care should be taken to outline that VPS VMs are not at all related to staff or volunteer activities but to a level of service possible for either equally.
  • How can we work better together on this?
    • Continue to sync up going forward.