SRE/Infrastructure naming conventions

From Wikitech
Jump to navigation Jump to search

This page documents the naming conventions of servers, routers, and data center sites.

Our servers currently fall in broadly two categories:

  • Clustered servers: These use numeral sequences with a descriptive prefix (see #Networking and #Servers). For example: db1001.
  • Miscellaneous servers: These used unique hostnames (see #Miscellaneous servers). For example: helium. This naming convention is deprecated and not used for new hosts, but some older miscellaneous-named hosts still exist.

Name reuse

Historically, we did not reuse names of past servers for new servers. For example, after db1001 is decommissioned, no other server will be named db1001. Ganeti VMs sometimes reuse hostnames, but bare metal typically will not.

The notable exception is networking gear, which are deterministically specified by rack. For example the access switch in Eqiad rack A8 is named asw-a8-eqiad. If it is replaced, the new switch will take the same name.

All hardware in the datacenter space is tracked in Netbox, which can be used to check for existing hostnames for both hardware and ganeti instances.

Data centers

Data centers are named as vendor initials (at time of lease signing) followed by the IATA code for a nearby major airport.

For example: our Dallas site is named codfw. The vendor is CyrusOne, and DFW being the large nearby airport. (Technically, Love Field airport is closer but less well-known.)

DC Vendor Airport Code
codfw CyrusOne DFW
drmrs Digital Realty MRS
eqdfw Equinix DFW
eqiad Equinix IAD
eqord Equinix ORD
eqsin Equinix SIN
esams EvoSwitch AMS
knams Kennisnet AMS
ulsfo United Layer SFO

Networking

Naming for network equipment is based on role and location.

This also applies to: power distribution units, serial console servers, and other networking infrastructure.

Name prefix Role Example
asw access switch asw-a1-eqiad
cr core router cr1-eqiad
mr management router mr1-eqiad
lsw leaf switch lsw1-e1-eqiad
ssw spine switch ssw1-e1-eqiad
msw management switch msw1-eqiad & msw-b2-eqiad
pfw payments fire wall pfw1-eqiad
ps1 / ps2 power strips/distribution units ps1-b3-eqiad
scs serial console server scs-a8-eqiad
fasw Fundraising access switch fasw-c-codfw
cloudsw Cloud L3 switches cloudsw1-c8-eqiad

OpenStack deployments

[Datacenter Site][numeric identifier](optional dev suffix to indicate non-external non-customer facing deployments) - [r (if region)][letter for AZ]

  • Current Eqiad/Codfw deployments will not fully meet these standards until rebuilt: [eqiad0 (deployment), eqiad (region), nova (AZ)]
Deployment Region Availability Zone
eqiad0 eqiad0-r eqiad0-rb
eqiad1 eqiad1-r eqiad1-rb
codfw0dev codfw0dev-r codfw0dev-rb
codfw1dev codfw1dev-r codfw1dev-rb

Disks

  • Arrays must use the Storage array device role in Netbox.
  • Naming follows two conventions:
  • Array is attached to a single host:
  • hostname_of_host_system-arrayN
  • Example: ms2001-array1, ms2001-array2
  • all arrays get a number, even if there is only a single array.
  • Example: dataset1001-array1
  • Array is attached to multiple hosts
  • Labs uses this for labstore, each shelf connects to two different hosts. As such, the older single host naming scheme fails.
  • servicehostgroup-arrayN-site
  • Example: labstore-array1-codfw, labstore-array2-codfw

Kubernetes

Any cluster that is not the main wikikube cluster should follow these conventions:

  • Cluster name: <identifier>-k8s (ex: dse-k8s, aux-k8s)
  • Control plane service name: <identifier>-k8s-ctrl
  • Ingress service name: <identifier>-k8s-ingress [-ro|-rw] for active/active or active/passive
  • Hostnames for control plane : <identifier>-k8s-ctrlXXXX.$site.wmnet
  • Hostnames for kubelets : <identifier>-k8s-workerXXXX.$site.wmnet

Servers

Any system that runs in a dedicated services cluster with other machines will be named after their role/service task. As a rule, we attempt to name after the service, not just the software package. Also, servers within a group are numbered based on the datacenter they are located in.

Data center Numeral range Example
pmtpa / sdtpa (decommissioned) 1-999 cp7
eqiad 1000-1999 db1001
codfw 2000-2999 mw2187
esams / knams 3000-3999 cp3031
ulsfo 4000-4999 bast4001
eqsin 5000-5999 dns5001
drmrs 6000-6999 cp6011

When adding a new datacenter, make sure to update operations/puppet.git's /typos file which checks hostnames.

Name prefix Description Status Points of contact
acmechief ACME certificate manager In use Traffic
acmechief-test ACME certificate manager staging environment In use Traffic
alert Alerting host (Icinga / Alertmanager) In use Observability
amssq esams caching server No longer used (deprecated)
amslvs esams LVS No longer used (deprecated)
analytics analytics nodes (Hadoop, Hive, Impala, and various other things) Being replaced by an-worker Data Engineering SREs
analytics-master analytics master nodes Being replaced by an-master Data Engineering SREs
analytics-tool virtual machines in production (Ganeti) running analytics tools/websites Being replaced by an-tool Data Engineering SREs
an-coord analytics coordination node In use Data Engineering SREs
an-db analytics postgresql database cluster In use Data Engineering SREs
an-master analytics master node In use, replacing analytics-master Data Engineering SREs
an-mariadb analytics-meta mariadb databases In use Data Engineering SREs
an-tool analytics tools node In use Data Engineering SREs
an-test-(coord/master/worker) analytics hadoop test cluster nodes In use Data Engineering SREs
an-worker analytics worker node In use, replacing analyticsNNNN Data Engineering SREs
an-scheduler analytics job scheduler node In use Data Engineering SREs
an-airflow analytics job scheduler node dedicated to the Discovery team In use Data Engineering SREs
aphlict notification server for Phabricator In use Service Operations
apt Advanced Package Tool Repository (Debian APT repo) In use Infrastructure Foundations
aqs Analytics Query Service In use Data Engineering SREs
archiva Archiva Artifact Repository In use Data Engineering SREs
auth Authentication server In use Infrastructure Foundations
authdns Authoritative DNS (gdsnd) In use Traffic
backup Backup hosts In use Data Persistence
backupmon Backup monitoring hosts In use Data Persistence
bast bastion host In use Infrastructure Foundations
censorship Censorship monitoring databases and scripts No longer used (deprecated)
centrallog Centralized syslog In use Observability
cephosd Ceph servers for use with Data Engineering and similar storage requirements In use Data Engineering SREs
certcentral Central certificates service No longer used (deprecated)
chartmuseum Helm Chart repository ChartMuseum In use Service Operations
cloud*-dev Any cloud role + '-dev' = internal deployment (PoC, Staging, etc) In use WMCS
cloudbackup Backup storage system for WMCS In use WMCS
cloudcephmon Ceph monitor and manager daemon for WMCS In use WMCS
cloudcephosd Ceph object storage data nodes for WMCS In use WMCS
cloudceph Converged Ceph object storage and monitor nodes for WMCS (only used for testing) No longer used
cloudcontrol OpenStack deployment controller for WMCS In use WMCS
clouddb Wiki replica servers for WMCS In use WMCS, with support from DBAs
cloudelastic Replication of ElasticSearch for WMCS In use WMCS
cloudgw Cloud gateway server for WMCS In use WMCS
cloudmetrics Monitoring server for WMCS In use WMCS
cloudnet Network gateway for tenants of WMCS (Neutron l3) In use WMCS
cloudservices Misc OpenStack components (Designate) for WMCS In use WMCS
cloudstore Storage system for WMCS In use WMCS
cloudvirt OpenStack Hypervisor (libvirtd + KVM) for WMCS In use WMCS
cloudvirtan OpenStack Hypervisor (libvirtd + KVM) for WMCS (dedicated to Analytics) No longer used
cloudvirt-wqds OpenStack Hypervisor (libvirtd + KVM) for WMCS (dedicated to WDQS) WMCS
cloudweb WMCS management websites (wikitech, horizon, striker) In use WMCS
conf Configuration system host (etcd, zookeeper...) In use Service Operations
config-master host running the config-master site In use Infrastructure Foundations
contint Continuous Integration In use Service Operations
cp Cache proxy (Varnish) In use Traffic
cumin Cluster management (cumin/spicerack/debdeploy/etc...) In use Infrastructure Foundations
datahubsearch DataHub OpenSearch Cluster - used for Data Catalog MVP In use Data Engineering SREs
dataset dataset dumps storage No longer used (deprecated)
db Database host In use Data Persistence
dbmonitor Database monitoring In use Data Persistence
dborch Database orchestration (MySQL Orchestrator) In use Data Persistence
dbprov Database backup generation and data provisioning In use Data Persistence
dbproxy Database proxy In use Data Persistence
dbstore Database analytics In use Data Engineering SREs & Data Persistence
debmonitor Debian packages monitoring In use Infrastructure Foundations
deploy Deployment hosts In use Service Operations
dns DNS recursors In use Infrastructure Foundations
doc Documentation server (CI) In use Service Operations (Supportive Services) & Release Engineering
doh Wikidough Anycasted In use Traffic
an-druid Druid Cluster (Analytics). Due to naming legacy, druid100[1-3] are also in this cluster. In use Data Engineering SREs
druid Druid Cluster (Public) In use Data Engineering SREs
dse-k8s-etcd etcd server for the kubernetes cluster of Data Science and Engineering In use Data Engineering SREs
dse-k8s-ctrl control plane server for the kubernetes cluster of Data Science and Engineering In use Data Engineering SREs
dse-k8s-worker worker node for the kubernetes cluster of Data Science and Engineering In use Data Engineering SREs
dumpsdata dataset generation fileset serving to snapshot hosts In use Platform Engineering
durum Check service for Wikidough In use Traffic
elastic elasticsearch servers In use Search Platform SREs
es Database host for MediaWiki external storage (wiki content, compressed) In use Data Persistence
etcd Etcd server In use Service Operations
etherpad Etherpad server In use Service Operations
eventlog EventLogging host In use Data Engineering SREs
flink-zk Dedicated zookeeper cluster for Flink in use (testing) Data Platform SREs
flowspec Network controller In use (testing) Infrastructure Foundations
fr* Fundraising servers, e.g. frdb, frlog, frpm (puppetmaster) In use fr-tech SREs
ganeti Ganeti Virtualization Cluster In use Infrastructure Foundations
ganeti-test Ganeti Virtualization Cluster (test setup) in use Infrastructure Foundations
gerrit Gerrit code review (gerrit1001 in eqiad is currently used) In use (deprecated) Service Operations & Release Engineering
gitlab Gitlab servers In use (phab:T274459) Service Operations
grafana Grafana server In use Observability
graphite Graphite server In use Observability
icinga Icinga servers In use Observability
idp Identity provider (Apereo CAS) In use Infrastructure Foundations
install Installation server In use Infrastructure Foundations
kafka Kafka brokers No longer used Data Engineering SREs & Infrastructure Foundations
kafka-main Kafka brokers In use Data Engineering SREs & Infrastructure Foundations
kafka-jumbo Large general purpose Kafka cluster In use Data Engineering SREs & Infrastructure Foundations
kafka-logging Logging/o11y Kafka cluster In use Observability
kafkamon Kafka monitoring (VMs) In use Data Engineering SREs & Infrastructure Foundations
karapace DataHub Schema Registry server (standalone) - Used for the Data Catalog MVP In use Data Engineering SREs
knsq knams squid No longer used (deprecated)
krb Kerberos KDC/Kadmin In use Infrastructure Foundations & Data Engineering SREs
kubernetes Kubernetes cluster (k8s) In use Service Operations
kubestage Kubernetes staging cluster In use Service Operations
kubestagetcd Etcd cluster for the Kubernetes staging cluster In use Service Operations
kubetcd Etcd cluster for the Kubernetes cluster In use Service Operations
lab labs virtual node No longer used (deprecated)
labcontrol Controller node for WMCS (aka "labs") No longer used (deprecated)
labnet Networking host for WMCS No longer used (deprecated)
labnodepool Dedicated WMCS host for Nodepool (CI) No longer used (deprecated)
labpuppetmaster Puppetmasters for WMCS No longer used (deprecated)
labsdb Replication of production databases for WMCS No longer used (deprecated)
labservices Services for WMCS No longer used (deprecated)
labstore Disk storage for WMCS In use (deprecated) WMCS
labtest* Test hosts for WMCS No longer used (deprecated)
labvirt Virtualization node for WMCS No longer used (deprecated)
labweb Management websites for WMCS No longer used (deprecated)
lists Mailing lists running Mailman In use Legoktm and Ladsgroup
logging-hd Logging Cluster - OpenSearch data node (hdd class) Planned Observability
logging-sd Logging Cluster - OpenSearch data node (ssd class) Planned Observability
logging-fe Logging Cluster - OpenSearch/OpenSearch-Dashboards/Logstash node Planned Observability
logstash opensearch/logstash/opensearch-dashboards node In use Observability
lvs lvs load balancer In use Traffic
maps Maps cluster In use Content Transform Team and hnowlan
maps-test maps test cluster No longer used (deprecated)
mc memcached server for mediawiki In use Service Operations
mc-gp memcached gutter pool server for mediawiki In use Service Operations
mc-wf memcached servers for wikifunctions In use Service Operations
ml-staging Machine learning stanging env etcd and control plane machines In use ML team
ml-serve Machine learning serving cluster (ml-serve-ctrl* are VMs for k8s control plane) In use ML team
ml-cache Machine leaning caching nodes In use ML team
mirror public mirror, e.g. Debian mirror, Ubuntu mirror In use Infrastructure Foundations
miscweb miscellaneous web server In use Service Operations
ms media storage No longer used (deprecated) Data Persistence (Media Storage)
ms-backup media storage backup generation (workers) In use Data Persistence (Media Storage)
ms-be media storage backend In use Data Persistence (Media Storage)
ms-fe media storage frontend In use Data Persistence (Media Storage)
mw MediaWiki application server (MediaWiki PHP webservers, api, jobrunners, videoscalers) In use Service Operations
mwdebug MediaWiki application server for debugging and deployment staging (Ganeti VMs) In use Service Operations
mwlog MediaWiki logging host In use Service Operations
mwmaint MediaWiki maintenance host (formerly "terbium") In use Service Operations
mx Mail relays In use Infrastructure Foundations
nas NAS boxes (NetApp) Unused
netflow Network visibility In use Infrastructure Foundations
netmon Network monitor (librenms, rancid, etc) In use Infrastructure Foundations
netbox Netbox front-end instances In use Infrastructure Foundations
netbox-dev Netbox test instances In use Infrastructure Foundations
netboxdb Netbox back-end database instances In use Infrastructure Foundations
notebook Jupyterhub experimental server Unused
nfs NFS server Unused
peek Security Team workflow and project management tooling In use Security Team
ocg offline content generator (PDF) No longer used (deprecated)
ores ORES cluster In use Machine Learning SREs
orespoolcounter ORES PoolCounter In use Machine Learning SREs
oresrdb ORES Redis systems No longer used (deprecated)
pc Parser cache database In use SRE Data Persistence (DBAs), with support from Platform and Performance
pdf PDF Collections No longer used (deprecated)
people peopleweb (people.wikimedia.org) In use Service Operations & Infrastructure Foundations
parse parsoid Soon in use Service Operations
phab Phabricator host (currently iridium is eqiad phab host) In use Service Operations
ping Ping offload server In use Infrastructure Foundations
planet Planet server In use (mistake) Service Operations
pki PKI Server (CFSSL) In use Infrastructure Foundations
pki-root PKI Root CA Server (CFSSL) In use Infrastructure Foundations
poolcounter PoolCounter cluster In use Service Operations
prometheus Prometheus cluster In use Observability
proton Proton cluster No longer used (deprecated)
puppetboard PuppetDB Web UI In use Service Operations
puppetdb PuppetDB cluster In use Service Operations
puppetmaster Puppet masters In use Infrastructure Foundations
puppetserver Puppet Servers In use Infrastructure Foundations
pybal-test PyBal testing and development In use Traffic
rbf Redis Bloom Filter server Unused
rcs Obsolete:RCStream server (recent changes stream) No longer used (deprecated)
rdb Redis server In use Service Operations
registry Docker registries In use Service Operations
releases Software Releases In use Service Operations
relforge Discovery's Relevance Forge (see discovery/relevanceForge.git, T131184) In use Search Platform SREs
restbase RESTBase server In use Service Operations
rpki RPKI#Validation In use Infrastructure Foundations
sca Service Cluster A - Includes various services No longer used (deprecated)
scb Service Cluster B - Includes various services. It's effectively the next generation of the sca cluster above No longer used (deprecated)
schema Event Schemas HTTP server In use Data Engineering SREs & Service Operations
search-loader Analytics to Elastic Search model data loader In use Search Platform SREs
sessionstore Cassandra cluster for sessionstore In use Data Persistence
snapshot Data dump processing node In use Platform Engineering
sq squid server No longer used (deprecated)
srv apache server No longer used (deprecated)
stat statistics computation hosts (see Analytics/Data access) In use Data Engineering SREs
storage storage host No longer used (deprecated)
stewards special hosts for wiki stewards (see T344164) In use SRE collaboration services
testreduce parsoid visual diff testing In use Service Operations
thanos-be Prometheus long term storage backend In use Observability
thanos-fe Prometheus long term storage frontend In use Observability
thumbor Thumbor In use Service Operations (& Performance)
tmh MediaWiki videoscaler (TimedMediaHandler). See T105009 and T115950. No longer used (deprecated)
torrelay Tor relay No longer used (deprecated)
urldownloader url-downloader In use (added in T224551) Service Operations
virt labs virtualization nodes No longer used (deprecated)
wcqs wikicommons query service In use Search Platform SREs
wdqs wikidata query service In use Search Platform SREs
webperf webperf metrics (performance team). See T179036. In use Performance & Service Operations
wtp wiki-text processor node (parsoid) In use Service Operations
xhgui A graphical interface for PHP debug profiles. See Performance/Runbook/XHGui service. In use Performance & Service Operations
dragonfly-supernode Supernode for Dragonfly P2P network (distributing docker images) (T286054) In use Service Operations

Miscellaneous servers

Historically, we used per-datacenter naming schemes for any one-off or single host. This included any software that wasn't load balanced across multiple machines, or general task machines that could cluster (to an extent) but required opsen work to do so.

Instead of being named for their purpose, these hosts were named according to a naming convention for their datacenter:

  • Hosts in eqiad were named for chemical elements, in order of increasing atomic number.
  • Hosts in codfw were named for stars. Stars in the Orion constellation were reserved for fundraising (Alnilam, Alnitak, Bellatrix, Betelgeuse, Heka, Meissa, Mintaka, Nair Al Saif, Rigel, Saiph, Tabit, Thabit).
  • Hosts in esams or knams were named for notable Dutch people.

These naming schemes are deprecated in favour of specialized cluster names above. Even if you're certain that the foobar service will only ever use a single host, you should name that host "foobar1001" (or 2001, 3001, etc. as appropriate to the datacenter).

One-off names were easy to come up with—especially for machines that did more than one kind of thing, where it's hard to identify a single descriptive name—but they were also opaque. Engineers had to know that the eqiad MediaWiki maintenance host was "terbium" and the codfw package-build host was "deneb." Naming these machines "mwmaint1001" and "build2001" is easier for sleepy oncallers to remember in an emergency, and friendlier to new hires who have to learn all the names at once.

Some older hosts in production still use these naming schemes, but new hosts should not use them.