Data Platform/Systems/Coordinator
The coordinator hosts
The Analytics team relies on two multi-purpose nodes called "an-coord100[1-2]" to run critical services like Hive and Presto.
In T257412 we started a major effort to reduce SPOFs in the Analytics infrastructure, since there was only one coordinator host (an-coord1001) and its hostname was hardcoded almost everywhere (puppet, refinery, etc..). This meant that the failover procedure was really complex and long to execute, so the Analytics team tried to reduce the gap. The final goal would be to have a very quick and easy (order of minutes) failover procedure for all the services, but it will take time. This document tracks the status of the failover procedures to give a good overview to how things run on the coordinator nodes.
Services
This is the list of services broken down by node (As of January 2024):
an-coord1003
Hive Metastore (port 9083)
Hive Server 2 (port 10000)
Presto coordinator
an-coord1004
Hive Metastore (port 9083)
Hive Server 2 (port 10000)
Hive
The Hive servers (port 10000) on every coordinators use their local metastore (port 9083) when needed (this is a standard for Hive). The metastores all fetch and store Hive sessions/tokens to the same mariadb database, in this case an-mariadb1001
. See Data Engineering/Systems/Analytics Meta for more information on that database and how we are working to achieve high-availbility for it.
The analytics-hive.eqiad.wmnet
is a DNS CNAME that points only to one of the coordinators, and it is set for all the Hive clients. This means that only one coordinator is active at any given time, the other one is in "standby" state.
How does it work from the Kerberos point of view? The keytab and principal that Hive uses is hive/analytics-hive.eqiad.wmnet@WIKIMEDIA on both coordinator nodes, so they share the same credentials and can answer to the same queries.
With a simple change of the analytics-hive.eqiad.wmnet
CNAME in the DNS it is possible to change the coordinator that serves Hive traffic transparently (already tested, it works) without changing any job config!