User:Razzi/Debugging eventlogging to druid network flows internal hourly.service

From Wikitech

In IRC I saw this alert today:

PROBLEM - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers

SSHing on to an-launcher showed this error:

Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 ERROR DataFrameToDruid: Druid ingestion task index_hadoop_network_flows_internal_lggaghgk_2022-02-18T22:00:35.639Z for network_flows_internal failed
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO HiveToDruid: Done.
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO SparkContext: Invoking stop() from shutdown hook
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO SparkUI: Stopped Spark web UI at http://an-launcher1002.eqiad.wmnet:4041
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO YarnClientSchedulerBackend: Interrupting monitor thread
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO YarnClientSchedulerBackend: Shutting down all executors
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: (serviceOption=None,
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]:  services=List(),
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]:  started=false)
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO YarnClientSchedulerBackend: Stopped
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO MemoryStore: MemoryStore cleared
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO BlockManager: BlockManager stopped
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO BlockManagerMaster: BlockManagerMaster stopped
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO SparkContext: Successfully stopped SparkContext
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO ShutdownHookManager: Shutdown hook called
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO ShutdownHookManager: Deleting directory /tmp/spark-fcccd681-efb8-4816-9a57-f8a66dc0b7db
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO ShutdownHookManager: Deleting directory /tmp/spark-f21f647b-6de0-4473-94de-46767d4f8fc8
Feb 18 22:20:39 an-launcher1002 systemd[1]: eventlogging_to_druid_network_flows_internal_hourly.service: Main process exited, code=exited, status=1/FAILURE
Feb 18 22:20:39 an-launcher1002 systemd[1]: eventlogging_to_druid_network_flows_internal_hourly.service: Failed with result 'exit-code'.

Unfortunately it has been recovering and then failing continuously. Root cause TBD