Event Platform/Stream Processing/Flink/FailureScenarios
Appearance
Known/Observed Flink Failure Scenarios
Kubernetes Operator
The Flink Kubernetes Operator runs as an HA pair. We have observed a scenario in which the active master loses sync with its resources. In this scenario, API calls that involve resource updates (writes, in other words) are either ignored, or hang forever.
Workaround
Deleting the Flink Operator's active master container will force a failover, which fixes the issue. To find the master: kubectl -n flink-operator get lease flink-operator-lease -o yaml