Analytics/Systems/Cluster/Workflow management tools study
We Analytics team identified the need of updating our main scheduler/task manager system: Oozie. Oozie is quite simple and robust, and has worked OK so far. However it is and old software that is not improving any more, has a diminishing community and sparse documentation; it has significant limitations like the lack of dynamic execution flows (i.e. looping all files in a directory) that we need for refining dynamic databases; also, the developer needs to write lots of boilerplate XML code for even a simple job; we Analytics are not satisfied with its UI (Hue) neither.
In parallel to that, we observed that we are maintaining other scheduling (or data presence sensors) systems, like: Refine, reportupdater and systemd timers. Refine implements the dynamic data presence sensors that Oozie does not support. Reportupdater tries to reduce all the boilerplate code from Oozie and its learning curve, so that users that are not super technical can still create jobs easily. And in some cases, for simplicity, we're using systemd-timers for jobs (deletion scripts) that could be better handled by a higher level scheduling system that can i.e. be monitored in a UI.
We want to find a new scheduling (and data presence sensor) system, that is able to replace all our current Oozie jobs, and if possible, unify all Refine jobs, all reportupdater jobs, and all systemd-timers that would make sense to be handled by a higher level system. We'd like a more modern software with a growing community and good docs, that can be used by different types of users (more or less technical). The new system should ideally match Oozie in robustness and ease of maintenance. With such a scheduler, we would reduce the maintenance of our jobs, improve the control over them and make it easier for collaborators to create their own jobs. Also, a more modern system would give our job pipeline a longer life and more state-of-the-art capabilities (i.e. kubernetes support).
These are the candidates that we evaluated. Please, add more as we do more studies.
Apache Airflow is an open-source workflow management platform created by AirBnB, that became an Apache Software Foundation project in January 2019. It is written in Python, and workflows are created via Python scripts (configuration as code). Airflow uses directed acyclic graphs (DAGs) to manage workflow orchestration. DAGs can be run either on a defined schedule (e.g. hourly or daily) or based on external event triggers (e.g. a file appearing in Hive). [summarized from https://en.wikipedia.org/wiki/Apache_Airflow]
We did a POC to test Airflow's capability of implementing dynamic workflows, and to get a grasp of how is it to develop and manage jobs with Airflow. The descriptions below and comparison chart reflect what we learned during the POC. You can find the code and read more technical details at https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/597623/.
In Airflow all jobs are written in Python, a programming language, as opposed of a markup language like Oozie's XML. This gives Airflow lots of power and flexibility. An example of that are dynamic workflows, i.e. one can iterate over the files in a directory at DAG definition time. Also, Python is an easy language to learn, and the majority of data analysts and researchers are already familiar with it. And finally, Python is easier to abstract and compartmentalize, thus making DAG code way shorter than i.e. Oozie's XML.
Support and extensibility
Airflow is probably the workflow management solution out there with most extensive out of the box support for related tools and technology, i.e.: Hadoop, Hive, Presto, Spark, Cassandra, kubernetes, Druid, Sqoop, Docker, Slack, Jenkins, etc. There are many operators and sensors within the standard Airflow install (i.e. HiveToDruid), plus there are many other connectors developed by the community. Also, you can build your own custom plugins. For the POC, we built a Refine plugin, and it was pretty straight forward.
Airflow has 3 main components: the meta-store (JDBC database), where it stores information and state about DAGs and tasks; the scheduler, a process that periodically reads DAGs and schedules tasks; and the web server, which serves the Airflow UI and also controls DAG and task events that come from it. Additionally, you can configure Airflow to use Celery as a task execution engine (4th component). Some Airflow community members say these several-moving-pieces system can be more costly to set up and monitor, especially if you're using Celery.
DAGs: interpretation in 2 steps
Something to take into account is that DAG coding has its gotchas; like understanding the interpretation of the code in 2 steps: DAG definition time and task execution time. A DAG is interpreted every N seconds by Airflow, so that dynamic DAGs are updated accordingly; or changed to DAG code are also updated. But no tasks are run at that moment. When a task within a DAG fulfills its execution conditions and is triggered, then some parts of the DAG spec code are used to execute that task. Note that both DAG and task spec belong to the same code file. So, there's still a learning curve for people that have never written an Airflow DAG.
Documentation and community
Apache Airflow has a decent documentation. It is not super detailed though (2020-06-23), so sometimes you can find yourself looking for several sources to confirm what you want to do, or even browsing Airflow code to be sure what a parameter does. Airflow seems to have the biggest community within open-source workflow management software options. There are lots of questions answered online and case studies in blogs, etc. During the POC we found that some of our questions were only partially answered. Maybe it's still a young Apache project, and it will improve in the near future.
Airflow's UI has a lot of diagrams, icons and buttons, and it can be intimidating at first. But it gives you decent control over your config, DAGs and tasks. You can do things like: switching DAGs on/off, killing/re-running/clearing/marking-success-of tasks or task sequences, look at (driver) logs, getting the state of each task in a live execution diagram, checking the rendered final parameters of the task, etc. The main improvement vs Hue, is probably usability, visual aids and the detail of information that is displayed about DAGs and tasks (the actual things you can do are mostly equivalent). Finally, big con of the Airflow UI: it does not auto-refresh, and it's a bit clunky when manually refreshing (i.e. takes a bit until a running task displays as running in the UI).
Airflow comes with Kerberos support out of the box. But so far it seems it only support one user. So if we wanted to use Airflow with both the analytics user and the hdfs user (or the product-analytics user), that could potentially be a problem. Or if you want to test a new job with your own user to make sure you don't overwrite production files, then that would also be an issue (unless we setup a separate Airflow testing instance). This limitation is significant and should be fixed in case we choose to use Airflow.
PyArrow concurrency issues
During the POC we've had problems with dynamic DAGs using PyArrow to access HDFS. Some tasks would get stuck at running state right after calling an HDFS ls operation using PyArrow library. It seems there's a mutex deadlock when combining parallel tasks using a PyArrow HDFS client. After lots of troubleshooting we decided to move on and write this issue here as a note to be considered. There are still ways to workaround this issue, like trying to use the Celery executor, or using the command line to query HDFS (poor). But definitely something else to solve if we decide to use Airflow.
Candidate comparison vs Oozie
It seems to me that Airflow will be able to do all we need it to do (Refine, regular schedules, data presence jobs, better maintainability, better UI, unify all our workflows, etc.). And Airflow will probably be maintained for a long time, and have a nice community. But it also seems that until we get there, we'll have to solve a set of non-trivial issues (concurrency, Kerberos, proper setup, and potentially more...), in addition to the actual migration of all jobs and Airflow's learning curve. It is a bit discouraging, and I think we should consider other options before we decide. If other options aren't better, then let's use Airflow. Mforns (talk) 13:40, 23 June 2020 (UTC)