Analytics/Archive/Oozie/Administration

From Wikitech
Oozie has now been deprecated and removed from our systems. The information below is retained only for historical purposes.

How to restart Oozie production jobs

Job Restart Checklist

If you restart an Oozie Job, chances are that you've made some changes in the code.

From previous real-life experiences and broken jobs, please make sure that:

  • Your schema change(s) are applied to the hive table(s)
  • The various paths used in your .properties file have been updated from testing to production (should be taken care of at code review time, but who knows)
  • The jars' versions you use are existing and deployed

THANKS !

When .properties file has changed

You need to kill the existing job and spawn a new one.

Here is the procedure to follow (not to forget any step ;)

  1. Finding the new start_time One important thing before killing the old job is to check the new job start_time you should use.
    • If the job to restart is a coordinator, pick the last finished time +1 coordinator frequency unit (coordinators define their frequency units, we usually have hour, day or month, seldom we use weeks).
    • If the job to restart is a bundle, apply the coordinator method for every coordinator the bundle runs, and pick the oldest time you find.
      This check is needed for two reasons. First, jobs take time to finish, therefore when you kill a job, the chances you kill a currently running job is almost 100%, and you should rerun it. Second, our oozie jobs are dependent on data being present, and it is natural that jobs wait for some time before having their data available (previsou job to finish for instance). Both in-flight and waiting jobs are present in a coordinator job queue, and when the coordinator is killed, it is more difficult to know which of the jobs have actually finished. For this reason, checking for finished jobs before killing the parent coordinator/bundle is best practice.
  2. Kill the existing job
    • If you use hue, click the kill button (left panel, Manage part at the bottom, red button).
    • If you prefer CLI:
      Find the job id (something like 0014015-161020124223818-oozie-oozi-C for instance - Notice that coordinators id contain a C at the end while bundles id have a B).
      Run oozie job -kill <job_id>
  3. Restart a new replacing job While hue provide a way to define/run oozie jobs, we do it with files and CLI and the two don't collaborate well. So you'll have to go for CLI :)
    The two values that change in a production job oozie command in our environment are start_time, and the path of the .properties file to run (this file actually defines which job will be started).
    The path of .properties file to use is most probably in /srv/deployment/analytics/refinery/oozie. Also, notice there is no = sign between -config and the .properties file path.
sudo -u analytics kerberos-run-command analytics oozie job --oozie $OOZIE_URL \
  -Drefinery_directory=hdfs://analytics-hadoop$(hdfs dfs -ls -d /wmf/refinery/$(date +"%Y")* | tail -n 1 | awk '{print $NF}') \
  -Dqueue_name=production \
  -Doozie_launcher_queue_name=production \
  -Dstart_time=<YOUR_START_TIME> \
  -config <YOUR_PROPERTIES_PATH> \
  -run


Gotchas when restarting webrequest_load bundle

Careful with times, if you are restarting refinery you need to make sure what is the last hour finished for all partitions. For example: small partitions are waiting on hour 20, but big ones have not finished 19 - so 19 is the right hour to rerun. See example below:

sudo -u analytics kerberos-run-command analytics oozie job \
-Dstart_time=2016-12-19T00:00Z \
-Dqueue_name=production \
-Drefinery_directory=hdfs://analytics-hadoop$(hdfs dfs -ls -d /wmf/refinery/$(date +"%Y")* | tail -n 1 | awk '{print $NF}') \
-oozie $OOZIE_URL -run -config /srv/deployment/analytics/refinery/oozie/webrequest/load/bundle.properties

If .properties file has not changed - Generic re-start

There is a script allowing to restart individual jobs or the full refinery job set without configuration changes except new refinery folder and new start date. This script is useful when critical parts of the cluster need to be restarted (oozie/hive server, namenode), and/or when jobs change without config changes (HQL update for instance).

You shouldn't use that script if your changes involves .properties files (configuration changes). The restart wouldn't take them into account.

Individual job restart

  1. Check the changes to be applied look correct (dry-run mode by default) sudo -u analytics kerberos-run-command analytics /srv/deployment/analytics/refinery/bin/refinery-oozie-rerun -v job JOB_ID
  2. Apply the changes (job killed, new job started) sudo -u analytics kerberos-run-command analytics /srv/deployment/analytics/refinery/bin/refinery-oozie-rerun -n job JOB_ID

Refinery full restart

The refinery full restart is based on job-naming convention and oozie job path:

  1. in oozie folder, look recursively in directories for bundle.properties or coordinator.properties files (top level jobs)
  2. For each found file, add a job named after folder and file: - replace /, bundle.properties becomes -bundle and coordinator.properties becomes -coord. For instance webrequest/load/bundle.properties leads to webrequest-load-bundle job name, and aqs/hourly/coordinator.properties leads to aqs-hourly-coord job name.
  3. Reconciliate this job list with the list of currently running oozie jobs (by name)
  4. Apply the individual process restart to each of these jobs

To make it happen:

  1. Check the changes to be applied look correct (dry-run mode by default) sudo -u analytics kerberos-run-command analytics /srv/deployment/analytics/refinery/bin/refinery-oozie-rerun -v
  2. Apply the changes (jobs killed, new jobs started) sudo -u analytics kerberos-run-command analytics /srv/deployment/analytics/refinery/bin/refinery-oozie-rerun -n

If nothing changed, a job just failed in production - Simple re-start

The simplest kind of restart is just restarting a job that failed and sent an error email. These don't need code changes, config changes, or new start times. They can be restarted by simply finding the highest level oozie thing that failed (bundle, coordinator, or workflow), selecting the failed instance, and clicking Rerun, like this:

Select instance of the oozie job you want to restart.
Click Rerun at the top and Submit without changing configuration.


Hard killing a workflow

It has happened that some workflow became unmanageable by regular oozie commands: it restarted automatically when the oozie server was restarted, and was not killed when asking to do so, whether through hue or command line.

The solution found was to manually delete the workflow data from the oozie database (solution named The Hammer by Luca). The following procedure should only be done by an ops person, or under control of an ops person.

  • Take a snapshot of the oozie database to ensure possible recovery (better safe than sorry)
  • Connect to the oozie mysql database
    mysql -u oozie -p --database oozie
    
  • Check the data to be deleted
    -- Number of jobs having your workflowid - Should be 1
    SELECT COUNT(1) FROM WF_JOBS WHERE id = 'YOUR_WORKFLOW_ID';
    -- Number of actions managed by your workflow - Should be relatively small, like in the 10s
    -- NOTE: Mind the % after the workflow id
    SELECT COUNT(1) FROM WF_ACTIONS WHERE id LIKE 'YOUR_WORKFLOW_ID%';
    
  • Delete the data
    DELETE FROM WF_JOBS WHERE id = 'YOUR_WORKFLOW_ID';
    DELETE FROM WF_ACTIONS WHERE id LIKE 'YOUR_WORKFLOW_ID%';
    
  • Disconnect from the database (in case you'd have forgotten :)