Discovery/Analytics

From Wikitech
Jump to navigation Jump to search

Discovery uses the Analytics Cluster to support CirrusSearch. The source code is in the wikimedia/discovery/analytics repository.

How to prepare a deploy

If any python dependencies have been updated the artifacts/ directory needs to be updated as well. On your local machine:

  1. git rm artifacts/*.whl
  2. bin/make_wheels.sh
  3. bin/upload_wheels.py artifacts/*.whl
  4. git add artifacts/*.whl

The make_wheels.py script checks each of the oozie tasks for a requirements.txt. If present all necessary wheel's are collected into artifacts/ and a requirements-frozen.txt containing exact versioning is emitted.

The upload_wheels.py script will sync the wheels to archiva. If a wheel already exists in archiva for the same version but a different sha1 your local version will be replaced. Wheel upload requires setting up maven with WMF's Archiva configuration.

How to deploy

  1. Ssh to deployment.eqiad.wmnet
  2. Run:
    cd /srv/deployment/wikimedia/discovery/analytics
    git checkout master
    git pull --ff-only
    scap deploy 'a logged description of deployment'

    This part brings the discovery analytics code from gerrit to stat1007. It additionally builds the virtualenv's necessary.
  3. Ssh into stat1007
  4. Run sudo -u analytics-search /srv/deployment/wikimedia/discovery/analytics/bin/discovery-deploy-to-hdfs --verbose --no-dry-run

    This part brings the discovery analytics code to the HDFS (but it does not resubmit Oozie jobs).

How to deploy Oozie production jobs

Oozie jobs are deployed from stat1007. The following environment variables are used to kick off all jobs:

  • REFINERY_VERSION should be set to the concrete, 'deployed' version of refinery that you want to deploy from. Like 2015-01-05T17.59.18Z--7bb7f07. (Do not use current there, or your job is likely to break when someone deploys refinery afresh).
  • DISCOVERY_VERSION should be set to the concrete, 'deployed' version of discovery analytics that you want to deploy from. Like 2016-01-22T20.19.59Z--e00dbef. (Do not use current there, or your job is likely to break when someone deploys disocvery analyitcs afresh).
  • PROPERTIES_FILE should be set to the properties file that you want to deploy; relative to the refinery root. Like oozie/popularity_score/bundle.properties.
  • START_TIME should denote the time the job should run the first time. Like 2016-01-05T11:00Z. This should be coordinated between both the popularity_score and transfer_to_es jobs so that they are asking for the same days. Generally you want to set this to the next day the job should run.

kill existing coordinator / bundle

Make sure you kill any existing coordinator / bundle before deploying a new one.

The simplest way is to search for analytics-search in https://hue.wikimedia.org/oozie/list_oozie_coordinators/ and https://hue.wikimedia.org/oozie/list_oozie_bundles/. Jobs running under your own user can be killed from the Hue interface. For jobs running under another user, such as analytics-search, you may need to copy the appropriate id from hue and run the following on stat1007 (after replacing the last argument with the id):

 sudo -u analytics-search oozie job -oozie $OOZIE_URL -kill 0000520-160202151345641-oozie-oozi-C

popularity_score

 export DISCOVERY_VERSION=$(ls -d /mnt/hdfs/wmf/discovery/20* | sort | grep -v -- -dirty | tail -n 1 | sed 's/^.*\///')
 export REFINERY_VERSION=$(ls -d /mnt/hdfs/wmf/refinery/20* | sort | grep -v -- -dirty | tail -n 1 | sed 's/^.*\///')
 export PROPERTIES_FILE=oozie/popularity_score/coordinator.properties
 export START_TIME=2016-01-05T11:00Z
 
 cd /mnt/hdfs/wmf/discovery/$DISCOVERY_VERSION
 sudo -u analytics-search oozie job \
   -oozie $OOZIE_URL \
   -run \
   -config $PROPERTIES_FILE \
   -D discovery_oozie_directory=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie \
   -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
   -D queue_name=production \
   -D start_time=$START_TIME

transfer_to_es

This job runs a single coordinator to transfer popularity score from hdfs to kafka, to be loaded into the elasticsearch cluster by the mjolnir bulk daemon on the production side.

 export DISCOVERY_VERSION=$(ls -d /mnt/hdfs/wmf/discovery/20* | sort | tail -n 1 | sed 's/^.*\///')
 export REFINERY_VERSION=$(ls -d /mnt/hdfs/wmf/refinery/20* | sort | grep -v -- -dirty | tail -n 1 | sed 's/^.*\///')
 export PROPERTIES_FILE=oozie/transfer_to_es/coordinator.properties
 export START_TIME=2016-01-05T11:00Z
 
 cd /mnt/hdfs/wmf/discovery/$DISCOVERY_VERSION
 sudo -u analytics-search oozie job \
   -oozie $OOZIE_URL \
   -run \
   -config $PROPERTIES_FILE \
   -D discovery_oozie_directory=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie \
   -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
   -D queue_name=production \
   -D start_time=$START_TIME

query_clicks

This job has 2 coordinators, an hourly one and a daily one. They are deployed independently. The analytics-search user does not currently have the right access to webrequest data, that will need to be fixed before a prod deployment.

DANGER: Double check that the namespace_map_snapshot_id still exists. If the snapshot has been deleted then the hourly jobs will not have any results.

 export DISCOVERY_VERSION=$(ls -d /mnt/hdfs/wmf/discovery/20* | sort | tail -n 1 | sed 's/^.*\///')
 export REFINERY_VERSION=$(ls -d /mnt/hdfs/wmf/refinery/20* | sort | grep -v -- -dirty | tail -n 1 | sed 's/^.*\///')
 export START_TIME=2016-01-05T11:00Z
 
 cd /mnt/hdfs/wmf/discovery/$DISCOVERY_VERSION
 for PROPERTIES_TYPE in hourly daily; do
   export PROPERTIES_FILE="oozie/query_clicks/${PROPERTIES_TYPE}/coordinator.properties"
   sudo -u analytics-search oozie job \
     -oozie $OOZIE_URL \
     -run \
     -config $PROPERTIES_FILE \
     -D discovery_oozie_directory=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie \
     -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
     -D queue_name=default \
     -D start_time=$START_TIME \
     -D namespace_map_snapshot_id=2019-05
 done

Oozie Test Deployments

There is no hadoop cluster in beta cluster or labs, so changes have to be tested in production. When submitting a job please ensure you override all appropriate values so the production data paths and tables are not effected. After testing you job be sure to kill it (the correct one!) from hue. Note that most of the time you won't need to do a full test through oozie, you can instead call the script directly with spark-submit.

deploy test code to hdfs

 git clone http://gerrit.wikimedia.org/r/wikimedia/discovery/analytics ~/discovery-analytics
 <copy some command from the gerrit ui to pull down and checkout your patch>
 ~/discovery-analytics/bin/discovery-deploy-to-hdfs --base hdfs:///user/$USER/discovery-analytics --verbose --no-dry-run

popularity_score

 export DISCOVERY_VERSION=current
 export REFINERY_VERSION=current
 export PROPERTIES_FILE=oozie/popularity_score/coordinator.properties
 cd /mnt/hdfs/user/$USER/discovery-analytics/$DISCOVERY_VERSION
 oozie job -oozie $OOZIE_URL \
           -run \
           -config $PROPERTIES_FILE \
           -D discovery_oozie_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie \
           -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
           -D start_time=2016-01-22T00:00Z \
           -D discovery_data_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics-data \
           -D popularity_score_table=$USER.discovery_popularity_score

transfer_to_es

 export DISCOVERY_VERSION=current
 export ANALYTICS_VERSION=current
 export PROPERTIES_FILE=oozie/transfer_to_es/coordinator.properties
 export START_TIME=2016-02-22T11:00Z
 cd /mnt/hdfs/user/$USER/discovery-analytics/$DISCOVERY_VERSION
 oozie job -oozie $OOZIE_URL \
           -run \
           -config $PROPERTIES_FILE \
           -D discovery_oozie_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie \
           -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
           -D start_time=$START_TIME \
           -D discovery_data_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics-data \
           -D popularity_score_table=$USER.discovery_popularity_score

Debugging

Web Based

Hue

Active oozie bundles, coordinators, and workflows can be seen in hue. Hue uses wikimedia LDAP for authentication. All production jobs for discovery are owned by the analytics-search user. This is only useful for viewing the state of things, actual manipulations need to be done by sudo'ing to the analytics-search user on stat1002.eqiad.wmnet and utilizing the CLI tools.

Container killed?

To find the logs about a killed container, specifically for a task that runs a hive script:

  • Find the Coordinator
  • Click on failed task
  • Click 'actions'
  • Click the external id of the hive task
  • Click the 'logs' button under 'Attempts' (not in the left sidebar)
  • Select 'syslog'
  • Search in page for 'beyond'. This may have a log like:

2017-03-01 00:05:57,469 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1488294419903_1313_m_000000_0: Container [pid=2002,containerID=container_e42_1488294419903_1313_01_000003] is running beyond physical memory limits. Current usage: 1.1 GB of 1 GB physical memory used; 5.0 GB of 2.1 GB virtual memory used. Killing container.

Yarn

Yarn shows the status of currently running jobs on the hadoop cluster. If you are in the wmf LDAP group (every WMF employees/contractors) you can login directly to yarn.wikimedia.org. Yarn is also accessible over a SOCKS proxy to analytics1001.eqiad.wmnet.

Currently running jobs are visible via the running applications page.

CLI

CLI commands are typically run by ssh'ing into stat1007.eqiad.wmnet

Oozie

The oozie CLI command can be used to get info about currently running bundles, coordinators, and workflows. Although often it is easier to get those from #Hue. Use oozie help for more detailed info, here are a few useful commands. Get the appropriate oozie id from hue.

Re-run a failed workflow
 sudo -u analytics-search oozie job -oozie $OOZIE_URL -rerun 0000612-160202151345641-oozie-oozi-W
Show info about running job
 oozie job -info 0000612-160202151345641-oozie-oozi-W

This will output something like the following

 Job ID : 0000612-160202151345641-oozie-oozi-W
 ------------------------------------------------------------------------------------------------------------------------------------
 Workflow Name : discovery-transfer_to_es-discovery.popularity_score-2016,2,1->http://elastic1017.eqiad.wmnet:9200-wf
 App Path      : hdfs://analytics-hadoop/wmf/discovery/2016-02-02T21.16.44Z--2b630f1/oozie/transfer_to_es/workflow.xml
 Status        : RUNNING
 Run           : 0
 User          : analytics-search
 Group         : -
 Created       : 2016-02-02 23:01 GMT
 Started       : 2016-02-02 23:01 GMT
 Last Modified : 2016-02-03 03:36 GMT
 Ended         : -
 CoordAction ID: 0000611-160202151345641-oozie-oozi-C@1
 Actions
 ------------------------------------------------------------------------------------------------------------------------------------
 ID                                                                            Status    Ext ID                 Ext Status Err Code  
 ------------------------------------------------------------------------------------------------------------------------------------
 0000612-160202151345641-oozie-oozi-W@:start:                                  OK        -                      OK         -         
 ------------------------------------------------------------------------------------------------------------------------------------
 0000612-160202151345641-oozie-oozi-W@transfer                                 RUNNING   job_1454006297742_16752RUNNING    -         
 ------------------------------------------------------------------------------------------------------------------------------------

You can continue down the rabbit hole to find more information. In this case the above oozie workflow kicked off a transfer job, which matches an element of the related workflow.xml

 oozie job -info 0000612-160202151345641-oozie-oozi-W@transfer

This will output something like the following

 ID : 0000612-160202151345641-oozie-oozi-W@transfer
 ------------------------------------------------------------------------------------------------------------------------------------
 Console URL       : http://analytics1001.eqiad.wmnet:8088/proxy/application_1454006297742_16752/
 Error Code        : -
 Error Message     : -
 External ID       : job_1454006297742_16752
 External Status   : RUNNING
 Name              : transfer
 Retries           : 0
 Tracker URI       : resourcemanager.analytics.eqiad.wmnet:8032
 Type              : spark
 Started           : 2016-02-02 23:01 GMT
 Status            : RUNNING
 Ended             : -
 ------------------------------------------------------------------------------------------------------------------------------------

Of particular interest is the Console URL, and the related application id (application_1454006297742_16752). This id can be used with the yarn command to retrieve logs of application that was run. Note though that two separate jobs are created, the @transfer job shown above is an oozie runner, for spark jobs this does little to no actual work and mostly just kicks off another job to run the spark application.

List jobs by user

Don't attempt to use the oozie jobs -filter ... command, it will stall out oozie. Instead use hue and the filter's available there.

Yarn

Yarn is the actual job runner.

List running spark jobs
 yarn application -appTypes SPARK -list
Fetch application logs

the yarn application id can be used to fetch application logs

 sudo -u analytics-search yarn logs -applicationId application_1454006297742_16752 | less

Remember that each spark job kicked off by oozie has two application id's. The one reported by oozie job -info is the oozie job runner. The sub job that is actually running the spark application isn't as easy to find the ID of. Your best bet is often to poke around in the yarn web ui to find the job with the right name, such as Discovery Transfer To http://elastic1017.eqiad.wmnet:9200.

Yarn manager logs

When an executor is lost, it's typically because it ran over it's memory allocation. You can verify this by looking at the yarn manager logs for the machine that lost an executor. To fetch the logs pull them down from stat1007 and grep them for the numeric portion of the application id. For an executor lost on analytics1056 for application_1454006297742_27819:

   curl -s http://analytics1056.eqiad.wmnet:8042/logs/yarn-yarn-nodemanager-analytics1056.log | grep 1454006297742_27819 | grep WARN

It might output something like this:

   2016-02-07 06:05:06,480 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_e12_1454006297742_27819_01_000020
   2016-02-07 06:26:57,151 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Process tree for container: container_e12_1454006297742_27819_01_000011 has processes older than 1 iteration running over the configured limit. Limit=5368709120, current usage = 5420392448
   2016-02-07 06:26:57,152 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=112313,containerID=container_e12_1454006297742_27819_01_000011] is running beyond physical memory limits. Current usage: 5.0 GB of 5 GB physical memory used; 5.7 GB of 10.5 GB virtual memory used. Killing container.
   2016-02-07 06:26:57,160 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_e12_1454006297742_27819_01_000011 is : 143