Analytics/Systems/Exporting from HDFS to Swift

From Wikitech
Jump to navigation Jump to search

When generating regular datasets via jobs scheduled in Oozie, you may want to export those datasets for production use outside of the Analytics Cluster. To do so, you (as of 2019-08) have two options:

  1. On a stat box, pull down files from HDFS to /srv/published/datasets. Data in /srv/published/datasets eventually is accessible to the world at https://analytics.wikimedia.org/datasets/. This is not schedule-able via Oozie itself. See also Analytics/Ad_hoc_datasets.
  2. Use analytics/refinery's oozie/util/swift/upload workflow to upload a dataset directory from HDFS to the production Swift object store. Once in Swift, the dataset can be downloaded by a service or job outside of the Analytics Cluster.

Oozie Swift Upload

To schedule Oozie jobs to upload your regular datasets to Swift, you should include the oozie/util/swift/upload sub workflow as an action in your main workflow.xml. As an example, see Discovery Analytics' esbulk job swift_upload action and related configuration properties.