Data Engineering/Exporting from HDFS to Swift

From Wikitech

When generating regular datasets via jobs scheduled in Oozie, you may want to export those datasets for production use outside of the Analytics Cluster. To do so, you (as of 2019-08) have two options:

  1. On a stat box, pull down files from HDFS to /srv/published/datasets. Data in /srv/published/datasets eventually is accessible to the world at https://analytics.wikimedia.org/datasets/. This is not schedule-able via Oozie itself. See also Analytics/Ad_hoc_datasets.
  2. Use analytics/refinery's oozie/util/swift/upload workflow to upload a dataset directory from HDFS to the production Swift object store. Once in Swift, the dataset can be downloaded by a service or job outside of the Analytics Cluster.

Oozie Swift Upload

To schedule Oozie jobs to upload your regular datasets to Swift, you should include the oozie/util/swift/upload sub workflow as an action in your main workflow.xml. As an example, see Discovery Analytics' esbulk job swift_upload action and related configuration properties.

Swift access credentials are rendered on a per request basis, and usually in environment variable file format for use with the swift python client. See https://phabricator.wikimedia.org/T296945 for an example of such a request.