Data Engineering/Evaluations/2021 data catalog selection/Rubric/Atlas

From Wikitech

Core Service and Dependency Setup

Atlas 2.2.0 was downloaded as a tarball and compiled on an-test-coord1001 without using any root privileges.

At a later date the HEAD from the GitHub repository was also tried as version 3.0.0-SNAPSHOT.

We used Maven to build the project and selected the BerkeleyDB & Apache Solr profile, which automatically built and started the Solr and Zookeeper dependencies on the same host.

The daemons were executed with bin/atlas_start.sh and the web service was available on port 21000.

Ingestion Configuration

The key ingestion elements that we wanted to get working were the Hive Hook and Bridge components, along with the utility script import_hive.sh

The hook and bridge provide for real-time metadata synchronization in Atlas, whenever data is changed in Hive.

The import_hive.sh script provides for a one-off import of existing Hive data.

The bulk of the time spent on the evalution was in trying to get the script working.

Progress Status

Getting the import script to talk to the Hive Metastore using an individual's existing Kerberos session took some time to get working. Unfortunately, once this had been achieved we discovered that the Hive integration with Atlas depends on having Hive version 3.1.0 or later. We currently have Hive version 2.3.6

Therefore, the only ways in which we could proceed with Atlas and its Hive integration were either:

  • Upgrade our existing Hive services, along with the underlying Hadoop services
  • Downgrade Atlas to version 1.2.0

The upgrade option was the more appealing of the two, but it would require a great deal of work before we could continue with the evaluation.

Perceptions

In some ways, Atlas really seems like it might have hit the spot for this requirement. The real-time integration with Hive would have been particularly useful and other projects, such as Amundsen, can build upon this further.

However, there are also a number of ways in which the project did not impress.

  • The community did not respond to our requests for assistance on the mailing list.
  • There were errors in the pristine 2.2.0 tarball that prevented building, suggesting a lack of quality.
  • The monolithic nature of the project makes it difficult to address a single component, such as the Hive connector.

Outcome

We ceased working on the prototype once the requirement for Hive 3.1 became clear.