User:Milimetric/Zarchive/Gobblin

From Wikitech

Totally untested right now, just notes from when I set this up:

If using a stat box to build, put this in ~/.gradle/gradle.properties:

org.gradle.daemon=true
systemProp.https.proxyHost=webproxy.eqiad.wmnet
systemProp.https.proxyPort=8080
systemProp.http.proxyHost=webproxy.eqiad.wmnet
systemProp.http.proxyPort=8080
systemProp.http.nonProxyHosts=localhost|127.0.0.1

Clone and customize the gobblin flavor:

git clone https://github.com/apache/gobblin.git ~/gobblin
vim ~/gobblin/gobblin-distribution/gobblin-flavor-custom-kafka1.gradle
# make it look like this:
dependencies {
  compile project(':gobblin-example')
  compile project(':gobblin-modules:gobblin-crypto-provider')
  compile project(':gobblin-modules:gobblin-kafka-08')
  compile project(':gobblin-modules:gobblin-kafka-1')
  compile project(':gobblin-modules:google-ingestion')
  compile project(':gobblin-modules:gobblin-elasticsearch')
}
~/gobblin/gradlew -PgobblinFlavor=custom-kafka1 -PhadoopVersion=2.10.1 -x test -x rat -x javadoc build

Then extract the build on a stat box, make a pull file, and run:

# make a setup_gobblin file that looks like this:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_CONF=/etc/hadoop/conf
source setup_gobblin
(s)cp ~/gobblin/build/gobblin-distribution/distributions/apache-gobblin-incubating-bin-0.15.0.tar.gz <to a stat box somewhere>
<go there>
tar -xvf apache-gobblin-incubating-bin-0.15.0.tar.gz
cd gobblin-dist
./bin/gobblin.sh cli run oneShot -appConf file:///your_pull_file.pull
 # example pull file that works on page_move:
 # add/change any classes you want to use as extractors etc. for testing 

 job.name=GobblinKafkaTest
 job.group=GobblinKafka
 job.description=Gobblin test for Kafka
 job.lock.enabled=false

 kafka.brokers=kafka-jumbo1001.eqiad.wmnet:9093
 topic.whitelist=eqiad.mediawiki.page-move

 source.class=org.apache.gobblin.source.extractor.extract.kafka.MilimetricKafkaDeserializerSource
 kafka.deserializer.type=BYTE_ARRAY
 gobblin.kafka.consumerClient.class=org.apache.gobblin.kafka.client.Kafka1ConsumerClient$Factory
 source.kafka.value.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer
 source.kafka.security.protocol=SSL
 source.kafka.ssl.cipher.suites=TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384

 # Other ssl settings
 #source.kafka.ssl.ca.location=/var/lib/puppet/ssl/certs/ca.pem
 #source.kafka.ssl.curves.list=P-256
 #source.kafka.ssl.sigalgs.list=ECDSA+SHA256
 #source.kafka.ssl.ca.location=/etc/ssl/certs/Puppet_Internal_CA.pem

 extract.namespace=org.wikimedia.analytics.test

 writer.builder.class=org.apache.gobblin.writer.SimpleDataWriterBuilder
 simple.writer.delimiter=\n
 writer.file.path.type=tablename
 writer.destination.type=HDFS
 writer.output.format=txt

 data.publisher.type=org.apache.gobblin.publisher.BaseDataPublisher

 metrics.reporting.file.enabled=true
 metrics.log.dir=/user/milimetric/test_gobblin/metrics
 metrics.reporting.file.suffix=txt

 bootstrap.with.offset=earliest

 fs.uri=hdfs://analytics-hadoop/
 writer.fs.uri=hdfs://analytics-hadoop/
 state.store.fs.uri=hdfs://analytics-hadoop/

 mr.job.max.mappers=2
 mr.job.root.dir=/user/milimetric/test_gobblin/working

 state.store.dir=/user/milimetric/test_gobblin/state-store
 task.data.root.dir=/user/milimetric/test_gobblin/task-data
 data.publisher.final.dir=/user/milimetric/test_gobblin/GobblinKafkaTest
 data.publisher.metadata.output.dir=/user/milimetric/test_gobblin/task-metadata

 # weird jobFailure log towards the end of output, searched but wasn't obvious: https://github.com/apache/gobblin/search?q=onJobFailure