User:Gmodena/Event Platform/Refine integration

From Wikitech
  1. All streams are ingested by default via gobblin with an event_default job/tag declared in ESC. eventlogging streams declared their own job name. If one want to disable ingestion, they can declare an empty/null analytics consumer block or set consumers.analytics_hadoop_ingestion: false.
  2. ESC consumer settings are used to configure Gobblin ingestion. Naming of `gobblin/jobs` files  in the refinery repo. ESC and Gobblin naming match by convention. E.g. ESC event_default  will be managed by gobblin/job/event_default.pull in refinery.
  3. refine jobs configs are (partially) declared in puppet. They are deployed to an-launcher1002 under /etc/refinery/refine/. refine_event.properties  configures the refine process for streams tagged with the event_default job name.
  4. refine loads data from e.g. /wmf/data/raw/event/ , which is declared in the matching gobblin pull job.
  5. refine job properties (e.g.refine_event.properties) declare the output database and which streams (dirs loaded in wmf/data/raw/event/ to exclude from the refine process (via the table_exclude_regex prop).