Analytics/Systems/Cluster/Kafka/Kafka Udp2log

From Wikitech
< Analytics‎ | Systems‎ | Cluster‎ | Kafka
Jump to: navigation, search

Analytics is in the process of verifying that the new varnishkafka setup is as reliable as we want it to be. Once we feel confident in varnishkafka+kafka as a varnishncsa+udp2log replacement, we will deploy varnishkafka on production mobile varnish servers. When we do this, Ops would like to disable the existing varnishncsa instance there. Ops has said that there can be a short timeperiod (1 or 2 months) where both varnishncsa and varnishkafka can be left running, but that there should be an explicit date set for when varnishncsa will be disabled.

Disabling the mobile varnishncsa -> udp2log instances will remove mobile webrequest data from the 'firehose' (the canonical datastream that contains all webrequests from all cache servers). The udp2log firehose stream feed several downstream consumers of mobile data. Analytics would like to find a solution that allows varnishncsa to be disabled, but also keeps the udp2log firehose intact for a longer time period.

Consumers of Mobile Webrequest Logs

  1. mobile-sampled-100.tsv.log - Adhoc research.
  2. zero.tsv.log - Adhoc research.
  3. sampled-1000.tsv.log - Adhoc research, and by Erik Zachte in wikistats to generate mobile stats on
  4. Fundraising - Fundraising banners are shown to mobile users, so Fundraising collects statistics that includes mobile data.
  5. webstatscollector - Generates pagecount statistics hosted at Webstatscollector provides project level counts for the mobile sites of the Wikimedia projects.

In order to continue supporting these consumers, we will need to either maintain the complete firehose stream, or find another way of generating these statistics in Hadoop or elsewhere. Ideally, the existing statistics generated from udp2log would not disappear, as they can be used as a baseline for new statistics generated by Hadoop before they are disabled.


A. UDP output support in varnishkafka

Magnus could build in support for additional outputs in varnishkafka. varnishkafka would emit logs to both Kafka and to udp2log. This is probably the best solution, as it satifies the requirement of turning off varnishncsa, and it leaves the udp2log firehose intact without additional complexity. The disadvantage is that network traffic between datacenters would be duplicated, once for udp2log and once for Kafka.

B. Keep varnishncsa on

This has the advantage of requireing no changes to any consumers. However, Ops would not like this option to be considered, as the whole point of the Kafka project is to replace udp2log, and running an extra process on frontend caches is not ideal.

C. Consume logs from Kafka and pipe back into the udp2log firehose

This has the advantage of requireing no changes to any consumers. However, it is a bit hacky and inefficient. This process would consume all mobile JSON data from Kafka and transform it back into the TSV format expected by the existing udp2log consumers. This is inefficient because a single process (possibly threaded) would be responsible for reading all of mobile logs and transforming them, which introduces another SPOF into the already fragile udp2log setups.

There is already a first gerrit patchset in for this process:

D. Generate consumer statistics in Hadoop

This is feasible for in the short term for consumers 1. and 2.

3. (wikistats) is a large complicated project and recreating the work that Erik Zachte has done in Hadoop is not likely to be successful, especially in the short term.

4. (Fundraising) would need special coordination, and as their stats are not under Analytics purvue, this is more difficult than it sounds.

5. (webstatscollector) is a stream processor, and requires that the logs flow through it in real time. (The hourly aggregations are done on current time, not on the webrequest's timestamp.)

E. Translate collected logs from Hadoop into their eventual locations and formats

This would mean running a process to dump logs out of HDFS back into the *.tsv.log files for Consumers 1. and 2. (mobile and zero).

3. (wikistats) is more complicated, as the process would have to combine the mobile logs from hadoop with the existing sampled-1000 logs.

4. (Fundraising) would need special coordination, and as their stats are not under Analytics purvue, this is more difficult than it sounds.

5. (webstatscollector) This solution would not work, as webstatscollector is real time stream based.

Not recommended.

F. Pipe logs from Kafka into the downstream consumer locations

This is similar to C, except logs would be consumed directly from Kafka into the *.tsv.log files. This solution would still require extra support for somehow piping streaming data into webstatscollector and Fundraising with something similar to D.

Not recommended.

G. Hybrid

Many of the above approaches could be combined. For example, we could consume from Kafka into the static *.tsv.log files, but also use pipe data from Kafka directly into webstatscollector, bypassing the udp2log firehose altogether. However, this would be more systems and code to maintain, whereas there are other complete solutions available.

Not recommended.

H. udp2log Kafka Consumer

Ah! Even better! If we deploy varnishkafka on all hosts and deprecate varnishncsa and the udp firehose stream, then all webrequest data would be available to consume from Kafka. We could modify udp2log to be able to consume from Kafka and transform the data there back into the .tsv format that downstream consumers expect. This would be transparent to downstream consumers, and also keeps us from having to duplicate the same data across the network.

Analytics preferences

Analytics would prefer a solution that keeps the udp2log firehose intact. Solution H. is the most ideal, but requires some coding work to modify udp2log.


Check it out! kafkatee!