Analytics/Archive/Data/Webrequests sampled

From Wikitech
This page contains historical information. It may be outdated or unreliable.

The Requests stream is holding request logs from all caches.

This stream is owned by the Analytics Team.

Availability

NOTES

  • As of 2015-11, webrequest udp2log instances have been turned off. The /a/squid/archive data is no longer generated.
  • By December 2016, all data in /a/squid/archive had been removed.

stat1002.eqiad.wmnet /a/squid/archive/sampled

The stream is available in Cache log format with a sampling rate of 1:1000 as gzipped files at /a/squid/archive/sampled/sampled-1000.tsv.log-*.gz on stat1002.eqiad.wmnet (using udp2log as backend).

The date in the file name does not mean that all logs of that day are in that file. Instead, the files contain logs from ~06:30 of the previous day until to ~06:30 of the day in the file name. So for example /a/squid/archive/sampled/sampled-1000.tsv.log-20130930.gz contains data from ~2013-09-29T06:30.00 until ~2013-09-30T06:30:00.

Statistics for 2013-12-01–2013-12-07
Avg. size / gzipped file 714 MiB
Avg. size / uncompressed file 3502 MiB
Avg. lines / uncompressed file 8782 K
Avg. lines / second 102
Avg. requests / second 102 K

This stream gets used for:

  • adhoc research

Events and known problems since 2013-09-01

Date from Date until Bug Details
Inherent * The stream may suffer from packet drop on udp2log. This should be <5%.
* zero” markers got set not only for wikipedia, but also for sister projects (wiktionary, ...)
* Lines that would be longer than ~8K get chopped off at that border (no newline gets added). (Affects <1 line/day on average)
* bug 60315 The stream does not contain the SSL requests that come to the SSL terminators, but only forwarded ones from the terminators.
* 2013-09-26 bug 53806 Until around 2013-09-26 ~22:57, traffic from the mobile varnishes might have been coming with a garbled client ip.
2013-09-26 2013-10-01 bug 54779 No “mf-m” markers in stream between 2013-09-26 ~22:32 and 2013-10-01 ~14:30.
2013-12-17 n/a Squids got phased out (Last entry in Squid log format is on 2013-12-17T15:45:32.764)
2013-12-18 n/a bug 58889 Increase in zero=470-01 (Grameenphone Bangladesh) tagged traffic, due to the advertisement by the carrier
2014-02-05 2014-02-25 bug 60955 The gz files with filenames 20140206, 20140208 on stat1002 were missing/bad between 2014-02-06, and 2014-02-20.

The file 20140207 had extra data from 2014-02-05 and 2014-02-06 that has been removed on 2014-02-22.

2014-03-21 2014-03-21 bug 62922 Sometimes zero tags are doubled like “zero=250-99;zero=250-99”. The first occurrence is on 2014-03-21T00:23:13. Last occurrence is on 2014-03-21T17:07:35.
2014-05-22 2014-06-24 bug 66833 Zero tags need not have a trailing characters stripped (like “zero=404-01b” instead of “zero=404-01”). Last occurrence is on 2014-06-24T14:18:24.
2014-07-09 09:00 2014-07-10 09:00 bug 68199 Traffic has been rerouted from ulsfo to eqiad for ULSFO floor move. No data has been lost, but host column may show eqiad caches for traffic that could be expected to go to ulsfo.
2014-07-25 ~14:00 2014-07-25 ~17:00 bug 69112 Carrier 250-99 was not properly zero tagged.
2014-07-29 01:35 2014-07-29 01:42 bug 68796 All of esams missing between 2014-07-29T01:35:45 and 2014-07-29T01:42:00 due to flapping network link (~11% of total zero traffic around that time)
2014-07-30 ~00:54 2014-08-04 ~21:00 bug 69112 Carrier 250-99 was not properly zero tagged, and some of the carrier's requests came with zero=ON instead.
2014-10-08 22:20 2014-10-08 23:20 bug 71879 ULSFO having connectivity issues leading to partial message loss
2014-10-20 13:06 2014-10-20 13:27 bug 72306 ULSFO connectivity issues causing packet loss between 6% and 47% for ulsfo caches.
2014-10-21 ~10:30 2014-10-21 ~11:43 bug 72355 Ulsfo connectivity issues causing packet loss for ulsfo caches.
2014-11-30 ~03:50 2014-11-30 ~10:13 task T76334 No data while analytics infrastructure suffered eqiad network issues.
2015-01-13 ~22:20 2015-01-13 ~23:18 task T86973 No data due to firewall problems

stat1002.eqiad.wmnet /a/log/webrequest/archive/sampled

The stream is available in Cache log format with a sampling rate of 1:1000 as gzipped files at /a/log/webrequest/archive/sampled/sampled-1000.tsv.log-*.gz on stat1002.eqiad.wmnet (using kafka as backend).

Each file covers the full day of the date in the file name.

Events and known problems since 2015-01-01

Date from Date until Bug Details