Cache log format

From Wikitech
Jump to: navigation, search

The Wikimedia cache log format is a custom format which is used on webserver caches to provide logs in a uniform format.

Format used in cache config files

In general, missing fields are not empty, but get represented by -.

Nginx Format

See squid_combined format of templates/nginx/nginx.conf.erb of operations/puppet. As of 2014-01-25, we use

$hostname $udplog_sequence $udplog_time $request_time $remote_addr -/$status $bytes_sent $request_method $scheme://$host$request_uri NONE/$proxy_host $content_type $http_referer $http_x_forwarded_for $http_user_agent $http_accept_language $sent_http_x_analytics

Documentation of the format specifiers is available in the nginx documentation.

Squid Format

Setting the custom format for the logs relies on the logformat directive available in Squid 2.6. The %sn format code is custom to the Wikimedia Foundation squids. For more information about our squid log setup, see Squid logging.

$hostname %sn %{%Y-%m-%dT%H:%M:%S}tg.%03tu %tr %>a %Ss/%03Hs %<st %rm %ru %Sh/%<A %mt %{Referer}>h %{X-Forwarded-For}>h %{User-Agent}>h %{X-CS}>h

Documentation of the format specifiers is available in squid's logformat documentation.

Squids have been phased out, and the last line in Squid log format got logged on 2013-12-17T15:45:32.764.

Varnishncsa Format

See LOG_FORMAT of modules/varnish/files/varnishncsa.default of operations/puppet. As of 2013-09-30, we use

%l %n %t %{Varnish:time_firstbyte}x %h %{Varnish:handling}x/%s %b %m http://%{Host}i%U%q - %{Content-Type}o %{Referer}i %{X-Forwarded-For}i %{User-agent}i %{Accept-Language}i %{X-Analytics}o

Documentation of the format specifiers is available in the varnishncsa documentation.

If a header is set multiple times (like two User-Agents) for a request, varnishncsa only logs the last one.

Invalid UTF-8 characters are copied byte-wise. So a line need not be a valid UTF-8 string.

Varnishkafka Format

See varnishkafka::instance's format of manifests/role/cache.pp of operations/puppet. As of 2015-01-10, we use

%{fake_tag0@hostname?${::fqdn}}x %{@sequence!num?0}n %{%FT%T@dt}t %{Varnish:time_firstbyte@time_firstbyte!num?0.0}x %{@ip}h %{Varnish:handling@cache_status}x %{@http_status}s %{@response_size!num?0}b %{@http_method}m %{Host@uri_host}i %{@uri_path}U %{@uri_query}q %{Content-Type@content_type}o %{Referer@referer}i %{X-Forwarded-For@x_forwarded_for}i %{User-Agent@user_agent}i %{Accept-Language@accept_language}i %{X-Analytics@x_analytics}o %{Range@range}i

Invalid UTF-8 characters get replaced by 0xEF 0xBF 0xBD (i.e.: Unicode Replacement Character (Codepoint U+FFFD) in UTF-8).

Log entry

Log entries within a log file are separated by a newline character.

Log fields within a log entry are separated by tab character.

As some fields may (and sometimes do) contain tabs (e.g.: “User-Agent” sometimes comes with tabs), they may screw up column numbering. So make sure machinery to read those logs can handle that gracefully.

Fields

Field# in csvs JSON key Description Details Avg. len in bytes for November 2013
1 hostname Hostname of the cache 22
2 sequence Sequence number Per host, this integer number should be strictly increasing by 1 per request. However, it overruns. It may be a negative number. 10
3 dt time stamp The current time in ISO 8601 format, according to the cache's clock
Cache Format Example
Nginx YYYY-MM-DDThh:mm:ss.sss 2013-09-27T06:27:25.310
Squid YYYY-MM-DDThh:mm:ss.sss 2013-09-27T06:27:25.310
Varnishncsa YYYY-MM-DDThh:mm:ss 2013-09-27T06:27:25
Varnishkafka YYYY-MM-DDThh:mm:ss 2013-09-27T06:27:25
19
4 time_firstbyte Request service time
Cache Unit Has fractional part? Example
Nginx seconds yes 0.048
Squid milliseconds no 48
Varnishncsa seconds yes 0.048412345
Varnishkafka seconds yes 0.048412345
10
5 ip Client IP You might want to override this IP with X-Forwarded-For header (see below). 13
6 n/a Cache request status, HTTP status code caching information from this field isn't very useful regardless of if its coming from squid or varnish, on account of being logged from the frontend routing instance, despite most cache hits come from the backend. The field is mostly just useful for the status code, though varnish states like pass may be of interest at times as well.

Squid and varnish use different texts to represent the same situations. For example:

Cache Cache hit from memory combined with “OK” response Cache miss combined with “Found” response
Nginx -/200 -/302
Squid TCP_MEM_HIT/200 TCP_MISS/302
Varnishncsa hit/200 miss/302

For November 2013, we saw 54.05% hits, 38.61% misses, 4.88% passes, 2.42% - for cache results, and for HTTP status codes we saw 82.82% 200, 8.12% 304, 7.72% 302, 0.49% 301, 0.39% 404, and 0.28% 206.

See the cache_status, and http_status fields for the JSON representation of this field.

8
n/a cache_status Cache request status (see notes in the “Cache request status, HTTP status code” row)
n/a http_status Http status code (see notes in the “Cache request status, HTTP status code” row)
7 response_size Reply size
Cache Meaning
Nginx including HTTP headers
Squid including HTTP headers
Varnishncsa excluding HTTP headers
Varnishkafka excluding HTTP headers

Content-Encoding affects this number. So if the payload gets for example gzipped, it contains the size of the gzipped payload, not the uncompressed one.

While in varnishncsa format, both 0, and - are to be expected, varnishkafka represents both codes as 0.

3
8 http_method Request method GET, POST, etc

For November 2013, we saw 99.14% GET, 0.60% POST, 0.22% HEAD, and 0.04% other request methods.

3
9 n/a URL See the uri_host, uri_path, and uri_query fields for the JSON representation of this field 112
n/a uri_host Host of the requested URL This field is the plain host without protocol.
n/a uri_path Path of the requested URL
n/a uri_query Query of the requested URL This field includes the leading ?.
10 n/a Cache hierarchy status, peer IP 2
11 content_type MIME content type 14
12 referer Referer header 57
13 x_forwarded_for X-Forwarded-For header This field should only be respected if the client IP (see above) is a known good IP (i.e.: in $wgSquidServersNoPurge of wmf-config/squid.php of operations/mediawiki-config). 3
14 user_agent User-Agent header This field may (but need not be) URL encoded. As of September 2013, about half of the lines are URL encoded for the Request stream. 108
15 accept_language Accept_Language 16
16 x_analytics X-Analytics General purpose header for instrumentation, which received it's own page at X-Analytics. 1
n/a range Range The Range header of the request.

Example

The client IP address has been replaced with 1.2.3.4 to protect the innocent:

Nginx example

ssl1001 129960997 2014-01-21T08:36:33.097 0.426 1.2.3.4 -/200 12324 GET https://en.wikipedia.org/w/index.php?title=Kirschkuchen&action=edit&section=8 NONE/wikimedia - https://en.wikipedia.org/wiki/Kirschkuchen - Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.76 Safari/537.36 en-US,en;q=0.8 -

Squid example

sq18.wikimedia.org 1715898 2010-12-01T21:57:22.331 0 1.2.3.4 TCP_MEM_HIT/200 13208 GET http://en.wikipedia.org/wiki/Main_Page NONE/- text/html - - Mozilla/4.0%20(compatible;%20MSIE%206.0;%20Windows%20NT%205.1;%20.NET%20CLR%201.1.4322) en-US -

Varnishncsa example

cp1048.eqiad.wmnet 8883921154 2013-09-26T06:28:16 0.001308203 1.2.3.4 hit/200 52362 GET http://en.wikipedia.org/wiki/Free_software - text/html http://en.wikipedia.org/wiki/Freedom 4.3.2.1 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36 en-US,en;q=0.8 -

Varnishkafka example

{"hostname":"cp1067.eqiad.wmnet","sequence":396385264,"dt":"2015-01-10T13:45:21","time_firstbyte":0.000140428,"ip":"2.3.4.5","cache_status":"hit","http_status":"200","response_size":27564,"http_method":"GET","uri_host":"en.wikipedia.org","uri_path":"/wiki/Foo","uri_query":"","content_type":"text/html; charset=UTF-8","referer":"http://en.wikipedia.org/wiki/Bar","x_forwarded_for":"-","user_agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko","accept_language":"en-US","x_analytics":"php=hhvm","range":"-"}

Usage

This format is used by the following streams:

So if you (plan to) change/update the format, please notify the owners of those streams.

Issues

Some issues that you should be aware of when you use these files.

A small number of requests log the host twice, for example:

 http://en.m.wikipedia.orghttp://en.m.wikipedia.org/favicon.ico

Why does this happens and why this is not a problem is explained is this bug: [1]