Obsolete:Analytics/Data Lake/Traffic/Cirrus
Appearance
(Redirected from Analytics/Data/Cirrus)
Idealized schema
This page documents an idealised schema for the Cirrus search requests table.
dt string Timestamp at cache in ISO 8601 - "2015-07-25 07:53:52" - first field in the existing tarballs. hostname string Source node hostname, e.g. "mw1168" - second field in the existing tarballs. source string The wiki it came from, e.g. "enwiki" - third field in the existing tarballs. target_index string The target index, e.g. "enwiki_content" - (in some rare cases multiple indexes can be requested can we have an array of string here?) ip string IP of packet at cache. This will need to be extracted and passed through. x_forwarded_for string The x_forwarded_for field. Will need to be extracted and passed through. search_query string The actual search query. user_agent string The user agent of the request. search_type string The type of search request it was; "full text", "prefix" or NULL. We actually probably don't want the maintenance tasks in here, do we? total_time int Total time taken. es_time int ElasticSearch time taken. total_results int Total results found. returned_results int Number of results returned. result_index int Index of returned results search_suggestion string The search suggestion provided; NULL if none. executor_id int a temporary unique ID identifying the executor, allowing us to group chains of queries as a single success or failure. is_api boolean A flag identifying whether the request was from the API (true) or web (false). year int Unpadded year of request month int Unpadded month of request day int Unpadded day of request hour int Unpadded hour of request # Partition Information # col_name data_type comment is_api boolean A flag identifying whether the request was from the API (true) or web (false). year int Unpadded year of request month int Unpadded month of request day int Unpadded day of request hour int Unpadded hour of request
Current schema
For more details, refer to CirrusSearchRequestSet Avro schema specification.
ts int Timestamp at cache in ISO 8601 wikiid string Source node hostname, e.g. "mw1168" source string The wiki it came from, e.g. "enwiki" identity string MD5(UA + XFF + Optional String) ip string IP of packet at cache useragent string The user agent of the request. backendusertests array<string> Lists A/B tests the user is enrolled in. payload map<string,string> requests array< struct< query:string, querytype:string, indices:array<string>, tookms:int, elastictookms:int, limit:int, hitstotal:int, hitsreturned:int, hitsoffset:int, namespaces:array<int>, suggestion:string, suggestionrequested:boolean, payload:map<string,string> > # /struct > # /array year string Unpadded year of request month string Unpadded month of request day string Unpadded day of request hour string Unpadded hour of request # Partition Information # col_name data_type comment year string Unpadded year of request month string Unpadded month of request day string Unpadded day of request hour string Unpadded hour of request