Analytics/Data Lake/Traffic/Cirrus

From Wikitech
< Analytics‎ | Data Lake‎ | Traffic(Redirected from Analytics/Data/Cirrus)
Jump to: navigation, search

Idealized schema

This page documents an idealised schema for the Cirrus search requests table.

dt                  	string              	Timestamp at cache in ISO 8601 - "2015-07-25 07:53:52" - first field in the existing tarballs.
hostname            	string              	Source node hostname, e.g. "mw1168" - second field in the existing tarballs.
source                  string                  The wiki it came from, e.g. "enwiki" - third field in the existing tarballs.
target_index            string                  The target index, e.g. "enwiki_content" - (in some rare cases multiple indexes can be requested can we have an array of string here?)
ip                  	string              	IP of packet at cache. This will need to be extracted and passed through.
x_forwarded_for         string                  The x_forwarded_for field. Will need to be extracted and passed through.
search_query           	string              	The actual search query.
user_agent              string                  The user agent of the request.
search_type             string                  The type of search request it was; "full text", "prefix" or NULL. We actually probably don't want the maintenance tasks in here, do we?
total_time              int                     Total time taken.
es_time                 int                     ElasticSearch time taken.
total_results           int                     Total results found.
returned_results        int                     Number of results returned.
result_index            int                     Index of returned results
search_suggestion       string                  The search suggestion provided; NULL if none.
executor_id             int                     a temporary unique ID identifying the executor, allowing us to group chains of queries as a single success or failure.
is_api                  boolean                 A flag identifying whether the request was from the API (true) or web (false).
year                	int                 	Unpadded year of request
month               	int                 	Unpadded month of request
day                 	int                 	Unpadded day of request
hour                	int                 	Unpadded hour of request

# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
is_api                  boolean                 A flag identifying whether the request was from the API (true) or web (false).
year                	int                 	Unpadded year of request
month               	int                 	Unpadded month of request
day                 	int                 	Unpadded day of request
hour                	int                 	Unpadded hour of request

Current schema

For more details, refer to CirrusSearchRequestSet Avro schema specification.

ts                  int                 Timestamp at cache in ISO 8601
wikiid              string              Source node hostname, e.g. "mw1168"
source              string              The wiki it came from, e.g. "enwiki"
identity            string              MD5(UA + XFF + Optional String)
ip                  string              IP of packet at cache
useragent           string              The user agent of the request.
backendusertests    array<string>       Lists A/B tests the user is enrolled in.
payload             map<string,string>
requests            array<
                      struct<
                        query:string,
                        querytype:string,
                        indices:array<string>,
                        tookms:int,
                        elastictookms:int,
                        limit:int,
                        hitstotal:int,
                        hitsreturned:int,
                        hitsoffset:int,
                        namespaces:array<int>,
                        suggestion:string,
                        suggestionrequested:boolean,
                        payload:map<string,string>
                      > # /struct
                    > # /array
year                string              Unpadded year of request
month               string              Unpadded month of request
day                 string              Unpadded day of request
hour                string              Unpadded hour of request                  

# Partition Information
# col_name          data_type           comment             
year                string              Unpadded year of request
month               string              Unpadded month of request
day                 string              Unpadded day of request
hour                string              Unpadded hour of request