Analytics/Data Lake/Edits/Wikidata entity

From Wikitech
Jump to navigation Jump to search

The wmf.wikidata_entity table (available on Hive) is a conversion of the Wikidata entities JSON dumps in parquet. In wikibase (the underlying data-model of Wikidata), entities information is stored in JSON, leading to dumps in that format. A conversion is needed for a better data-format in parquet, as JSON makes extensive use of maps (objects with a variable name), which are less easy to use in Parquet when the data-model is actually well-defined.

New full dumps are generated and copied to the analytics hadoop cluster every week, and then converted to Parquet and added as a new snapshot in the hive table.

Current Schema

$ hive --database wmf

hive (wmf)> describe wikidata_entity;
OK
col_name	data_type	comment
id                  	string              	The id of the entity, P31 or Q32753077 for instance
typ                 	string              	The type of the entity, property or item for instance
labels              	map<string,string>  	The language/label map of the entity
descriptions        	map<string,string>  	The language/description map of the entity
aliases             	map<string,array<string>>	The language/List-of-aliases map of the entity
claims              	array<struct<id:string,mainSnak:struct<typ:string,property:string,dataType:string,dataValue:struct<typ:string,value:string>,hash:string>,typ:string,rank:string,qualifiers:array<struct<typ:string,property:string,dataType:string,dataValue:struct<typ:string,value:string>,hash:string>>,references:array<struct<snaks:array<struct<typ:string,property:string,dataType:string,dataValue:struct<typ:string,value:string>,hash:string>>,order:array<string>,hash:string>>>>	The claim array of the entity
sitelinks           	array<struct<site:string,title:string,badges:array<string>,url:string>>	The siteLinks array of the entity
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports)
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             	 	 
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports)

Notice the snapshot field. It is a Hive partition, an explicit mapping to weekly import in HDFS. You must include this partition predicate in the where clause of your queries (even if it is just snapshot > '0'). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in the 2020-01-20 snaphsot, you should add where snapshot = '2020-01-20'. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.

Sample queries

SELECT
    id,
    COUNT(1) as sitelink_number
FROM wmf.wikidata_entity LATERAL VIEW explode(sitelinks) t AS sitelink
WHERE snapshot='2020-02-10'
GROUP BY id
ORDER BY sitelink_number DESC
LIMIT 10;
SELECT id, claim.id, referenceSnak.datavalue.value
FROM wmf.wikidata_entity
LATERAL VIEW explode(claims) t AS claim
LATERAL VIEW OUTER explode(claim.references) t as reference
LATERAL VIEW OUTER explode(reference.snaks) t as referenceSnak
WHERE snapshot='2020-03-02'
AND typ = 'item'
AND referenceSnak.property = 'P854'
AND referenceSnak.datavalue.value RLIKE 'https?:\/\/(.*\.)?fullfact\.org'

Changes and known problems since 2020-02

Date from Task Details
2020-02 task T209655 Table is created and first imports and conversions are automated.

See also