Jump to content

Data Platform/Data Lake/Content/Wikidata entity

From Wikitech

The wmf.wikidata_entity table (available on Hive) is a conversion of the Wikidata entities JSON dumps in parquet. In wikibase (the underlying data-model of Wikidata), entities information is stored in JSON, leading to dumps in that format. A conversion is needed for a better data-format in parquet, as JSON makes extensive use of maps (objects with a variable name), which are less easy to use in Parquet when the data-model is actually well-defined.

New full dumps are generated and copied to the analytics hadoop cluster every week, and then converted to Parquet and added as a new snapshot in the hive table.

Current Schema

$ hive --database wmf

hive (wmf)> describe wikidata_entity;
OK
col_name	data_type	comment
id                  	string              	The id of the entity, P31 or Q32753077 for instance
typ                 	string              	The type of the entity, property or item for instance
datatype            	string              	The data type of the entity when a property
labels              	map<string,string>  	The language/label map of the entity
descriptions        	map<string,string>  	The language/description map of the entity
aliases             	map<string,array<string>>	The language/List-of-aliases map of the entity
claims              	array<struct<id:string,mainSnak:struct<typ:string,property:string,dataType:string,dataValue:struct<typ:string,value:string>,hash:string>,typ:string,rank:string,qualifiers:array<struct<typ:string,property:string,dataType:string,dataValue:struct<typ:string,value:string>,hash:string>>,qualifiersOrder:array<string>,references:array<struct<snaks:array<struct<typ:string,property:string,dataType:string,dataValue:struct<typ:string,value:string>,hash:string>>,snaksOrder:array<string>,hash:string>>>>	The claim array of the entity
sitelinks           	array<struct<site:string,title:string,badges:array<string>,url:string>>	The siteLinks array of the entity
lastrevid           	bigint              	The latest revision id of the entity
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports)
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             	 	 
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports)

Notice the snapshot field. It is a Hive partition, an explicit mapping to weekly import in HDFS. You must include this partition predicate in the where clause of your queries (even if it is just snapshot > '0'). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in the 2020-01-20 snaphsot, you should add where snapshot = '2020-01-20'. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.

Sample queries

Superset Presto (presto_analytics_hive)

Labels from Item IDs

SELECT 
    id, 
    labels['en'] AS en_label
FROM wmf.wikidata_entity
WHERE id IN ('Q42', 'Q64', 'Q1')
  AND snapshot = '2025-07-07';

Items with the most sitelinks

SELECT
    id,
    COUNT(*) AS sitelink_number
FROM wmf.wikidata_entity,
UNNEST(transform(sitelinks, x -> x.title)) AS t(sitelink)
WHERE snapshot = '2025-06-30'
GROUP BY id
ORDER BY sitelink_number DESC
LIMIT 10;

Items with statements that have a fullfact.org URL as the reference URL

SELECT 
  e.id,
  c.id as claim_id,
  rs.datavalue.value as reference_value
FROM wmf.wikidata_entity e
CROSS JOIN UNNEST(e.claims) AS c
CROSS JOIN UNNEST(c.references) AS r  
CROSS JOIN UNNEST(r.snaks) AS rs
WHERE e.snapshot = '2025-06-30'
  AND e.typ = 'item'
  AND rs.property = 'P854'
  AND rs.datavalue.value LIKE '%fullfact.org%'
LIMIT 1

Has a LIMIT 1 statement, as the full result is over the 60s timeout on superset currently.

Hive

Items with the most sitelinks

SELECT
    id,
    COUNT(1) as sitelink_number
FROM wmf.wikidata_entity LATERAL VIEW explode(sitelinks) t AS sitelink
WHERE snapshot='2020-02-10'
GROUP BY id
ORDER BY sitelink_number DESC
LIMIT 10;

Items with statements that have a fullfact.org URL as the reference URL

SELECT id, claim.id, referenceSnak.datavalue.value
FROM wmf.wikidata_entity
LATERAL VIEW explode(claims) t AS claim
LATERAL VIEW OUTER explode(claim.references) t as reference
LATERAL VIEW OUTER explode(reference.snaks) t as referenceSnak
WHERE snapshot='2020-03-02'
AND typ = 'item'
AND referenceSnak.property = 'P854'
AND referenceSnak.datavalue.value RLIKE 'https?:\/\/(.*\.)?fullfact\.org'

Changes and known problems since 2020-02

Date from Task Details
2020-02 task T209655 Table is created and first imports and conversions are automated.

See also