Data Platform/Data Lake/Content/Wikidata entity
The wmf.wikidata_entity table (available on Hive) is a conversion of the Wikidata entities JSON dumps in parquet. In wikibase (the underlying data-model of Wikidata), entities information is stored in JSON, leading to dumps in that format. A conversion is needed for a better data-format in parquet, as JSON makes extensive use of maps (objects with a variable name), which are less easy to use in Parquet when the data-model is actually well-defined.
New full dumps are generated and copied to the analytics hadoop cluster every week, and then converted to Parquet and added as a new snapshot in the hive table.
Current Schema
$ hive --database wmf hive (wmf)> describe wikidata_entity; OK col_name data_type comment id string The id of the entity, P31 or Q32753077 for instance typ string The type of the entity, property or item for instance datatype string The data type of the entity when a property labels map<string,string> The language/label map of the entity descriptions map<string,string> The language/description map of the entity aliases map<string,array<string>> The language/List-of-aliases map of the entity claims array<struct<id:string,mainSnak:struct<typ:string,property:string,dataType:string,dataValue:struct<typ:string,value:string>,hash:string>,typ:string,rank:string,qualifiers:array<struct<typ:string,property:string,dataType:string,dataValue:struct<typ:string,value:string>,hash:string>>,qualifiersOrder:array<string>,references:array<struct<snaks:array<struct<typ:string,property:string,dataType:string,dataValue:struct<typ:string,value:string>,hash:string>>,snaksOrder:array<string>,hash:string>>>> The claim array of the entity sitelinks array<struct<site:string,title:string,badges:array<string>,url:string>> The siteLinks array of the entity lastrevid bigint The latest revision id of the entity snapshot string Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports) # Partition Information # col_name data_type comment snapshot string Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports)
Notice the snapshot field. It is a Hive partition, an explicit mapping to weekly import in HDFS. You must include this partition predicate in the where clause of your queries (even if it is just snapshot > '0'). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in the 2020-01-20 snaphsot, you should add where snapshot = '2020-01-20'. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.
Sample queries
Hive
SELECT
id,
COUNT(1) as sitelink_number
FROM wmf.wikidata_entity LATERAL VIEW explode(sitelinks) t AS sitelink
WHERE snapshot='2020-02-10'
GROUP BY id
ORDER BY sitelink_number DESC
LIMIT 10;
SELECT id, claim.id, referenceSnak.datavalue.value
FROM wmf.wikidata_entity
LATERAL VIEW explode(claims) t AS claim
LATERAL VIEW OUTER explode(claim.references) t as reference
LATERAL VIEW OUTER explode(reference.snaks) t as referenceSnak
WHERE snapshot='2020-03-02'
AND typ = 'item'
AND referenceSnak.property = 'P854'
AND referenceSnak.datavalue.value RLIKE 'https?:\/\/(.*\.)?fullfact\.org'
Changes and known problems since 2020-02
Date from | Task | Details |
---|---|---|
2020-02 | task T209655 | Table is created and first imports and conversions are automated. |
See also
- The code that imports and converts:
- https://github.com/wikimedia/puppet/blob/1f8c823057b72d177187ee508a2e4b8068006516/modules/profile/manifests/analytics/refinery/job/import_wikidata_entities_dumps.pp
- https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/structureddata/jsonparse/JsonDumpConverter.scala
- The item_page_link hive table generated from this dataset. It contains links between wikipedia pages and wikidata items using wiki_db and page_id as references instead of site-links.