Analytics/Data Lake/Edits/Wikidata item page link

From Wikitech
Jump to navigation Jump to search

The wmf.wikidata_item_page_link table (available on Hive) contains links between a wikidata item and its related wikipedia pages in various languages. It is generated by joining wmf.wikidata_entity site-links to reconstructed localized-namespaced-page-titles from wmf.mediawiki_page_history and wmf_raw.mediawiki_project_namespace_map.

The dataset is fully regenerated every week, following the wmf.wikidata_entity dataset snapshot release.

WARNING: The wmf.mediawiki_page_history dataset being released every month, there is a divergence between sitelinks available and pages available as the month pass. This means links correctness is better when using snapshots early in the month than later in the month.

Current Schema

$ hive --database wmf

hive (wmf)> describe wikidata_item_page_link;
OK
col_name	data_type	comment
item_id             	string              	The wikidata item_id (Q32753077 for instance)
wiki_db             	string              	The db project of the page the wikidata item links to
page_id             	bigint              	The id of the page the wikidata item links to
page_title          	string              	The title of the page the wikidata item links to
page_namespace      	int                 	The namespace of the page the wikidata item links to
page_title_localized_namespace	string              	The title with localized namespace header of the page the wikidata item links to
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports)
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment              	 
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports)

Notice the snapshot field. It is a Hive partition, an explicit mapping to weekly import in HDFS. You must include this partition predicate in the where clause of your queries (even if it is just snapshot > '0'). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in the 2020-01-20 snaphsot, you should add where snapshot = '2020-01-20'. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.


And entry (like, say, Q1) that has a presence in a number of wikis will have one entry per wiki on this table.

Sample queries

SELECT
    wiki_db,
    page_id,
    COUNT(1) as items_numbers
FROM wmf.wikidata_item_page_link
WHERE snapshot='2020-02-10'
GROUP BY wiki_db, page_id
ORDER BY items_numbers DESC
LIMIT 10;

Changes and known problems since 2020-02

Date from Task Details
2020-02 task T244707 Table is created with first automated snapshots.

See also