Wikidata Identifier Landscape

From Wikitech
Jump to navigation Jump to search

Code Repository: https://github.com/wikimedia/analytics-wmde-WD-WD_identifierLandscape

The Wikidata Identifier Landscape is an RStudio {Shiny} dashboard developed by WMDE in response to the Analyze and visualize the identifier landscape of Wikidata (Phab T204440) task.

The goals of the dashboard are to

  • represent, visually and by providing exact numbers, the extent upon which the Wikidata external identifiers overlap,
  • report upon how many statements are there for each Wikidata external identifier,
  • provide an overview of what topic areas do they represent, and
  • what type of resource do they link to.

This technical documentation provides (1) an overview of the essential operations that are performed to produce the data sets used on the dashboard, and serves (2) to summarize upon the visualization approach undertaken to represent the full complexity of the relations that hold between the Wikidata external identifiers. The dashboard user documentation is provided on Meta as well as on the dashboard itself.

Overview

The system operates two independent components: (a) the update engine, currently run on stat1007 and the WMF Analytics Cluster, and (b) the CloudVPS component that hosts the dashboard itself.

Each consecutive run of the update engine produces public data sets that are download and processed (lightly) on the client-side upon loading the dashboard.  

Wikidata Identifier Landscape - Overview (Technical)
Wikidata Identifier Landscape - Overview (Technical)

The Update Engine

The update engine comprises two scripts:

- WD_IdenfitierLandscape_Data.R, the governing R script, and

- WD_IdentifierLandscape_Data.py, the Pyspark script used to pre-process the WD dump copy in hdfs.

Order of operations:

1. Fetch all external identifiers and the Wikidata classes they belong to from WDQS [R].

2. Run WD_IdentifierLandscape_Data.py to collect item-property (i.e. item-identifier) pairs from the WD JSON dump copy in hdfs [Pyspark].

3. Run {data.table} cleaning and re-structuring operations on the resulting data set; produce binary contingency data with xtabs(), crossprod(), and sim2() from {text2vec} to compute the Jaccard distance matrix [R].

4. Run {Rtsne} implementation of t-distributed stochastic neighbor embedding (t-SNE) to perform a 2D dimensionality reduction.

5. Fetch property usage statistics.

6. Publish crucial statistics and datasets to https://analytics.wikimedia.org/datasets/wmde-analytics-engineering/Wikidata/WD_External_Identifiers/

Note. The current solution combines Pyspark to produce the item-identifier pairs from the dump with R {data.table} procedures to produce the final dataset, and then proceeds on to produce binary contingencies as well as Jaccard distances in R (in-memory computation on stat1007). This is motivated by the constraint on the number of unique values that can be handled by Apache Spark’s crosstab() - and we need to cross-tabulate a large number of categories. Future solutions might employ handy tricks (e.g. auto-joins) to keep all data pre-processing procedures in Spark alone and avoid in-memory processing on our number cruncher(s).

Public Datasets

They are hosted in: https://analytics.wikimedia.org/datasets/wmde-analytics-engineering/Wikidata/WD_External_Identifiers/

  • WD_ExtIdentifiers_UpdateInfo.csv – The timestamp of the latest update. The dashboard will be updated manually until the WD JSON dump copy in hdfs is not productionized (Phab T209655).
  • WD_ExternalIdentifiers_Co-Occurence.csv – A symmetric identifier x identifier co-occurence matrix.
  • WD_ExternalIdentifiers_DataFrame.csv – A list of all external identifiers with (a) their P numbers, (b) labels, (c) classes to which they belong (in a sense of P31), (d) their classes’ labels.
  • WD_ExternalIdentifiers_JaccardDistance.csv – A symmetric identifier x identifier Jaccard distance matrix.
  • WD_ExternalIdentifiers_Stats.csv – Essential statistics on WD external identifier usage.
  • WD_ExternalIdentifiers_Usage.csv – Essentially the same data set as WD_ExternalIdentifiers_DataFrame.csv except for it includes the identifier usage statistics.
  • WD_ExternalIdentifiers_tsneMap.csv – the 2D t-SNE solution coordinates.

The Dashboard (CloudVPS Component)

Following a series of experiments with {ggraph}, {rbokeh}, and {plotly}, the decision was made to use {plotly} for both semantic maps (2D t-SNE) and network visualizations, with some {igraph} data structures support in the background, as well as the {igraph} implementation of the Fruchterman-Reingold algorithm to layout large semantic networks. Over 1,000 nodes (WD external identifiers) are visualized, making the maximal possible possible effort to make the visualization readable instead of focusing on aesthetical aspects merely.

{DT} is used to produce searchable tables with exact data on identifier usage across WD items and overlap. Some simple SPARQL queries are produced and executed against the WDQS in order to fetch exemplars for each particular external identifier.

Wikidata Identifier Landscape
Wikidata Identifier Landscape