Wikidata Concepts Monitor

From Wikitech
Jump to navigation Jump to search

The Wikidata Concepts Monitor (WDCM) is a system that analyzes and visualizes Wikidata usage across the Wikimedia projects [WDCM Wikidata Project Page|WDCM Dashboards|Gerrit|Diffusion].

WDCM is developed and maintained (mainly) by Goran S. Milovanovic, Data Scientist, WMDE; any suggestions, contributions, and questions are welcome and should be directed to him.

Introduction

This page presents the technical documentation and important aspects of the system design of Wikidata Concepts Monitor (WDCM). The WDCM data product presents a set of Shiny dashboards that provide analytical insight into the Wikidata usage across its client projects, fully developed in R. In deployment, WDCM resides on the open source version of the RStudio Shiny Server. The WDCM dashboards are hosted on the wikidataconcepts Labs instance, relying on a MariaDB back-end that supports their immediate functionality; however, the WDCM system as a whole also depends on numerous ETL procedures that are run from production (stat1004 and stat1005) and supported by Apache Sqoop and Hadoop, as well as on a set of SPARQL queries that extract pre-defined sets of Wikidata items for further analyses. The document explains the modular design of the WDCM and documents the critical procedures; a public code repository where the respective procedures are found is found on Gerrit and Diffusion.

Note: the WDCM Dashboards user manuals are found in the Description section on the respective dashboards (WDCM Overview, WDCM Usage, WDCM Semantics, WDCM Geo).

The WDCM System Operation Workflow: an overview of the WDCM monthly update.

General Approach to the Study of Wikidata usage

While Wikidata itself is a semantic ontology with pre-defined and evolving normative rules of description and inference, Wikidata usage is essentialy a social, behavioral phenomenon, suitable for study by means of machine learning in the field of distributional semantics: the analysis and modeling of statistical patterns of occurrence and co-occurence of Wikidata item and property usage across the client projects (e.g. enwiki, frwiki, ruwiki, etc). WDCM thus employs various statistical models in an attempt to describe and provide insights from the observable Wikidata usage statistics (e.g. topic modeling, clustering, dimensionality reduction, all beyond providing elementary descriptive statistics of Wikidata usage, of course.).

Wikidata usage patterns

The “golden line” that connects the reasoning behind all WDCM functions can be non-technically described in the following way. Imagine observing the number of times a set of size N of particular Wikidata items was used across some project (enwiki, for example). Imagine having the same data or other projects as well: for example, if 200 projects are under analysis, then we have 200 counts for N items in a set, and the data can be described by a N x 200 matrix (items x projects). Each column of counts, representing the frequency of occurrence of all Wikidata entities under consideration across one of the 200 projects under discussion - a vector, obviously - represents a particular Wikidata usage pattern. By inspecting and modeling statistically the usage pattern matrix - a matrix that encompasses all such usage patterns across the projects, or the derived co-variance/correlation matrix - many insights into the similarities between Wikimedia projects items projects (or, more precisely, the similarities between their usage patterns) can be found. 

In order to provide an illustration for this logic behind all structural WDCM analyses, the following figure presents the Wikidata usage patterns across 14 items categories for some of the largest Wikipedia projects:

WDCM Usage Patterns.
WDCM Usage Patterns.

The horizontal axis lists 14 categories of Wikidata items that are currently tracked by the WDCM system. The vertical axis represents the logarithm of the counts of how many times have the items from a particular category been used on a particular Wikipedia. The logarithmic scale is used only to prevent from the overcrowding of data points in the plot. Each line connecting the data points presents one Wikidata usage pattern. In this setting, the usage patterns are a characteristic of some particular Wikipedia. However, we could imagine the same plot "transposed", so that the line would connect categories of items and not projects. We would thus obtain the category-specific usage patterns.

The following explanation is a brute simplification of what WDCM does, however, it maybe presents a nice conceptual introduction to the understanding of the inner working of this system. From the viewpoint of Wikidata usage in the 14 semantic categories presented in the picture, any Wikipedia project can be described as a vector of 14 numbers, each number standing for the count of how many times has an item from the respective category been used in a Wikipedia under consideration. The lines connecting the data points for a particular project in the plot represent exactly those counts (precisely: their logarithms). How can we use this information assess the similarity in Wikidata usage between any two Wikipedias? The simplest possible approach is to compute the correlation between the respective usage patterns. The following table presents a correlation matrix in which rows and columns stand for projects. The matrix is populated by correlation coefficients (we've used the Spearman's ρ coefficient). These coefficients range from -1 (absolute negative correlation) to + 1(absolute positive correlation). The value of zero would mean that the two usage patterns are not dependent at all.

cebwiki dewiki enwiki frwiki itwiki ruwiki tawiki zhwiki
cebwiki 1.00 0.75 0.82 0.80 0.68 0.73 0.75 0.85
dewiki 0.75 1.00 0.90 0.96 0.92 0.89 0.96 0.88
enwiki 0.82 0.90 1.00 0.92 0.93 0.75 0.88 0.86
frwiki 0.80 0.96 0.92 1.00 0.95 0.87 0.93 0.94
itwiki 0.68 0.92 0.93 0.95 1.00 0.77 0.91 0.86
ruwiki 0.73 0.89 0.75 0.87 0.77 1.00 0.90 0.76
tawiki 0.75 0.96 0.88 0.93 0.91 0.90 1.00 0.83
zhwiki 0.85 0.88 0.86 0.94 0.86 0.76 0.83 1.00

As we can see from this correlation matrix, all diagonal elements that represent the correlations of the particular project usage patterns with themselves contain ones, as expected: a Wikidata usage pattern for a particular projects is maximally self-similar. Looking into any other cells reveal positive numbers less than one. Each one represents a correlation between the usage patterns of the Wikipedias in the respective rows and columns of the matrix. Thus, we can say that dewiki and enwiki (having a correlation of 0.90) are more similar in the respect to how they use Wikidata items from the 14 categories under consideration than, say, dewiki and cebwiki (having a correlation of 0.75).

Now let's transpose the usage patterns and re-calculate the correlation matrix:

Architectural Structure Astronomical Object Book Chemical Entities Event Gene Geographical Object Human Organization Scientific Article Taxon Thoroughfare Wikimedia Work Of Art
Architectural Structure 1.00 0.50 0.79 0.93 0.67 0.93 -0.45 0.81 0.95 0.98 -0.10 0.60 0.86 0.67
Astronomical Object 0.50 1.00 0.90 0.62 0.90 0.62 0.05 0.74 0.55 0.57 0.19 0.81 0.50 0.90
Book 0.79 0.90 1.00 0.81 0.95 0.81 -0.14 0.83 0.79 0.81 0.10 0.76 0.74 0.95
Chemical Entities 0.93 0.62 0.81 1.00 0.76 1.00 -0.38 0.93 0.88 0.95 0.05 0.76 0.93 0.76
Event 0.67 0.90 0.95 0.76 1.00 0.76 -0.17 0.83 0.64 0.69 0.02 0.71 0.69 1.00
Gene 0.93 0.62 0.81 1.00 0.76 1.00 -0.38 0.93 0.88 0.95 0.05 0.76 0.93 0.76
Geographical Object -0.45 0.05 -0.14 -0.38 -0.17 -0.38 1.00 -0.40 -0.24 -0.29 0.43 -0.05 -0.26 -0.17
Human 0.81 0.74 0.83 0.93 0.83 0.93 -0.40 1.00 0.79 0.83 0.14 0.88 0.86 0.83
Organization 0.95 0.55 0.79 0.88 0.64 0.88 -0.24 0.79 1.00 0.98 -0.02 0.67 0.81 0.64
Scientific Article 0.98 0.57 0.81 0.95 0.69 0.95 -0.29 0.83 0.98 1.00 0.00 0.69 0.88 0.69
Taxon -0.10 0.19 0.10 0.05 0.02 0.05 0.43 0.14 -0.02 0.00 1.00 0.48 0.29 0.02
Thoroughfare 0.60 0.81 0.76 0.76 0.71 0.76 -0.05 0.88 0.67 0.69 0.48 1.00 0.71 0.71
Wikimedia 0.86 0.50 0.74 0.93 0.69 0.93 -0.26 0.86 0.81 0.88 0.29 0.71 1.00 0.69
Work Of Art 0.67 0.90 0.95 0.76 1.00 0.76 -0.17 0.83 0.64 0.69 0.02 0.71 0.69 1.00

We have now used the Wikidata usage patterns of particular categories of items across the projects to compute the correlations. Again, all categories are maximally self-similar in respect to how they are used across the projects: look at the diagonal elements. However, we can say that, for example, Architectural Structures are more similarly used across the Wikipedias under consideration to the way that Chemical Entities are used (having a correlation of 0.93), compared to the way in which the Taxon category is used (having a negative correlation of -0.10).

The variety of usage patterns

We can imagine computing the usage patterns across many different variables. For example, we didn't have to compare the Wikipedias by how much they make use of Wikidata items from particular item categories, but ask: how much do they use particular items? In that case we would have to deal with usage patterns that would encompass many millions of elements, and not only fourteen elements that represent the aggregate counts in particular categories. Have we included all Wikimedia projects that have client-side Wikidata tracking enabled, we would have to deal with more than 800 usage patterns, one for every project, and not with only eight as in this example. We could have picked only items from a single Wikidata category, for example all instances of Human(Q5), and then compute the correlations between the usage patterns across all of them; we would again have to deal with usage patterns of length of several millions. Every time we change the definition of a usage pattern, we are changing the goals of the analysis, and this is the first thing to keep in mind when learning about WDCM. We can analyze the similarity between the Wikidata usage patterns for different projects from a viewpoint of only some Wikidata items, or from a viewpoint of a complete category of Wikidata items, or we can analyze only a subset of projects. On many levels of analysis, WDCM changes these "perspectives" of analysis to illustrate the ways in which Wikimedia project make use of Wikidata in as much as possible detail.

The second thing to keep in mind at this point is that this example is, once again, a brute oversimplification of our methodology. WDCM uses a much more advanced mathematical model to assess the similarities in usage patterns then the correlation matrices that were used in this example. We will later use a few words here and there in order to provide for a conceptual introduction to the methodology used to assess project and category similarity in Wikidata usage, but an interested reader who wants to go under the hood will certainly have to do some reading first. Don't worry, we will list the recommended readings too.

In essence, the technology and mathematics behind WDCM relies on the same set of practical tools and ideas that support the development of semantic search engines and recommendation systems, only applied to a specific dataset that encompasses the usage patterns for tens of millions of Wikidata entities across its client projects.

Motivation

The data obtained in this way, and analyzed properly, allow for the inferences about how different communities use Wikidata to build their specific projects, or about the ways in which semantically related collections of entities are used across some set of projects. By knowing this, it becomes possible to develop suggestions on what cooperation among the communities would be fruitful and mutually beneficial in terms of enhancing the Wikidata usage on the respective projects. On the other hand, communities that are focused on some particular semantic topics, categories (sets), sub-ontologies, etc. can advance by recognizing the similarity in their approaches and efforts. Thus, a whole new level of collaborative development around Wikipedia could be achieved. This goal motivates the development of the WDCM system, beyond the obvious possibility to assess data of fundamental scientific importance - for cognitive and data scientists, sociologists of knowledge, AI engineers, ontologists, pure enthusiasts, and many others.

WDCM is designed to answer questions like the following:

  • How much are the particular classes of Wikidata items used across the Wikimedia projects?
  • What are the most frequently used Wikidata items in particular Wikimedia projects or from particular Wikidata sets of items?
  • How can we categorize the Wikimedia projects in respect to the characteristic patterns of Wikidata usage that we discover in them?
  • What Wikimedia projects are similar in respect to how they use Wikidata, overall and from the perspective of some particular sets of items?
  • How is the Wikidata usage of the geolocalized items (such as those relevant for the GLAM initiatives) spatially distributed?

Definitions

Wikidata usage

Wikidata usage analytics

By Wikidata usage analytics it is meant: all important and interesting statistics, summaries of statistical models and tests, visualizations, and reports on how Wikidata is used across the Wikimedia projects. The end goal of WDCM is to deliver consistent, high quality Wikidata usage analytics.

Wikidata usage (statistics)

Consider a set of sister projects (e.g. enwiki, dewiki, frwiki, zhwiki, ruwiki, etc; from the viewpoint of Wikidata usage, we also call them: client projects). Statistical count data that represent the frequency of usage of particular Wikidata entities over any given set of client projects are considered to be Wikidata usage (statistics) in the context of WDCM.

[AN IMPORTANT] NOTE on the Wikidata usage definition

The following discussion relies on the understanding of the Wikibase Schema, especially the wbc_entity_usage table schema (a more thorough explanation of Wikidata item usage tracking in the wbc_entity_usage tables is provided on Phabricator). The methodological discussion of the development of Wikidata usage tracking in relation to this schema is also found on Phabricator.

A strict, working, operational definition of Wikidata usage data is still under development. The problem with its development is of a technical nature and related to the current logic of the wbc_entity_usage table schema. This table is found on MariaDB replicas in the database for any respective project that has a client-side Wikidata usage tracking enabled. 

The “S”, “T”, “O”, and “X” usage aspects

The problematic field in the current wbc_entity_usage schema is eu_aspect. With its current definition, this field enables to select in a non-redundant way only the “S”, “O”, and “T” entity usage aspects; meaning: only “S”, “O”, and “T” occurrences of any given Wikidata entity on any given sister projects that maintains client-side Wikidata usage tracking signal one and only one entity usage in the respective aspect on that project (i.e. these aspects are non-overlapping in their registration of Wikidata usage). However, while “S”, “O”, and “T” do not overlap, they may overlap with the “X” usage aspect. Excluding the “X” aspect from the definition is again not possible, namely: ignoring it implies that the majority of relevant usage, e.g. usage in infoboxes, will not be tracked (accessing statement data via Lua is typically tracked as “X”).

The “L” aspects problem: tracking the fallback mechanism

The “L” aspects, usually modified by a specific language modifier (e.g. “L.de”, “L.en”, and similar) cannot be counted in a non-redundant way currently. This is a consequence of the way the wbc_entity_table is produced in respect to the possible triggering of the language fallback mechanism. To explain a language fallback mechanism in a glimpse: for example, let a language fallback chain for a particular language be: “L.de-ch” → “L.de” → “L.en”. That implies the following: if the usage of item label in Swiss German (“L.de-ch”) was attempted, and no label in Swiss German was found, an attempt to use the German (“L.de”) would be made, and an attempt at the English label (“L.en”) made in the end if the previous attempt fails. However, if a language fallback mechanism is triggered on a particular entity usage occasion, all L aspects in that fallback chain will be registered in the wbc_entity_usage table as if they were used simultaneously. From the viewpoint of Wikidata usage, it would be interesting to track (a) the attempted – i.e. the user intended – L aspect, or (at least) (b) the actually used L aspect for a given entity usage. However, the current design of the wbc_entity_usage table does not provide for an assessment of neither of these possibilities. 

Finally, there are other uncertainties related to the current design of the wbc_entity_usage table. For example, imagine an editor action that results in a presence of a particular item, with a sitelink, instantiating a label in a particular language at the same time. How many item usage counts do we have: one, two, or more (one “S” aspect count for the sitelink, and at least another for a specific “L” aspect count)?

In conclusion, if Wikidata usage statistics are to encompass all different ways in which an item usage could be defined, by mapping onto all possible editor actions in instantiating a particular item on a particular page, the design of the wbc_entity_usage table would have to undergo a thorough revision, or a new Wikidata usage tracking mechanism would have to be developed from scratch. The wbc_entity_usage table was never designed to enable for analytical purposes in the first place; however, it is the only source for Wikidata usage statistics that we can currently rely on.

A proposal for an initial solution:

- [NOTE] This is the current Wikidata usage definition in the context of WDCM.

From the existing wbc_entity_table schema, it seems possible to rely on the following definition. For the initial version of the WDCM system, use a simplified definition of Wikidata usage that excludes the multiple item per-page usage cases, in effect: 

  • count on how many pages a particular Wikidata item occurs in a project;
  • take that as a Wikidata usage per-project statistic;
  • ignore usage aspects completely until a proper tracking of usage per-page is enabled in the future.

By "proper tracking of usage per-page" the following is meant:

  • a methodology that counts exactly how many usage cases of a particular item there are on a particular page in a particular project.

WDCM Taxonomy

The WDCM Taxonomy presents a human choice of specific categories and items from the Wikidata ontology that are submitted to WDCM for analytics.

Currently, only one WDCM Taxonomy is specified (Lydia Pintcher, 05/03/2017, Berlin).

The fact that the WDCM relies on a specific choice of taxonomy implies that not all Wikidata items are necessarily tracked and analyzed by the system.

Users of WDCM can specify any imaginable taxonomy that presents a proper subset of Wikidata; no components of the WDCM system are dependent upon any characteristics of some particular choice of taxonomy.

Once defined, the WDCM taxonomy is translated into a set of (typically very simple) SPARQL queries that are used to collect the respective item IDs; only collected items will be tracked and analyzed by the system.

The 14 currently encompassed item categories are:  

The WDCM Taxonomy is still undergoing refinement. An ideal situation would be to completely avoid category overlap, which is not yet satisfied, and it is questionable whether it is possible as a general solution at all in respect to the structure of Wikidata. The following directed graph shows the current item categories in the WDCM taxonomy and the network of the P279 (Subclass Of) relations in which they play a role. The structure was obtained by performing a recurrent search through the P279 paths, starting from entity (Q35120) and down from it to depth of 4 (searching for sub-classes of sub-classes of sub-classes etc). Some item categories from the WDCM Taxonomy are not found even at the sub-class depth 4 from entity, which constraints P279 as its necessary target item (on a recurrent path from anything, of course). Note: the only cycle in the graph is Entity →Entity.

WDCM Taxonomy, P279 structure down to depth 4 from entity (Q35120).
WDCM Taxonomy, P279 structure down to depth 4 from entity (Q35120).

WDCM Data Schemata

The WDCM data schemata encompass three components:

  • HDFS, Big Data component (Production, Analytics Cluster)
  • The RDBS Component (Cloud VPS, MariaDB)
  • Cloud VPS Instance Local Data Component

The first, (1) HDFS Big Data component, is produced by an (A) R-orchestrated Apache Sqoop cycle which transfers many wbc_entity_usage tables from MariaDB in production to a single Hive table in Hadoop on the Analytics Cluster, and an (B) ETL and machine learning cycle that uses R packages and HiveQL interchangeably to produce the data sets for the (2) RDBS component on tools.labdb and (3) the Local Data Component stored as .csv files on the wikidataconcepts.eqiad.wmflabs instance.

The details of this process are given below (WDCM System Operation Workflow). All WDCM public data sets are produced from the HDFS Big Data component and are publicly available from https://analytics.wikimedia.org/datasets/wdcm/; the WDCM Dashboards operate on the very same files.

HDFS, Big Data component (Production)

This component encompasses the following two HiveQL tables, currently in the goransm Hadoop database:

wdcm_clients_wb_entity_usage

This table is the result of the the WDCM_Sqoop_Clients.R script which runs a regular weekly Apache Sqoop update to collect the data from all client projects that maintain the wbc_entity_usage table - which means that they have Wikidata usage client-side tracking enabled. This Hive table presents the raw WDCM data set; it is simply a product of sqooping many big MariaDB tables that are not suited for analytics queries into Hadoop. This Hive table is used only to produce the wdcm_maintable in Hive that is then used in all WDCM pre-processing operations.

goransm.wdcm_clients_wb_entity_usage
col_name data_type comment
eu_row_id bigint row identifier
eu_entity_id string Wikidata item ID, e.g. Q5
eu_aspect string eu_aspect, see wbc_entity_usage schema
eu_page_id bigint the ID of the page where the item is used
wiki_db string partition; the project database, e.g. "enwiki", "dewiki".

wdcm_maintable

The main WDCM data set, a Hive table produced by the WDCM_Engine_goransm.R script from the wdcm_clients_wb_entity_usage table. All WDCM data sets are produced from this Hive table by the WDCM_Engine_goransm.R script.

goransm.wdcm_maintable
col_name data_type comment
eu_entity_id string Wikidata item ID, e.g. Q5
eu_project string the project database, e.g. "enwiki", "dewiki".
eu_count bigint the WDCM statistic: how many different pages do make use of this eu_entity_id in this eu_project?
category string partition; to which category of Wikidata items from the WDCM Taxonomy does this eu_entity_id belongs to?

The RDBS component (Cloud VPS)

This component is supported by MariaDB, with many SQL tables in the u16664__wdcm_p database on tools.labsdb. All these tables are produced by the WDCM_Process.R script that is run from the wikidataconcepts.eqiad.wmflabs Cloud VPS (i.e. Labs) instance. The same instance servers the WDCM front-end. All tables are currently produced by the {RMySQL} package, however, that will change in the future (by performing direct system() calls to MariaDB).

wdcm2_category The per WDCM semantic category aggregated WDCM usage statistics.

wdcm2_category data type
category text
eu_count bigint(20)

wdcm2_category_item100 The 100 most frequently used Wikidata items per WDCM semantic category, including item labels.

wdcm2_category_item100 data type
eu_entity_id varchar(255)
eu_count int(11)
category varchar(255)
eu_label varchar(255)

wdcm2_category_project_2dmap The t-SNE 2D reduction of the WDCM Semantic Category x Wikimedia Projects matrices. Representation is by category.

wdcm2_category_project_2dmap data type
D1 double
D2 double
category text

wdcm2_itemtopic_<category_name> [generic, as many tables as there are WDCM semantics categories]. The Wikidata item x Topics matrices obtained from Latent Dirichlet Allocation across the 5,000 most frequently used items per WDCM semantic category. Thus, each category receives one table, for example: wdcm2_itemtopic_Taxon. The number of Topic fields depends upon the particular LDA Topic Model of course. Generated by the WDCM_Engine_goransm.R script initially, only copied from Production and stored as SQL tables.

wdcm2_itemtopic_<category> data type
eu_entity_id text
topic1 double
topic2 double
topic3 double
... double
topicN double
eu_label text

wdcm2_project WDCM total usage statistics per Wikimedia project.

wdcm2_project data type
eu_project text
eu_count bigint(20)
projectype text

wdcm2_project_category The Wikimedia Project x WDCM Semantic Category usage statistics cross-tabulation.

wdcm2_project_category data types
eu_project varchar(255)
category varchar(37)
eu_count int(11)
projecttype varchar(255)

wdcm2_project_category_2dmap The t-SNE 2D reduction of the WDCM Semantic Category x Wikimedia Projects matrices. Representation is by project.

wdcm2_project_category_2dmap data type
D1 double
D2 double
projects text
projecttype text

wdcm2_project_category_item100 The 100 most frequently used Wikidata items per WDCM semantic category per Wikimedia project.

wdcm2_project_category_item100 data type
eu_project varchar(255)
category varchar(255)
eu_entity_id varchar(37)
eu_count int(11)
projecttype varchar(255)
eu_label varchar(255)

wdcm2_project_item100 The 100 most frequently used Wikidata items per project, with item labels included.

wdcm2_project_item100 data type
eu_project varchar(255)
eu_entity_id varchar(37)
eu_count int(11)
projecttype varchar(255)
eu_label varchar(255)

wdcm2_projects_2dmaps The t-SNE 2D reductions of the Wikimedia Projects x Topics matrices obtained from Latent Dirichlet Allocation across the 5,000 most frequently used items per WDCM semantic category performed over the Wikidata Item x Wikimedia Project matrices in WDCM_Engine_goransm.R. Generated by the WDCM_Engine_goransm.R script initially, only copied from Production and stored as SQL tables. These are the most precise WDCM descriptions of usage pattern structures across the Wikimedia projects.

Cloud VPS Instance Local Data Component

Sometimes, for smaller data sets, it is more efficient to load with data.table::fread() than to fetch from MariaDB. Thus the WDCM system store some of the data sets locally as .csv files and the dashboards load them directly from the respective directories.

The following .csv tables are produced in production (stat1005) from WDCM_Engine_goransm.R and then copied to the wikidataconcepts.eqiad.wmflabs instance where they are loaded directly from R to support the WDCM Dashboards:

wdcm2_projecttopic_<category_name>.csv

  • the Wikimedia Projects x Topics matrices obtained from Latent Dirichlet Allocation across the 5,000 most frequently used items per WDCM semantic category

wdcm2_visNetworkNodes_project_<category_name>.csv

wdcm2_visNetworkEdges_project_<category_name>.csv

Besides these data frames, a specific set of files used by the WDCM Geo Dashboard is also stored locally on the Cloud VPS wikidataconcepts.eqiad.wmflabs instance:

wdcm_geoitem_<geolocalized_item_category>.csv

WDCM System Operation Workflow

The following schema represents the WDCM System Operation Workflow. We will proceed by explaining component by component.

WDCM System Operation Workflow
WDCM System Operation Workflow.
  • The first phase is performed by the WDCM_Sqoop_Clients.R script, run on a regular weekly schedule from stat1004. This R script orchestrates Apache Sqoop operations to (1) transfer the many (currently more than 800) wbc_entity_usage SQL tables from the m2 MariaDB replica in order to produce (2) the Hadoop/Hive wdcm_clients_wb_entity_usage table, partitioned by WDCM semantic category (elements of the WDCM Taxonomy) in the goransm database where they can processed. The first phase typically takes 7-8 hours to complete.
  • The second phase (3 - 6) is performed by the WDCM_Engine_goransm.R from stat1005, run on a regular monthly basis. This script presents, in many respects, the central WDCM ETL and computing engine:
    • in the first step, this script will (3) load the current WDCM Taxonomy to determine what Wikidata item classes need to be fetched from Wikidata via the SPARQL endpoint;
    • in the second step, (4) many millions of Wikidata items are selected and their IDs fetched from the SPARQL endpoint, in order to determine what information to search for in the previously produced wdcm_clients_wb_entity_usage table Hive table;
    • in the third step, (5) many HiveQL cycles of batch processing are performed in order to aggregate the WDCM statistics per Wikimedia project and per Wikidata item to produce the wdcm_maintable, partitioned by Wikimedia project;
    • in the fourth step, (6a) ETL steps are performed in HiveQL over the wdcm_maintable to produce various aggregate WDCM data sets;
      • in the fifth step, Wikimedia Project x Wikidata items matrices - one matrix per WDCM semantic category - are submitted to topic modeling by the R {maptpx} MAP estimation of the Latent Dirichlet Allocation model; while {maptpx} allows for a rapid estimation of the LDA topic models in respect to other algorithms, this step will soon be replaced by running LDA from Apache Spark in order to gain additional processing efficiency and enable for cross-validation procedures;
      • in the sixth step, Wikimedia project x Semantic topics matrices are submitted to an {Rtsne} implementation of the t-Distributed Stochastic Neighbor Embedding 2D dimensionality reduction to support visualizations on the WDCM Dashboards, and the coordinates of the respective 2D representations are stored;
      • in the seventh step, data frames for network visualizations with R {visNetwork} are prepared and stored;
    • in the final step (6b) all data sets are made public from /srv/published-datasets/wdcm on stat1005 which is mapped on https://analytics.wikimedia.org/datasets/wdcm/ for open access. The second phase - the operation of the WDCM_Engine_goransm.R script from stat1005 - typically takes 32 - 33 hours to complete.
  • The third and the final phase begins when (7) the regular hourly scheduled check for timestamp changes in the WDCM public data sets from the wikidataconcepts.eqiad.wmflabs Cloud VPS instance determines that a new WDCM update is ready and (8) starts the execution of the WDCM_Process.R script:
    • This script (9a) populates the SQL tables in the u16664_wdcm_p MariaDB database and (9b) copies some of the WDCM public data sets into local directories, which in turn support the WDCM Shiny Dashboards directly. This operation typically takes around 20 minutes to complete. Item labels are fetched from Wikibase/Schema/wb terms in this phase only, in order to avoid overloading the SPARQL endpoint in the previous phases.

The process illustrated here does not encompass the WDCM engine update of the WDCM Geo Dashboard, which is run by a separate (WDCM_EngineGeo_goransm.R) script from stat1005, relying on the same WDCM Hive tables as the main engine update.

WDCM Dashboards

The Dashboards module is a set of RStudio Shiny dashboards that serve the Wikidata usage analytics to its end-users.

This set of Shiny dashboards relies on the WDCM Database (u16664__wdcm_p on tools.labsdb) to serve Wikidata usage analytics; the database is obtained directly from the WDCM_Process.R script.

Currently, the WDCM System runs four dashboards:

  • WDCM Overview, providing an elementary overview - the "big picture" - of Wikidata usage
  • WDCM Usage, providing for detailed usage statistics, and
  • WDCM Semantics, providing insights from the topic models derived from the usage data.
  • WDCM Geo, providing interactive maps of geolocalized Wikidata items alongside the respective WDCM usage statistics.
  • WDCM (S)itelinks, providing detailed insights into the structure of Wikidata usage across a selection of Wikipedia projects.

All WDCM Dashboards are documented in their respective Description sections. A walk-through with illustrative usage examples is provided on the WDCM Project page.

WDCM (S)itelinks Dashboard

The WDCM (S)itelinks dashboard

  • analyzes only the sitelinks usage aspect from the wbc_entity_usage table (in the Wikibase schema), and
  • takes into account only mature Wikipedia projects (in terms of Wikidata usage)

in order to obtain and present a broad and as clear as possible overview of the structure of Wikidata (S)itelink usage aspect across the Wikipedia.

The dashboard's update engine is (currently) run from stat1007 and encompasses (a) R and HiveQL orchestration from R to obtain the necessary data from the wdcm_clients_wb_entity_usage table, and (b) {maptpx} topic modeling and various other R packages to produce the data sets that are used to obtain the visualizations of the Wikidata usage structure on the dashboard itself. The dashboard is developed in RStudio Shiny and runs on the Shiny Server from the wikidataconcepts.eqiad.wmflabs CloudVPS instance.

Dashboard Update Engine: schedule, wrangling, and modeling procedures

Updates are run at 00:00 UTC each 2nd, 8th, 15th, 21st, and 28th in the month, each following a day after the completion of the WDCM_Sqoop_Clients.R runs from stat1004 on 1st, 7th, 14th, 20th, and 27th in the month. Thus we have five dashboard updates each month. Each update engine run takes approximately between eight and nine hours to complete, with machine learning procedures (LDA) accounting for a large fraction of the runtime.

Filtering out Wikidata item use cases. A sitelink usage of a Wikidata item on a project with the Wikibase Client extension installed is recorded in the client's wbc_entity_usage table when "... a client page [...] is connected to an item via an incoming sitelink, but does not access any data of the item directly".

We do not consider all Wikidata classes (as in any WDCM dashboard): the following WDCM semantic classes are considered only:

  • Human (human (Q5))
  • Work of Art (work of art (Q838948))
  • Scientific Article (scientific article (Q13442814))
  • Book (book (Q571))
  • Geographical Object (geographical object (Q618123))
  • Organization (encompassing company (Q783794), club (Q988108), and organization (Q43229))
  • Architectural Structure (encompassing monument (Q4989906) and building (Q41176))
  • Gene (gene (Q7187))
  • Chemical Entities (encompassing chemical element (Q11344), chemical compound (Q11173), and chemical substance (Q79529))
  • Astronomical Object (astronomical object (Q6999))
  • Taxon (taxon (Q16521))
  • Event (event (Q1656682))
  • Thoroughfare (thoroughfare (Q83620))

Filtering out projects and semantic categories. In order to obtain a comprehensible picture of Wikidata items sitelink usage, we apply the following set of criteria for project and semantic category retention in the analyses:

  1. a formal check is first performed to filter out all projects that are not present in the List of Wikipedias;
  2. the total Wikidata usage per project is computed by summing up all sitelink item use cases per project, and then only projects with above median total Wikidata usage are considered;
  3. only projects that make use of at least 10 WDCM semantic categories listed above are kept;
  4. only semantic categories with 100 or more items that are currently used across all selected projects are retained;
  5. only items that are used in at least 10% of the selected projects are kept from each semantic category;
  6. only the 1000 most frequently items are kept for all purposes of category specific topic modeling.

The sixth selection criterion is a introduced following numerous experimental studies in the application of Latent Dirichlet Allocation for topic modeling of Wikidata classes. These studies have confirmed the following: due to the hihgly skewed, zipfian distribution of Wikidata item usage, selecting a small fraction of items from the full term-document (i.e. item-project, in this case) matrix results in topic models of higher interpretability due to the elimination of statistical noise. Given that the goal of the WDCM (S)itelinks dashboard is to inform about the structure of Wikidata (S)itelinks usage in the most comprehensible way, the introduction of this criterion is of a rather essential importance.

Topic Modeling and Model Selection. Coherence-based criteria are used to determine the best topic model in each semantic category. The R package {maptpx} is used for rapid estimation of LDA topic models in each semantic category under consideration. A range of models encompassing two to 20 topics is considered in each category's term-document (i.e. item-project) matrix, and each model's estimation is replicated five times; optimizations are run in parallel on stat1007 (currently; we use 30 cores w. 64Gb of RAM and tol = .01). After the LDA models have been obtained, a coherence-based measure based on Normalized Pairwise Mutual Information (NMPI; a version of similar measures discussed in Exploring the Space of Topic Coherence Measures) is used to determine the most interpretable mode (R code follows):

 1 # - topicCoherence_tdm() - compute topic coherence 
 2 # - for a full topic model
 3 topicCoherence_tdm <- function(tdm, theta, M, normalized = T) {
 4   
 5   # - tdm: a term-document matrix (columns = terms, rows = documents)
 6   # - theta: The num(terms) by num(topics) matrix of estimated topic-phrase probabilities
 7   # - M: number of top topic terms to use to compute coherence
 8   
 9   # - constant to add to joint probabilities
10   # - (avoid log(0))
11   epsilon <- 1e-12
12   
13   # - select top term subsets from each topic
14   topTerms <- apply(theta, 2, function(x) {
15     names(sort(x, decreasing = T)[1:M])
16   })
17   
18   # - compute topic coherences
19   nmpi <- apply(topTerms, 2, function(x) {
20     
21     # - compute Normalized Pairwise Mutual Information (NMPI)
22     wT <- which(colnames(tdm) %in% x)
23     # - term probabilities
24     pT <- colSums(tdm[, wT])/sum(tdm)
25     # - term joint probabilities
26     terms <- colnames(tdm)[wT]
27     bigS <- sum(tdm)
28     jpT <- lapply(terms, function(y) {
29       cmpTerms <- setdiff(terms, y)
30       p <- lapply(cmpTerms, function(z) {
31         mp <- sum(apply(tdm[, c(y, z)], 1, min))/bigS
32         names(mp) <- z
33         return(mp)
34       })
35       p <- unlist(p)
36       return(p)
37     })
38     names(jpT) <- terms
39     # - produce all pairs from terms
40     pairTerms <- combn(terms, 2)
41     # - compute topic NMPI
42     n_mpi <- vector(mode = "numeric", length = dim(pairTerms)[2])
43     for (i in 1:dim(pairTerms)[2]) {
44       p1 <- pT[which(names(pT) %in% pairTerms[1, i])]
45       p2 <- pT[which(names(pT) %in% pairTerms[2, i])]
46       p12 <- jpT[[which(names(jpT) %in% names(p1))]][names(p2)]
47       p12 <- p12 + epsilon
48       # - if NMPI is required (default; normalized = T)
49       if (normalized == T) {
50         n_mpi[i] <- log2(p12/(p1*p2))/(-log2(p12)) 
51       } else {
52         # - if PMI is required (normalized = F)
53         n_mpi[i] <- log2((p1*p2)/p12)
54       }
55     }
56     # - aggregate in topic
57     n_mpi <- mean(n_mpi)
58     return(n_mpi)
59   })
60   
61   # - aggregate across topics:
62   # - full topic model coherence
63   return(mean(nmpi))
64 }

M = 15 items is used to compute topic coherence measures. The model with the best aggregate topic coherence is selected. Empirically, this procedure results in the selection of a larger number of topics than the number that would result from the application of statistical decision criteria (e.g. perplexity), and provides topics of a prima facie higher interpretability.

Topic Annotation from Wikidata classes. Once the topic modeling phase is finished we select M = 15 (in general, the same number of items used to compute the topic coherence measures) most important items from each semantic category's items-topics matrix and access Wikidata to fetch all classes of which they are instances of (P31). To compute the relative importance of these Wikidata classes in each topic of some particular semantic category we first produce a binary (index) items x classes matrix. Then, only the weights of selected M = 15 items from the category's item-topics matrix are extracted, and the two matrices (index and the obtained subset of the item-topics matrix) are multiplied; the columns of the resulting classes x topics matrix are normalized and the obtained results are considered as a measure of the importance of a particular Wikidata class for a particular topic in the given semantic category.

Published Datasets. All WDCM (S)itelinks published data sets are available from https://analytics.wikimedia.org/datasets/wdcm/WDCM_Sitelinks/ and described in the README.txt file in the same directory. Some of these published data sets are picked up by a regular update procedure scanning the update timestamp in this directory every hour, and transfered to the wikidataconcepts.eqiad.wmflabs CloudVPS instance where they are used from the Shiny dashboard itself.

Dashboard Functionality

Note: "semantic topic" and "semantic theme" are used interchangeably in this section.

The dashboard is organized in two Views:

  • Category View, and
  • Wiki View.

Category View

In the Category view we select one WDCM semantic category (e.g. Architectural Structure, Geographical Object, Human, Event, etc.) and the dashboard produces a set of analytical results on Wikidata usage from that category. The choice of the category is always made on the Category View: Category tab (the first tab under the Category View), and the choice made there applies to all tabs under the Category View.

Category View: Category

Each row in the table stands for one semantic theme that describes the selected category of Wikidata items. The most important Wikidata classes that describe a particular semantic theme are listed in the Classes column and in decreasing order of importance in each theme. The Diversity score, expressed in percent units, tells us how well "diversified" is the given semantic theme. A semantic theme can be focused on some Wikidata items and classes while some other items or classes might be relatively unimportant to it. The higher the diversity score for some given semantic theme - the larger the number of items and classes that play an important role there. In order to gain understanding on a particular theme, you need to inspect what classes are more important in it. Later, you will observe how the Wikidata classes can be used to describe each Wikipedia in respect to where does it focus its interets in the scope of a given semantic theme and category (hint: see Wiki View:Topics).

Category View: Themes: Items

The chart represents the most important items in the selected semantic theme for the respectitve category. The vertical axis represents the item weight (0 - 1) in the given semantic theme: higher weights indicate more important items. In order to understand the meaning of the selected topic, look at the most important items and ask yourself: what principle holds them together?

Category View: Distribution: Items

The chart represents the distribution of item weight in all semantic themes in the selected semantic category. The horizontal axes represents Item Weight (which is a probability measure, thus ranging from 0 to 1), while the vertical axis stands for the number of items of a given weight. Roughly speaking, the more spread-out the distribution in a given theme, the more diversified are the semantics that it describes (i.e. a larger number of different Wikidata items play a significant role in it; the theme is "less focused").

Category View: Themes: Projects

The chart represents the top 50 Wikipedias in which the selected semantic theme in the respective category plays an important role. Each Wikipedia receives an importance score in each semantic theme of a particular category of Wikidata items. The vertical axes represent the importance score (0 - 1, i.e. how much is the respective theme important in some Wikipedia).

Category View: Distribution: Projects

The chart represents the distribution of the importance score for a selected semantic theme across Wikipedias. Each Wikipedia receives an importance score in each semantic theme of a particular category of Wikidata items. The more spread out the distribution of the importance score, larger the number of Wikipedias in which the respective semantic theme plays an important role. The horizontal axis represent the importance score (0 - 1), while the vertical axis stands for the count of Wikipedias with the respective score.

Category View: Items: Graph

The graph represents the structure of similarity across the most important items in the selected category. The similarity between any two items is computed from their weights across all semantic themes in the category. Each item in the graph points towards the three most similar items to it: the width of the line that connects them corresponds to how similar they are. Items receiving a lot of incoming links are quite interesting, as they act as "hubs" in the similarity structure of the whole category: they are rather illustrative of the category's semantics in general.

Category View: Items: Hiearchy

We first look at (1) how similar are the items from the selected category, then (2) trace how do the items form small groups (i.e. clusters) in respect to their mutual similarity, and then (3) how do these small groups tend to join to form progressively larger groups of similar items. Do not forget that the similarity between items here is not guided only but what you or anyone else would claim to know about them, but also by how the editor community chooses to use these items across various Wikipedias! For example, if two manifestly unrelated items are frequently used across the same set of Wikipedias, they will be recognized as similar in that respect.

Category View: Projects: Graph

The graph represents the structure of similarity across the Wikipedias in the selected category. The similarity between any two Wikipedias is computed from their importance scores across all semantic themes in the category. Each Wikipedia in the graph points towards the three most similar Wikipedias to it: the width of the line that connects them corresponds to how similar they are. Wikipedias receiving a lot of incoming links act as "attractors" in the similarity structure of the whole category: they are rather representative of the category as such.

Category View: Projects: Hiearchy

We look at (1) how similar are the Wikipedias in the selected category, then (2) trace how do the they first form small groups of Wikipedias (i.e. clusters) in respect to their mutual similarity, and then (3) how do these small groups tend to join to form progressively larger groups of similar Wikipedias. In other words, similar Wikipedias are found under the same branches of the tree spawned by this hierarchical representation of similarity.

Wiki View

In the Wiki view we select one Wikipedia (e.g. enwiki, ruwiki, dewiki, frwiki, cewik, etc.) and the dashboard produces a set of analytical results on its Wikidata usage. The choice of Wikipedia is always made on the Wiki View: Wikipedia tab (the first tab under the Wiki View), and the choice made there applies to all tabs under the Wiki View.

Wiki View: Wikipedia

Four charts are generated upon the selection of Wikipedia:

  1. Category Distribution in Wiki. This pie chart presents the distribution of item usage across all considered WDCM semantic categories in the selected project.
  2. Local Semantic Neighbourhood. This graph presents the selected Wikipedia alongside the ten most similar Wikipedias to it. Similarity was computed by inspecting a large number of Wikidata items from all item classes under consideration and registering what items are used across different Wikipedias. Each Wikipedia points towards the three most similar Wikipedias to it. NOTE. This is the local similarity neighbourhood only; a full similarity graph can be obtained from the Wiki:Similarity tab.
  3. Category Usage Profiles. The chart represents the usage of different Wikidata classes in the selected Wikipedia and the ten most similar Wikipedias to it. The vertical axis, representing the count of items used from the respective classes on the horizontal axis, is provided on a logarithmic scale. The data points of the selected Wikipedia are labeled by exact counts.
  4. Wikipedia Similarity Profile. The histogram represents the distribution of similarity between the selected Wikipedia and all other Wikipedias on this dashboard. The similarity coefficient used is Jaccard, which has a range from 0 (high similarity) to 1 (low similarity). Similarity is binned into ten categories on the horizontal axis, while the counts of Wikipedias found in each bin is given on the vertical axis. The more is the histogram skewed to the left - higher the number of Wikipedias similar to the selected one.

Wiki View: Wiki:Similarity

The graph represents the similarity structure across all Wikipedias that can be compared to the selected one. We first select all Wikipedias that make use of the same semantic categories as the selected one and then inspect how many times was each of the 10,000 most frequently used Wikidata items in each semantic category used in every comparable Wikipedia. From these data we derive a similarity measure that describes the pairwise similarity among Wikipedias.

Each Wikipedia in the graph points towards the three most similar Wikipedias to it: the width of the line that connects them corresponds to how similar they are.

Wiki View: Wiki:Topics

The chart represents the importance score of the selected Wikipedia in each semantic theme (themes are represented on the horizontal axes of the plots), in each semantic class. Here we can start building an understanding of "what is a particular Wikipedia about": we might first study each semantic theme in each semantic class (in Category View: Category) to understand what do the semantic themes represent, and then get back here to see in which semantic themes in particular classes is the selected Wikipedia well represented. While the horizontal axes represent a large number of semantic themes, not every WDCM semantic category (they are represented on different panels) encompass that many topics; take a look at Category View: Category to find out how many semantic themes there are in a particular class. Data points for the themes that do not exist in a particular class, or have an importance score of zero, are not labeled.

WDCM Code Repository

All WDCM code is hosted on Gerrit and distributed on GitHub.

WDCM Puppetization

The WDCM is currently ongoing puppetization:

  • On the statboxes: see https://phabricator.wikimedia.org/T171258
    • NOTE: you might have noticed that all WDCM Engine scripts mentioned in this technical documentation have a _goransm suffix. The reason is that, because of the current setup on the statboxes (stat1004, stat1005), the analytics-wmde user is not able to run HiveQL scripts as an automated user, which is currently preventing the puppetization of the WDCM system in production. Phab tickets are opened in that respect and the problem will be be resolved soon.
  • On the wikidataconcepts labs instance: see https://phabricator.wikimedia.org/T171258