Wikidata Concepts Monitor

The Wikidata Concepts Monitor (WDCM) is a system that analyzes and visualizes Wikidata usage across the Wikimedia projects [WDCM Wikidata Project Page|WDCM Dashboards|Gerrit|Diffusion].

WDCM is developed and maintained (mainly) by Goran S. Milovanovic, Data Scientist, WMDE; any suggestions, contributions, and questions are welcome and should be directed to him.

Introduction

This page presents the technical documentation and important aspects of the system design of Wikidata Concepts Monitor (WDCM). The WDCM data product presents a set of Shiny dashboards that provide analytical insight into the Wikidata usage across its client projects, fully developed in R and Pyspark. In deployment, WDCM resides on the open source version of the RStudio Shiny Server. The WDCM dashboards are hosted on a WMF CloudVPS instance, relying on a set of fully client-side dependent dashboards that collect the relevant datasets from public sources; however, the WDCM system as a whole also depends on numerous ETL procedures that are run from production (stat1004 and stat1007) and supported by Apache Sqoop and Spark, as well as on a set of SPARQL queries and Blazegraph GAS programs that extract pre-defined sets of Wikidata items for further analyses. The document explains the modular design of the WDCM and documents the critical procedures; a public code repository where the respective procedures are found is found on Gerrit and GitHub.

Note: the WDCM Dashboards user manuals/documentation are found on the respective dashboards themselves (WDCM Overview, WDCM Usage, WDCM Semantics, WDCM Geo).

General Approach to the Study of Wikidata usage

While Wikidata itself is a semantic ontology with pre-defined and evolving normative rules of description and inference, Wikidata usage is essentialy a social, behavioral phenomenon, suitable for study by means of machine learning in the field of distributional semantics: the analysis and modeling of statistical patterns of occurrence and co-occurence of Wikidata item and property usage across the client projects (e.g. enwiki, frwiki, ruwiki, etc). WDCM thus employs various statistical models in an attempt to describe and provide insights from the observable Wikidata usage statistics (e.g. topic modeling, clustering, dimensionality reduction, all beyond providing elementary descriptive statistics of Wikidata usage, of course.).

Wikidata usage patterns

The “golden line” that connects the reasoning behind all WDCM functions can be non-technically described in the following way. Imagine observing the number of times a set of size N of particular Wikidata items was used across some project (enwiki, for example). Imagine having the same data or other projects as well: for example, if 200 projects are under analysis, then we have 200 counts for N items in a set, and the data can be described by a N x 200 matrix (items x projects). Each column of counts, representing the frequency of occurrence of all Wikidata entities under consideration across one of the 200 projects under discussion - a vector, obviously - represents a particular Wikidata usage pattern. By inspecting and modeling statistically the usage pattern matrix - a matrix that encompasses all such usage patterns across the projects, or the derived co-variance/correlation matrix - many insights into the similarities between Wikimedia projects items projects (or, more precisely, the similarities between their usage patterns) can be found.

In order to provide an illustration for this logic behind all structural WDCM analyses, the following figure presents the Wikidata usage patterns across 14 items categories for some of the largest Wikipedia projects:

The horizontal axis lists 14 categories of Wikidata items that are currently tracked by the WDCM system. The vertical axis represents the logarithm of the counts of how many times have the items from a particular category been used on a particular Wikipedia. The logarithmic scale is used only to prevent from the overcrowding of data points in the plot. Each line connecting the data points presents one Wikidata usage pattern. In this setting, the usage patterns are a characteristic of some particular Wikipedia. However, we could imagine the same plot "transposed", so that the line would connect categories of items and not projects. We would thus obtain the category-specific usage patterns.

The following explanation is a brute simplification of what WDCM does, however, it maybe presents a nice conceptual introduction to the understanding of the inner working of this system. From the viewpoint of Wikidata usage in the 14 semantic categories presented in the picture, any Wikipedia project can be described as a vector of 14 numbers, each number standing for the count of how many times has an item from the respective category been used in a Wikipedia under consideration. The lines connecting the data points for a particular project in the plot represent exactly those counts (precisely: their logarithms). How can we use this information assess the similarity in Wikidata usage between any two Wikipedias? The simplest possible approach is to compute the correlation between the respective usage patterns. The following table presents a correlation matrix in which rows and columns stand for projects. The matrix is populated by correlation coefficients (we've used the Spearman's ρ coefficient). These coefficients range from -1 (absolute negative correlation) to + 1(absolute positive correlation). The value of zero would mean that the two usage patterns are not dependent at all.

	cebwiki	dewiki	enwiki	frwiki	itwiki	ruwiki	tawiki	zhwiki
cebwiki	1.00	0.75	0.82	0.80	0.68	0.73	0.75	0.85
dewiki	0.75	1.00	0.90	0.96	0.92	0.89	0.96	0.88
enwiki	0.82	0.90	1.00	0.92	0.93	0.75	0.88	0.86
frwiki	0.80	0.96	0.92	1.00	0.95	0.87	0.93	0.94
itwiki	0.68	0.92	0.93	0.95	1.00	0.77	0.91	0.86
ruwiki	0.73	0.89	0.75	0.87	0.77	1.00	0.90	0.76
tawiki	0.75	0.96	0.88	0.93	0.91	0.90	1.00	0.83
zhwiki	0.85	0.88	0.86	0.94	0.86	0.76	0.83	1.00

As we can see from this correlation matrix, all diagonal elements that represent the correlations of the particular project usage patterns with themselves contain ones, as expected: a Wikidata usage pattern for a particular projects is maximally self-similar. Looking into any other cells reveal positive numbers less than one. Each one represents a correlation between the usage patterns of the Wikipedias in the respective rows and columns of the matrix. Thus, we can say that dewiki and enwiki (having a correlation of 0.90) are more similar in the respect to how they use Wikidata items from the 14 categories under consideration than, say, dewiki and cebwiki (having a correlation of 0.75).

Now let's transpose the usage patterns and re-calculate the correlation matrix:

	Architectural Structure	Astronomical Object	Book	Chemical Entities	Event	Gene	Geographical Object	Human	Organization	Scientific Article	Taxon	Thoroughfare	Wikimedia	Work Of Art
Architectural Structure	1.00	0.50	0.79	0.93	0.67	0.93	-0.45	0.81	0.95	0.98	-0.10	0.60	0.86	0.67
Astronomical Object	0.50	1.00	0.90	0.62	0.90	0.62	0.05	0.74	0.55	0.57	0.19	0.81	0.50	0.90
Book	0.79	0.90	1.00	0.81	0.95	0.81	-0.14	0.83	0.79	0.81	0.10	0.76	0.74	0.95
Chemical Entities	0.93	0.62	0.81	1.00	0.76	1.00	-0.38	0.93	0.88	0.95	0.05	0.76	0.93	0.76
Event	0.67	0.90	0.95	0.76	1.00	0.76	-0.17	0.83	0.64	0.69	0.02	0.71	0.69	1.00
Gene	0.93	0.62	0.81	1.00	0.76	1.00	-0.38	0.93	0.88	0.95	0.05	0.76	0.93	0.76
Geographical Object	-0.45	0.05	-0.14	-0.38	-0.17	-0.38	1.00	-0.40	-0.24	-0.29	0.43	-0.05	-0.26	-0.17
Human	0.81	0.74	0.83	0.93	0.83	0.93	-0.40	1.00	0.79	0.83	0.14	0.88	0.86	0.83
Organization	0.95	0.55	0.79	0.88	0.64	0.88	-0.24	0.79	1.00	0.98	-0.02	0.67	0.81	0.64
Scientific Article	0.98	0.57	0.81	0.95	0.69	0.95	-0.29	0.83	0.98	1.00	0.00	0.69	0.88	0.69
Taxon	-0.10	0.19	0.10	0.05	0.02	0.05	0.43	0.14	-0.02	0.00	1.00	0.48	0.29	0.02
Thoroughfare	0.60	0.81	0.76	0.76	0.71	0.76	-0.05	0.88	0.67	0.69	0.48	1.00	0.71	0.71
Wikimedia	0.86	0.50	0.74	0.93	0.69	0.93	-0.26	0.86	0.81	0.88	0.29	0.71	1.00	0.69
Work Of Art	0.67	0.90	0.95	0.76	1.00	0.76	-0.17	0.83	0.64	0.69	0.02	0.71	0.69	1.00

We have now used the Wikidata usage patterns of particular categories of items across the projects to compute the correlations. Again, all categories are maximally self-similar in respect to how they are used across the projects: look at the diagonal elements. However, we can say that, for example, Architectural Structures are more similarly used across the Wikipedias under consideration to the way that Chemical Entities are used (having a correlation of 0.93), compared to the way in which the Taxon category is used (having a negative correlation of -0.10).

The variety of usage patterns

We can imagine computing the usage patterns across many different variables. For example, we didn't have to compare the Wikipedias by how much they make use of Wikidata items from particular item categories, but ask: how much do they use particular items? In that case we would have to deal with usage patterns that would encompass many millions of elements, and not only fourteen elements that represent the aggregate counts in particular categories. Have we included all Wikimedia projects that have client-side Wikidata tracking enabled, we would have to deal with more than 800 usage patterns, one for every project, and not with only eight as in this example. We could have picked only items from a single Wikidata category, for example all instances of Human(Q5), and then compute the correlations between the usage patterns across all of them; we would again have to deal with usage patterns of length of several millions. Every time we change the definition of a usage pattern, we are changing the goals of the analysis, and this is the first thing to keep in mind when learning about WDCM. We can analyze the similarity between the Wikidata usage patterns for different projects from a viewpoint of only some Wikidata items, or from a viewpoint of a complete category of Wikidata items, or we can analyze only a subset of projects. On many levels of analysis, WDCM changes these "perspectives" of analysis to illustrate the ways in which Wikimedia project make use of Wikidata in as much as possible detail.

The second thing to keep in mind at this point is that this example is, once again, a brute oversimplification of our methodology. WDCM uses a much more advanced mathematical model to assess the similarities in usage patterns then the correlation matrices that were used in this example. We will later use a few words here and there in order to provide for a conceptual introduction to the methodology used to assess project and category similarity in Wikidata usage, but an interested reader who wants to go under the hood will certainly have to do some reading first. Don't worry, we will list the recommended readings too.

In essence, the technology and mathematics behind WDCM relies on the same set of practical tools and ideas that support the development of semantic search engines and recommendation systems, only applied to a specific dataset that encompasses the usage patterns for tens of millions of Wikidata entities across its client projects.

Motivation

The data obtained in this way, and analyzed properly, allow for the inferences about how different communities use Wikidata to build their specific projects, or about the ways in which semantically related collections of entities are used across some set of projects. By knowing this, it becomes possible to develop suggestions on what cooperation among the communities would be fruitful and mutually beneficial in terms of enhancing the Wikidata usage on the respective projects. On the other hand, communities that are focused on some particular semantic topics, categories (sets), sub-ontologies, etc. can advance by recognizing the similarity in their approaches and efforts. Thus, a whole new level of collaborative development around Wikipedia could be achieved. This goal motivates the development of the WDCM system, beyond the obvious possibility to assess data of fundamental scientific importance - for cognitive and data scientists, sociologists of knowledge, AI engineers, ontologists, pure enthusiasts, and many others.

WDCM is designed to answer questions like the following:

How much are the particular classes of Wikidata items used across the Wikimedia projects?
What are the most frequently used Wikidata items in particular Wikimedia projects or from particular Wikidata sets of items?
How can we categorize the Wikimedia projects in respect to the characteristic patterns of Wikidata usage that we discover in them?
What Wikimedia projects are similar in respect to how they use Wikidata, overall and from the perspective of some particular sets of items?
How is the Wikidata usage of the geolocalized items (such as those relevant for the GLAM initiatives) spatially distributed?

Definitions

Wikidata usage

Wikidata usage analytics

By Wikidata usage analytics it is meant: all important and interesting statistics, summaries of statistical models and tests, visualizations, and reports on how Wikidata is used across the Wikimedia projects. The end goal of WDCM is to deliver consistent, high quality Wikidata usage analytics.

Wikidata usage (statistics)

Consider a set of sister projects (e.g. enwiki, dewiki, frwiki, zhwiki, ruwiki, etc; from the viewpoint of Wikidata usage, we also call them: client projects). Statistical count data that represent the frequency of usage of particular Wikidata entities over any given set of client projects are considered to be Wikidata usage (statistics) in the context of WDCM.

[AN IMPORTANT] NOTE on the Wikidata usage definition

The following discussion relies on the understanding of the Wikibase Schema, especially the wbc_entity_usage table schema (a more thorough explanation of Wikidata item usage tracking in the wbc_entity_usage tables is provided on Phabricator). The methodological discussion of the development of Wikidata usage tracking in relation to this schema is also found on Phabricator.

A strict, working, operational definition of Wikidata usage data is still under development. The problem with its development is of a technical nature and related to the current logic of the wbc_entity_usage table schema. This table is found on MariaDB replicas in the database for any respective project that has a client-side Wikidata usage tracking enabled.

The “S”, “T”, “O”, and “X” usage aspects

The problematic field in the current wbc_entity_usage schema is eu_aspect. With its current definition, this field enables to select in a non-redundant way only the “S”, “O”, and “T” entity usage aspects; meaning: only “S”, “O”, and “T” occurrences of any given Wikidata entity on any given sister projects that maintains client-side Wikidata usage tracking signal one and only one entity usage in the respective aspect on that project (i.e. these aspects are non-overlapping in their registration of Wikidata usage). However, while “S”, “O”, and “T” do not overlap, they may overlap with the “X” usage aspect. Excluding the “X” aspect from the definition is again not possible, namely: ignoring it implies that the majority of relevant usage, e.g. usage in infoboxes, will not be tracked (accessing statement data via Lua is typically tracked as “X”).

The “L” aspects problem: tracking the fallback mechanism

The “L” aspects, usually modified by a specific language modifier (e.g. “L.de”, “L.en”, and similar) cannot be counted in a non-redundant way currently. This is a consequence of the way the wbc_entity_table is produced in respect to the possible triggering of the language fallback mechanism. To explain a language fallback mechanism in a glimpse: for example, let a language fallback chain for a particular language be: “L.de-ch” → “L.de” → “L.en”. That implies the following: if the usage of item label in Swiss German (“L.de-ch”) was attempted, and no label in Swiss German was found, an attempt to use the German (“L.de”) would be made, and an attempt at the English label (“L.en”) made in the end if the previous attempt fails. However, if a language fallback mechanism is triggered on a particular entity usage occasion, all L aspects in that fallback chain will be registered in the wbc_entity_usage table as if they were used simultaneously. From the viewpoint of Wikidata usage, it would be interesting to track (a) the attempted – i.e. the user intended – L aspect, or (at least) (b) the actually used L aspect for a given entity usage. However, the current design of the wbc_entity_usage table does not provide for an assessment of neither of these possibilities.

Finally, there are other uncertainties related to the current design of the wbc_entity_usage table. For example, imagine an editor action that results in a presence of a particular item, with a sitelink, instantiating a label in a particular language at the same time. How many item usage counts do we have: one, two, or more (one “S” aspect count for the sitelink, and at least another for a specific “L” aspect count)?

In conclusion, if Wikidata usage statistics are to encompass all different ways in which an item usage could be defined, by mapping onto all possible editor actions in instantiating a particular item on a particular page, the design of the wbc_entity_usage table would have to undergo a thorough revision, or a new Wikidata usage tracking mechanism would have to be developed from scratch. The wbc_entity_usage table was never designed to enable for analytical purposes in the first place; however, it is the only source for Wikidata re-use statistics that we can currently rely on.

A proposal for an initial solution:

- [NOTE] This is the current Wikidata usage definition in the context of WDCM.

From the existing wbc_entity_table schema, it seems possible to rely on the following definition. For the initial version of the WDCM system, use a simplified definition of Wikidata usage that excludes the multiple item per-page usage cases, in effect:

count on how many pages a particular Wikidata item occurs in a project;
take that as a Wikidata usage per-project statistic;
ignore usage aspects completely until a proper tracking of usage per-page is enabled in the future.

By "proper tracking of usage per-page" the following is meant:

a methodology that counts exactly how many usage cases of a particular item there are on a particular page in a particular project.

WDCM Taxonomy

The WDCM Taxonomy presents a human choice of specific categories and items from the Wikidata ontology that are submitted to WDCM for analytics.

Currently, only one WDCM Taxonomy is specified (Lydia Pintcher, 05/03/2017, Berlin).

The fact that the WDCM relies on a specific choice of taxonomy implies that not all Wikidata items are necessarily tracked and analyzed by the system.

Users of WDCM can specify any imaginable taxonomy that presents a proper subset of Wikidata; no components of the WDCM system are dependent upon any characteristics of some particular choice of taxonomy.

Once defined, the WDCM taxonomy is translated into a set of (typically very simple) SPARQL queries that are used to collect the respective item IDs; only collected items will be tracked and analyzed by the system.

The 14 currently encompassed item categories are:

Human (human (Q5))
Wikimedia Internal (encompassing: Wikimedia category (Q4167836), Wikimedia disambiguation page (Q4167410), and Wikimedia template (Q11266439)
Work of Art (work of art (Q838948))
Scientific Article (scientific article (Q13442814))
Book (book (Q571))g
Geographical Object (geographical object (Q618123))
Organization (encompassing company (Q783794), club (Q988108), and organization (Q43229))
Architectural Structure (encompassing monument (Q4989906) and building (Q41176))
Gene (gene (Q7187))
Chemical Entities (encompassing chemical element (Q11344), chemical compound (Q11173), and chemical substance (Q79529))
Astronomical Object (astronomical object (Q6999))
Taxon (taxon (Q16521))
Event (event (Q1656682))
Thoroughfare (thoroughfare (Q83620)).

The WDCM Taxonomy is still undergoing refinement. An ideal situation would be to completely avoid category overlap, which is not yet satisfied, and it is questionable whether it is possible as a general solution at all in respect to the structure of Wikidata. The following directed graph shows the current item categories in the WDCM taxonomy and the network of the P279 (Subclass Of) relations in which they play a role. The structure was obtained by performing a recurrent search through the P279 paths, starting from entity (Q35120) and down from it to depth of 4 (searching for sub-classes of sub-classes of sub-classes etc). Some item categories from the WDCM Taxonomy are not found even at the sub-class depth 4 from entity, which constraints P279 as its necessary target item (on a recurrent path from anything, of course). Note: the only cycle in the graph is Entity →Entity.

WDCM Data Model

The WDCM data model encompass two components:

HDFS, Big Data component (Production, Analytics Cluster)
Cloud VPS Instance Local Data Component

The first, (1) HDFS Big Data component, is produced by an (A) R-orchestrated Apache Sqoop cycle which transfers many wbc_entity_usage tables from MariaDB in production to a single Hive table in Hadoop on the Analytics Cluster, and an (B) ETL from Apache Spark (Pyspark) and machine learning (R) cycles that operate to produce the WDCM public data sets for the (2) the Local Data Component - fetched from the client-side dependent dashboards for visualizations and further analytics. The details of this process are given below (WDCM System Operation Workflow).

HDFS, Big Data component (Production)

This component encompasses the following HiveQL table, currently in the goransm Hadoop database:

`wdcm_clients_wb_entity_usage`

This table is the result of the the WDCM_Sqoop_Clients.R script which runs a regular weekly Apache Sqoop update to collect the data from all client projects that maintain the wbc_entity_usage table - which means that they have Wikidata usage client-side tracking enabled. This Hive table presents the raw WDCM data set; it is simply a product of sqooping many big MariaDB tables that are not suited for analytics queries into Hadoop. This Hive table is used to produce all other data aggregates for publication and machine learning procedures; the production of the datasets from this Hive table is done from Apache Spark (wdcmModule_ETL.py Pyspark script).

goransm.wdcm_clients_wb_entity_usage
col_name	data_type	comment
eu_row_id	bigint	row identifier
eu_entity_id	string	Wikidata item ID, e.g. Q5
eu_aspect	string	eu_aspect, see wbc_entity_usage schema
eu_page_id	bigint	the ID of the page where the item is used
wiki_db	string	partition; the project database, e.g. "enwiki", "dewiki".

Cloud VPS Instance Local Data Component/Public Datasets

All data aggregates and datasets for machine learning procedures are publicly available from: https://analytics.wikimedia.org/datasets/wmde-analytics-engineering/wdcm/

The two directories, etl and ml, encompass the data aggregates (as presented to the end-users via WDCM dashboards) and machine learning datasets, respectively.

The datasets that support any particular WDCM dashboard are downloaded to the client from this public repository during runtime.

WDCM System Operation Workflow

The following schema represents the WDCM System Operation Workflow. We will proceed by explaining component by component.

The first phase (1, 2 in the diagram) is performed by the WDCM_Sqoop_Clients.R script, run on a regular weekly schedule from stat1004. This R script orchestrates Apache Sqoop operations to (1) transfer the many (currently more than 800) wbc_entity_usage SQL tables from the m2 MariaDB replica in order to produce (2) the Hadoop/Hive wdcm_clients_wb_entity_usage table, partitioned by WDCM semantic category (elements of the WDCM Taxonomy) in the goransm database where they can processed. The first phase typically takes 5 to 6 hours to complete.
All remaining steps in the operational workflow are orchestrated from the wdcmModule_Orchestra.R script - running on a regular update schedule from stat1007 - in the following way:
- the second phase (3, 4 in the diagram) is performed by the wdcmModule_CollectItems.R module:
  - in the first step, this module will (3) load the current WDCM Taxonomy to determine what Wikidata item classes need to be fetched from Wikidata via the SPARQL endpoint;
  - in the second step, (4) many millions of Wikidata items are selected and their IDs fetched from the SPARQL endpoint, in order to determine what information to search for in the previously produced wdcm_clients_wb_entity_usage table Hive table;
- the third phase, (5 in the diagram) ETL steps are performed by the wdcmModule_ETL.py Pyspark module to produce various aggregate WDCM data sets, including the Wikimedia Project x Wikidata items matrices (analogous to the usual Document-Term Matrix in NLP, except for that Wikimedia projects represent "documents" and items represent "terms");
- in the fourth phase (6 in the diagram), the wdcmModule_ML.R takes over the Wikimedia Project x Wikidata items matrices - one matrix per WDCM semantic category - and submits them to topic modeling by the R {maptpx} MAP estimation of the Latent Dirichlet Allocation model; while {maptpx} allows for a rapid estimation of the LDA topic models in respect to other algorithms (this step will soon be replaced by running LDA from Apache Spark or Python Gensim in order to gain additional processing efficiency and enable for cross-validation procedures);
- the resulting Wikimedia project x Semantic topics matrices are then submitted to an {Rtsne} implementation of the t-Distributed Stochastic Neighbor Embedding 2D dimensionality reduction to support visualizations on the WDCM Dashboards, and the coordinates of the respective 2D representations are stored;
- the data frames for network visualizations with R {visNetwork} are prepared and stored;
- in the final phase (7a, 7b in the diagram) all data sets are made public on https://analytics.wikimedia.org/datasets/wmde-analytics-engineering/wdcm/ for open access.

The process illustrated here does not encompass the WDCM engine update of the WDCM Geo Dashboard, which is run by a separate (WDCM_EngineGeo_goransm.R) script from stat1007, relying on the same WDCM Hive table as the main engine update.

WDCM Dashboards

The Dashboards module is a set of RStudio Shiny dashboards that serve the Wikidata usage analytics to its end-users.

List of WDCM Dashboards

Currently, the WDCM System runs four dashboards:

WDCM Overview, providing an elementary overview - the "big picture" - of Wikidata usage
WDCM Usage, providing for detailed usage statistics, and
WDCM Semantics, providing insights from the topic models derived from the usage data.
WDCM Geo, providing interactive maps of geolocalized Wikidata items alongside the respective WDCM usage statistics.
WDCM (S)itelinks, providing detailed insights into the structure of Wikidata (S)itelink usage aspect across a selection of Wikipedia projects.
WDCM (T)titles, providing detailed insights into the structure of Wikidata (T)itle usage aspect across a selection of Wikipedia projects.

Most of the WDCM Dashboards are documented in their respective Description sections. A walk-through with illustrative usage examples is provided on the WDCM Project page.

WDCM (S)itelinks Dashboard

The WDCM (S)itelinks dashboard

analyzes only the sitelinks usage aspect from the wbc_entity_usage table (in the Wikibase schema), and
takes into account only mature Wikipedia projects (in terms of Wikidata usage)

in order to obtain and present a broad and as clear as possible overview of the structure of Wikidata (S)itelink usage aspect across the Wikipedia.

The dashboard's update engine is (currently) run from stat1007 and encompasses (a) R and HiveQL orchestration from R to obtain the necessary data from the wdcm_clients_wb_entity_usage table, and (b) {maptpx} topic modeling and various other R packages to produce the data sets that are used to obtain the visualizations of the Wikidata usage structure on the dashboard itself. The dashboard is developed in RStudio Shiny and runs on the Shiny Server from the wikidataconcepts.eqiad.wmflabs CloudVPS instance.

Dashboard Update Engine: schedule, wrangling, and modeling procedures

Updates are run at 00:00 UTC each 2nd, 8th, 15th, 21st, and 28th in the month, each following a day after the completion of the WDCM_Sqoop_Clients.R runs from stat1004 on 1st, 7th, 14th, 20th, and 27th in the month. Thus we have five dashboard updates each month. Each update engine run takes approximately between eight and nine hours to complete, with machine learning procedures (LDA) accounting for a large fraction of the runtime.

Filtering out Wikidata item use cases. A sitelink usage of a Wikidata item on a project with the Wikibase Client extension installed is recorded in the client's wbc_entity_usage table when "... a client page [...] is connected to an item via an incoming sitelink, but does not access any data of the item directly".

We do not consider all Wikidata classes (as in any WDCM dashboard): the following WDCM semantic classes are considered only:

Human (human (Q5))
Work of Art (work of art (Q838948))
Scientific Article (scientific article (Q13442814))
Book (book (Q571))
Geographical Object (geographical object (Q618123))
Organization (encompassing company (Q783794), club (Q988108), and organization (Q43229))
Architectural Structure (encompassing monument (Q4989906) and building (Q41176))
Gene (gene (Q7187))
Chemical Entities (encompassing chemical element (Q11344), chemical compound (Q11173), and chemical substance (Q79529))
Astronomical Object (astronomical object (Q6999))
Taxon (taxon (Q16521))
Event (event (Q1656682))
Thoroughfare (thoroughfare (Q83620))

Filtering out projects and semantic categories. In order to obtain a comprehensible picture of Wikidata items sitelink usage, we apply the following set of criteria for project and semantic category retention in the analyses:

a formal check is first performed to filter out all projects that are not present in the List of Wikipedias;
the total Wikidata usage per project is computed by summing up all sitelink item use cases per project, and then only projects with above median total Wikidata usage are considered;
only projects that make use of at least 10 WDCM semantic categories listed above are kept;
only semantic categories with 100 or more items that are currently used across all selected projects are retained;
only items that are used in at least 10% of the selected projects are kept from each semantic category;
only the 1000 most frequently items are kept for all purposes of category specific topic modeling.

The sixth selection criterion is a introduced following numerous experimental studies in the application of Latent Dirichlet Allocation for topic modeling of Wikidata classes. These studies have confirmed the following: due to the hihgly skewed, zipfian distribution of Wikidata item usage, selecting a small fraction of items from the full term-document (i.e. item-project, in this case) matrix results in topic models of higher interpretability due to the elimination of statistical noise. Given that the goal of the WDCM (S)itelinks dashboard is to inform about the structure of Wikidata (S)itelinks usage in the most comprehensible way, the introduction of this criterion is of a rather essential importance.

Topic Modeling and Model Selection. Coherence-based criteria are used to determine the best topic model in each semantic category. The R package {maptpx} is used for rapid estimation of LDA topic models in each semantic category under consideration. A range of models encompassing two to 20 topics is considered in each category's term-document (i.e. item-project) matrix, and each model's estimation is replicated five times; optimizations are run in parallel on stat1007 (currently; we use 30 cores w. 64Gb of RAM and tol = .01). After the LDA models have been obtained, a coherence-based measure based on Normalized Pairwise Mutual Information (NMPI; a version of similar measures discussed in Exploring the Space of Topic Coherence Measures) is used to determine the most interpretable mode (R code follows):

# - topicCoherence_tdm() - compute topic coherence 
# - for a full topic model
topicCoherence_tdm <- function(tdm, theta, M, normalized = T) {
  
  # - tdm: a term-document matrix (columns = terms, rows = documents)
  # - theta: The num(terms) by num(topics) matrix of estimated topic-phrase probabilities
  # - M: number of top topic terms to use to compute coherence
  
  # - constant to add to joint probabilities
  # - (avoid log(0))
  epsilon <- 1e-12
  
  # - select top term subsets from each topic
  topTerms <- apply(theta, 2, function(x) {
    names(sort(x, decreasing = T)[1:M])
  })
  
  # - compute topic coherences
  nmpi <- apply(topTerms, 2, function(x) {
    
    # - compute Normalized Pairwise Mutual Information (NMPI)
    wT <- which(colnames(tdm) %in% x)
    # - term probabilities
    pT <- colSums(tdm[, wT])/sum(tdm)
    # - term joint probabilities
    terms <- colnames(tdm)[wT]
    bigS <- sum(tdm)
    jpT <- lapply(terms, function(y) {
      cmpTerms <- setdiff(terms, y)
      p <- lapply(cmpTerms, function(z) {
        mp <- sum(apply(tdm[, c(y, z)], 1, min))/bigS
        names(mp) <- z
        return(mp)
      })
      p <- unlist(p)
      return(p)
    })
    names(jpT) <- terms
    # - produce all pairs from terms
    pairTerms <- combn(terms, 2)
    # - compute topic NMPI
    n_mpi <- vector(mode = "numeric", length = dim(pairTerms)[2])
    for (i in 1:dim(pairTerms)[2]) {
      p1 <- pT[which(names(pT) %in% pairTerms[1, i])]
      p2 <- pT[which(names(pT) %in% pairTerms[2, i])]
      p12 <- jpT[[which(names(jpT) %in% names(p1))]][names(p2)]
      p12 <- p12 + epsilon
      # - if NMPI is required (default; normalized = T)
      if (normalized == T) {
        n_mpi[i] <- log2(p12/(p1*p2))/(-log2(p12)) 
      } else {
        # - if PMI is required (normalized = F)
        n_mpi[i] <- log2((p1*p2)/p12)
      }
    }
    # - aggregate in topic
    n_mpi <- mean(n_mpi)
    return(n_mpi)
  })
  
  # - aggregate across topics:
  # - full topic model coherence
  return(mean(nmpi))
}

M = 15 items is used to compute topic coherence measures. The model with the best aggregate topic coherence is selected. Empirically, this procedure results in the selection of a larger number of topics than the number that would result from the application of statistical decision criteria (e.g. perplexity), and provides topics of a prima facie higher interpretability.

Topic Annotation from Wikidata classes. Once the topic modeling phase is finished we select M = 15 (in general, the same number of items used to compute the topic coherence measures) most important items from each semantic category's items-topics matrix and access Wikidata to fetch all classes of which they are instances of (P31). To compute the relative importance of these Wikidata classes in each topic of some particular semantic category we first produce a binary (index) items x classes matrix. Then, only the weights of selected M = 15 items from the category's item-topics matrix are extracted, and the two matrices (index and the obtained subset of the item-topics matrix) are multiplied; the columns of the resulting classes x topics matrix are normalized and the obtained results are considered as a measure of the importance of a particular Wikidata class for a particular topic in the given semantic category.

Published Datasets. All WDCM (S)itelinks published data sets are available from https://analytics.wikimedia.org/datasets/wdcm/WDCM_Sitelinks/ and described in the README.txt file in the same directory. Some of these published data sets are picked up by a regular update procedure scanning the update timestamp in this directory every hour, and transfered to the wikidataconcepts.eqiad.wmflabs CloudVPS instance where they are used from the Shiny dashboard itself.

Dashboard Functionality

Note: "semantic topic" and "semantic theme" are used interchangeably in this section.

The dashboard is organized in two Views:

Category View, and
Wiki View.

Category View

In the Category view we select one WDCM semantic category (e.g. Architectural Structure, Geographical Object, Human, Event, etc.) and the dashboard produces a set of analytical results on Wikidata usage from that category. The choice of the category is always made on the Category View: Category tab (the first tab under the Category View), and the choice made there applies to all tabs under the Category View.

Category View: Category

Each row in the table stands for one semantic theme that describes the selected category of Wikidata items. The most important Wikidata classes that describe a particular semantic theme are listed in the Classes column and in decreasing order of importance in each theme. The Diversity score, expressed in percent units, tells us how well "diversified" is the given semantic theme. A semantic theme can be focused on some Wikidata items and classes while some other items or classes might be relatively unimportant to it. The higher the diversity score for some given semantic theme - the larger the number of items and classes that play an important role there. In order to gain understanding on a particular theme, you need to inspect what classes are more important in it. Later, you will observe how the Wikidata classes can be used to describe each Wikipedia in respect to where does it focus its interets in the scope of a given semantic theme and category (hint: see Wiki View:Topics).

Category View: Themes: Items

The chart represents the most important items in the selected semantic theme for the respectitve category. The vertical axis represents the item weight (0 - 1) in the given semantic theme: higher weights indicate more important items. In order to understand the meaning of the selected topic, look at the most important items and ask yourself: what principle holds them together?

Category View: Distribution: Items

The chart represents the distribution of item weight in all semantic themes in the selected semantic category. The horizontal axes represents Item Weight (which is a probability measure, thus ranging from 0 to 1), while the vertical axis stands for the number of items of a given weight. Roughly speaking, the more spread-out the distribution in a given theme, the more diversified are the semantics that it describes (i.e. a larger number of different Wikidata items play a significant role in it; the theme is "less focused").

Category View: Themes: Projects

The chart represents the top 50 Wikipedias in which the selected semantic theme in the respective category plays an important role. Each Wikipedia receives an importance score in each semantic theme of a particular category of Wikidata items. The vertical axes represent the importance score (0 - 1, i.e. how much is the respective theme important in some Wikipedia).

Category View: Distribution: Projects

The chart represents the distribution of the importance score for a selected semantic theme across Wikipedias. Each Wikipedia receives an importance score in each semantic theme of a particular category of Wikidata items. The more spread out the distribution of the importance score, larger the number of Wikipedias in which the respective semantic theme plays an important role. The horizontal axis represent the importance score (0 - 1), while the vertical axis stands for the count of Wikipedias with the respective score.

Category View: Items: Graph

The graph represents the structure of similarity across the most important items in the selected category. The similarity between any two items is computed from their weights across all semantic themes in the category. Each item in the graph points towards the three most similar items to it: the width of the line that connects them corresponds to how similar they are. Items receiving a lot of incoming links are quite interesting, as they act as "hubs" in the similarity structure of the whole category: they are rather illustrative of the category's semantics in general.

Category View: Items: Hiearchy

We first look at (1) how similar are the items from the selected category, then (2) trace how do the items form small groups (i.e. clusters) in respect to their mutual similarity, and then (3) how do these small groups tend to join to form progressively larger groups of similar items. Do not forget that the similarity between items here is not guided only but what you or anyone else would claim to know about them, but also by how the editor community chooses to use these items across various Wikipedias! For example, if two manifestly unrelated items are frequently used across the same set of Wikipedias, they will be recognized as similar in that respect.

Category View: Projects: Graph

The graph represents the structure of similarity across the Wikipedias in the selected category. The similarity between any two Wikipedias is computed from their importance scores across all semantic themes in the category. Each Wikipedia in the graph points towards the three most similar Wikipedias to it: the width of the line that connects them corresponds to how similar they are. Wikipedias receiving a lot of incoming links act as "attractors" in the similarity structure of the whole category: they are rather representative of the category as such.

Category View: Projects: Hiearchy

We look at (1) how similar are the Wikipedias in the selected category, then (2) trace how do the they first form small groups of Wikipedias (i.e. clusters) in respect to their mutual similarity, and then (3) how do these small groups tend to join to form progressively larger groups of similar Wikipedias. In other words, similar Wikipedias are found under the same branches of the tree spawned by this hierarchical representation of similarity.

Wiki View

In the Wiki view we select one Wikipedia (e.g. enwiki, ruwiki, dewiki, frwiki, cewik, etc.) and the dashboard produces a set of analytical results on its Wikidata usage. The choice of Wikipedia is always made on the Wiki View: Wikipedia tab (the first tab under the Wiki View), and the choice made there applies to all tabs under the Wiki View.

Wiki View: Wikipedia

Four charts are generated upon the selection of Wikipedia:

Category Distribution in Wiki. This pie chart presents the distribution of item usage across all considered WDCM semantic categories in the selected project.
Local Semantic Neighbourhood. This graph presents the selected Wikipedia alongside the ten most similar Wikipedias to it. Similarity was computed by inspecting a large number of Wikidata items from all item classes under consideration and registering what items are used across different Wikipedias. Each Wikipedia points towards the three most similar Wikipedias to it. NOTE. This is the local similarity neighbourhood only; a full similarity graph can be obtained from the Wiki:Similarity tab.
Category Usage Profiles. The chart represents the usage of different Wikidata classes in the selected Wikipedia and the ten most similar Wikipedias to it. The vertical axis, representing the count of items used from the respective classes on the horizontal axis, is provided on a logarithmic scale. The data points of the selected Wikipedia are labeled by exact counts.
Wikipedia Similarity Profile. The histogram represents the distribution of similarity between the selected Wikipedia and all other Wikipedias on this dashboard. The similarity coefficient used is Jaccard, which has a range from 0 (high similarity) to 1 (low similarity). Similarity is binned into ten categories on the horizontal axis, while the counts of Wikipedias found in each bin is given on the vertical axis. The more is the histogram skewed to the left - higher the number of Wikipedias similar to the selected one.

Wiki View: Wiki:Similarity

The graph represents the similarity structure across all Wikipedias that can be compared to the selected one. We first select all Wikipedias that make use of the same semantic categories as the selected one and then inspect how many times was each of the 10,000 most frequently used Wikidata items in each semantic category used in every comparable Wikipedia. From these data we derive a similarity measure that describes the pairwise similarity among Wikipedias.

Each Wikipedia in the graph points towards the three most similar Wikipedias to it: the width of the line that connects them corresponds to how similar they are.

Wiki View: Wiki:Topics

The chart represents the importance score of the selected Wikipedia in each semantic theme (themes are represented on the horizontal axes of the plots), in each semantic class. Here we can start building an understanding of "what is a particular Wikipedia about": we might first study each semantic theme in each semantic class (in Category View: Category) to understand what do the semantic themes represent, and then get back here to see in which semantic themes in particular classes is the selected Wikipedia well represented. While the horizontal axes represent a large number of semantic themes, not every WDCM semantic category (they are represented on different panels) encompass that many topics; take a look at Category View: Category to find out how many semantic themes there are in a particular class. Data points for the themes that do not exist in a particular class, or have an importance score of zero, are not labeled.

WDCM (T)titles Dashboard

The WDCM (T)itles Dashboard is an exact copy of the WDCM(S)itelinks dashboard except for it filters out only the (T)itle usage aspect from the client's wbc_entity_usage schema.

WDCM Code Repository

All WDCM code is hosted on Gerrit and distributed on GitHub.

WDCM Puppetization

The WDCM is currently ongoing puppetization:

On the statboxes: see https://phabricator.wikimedia.org/T171258
- NOTE: you might have noticed that all WDCM Engine scripts mentioned in this technical documentation have a _goransm suffix. The reason is that, because of the current setup on the statboxes (stat1004, stat1005), the analytics-wmde user is not able to run HiveQL scripts as an automated user, which is currently preventing the puppetization of the WDCM system in production. Phab tickets are opened in that respect and the problem will be be resolved soon.