Wikidata Concepts Monitor

From Wikitech
Jump to: navigation, search

The Wikidata Concepts Monitor (WDCM) is a system that analyzes and visualizes Wikidata usage across the Wikimedia projects [WDCM Wikidata Project Page|WDCM Dashboards|Gerrit|Diffusion].

WDCM is developed and maintained (mainly) by Goran S. Milovanovic, Data Scientist, WMDE; any suggestions, contributions, and questions are welcome and should be directed to him.


This page presents the technical documentation and important aspects of the system design of Wikidata Concepts Monitor (WDCM). The WDCM data product presents a set of Shiny dashboards that provide analytical insight into the Wikidata usage across its client projects, fully developed in R. In deployment, WDCM resides on the open source version of the RStudio Shiny Server. The WDCM dashboards are hosted on the wikidataconcepts Labs instance, relying on a MariaDB back-end that supports their immediate functionality; however, the WDCM system as a whole also depends on numerous ETL procedures that are run from production (stat1004 and stat1005) and supported by Apache Sqoop and Hadoop, as well as on a set of SPARQL queries that extract pre-defined sets of Wikidata items for further analyses. The document explains the modular design of the WDCM and documents the critical procedures; a public code repository where the respective procedures are found is found on Gerrit and Diffusion.

Note: the WDCM Dashboards user manuals are found in the Description section on the respective dashboards (WDCM Overview, WDCM Usage, WDCM Semantics, WDCM Geo).

The WDCM System Operation Workflow: an overview of the WDCM monthly update.

General Approach to the Study of Wikidata usage

While Wikidata itself is a semantic ontology with pre-defined and evolving normative rules of description and inference, Wikidata usage is essentialy a social, behavioral phenomenon, suitable for study by means of machine learning in the field of distributional semantics: the analysis and modeling of statistical patterns of occurrence and co-occurence of Wikidata item and property usage across the client projects (e.g. enwiki, frwiki, ruwiki, etc). WDCM thus employs various statistical models in an attempt to describe and provide insights from the observable Wikidata usage statistics (e.g. topic modeling, clustering, dimensionality reduction, all beyond providing elementary descriptive statistics of Wikidata usage, of course.).

Wikidata usage patterns

The “golden line” that connects the reasoning behind all WDCM functions can be non-technically described in the following way. Imagine observing the number of times a set of size N of particular Wikidata items was used across some project (enwiki, for example). Imagine having the same data or other projects as well: for example, if 200 projects are under analysis, then we have 200 counts for N items in a set, and the data can be described by a N x 200 matrix (items x projects). Each column of counts, representing the frequency of occurrence of all Wikidata entities under consideration across one of the 200 projects under discussion - a vector, obviously - represents a particular Wikidata usage pattern. By inspecting and modeling statistically the usage pattern matrix - a matrix that encompasses all such usage patterns across the projects, or the derived co-variance/correlation matrix - many insights into the similarities between Wikimedia projects items projects (or, more precisely, the similarities between their usage patterns) can be found. 

In order to provide an illustration for this logic behind all structural WDCM analyses, the following figure presents the Wikidata usage patterns across 14 items categories for some of the largest Wikipedia projects:

WDCM Usage Patterns.
WDCM Usage Patterns.

The horizontal axis lists 14 categories of Wikidata items that are currently tracked by the WDCM system. The vertical axis represents the logarithm of the counts of how many times have the items from a particular category been used on a particular Wikipedia. The logarithmic scale is used only to prevent from the overcrowding of data points in the plot. Each line connecting the data points presents one Wikidata usage pattern. In this setting, the usage patterns are a characteristic of some particular Wikipedia. However, we could imagine the same plot "transposed", so that the line would connect categories of items and not projects. We would thus obtain the category-specific usage patterns.

The following explanation is a brute simplification of what WDCM does, however, it maybe presents a nice conceptual introduction to the understanding of the inner working of this system. From the viewpoint of Wikidata usage in the 14 semantic categories presented in the picture, any Wikipedia project can be described as a vector of 14 numbers, each number standing for the count of how many times has an item from the respective category been used in a Wikipedia under consideration. The lines connecting the data points for a particular project in the plot represent exactly those counts (precisely: their logarithms). How can we use this information assess the similarity in Wikidata usage between any two Wikipedias? The simplest possible approach is to compute the correlation between the respective usage patterns. The following table presents a correlation matrix in which rows and columns stand for projects. The matrix is populated by correlation coefficients (we've used the Spearman's ρ coefficient). These coefficients range from -1 (absolute negative correlation) to + 1(absolute positive correlation). The value of zero would mean that the two usage patterns are not dependent at all.

cebwiki dewiki enwiki frwiki itwiki ruwiki tawiki zhwiki
cebwiki 1.00 0.75 0.82 0.80 0.68 0.73 0.75 0.85
dewiki 0.75 1.00 0.90 0.96 0.92 0.89 0.96 0.88
enwiki 0.82 0.90 1.00 0.92 0.93 0.75 0.88 0.86
frwiki 0.80 0.96 0.92 1.00 0.95 0.87 0.93 0.94
itwiki 0.68 0.92 0.93 0.95 1.00 0.77 0.91 0.86
ruwiki 0.73 0.89 0.75 0.87 0.77 1.00 0.90 0.76
tawiki 0.75 0.96 0.88 0.93 0.91 0.90 1.00 0.83
zhwiki 0.85 0.88 0.86 0.94 0.86 0.76 0.83 1.00

As we can see from this correlation matrix, all diagonal elements that represent the correlations of the particular project usage patterns with themselves contain ones, as expected: a Wikidata usage pattern for a particular projects is maximally self-similar. Looking into any other cells reveal positive numbers less than one. Each one represents a correlation between the usage patterns of the Wikipedias in the respective rows and columns of the matrix. Thus, we can say that dewiki and enwiki (having a correlation of 0.90) are more similar in the respect to how they use Wikidata items from the 14 categories under consideration than, say, dewiki and cebwiki (having a correlation of 0.75).

Now let's transpose the usage patterns and re-calculate the correlation matrix:

Architectural Structure Astronomical Object Book Chemical Entities Event Gene Geographical Object Human Organization Scientific Article Taxon Thoroughfare Wikimedia Work Of Art
Architectural Structure 1.00 0.50 0.79 0.93 0.67 0.93 -0.45 0.81 0.95 0.98 -0.10 0.60 0.86 0.67
Astronomical Object 0.50 1.00 0.90 0.62 0.90 0.62 0.05 0.74 0.55 0.57 0.19 0.81 0.50 0.90
Book 0.79 0.90 1.00 0.81 0.95 0.81 -0.14 0.83 0.79 0.81 0.10 0.76 0.74 0.95
Chemical Entities 0.93 0.62 0.81 1.00 0.76 1.00 -0.38 0.93 0.88 0.95 0.05 0.76 0.93 0.76
Event 0.67 0.90 0.95 0.76 1.00 0.76 -0.17 0.83 0.64 0.69 0.02 0.71 0.69 1.00
Gene 0.93 0.62 0.81 1.00 0.76 1.00 -0.38 0.93 0.88 0.95 0.05 0.76 0.93 0.76
Geographical Object -0.45 0.05 -0.14 -0.38 -0.17 -0.38 1.00 -0.40 -0.24 -0.29 0.43 -0.05 -0.26 -0.17
Human 0.81 0.74 0.83 0.93 0.83 0.93 -0.40 1.00 0.79 0.83 0.14 0.88 0.86 0.83
Organization 0.95 0.55 0.79 0.88 0.64 0.88 -0.24 0.79 1.00 0.98 -0.02 0.67 0.81 0.64
Scientific Article 0.98 0.57 0.81 0.95 0.69 0.95 -0.29 0.83 0.98 1.00 0.00 0.69 0.88 0.69
Taxon -0.10 0.19 0.10 0.05 0.02 0.05 0.43 0.14 -0.02 0.00 1.00 0.48 0.29 0.02
Thoroughfare 0.60 0.81 0.76 0.76 0.71 0.76 -0.05 0.88 0.67 0.69 0.48 1.00 0.71 0.71
Wikimedia 0.86 0.50 0.74 0.93 0.69 0.93 -0.26 0.86 0.81 0.88 0.29 0.71 1.00 0.69
Work Of Art 0.67 0.90 0.95 0.76 1.00 0.76 -0.17 0.83 0.64 0.69 0.02 0.71 0.69 1.00

We have now used the Wikidata usage patterns of particular categories of items across the projects to compute the correlations. Again, all categories are maximally self-similar in respect to how they are used across the projects: look at the diagonal elements. However, we can say that, for example, Architectural Structures are more similarly used across the Wikipedias under consideration to the way that Chemical Entities are used (having a correlation of 0.93), compared to the way in which the Taxon category is used (having a negative correlation of -0.10).

The variety of usage patterns

We can imagine computing the usage patterns across many different variables. For example, we didn't have to compare the Wikipedias by how much they make use of Wikidata items from particular item categories, but ask: how much do they use particular items? In that case we would have to deal with usage patterns that would encompass many millions of elements, and not only fourteen elements that represent the aggregate counts in particular categories. Have we included all Wikimedia projects that have client-side Wikidata tracking enabled, we would have to deal with more than 800 usage patterns, one for every project, and not with only eight as in this example. We could have picked only items from a single Wikidata category, for example all instances of Human(Q5), and then compute the correlations between the usage patterns across all of them; we would again have to deal with usage patterns of length of several millions. Every time we change the definition of a usage pattern, we are changing the goals of the analysis, and this is the first thing to keep in mind when learning about WDCM. We can analyze the similarity between the Wikidata usage patterns for different projects from a viewpoint of only some Wikidata items, or from a viewpoint of a complete category of Wikidata items, or we can analyze only a subset of projects. On many levels of analysis, WDCM changes these "perspectives" of analysis to illustrate the ways in which Wikimedia project make use of Wikidata in as much as possible detail.

The second thing to keep in mind at this point is that this example is, once again, a brute oversimplification of our methodology. WDCM uses a much more advanced mathematical model to assess the similarities in usage patterns then the correlation matrices that were used in this example. We will later use a few words here and there in order to provide for a conceptual introduction to the methodology used to assess project and category similarity in Wikidata usage, but an interested reader who wants to go under the hood will certainly have to do some reading first. Don't worry, we will list the recommended readings too.

In essence, the technology and mathematics behind WDCM relies on the same set of practical tools and ideas that support the development of semantic search engines and recommendation systems, only applied to a specific dataset that encompasses the usage patterns for tens of millions of Wikidata entities across its client projects.


The data obtained in this way, and analyzed properly, allow for the inferences about how different communities use Wikidata to build their specific projects, or about the ways in which semantically related collections of entities are used across some set of projects. By knowing this, it becomes possible to develop suggestions on what cooperation among the communities would be fruitful and mutually beneficial in terms of enhancing the Wikidata usage on the respective projects. On the other hand, communities that are focused on some particular semantic topics, categories (sets), sub-ontologies, etc. can advance by recognizing the similarity in their approaches and efforts. Thus, a whole new level of collaborative development around Wikipedia could be achieved. This goal motivates the development of the WDCM system, beyond the obvious possibility to assess data of fundamental scientific importance - for cognitive and data scientists, sociologists of knowledge, AI engineers, ontologists, pure enthusiasts, and many others.

WDCM is designed to answer questions like the following:

  • How much are the particular classes of Wikidata items used across the Wikimedia projects?
  • What are the most frequently used Wikidata items in particular Wikimedia projects or from particular Wikidata sets of items?
  • How can we categorize the Wikimedia projects in respect to the characteristic patterns of Wikidata usage that we discover in them?
  • What Wikimedia projects are similar in respect to how they use Wikidata, overall and from the perspective of some particular sets of items?
  • How is the Wikidata usage of the geolocalized items (such as those relevant for the GLAM initiatives) spatially distributed?


Wikidata usage

Wikidata usage analytics

By Wikidata usage analytics it is meant: all important and interesting statistics, summaries of statistical models and tests, visualizations, and reports on how Wikidata is used across the Wikimedia projects. The end goal of WDCM is to deliver consistent, high quality Wikidata usage analytics.

Wikidata usage (statistics)

Consider a set of sister projects (e.g. enwiki, dewiki, frwiki, zhwiki, ruwiki, etc; from the viewpoint of Wikidata usage, we also call them: client projects). Statistical count data that represent the frequency of usage of particular Wikidata entities over any given set of client projects are considered to be Wikidata usage (statistics) in the context of WDCM.

[AN IMPORTANT] NOTE on the Wikidata usage definition

The following discussion relies on the understanding of the Wikibase Schema, especially the wbc_entity_usage table schema (a more thorough explanation of Wikidata item usage tracking in the wbc_entity_usage tables is provided on Phabricator). The methodological discussion of the development of Wikidata usage tracking in relation to this schema is also found on Phabricator.

A strict, working, operational definition of Wikidata usage data is still under development. The problem with its development is of a technical nature and related to the current logic of the wbc_entity_usage table schema. This table is found on MariaDB replicas in the database for any respective project that has a client-side Wikidata usage tracking enabled. 

The “S”, “T”, “O”, and “X” usage aspects

The problematic field in the current wbc_entity_usage schema is eu_aspect. With its current definition, this field enables to select in a non-redundant way only the “S”, “O”, and “T” entity usage aspects; meaning: only “S”, “O”, and “T” occurrences of any given Wikidata entity on any given sister projects that maintains client-side Wikidata usage tracking signal one and only one entity usage in the respective aspect on that project (i.e. these aspects are non-overlapping in their registration of Wikidata usage). However, while “S”, “O”, and “T” do not overlap, they may overlap with the “X” usage aspect. Excluding the “X” aspect from the definition is again not possible, namely: ignoring it implies that the majority of relevant usage, e.g. usage in infoboxes, will not be tracked (accessing statement data via Lua is typically tracked as “X”).

The “L” aspects problem: tracking the fallback mechanism

The “L” aspects, usually modified by a specific language modifier (e.g. “”, “L.en”, and similar) cannot be counted in a non-redundant way currently. This is a consequence of the way the wbc_entity_table is produced in respect to the possible triggering of the language fallback mechanism. To explain a language fallback mechanism in a glimpse: for example, let a language fallback chain for a particular language be: “” → “” → “L.en”. That implies the following: if the usage of item label in Swiss German (“”) was attempted, and no label in Swiss German was found, an attempt to use the German (“”) would be made, and an attempt at the English label (“L.en”) made in the end if the previous attempt fails. However, if a language fallback mechanism is triggered on a particular entity usage occasion, all L aspects in that fallback chain will be registered in the wbc_entity_usage table as if they were used simultaneously. From the viewpoint of Wikidata usage, it would be interesting to track (a) the attempted – i.e. the user intended – L aspect, or (at least) (b) the actually used L aspect for a given entity usage. However, the current design of the wbc_entity_usage table does not provide for an assessment of neither of these possibilities. 

Finally, there are other uncertainties related to the current design of the wbc_entity_usage table. For example, imagine an editor action that results in a presence of a particular item, with a sitelink, instantiating a label in a particular language at the same time. How many item usage counts do we have: one, two, or more (one “S” aspect count for the sitelink, and at least another for a specific “L” aspect count)?

In conclusion, if Wikidata usage statistics are to encompass all different ways in which an item usage could be defined, by mapping onto all possible editor actions in instantiating a particular item on a particular page, the design of the wbc_entity_usage table would have to undergo a thorough revision, or a new Wikidata usage tracking mechanism would have to be developed from scratch. The wbc_entity_usage table was never designed to enable for analytical purposes in the first place; however, it is the only source for Wikidata usage statistics that we can currently rely on.

A proposal for an initial solution:

- [NOTE] This is the current Wikidata usage definition in the context of WDCM.

From the existing wbc_entity_table schema, it seems possible to rely on the following definition. For the initial version of the WDCM system, use a simplified definition of Wikidata usage that excludes the multiple item per-page usage cases, in effect: 

  • count on how many pages a particular Wikidata item occurs in a project;
  • take that as a Wikidata usage per-project statistic;
  • ignore usage aspects completely until a proper tracking of usage per-page is enabled in the future.

By "proper tracking of usage per-page" the following is meant:

  • a methodology that counts exactly how many usage cases of a particular item there are on a particular page in a particular project.

WDCM Taxonomy

The WDCM Taxonomy presents a human choice of specific categories and items from the Wikidata ontology that are submitted to WDCM for analytics.

Currently, only one WDCM Taxonomy is specified (Lydia Pintcher, 05/03/2017, Berlin).

The fact that the WDCM relies on a specific choice of taxonomy implies that not all Wikidata items are necessarily tracked and analyzed by the system.

Users of WDCM can specify any imaginable taxonomy that presents a proper subset of Wikidata; no components of the WDCM system are dependent upon any characteristics of some particular choice of taxonomy.

Once defined, the WDCM taxonomy is translated into a set of (typically very simple) SPARQL queries that are used to collect the respective item IDs; only collected items will be tracked and analyzed by the system.

The 14 currently encompassed item categories are:  

The WDCM Taxonomy is still undergoing refinement. An ideal situation would be to completely avoid category overlap, which is not yet satisfied, and it is questionable whether it is possible as a general solution at all in respect to the structure of Wikidata. The following directed graph shows the current item categories in the WDCM taxonomy and the network of the P279 (Subclass Of) relations in which they play a role. The structure was obtained by performing a recurrent search through the P279 paths, starting from entity (Q35120) and down from it to depth of 4 (searching for sub-classes of sub-classes of sub-classes etc). Some item categories from the WDCM Taxonomy are not found even at the sub-class depth 4 from entity, which constraints P279 as its necessary target item (on a recurrent path from anything, of course). Note: the only cycle in the graph is Entity →Entity.

WDCM Taxonomy, P279 structure down to depth 4 from entity (Q35120).
WDCM Taxonomy, P279 structure down to depth 4 from entity (Q35120).

WDCM Data Schemata

The WDCM data schemata encompass three components:

  • HDFS, Big Data component (Production, Analytics Cluster)
  • The RDBS Component (Cloud VPS, MariaDB)
  • Cloud VPS Instance Local Data Component

The first, (1) HDFS Big Data component, is produced by an (A) R-orchestrated Apache Sqoop cycle which transfers many wbc_entity_usage tables from MariaDB in production to a single Hive table in Hadoop on the Analytics Cluster, and an (B) ETL and machine learning cycle that uses R packages and HiveQL interchangeably to produce the data sets for the (2) RDBS component on tools.labdb and (3) the Local Data Component stored as .csv files on the wikidataconcepts.eqiad.wmflabs instance.

The details of this process are given below (WDCM System Operation Workflow). All WDCM public data sets are produced from the HDFS Big Data component and are publicly available from; the WDCM Dashboards operate on the very same files.

HDFS, Big Data component (Production)

This component encompasses the following two HiveQL tables, currently in the goransm Hadoop database:


This table is the result of the the WDCM_Sqoop_Clients.R script which runs a regular weekly Apache Sqoop update to collect the data from all client projects that maintain the wbc_entity_usage table - which means that they have Wikidata usage client-side tracking enabled. This Hive table presents the raw WDCM data set; it is simply a product of sqooping many big MariaDB tables that are not suited for analytics queries into Hadoop. This Hive table is used only to produce the wdcm_maintable in Hive that is then used in all WDCM pre-processing operations.

col_name data_type comment
eu_row_id bigint row identifier
eu_entity_id string Wikidata item ID, e.g. Q5
eu_aspect string eu_aspect, see wbc_entity_usage schema
eu_page_id bigint the ID of the page where the item is used
wiki_db string partition; the project database, e.g. "enwiki", "dewiki".


The main WDCM data set, a Hive table produced by the WDCM_Engine_goransm.R script from the wdcm_clients_wb_entity_usage table. All WDCM data sets are produced from this Hive table by the WDCM_Engine_goransm.R script.

col_name data_type comment
eu_entity_id string Wikidata item ID, e.g. Q5
eu_project string the project database, e.g. "enwiki", "dewiki".
eu_count bigint the WDCM statistic: how many different pages do make use of this eu_entity_id in this eu_project?
category string partition; to which category of Wikidata items from the WDCM Taxonomy does this eu_entity_id belongs to?

The RDBS component (Cloud VPS)

This component is supported by MariaDB, with many SQL tables in the u16664__wdcm_p database on tools.labsdb. All these tables are produced by the WDCM_Process.R script that is run from the wikidataconcepts.eqiad.wmflabs Cloud VPS (i.e. Labs) instance. The same instance servers the WDCM front-end. All tables are currently produced by the {RMySQL} package, however, that will change in the future (by performing direct system() calls to MariaDB).

wdcm2_category The per WDCM semantic category aggregated WDCM usage statistics.

wdcm2_category data type
category text
eu_count bigint(20)

wdcm2_category_item100 The 100 most frequently used Wikidata items per WDCM semantic category, including item labels.

wdcm2_category_item100 data type
eu_entity_id varchar(255)
eu_count int(11)
category varchar(255)
eu_label varchar(255)

wdcm2_category_project_2dmap The t-SNE 2D reduction of the WDCM Semantic Category x Wikimedia Projects matrices. Representation is by category.

wdcm2_category_project_2dmap data type
D1 double
D2 double
category text

wdcm2_itemtopic_<category_name> [generic, as many tables as there are WDCM semantics categories]. The Wikidata item x Topics matrices obtained from Latent Dirichlet Allocation across the 5,000 most frequently used items per WDCM semantic category. Thus, each category receives one table, for example: wdcm2_itemtopic_Taxon. The number of Topic fields depends upon the particular LDA Topic Model of course. Generated by the WDCM_Engine_goransm.R script initially, only copied from Production and stored as SQL tables.

wdcm2_itemtopic_<category> data type
eu_entity_id text
topic1 double
topic2 double
topic3 double
... double
topicN double
eu_label text

wdcm2_project WDCM total usage statistics per Wikimedia project.

wdcm2_project data type
eu_project text
eu_count bigint(20)
projectype text

wdcm2_project_category The Wikimedia Project x WDCM Semantic Category usage statistics cross-tabulation.

wdcm2_project_category data types
eu_project varchar(255)
category varchar(37)
eu_count int(11)
projecttype varchar(255)

wdcm2_project_category_2dmap The t-SNE 2D reduction of the WDCM Semantic Category x Wikimedia Projects matrices. Representation is by project.

wdcm2_project_category_2dmap data type
D1 double
D2 double
projects text
projecttype text

wdcm2_project_category_item100 The 100 most frequently used Wikidata items per WDCM semantic category per Wikimedia project.

wdcm2_project_category_item100 data type
eu_project varchar(255)
category varchar(255)
eu_entity_id varchar(37)
eu_count int(11)
projecttype varchar(255)
eu_label varchar(255)

wdcm2_project_item100 The 100 most frequently used Wikidata items per project, with item labels included.

wdcm2_project_item100 data type
eu_project varchar(255)
eu_entity_id varchar(37)
eu_count int(11)
projecttype varchar(255)
eu_label varchar(255)

wdcm2_projects_2dmaps The t-SNE 2D reductions of the Wikimedia Projects x Topics matrices obtained from Latent Dirichlet Allocation across the 5,000 most frequently used items per WDCM semantic category performed over the Wikidata Item x Wikimedia Project matrices in WDCM_Engine_goransm.R. Generated by the WDCM_Engine_goransm.R script initially, only copied from Production and stored as SQL tables. These are the most precise WDCM descriptions of usage pattern structures across the Wikimedia projects.

Cloud VPS Instance Local Data Component

Sometimes, for smaller data sets, it is more efficient to load with data.table::fread() than to fetch from MariaDB. Thus the WDCM system store some of the data sets locally as .csv files and the dashboards load them directly from the respective directories.

The following .csv tables are produced in production (stat1005) from WDCM_Engine_goransm.R and then copied to the wikidataconcepts.eqiad.wmflabs instance where they are loaded directly from R to support the WDCM Dashboards:


  • the Wikimedia Projects x Topics matrices obtained from Latent Dirichlet Allocation across the 5,000 most frequently used items per WDCM semantic category



Besides these data frames, a specific set of files used by the WDCM Geo Dashboard is also stored locally on the Cloud VPS wikidataconcepts.eqiad.wmflabs instance:


WDCM System Operation Workflow

The following schema represents the WDCM System Operation Workflow. We will proceed by explaining component by component.

WDCM System Operation Workflow
WDCM System Operation Workflow.
  • The first phase is performed by the WDCM_Sqoop_Clients.R script, run on a regular weekly schedule from stat1004. This R script orchestrates Apache Sqoop operations to (1) transfer the many (currently more than 800) wbc_entity_usage SQL tables from the m2 MariaDB replica in order to produce (2) the Hadoop/Hive wdcm_clients_wb_entity_usage table, partitioned by WDCM semantic category (elements of the WDCM Taxonomy) in the goransm database where they can processed. The first phase typically takes 7-8 hours to complete.
  • The second phase (3 - 6) is performed by the WDCM_Engine_goransm.R from stat1005, run on a regular monthly basis. This script presents, in many respects, the central WDCM ETL and computing engine:
    • in the first step, this script will (3) load the current WDCM Taxonomy to determine what Wikidata item classes need to be fetched from Wikidata via the SPARQL endpoint;
    • in the second step, (4) many millions of Wikidata items are selected and their IDs fetched from the SPARQL endpoint, in order to determine what information to search for in the previously produced wdcm_clients_wb_entity_usage table Hive table;
    • in the third step, (5) many HiveQL cycles of batch processing are performed in order to aggregate the WDCM statistics per Wikimedia project and per Wikidata item to produce the wdcm_maintable, partitioned by Wikimedia project;
    • in the fourth step, (6a) ETL steps are performed in HiveQL over the wdcm_maintable to produce various aggregate WDCM data sets;
      • in the fifth step, Wikimedia Project x Wikidata items matrices - one matrix per WDCM semantic category - are submitted to topic modeling by the R {maptpx} MAP estimation of the Latent Dirichlet Allocation model; while {maptpx} allows for a rapid estimation of the LDA topic models in respect to other algorithms, this step will soon be replaced by running LDA from Apache Spark in order to gain additional processing efficiency and enable for cross-validation procedures;
      • in the sixth step, Wikimedia project x Semantic topics matrices are submitted to an {Rtsne} implementation of the t-Distributed Stochastic Neighbor Embedding 2D dimensionality reduction to support visualizations on the WDCM Dashboards, and the coordinates of the respective 2D representations are stored;
      • in the seventh step, data frames for network visualizations with R {visNetwork} are prepared and stored;
    • in the final step (6b) all data sets are made public from /srv/published-datasets/wdcm on stat1005 which is mapped on for open access. The second phase - the operation of the WDCM_Engine_goransm.R script from stat1005 - typically takes 32 - 33 hours to complete.
  • The third and the final phase begins when (7) the regular hourly scheduled check for timestamp changes in the WDCM public data sets from the wikidataconcepts.eqiad.wmflabs Cloud VPS instance determines that a new WDCM update is ready and (8) starts the execution of the WDCM_Process.R script:
    • This script (9a) populates the SQL tables in the u16664_wdcm_p MariaDB database and (9b) copies some of the WDCM public data sets into local directories, which in turn support the WDCM Shiny Dashboards directly. This operation typically takes around 20 minutes to complete. Item labels are fetched from Wikibase/Schema/wb terms in this phase only, in order to avoid overloading the SPARQL endpoint in the previous phases.

The process illustrated here does not encompass the WDCM engine update of the WDCM Geo Dashboard, which is run by a separate (WDCM_EngineGeo_goransm.R) script from stat1005, relying on the same WDCM Hive tables as the main engine update.

WDCM Dashboards

The Dashboards module is a set of RStudio Shiny dashboards that serve the Wikidata usage analytics to its end-users.

This set of Shiny dashboards relies on the WDCM Database (u16664__wdcm_p on tools.labsdb) to serve Wikidata usage analytics; the database is obtained directly from the WDCM_Process.R script.

Currently, the WDCM System runs four dashboards:

  • WDCM Overview, providing an elementary overview - the "big picture" - of Wikidata usage
  • WDCM Usage, providing for detailed usage statistics, and
  • WDCM Semantics, providing insights from the topic models derived from the usage data.
  • WDCM Geo, providing interactive maps of geolocalized Wikidata items alongside the respective WDCM usage statistics.

All WDCM Dashboards are documented in their respective Description sections. A walk-through with illustrative usage examples is provided on the WDCM Project page.

WDCM Code Repository

All WDCM code is hosted on Gerrit and distributed on Diffusion:

WDCM Puppetization

The WDCM is currently ongoing puppetization:

  • On the statboxes: see
    • NOTE: you might have noticed that all WDCM Engine scripts mentioned in this technical documentation have a _goransm suffix. The reason is that, because of the current setup on the statboxes (stat1004, stat1005), the analytics-wmde user is not able to run HiveQL scripts as an automated user, which is currently preventing the puppetization of the WDCM system in production. Phab tickets are opened in that respect and the problem will be be resolved soon.
  • On the wikidataconcepts labs instance: see