WMDE/Wikidata/Growth

From Wikitech
< WMDE‎ | Wikidata

Last updated March 2020

Data & Writing

Edit rate

Wikidata edit rate (per year)

2019-20 prediction, no vast increase in rate, 200 million - 250 million.

Data from hadoop: https://phabricator.wikimedia.org/P8193 Yearly EPM using X/365/24/60

Past rate:

  • 2012, 2,912,964
  • 2013, 94,323,394 (179 EPM)
  • 2014, 87,411,229 (166 EPM)
  • 2015, 102,362,226 (194 EPM)
  • 2016, 135,511,683 (257 EPM)
  • 2017, 192,353,549 (365 EPM)
  • 2018, 208,944,716 (397 EPM)
  • 2019, 263,729,707 (501 EPM)
  • 2020, 244,624,756 (465 EPM)
  • 2021, 137,918,274 (392 EPM) (End of August 2021)

Yearly edit rate equivalent sustained EPMs

In order to put looking at yearly figures in perspective see below conversion table for going from yearly edits to sustained / average EPM for the year.

Year Edits EPM
200 million 380 EPM
300 million 570 EPM
600 million 1141 EPM

Revision count

Can be retrieved at any given time by looking at the rev id of the latest new page creation on https://www.wikidata.org/wiki/Special:NewPages

Wikidata

  • 2018 we were at 208,944,716 edits.
  • March 2019 we are at 881,499,873 revisions.
  • June 2019 we are at 965,310,320 revisions.
  • October 2019 we are at 1,042,114,532 revisions.
  • February 2020 we are at 1,129,441,630 revisions.
  • September 2021 we are at 1,494,706,636 revisions

This will probably increase to over 2 billion around 2023-2024..

Long term, reaching 4,294,967,295 (bigint revids)

Based on what we know now we would predicate that we would not need bigints on the revision table until at least 2025, likely further in the future.

Year (end) Increase? Total
2021 200-350 million 1.5-1.85 billion
2022 200-400 million 1.7-2.25 billion
2023 200-450 million 1.9-2.7 billion
2024 200-500 million 2.1-3.2 billion
2025 200-550 million 2.3-3.75 billion
2025 200-600 million 2.5-4.15 billion

Commons

  • June 2019 we are at 354,280,797 revisions.
  • October we are at 372,682,216 revisions.
  • February 2020 we are at 401,617,996 revisions.
  • September 2021 we are at 589,723,281 revisions.

Entity size

Average size

  • Average size of items remains pretty steady, ~18KB in March 2019 raising to ~20KB in September 2021
  • 2022-23 prediction would not see this increase to over ~30KB
  • Lexeme size isn't tracked, but assumed to be much smaller than items.

Max size

  • In 2019 the max size of entities was increased from 2500 to 3000.

Storage in memcached

Note: This may no longer be relevant?

Currently (March 2019) the size of entities could become an issue for storage in the shared memcached cache when they reach 1MB.

See WMDE/Wikidata/Caching#WikiPageEntityRevisionLookup for more details.

Right now the biggest shared cache entity is less than 200k, meaning the max entity size limit would have to increase to around 15,000 to become an issue[citation needed].

Changes in the way the serialization is stored though could accelerate this.

Number of Entities by type

Grafana: https://grafana.wikimedia.org/d/000000167/wikidata-datamodel

Items

2021-22 predicted growth 15 million - 25 million, resulting in no more than 117 million items.

Past growth:

  • 2016-17 5.3 million
  • 2017-18 17.7 million
  • 2018-19 11.3 million (ending with 53.6 million)
  • 2019-20 18.5 million (ending with 72 million)
  • 2020-21 19.5 million (ending with 91.5 million)

Properties

2021-22 predicted grown 1000 - 2000 property increase, resulting in no more than 10,300 properties.

Past growth:

  • 2016-17, 900
  • 2017-18, 1200
  • 2018-19, 1500 (ending in 5715)
  • 2019-20, 1329 (ending in 7044)
  • 2020-21, 1245 (ending in 8289)

Lexemes

Lexemes were released to the world in 2018.

  • 2018-19, (ending in 40k)
  • 2019-20, +189k (ending in 229k)
  • 2020-21, +161k (ending in 390k)

Unless something drastic happens we would comfortably stay below 2 million lexemes for 2021-2022. Maybe even until 2024/5.

No prediction for Forms or Senses here...

MediaInfo

There is no grafana tracking for mediainfo entities currently.

  • 2019-03-xx 273,540 entities, out of 52 million files ≈ 0.5% of files
  • 2019-06-20 1,210,558 entities, out of 54.4 million files ≈ 2.2% of files
  • 2019-11-09 2,861,376 entities, out of 57.1 million files ≈ 5.0% of files
  • 2021-09-08 68,215,980 entities, out of 78.5 million files ≈ TODO% of files

MediaInfo entities are expected to have the same number as the number of files on Commons (>50 million).

DB Tables size

Ratio to max value auto increment for Wikidata is monitored on https://grafana.wikimedia.org/d/79S1Hq9Mz/wikidata-reliability-metrics

Currently everything is below 40% usage.

We could also monitor disk and index sizes? But we don't right now.

Misc storage

WikibaseQualityConstraints check data

TBA (we are going to persistently store this stuff, but still don't yet)

Usage & Reading

TBA more stuff?

Wikidata.org / Repo

3rd party federated wikis

At some point we will develop federation for 3rd parties. This will likely result in an increase in requests to Special:EntityData and or the API. More details to come in the future...

3rd party WDQS updaters

As identified in https://phabricator.wikimedia.org/T217897#5020183 WDQS updaters both internal to WMF and external hit Special:EntityData a lot. These requests account for most of the cache misses on wikidata.org.

WDQS

Naturally this is predicted to increase but this is mainly for the WMF search platform team to worry about.