WMDE/Wikidata/Growth

From Wikitech
< WMDE‎ | Wikidata
Jump to navigation Jump to search

Last updated March 2020

Data & Writing

Edit rate

Wikidata edit rate (per year)

2019-20 prediction, no vast increase in rate, 200 million - 250 million.

Data from hadoop: https://phabricator.wikimedia.org/P8193 Yearly EPM using X/365/24/60

Past rate:

  • 2012, 2,912,964
  • 2013, 94,323,394 (179 EPM)
  • 2014, 87,411,229 (166 EPM)
  • 2015, 102,362,226 (194 EPM)
  • 2016, 135,511,683 (257 EPM)
  • 2017, 192,353,549 (365 EPM)
  • 2018, 208,944,716 (397 EPM)
  • 2019, (415 EPM) (end of october 2019)

Yearly edit rate equivalent sustained EPMs

In order to put looking at yearly figures in perspective see below conversion table for going from yearly edits to sustained / average EPM for the year.

Year Edits EPM
200 million 380 EPM
300 million 570 EPM
600 million 1141 EPM

Revision count

Can be retrieved at any given time by looking at the rev id of the latest new page creation on https://www.wikidata.org/wiki/Special:NewPages

Wikidata

  • March 2019 we are at 881,499,873 revisions.
  • June 2019 we are at 965,310,320 revisions.
  • October 2019 we are at 1,042,114,532 revisions.
  • February 2020 we are at 1,129,441,630 revisions.

This will probably increase to over 1 billion by the end of 2019..

In 2018 the year edit count was 208,944,716. The rate is predicted to continue increasing at around 200 million - 250 million for 2019-20.

Long term, reaching 4,294,967,295 (bigint irevids)

Based on what we know now we would predicate that we would not need bigints on the revision table until at least 2025, likely further in the future.

Year (end) Increase? Total
2020 200-300 million 1.3-1.5 billion
2021 200-350 million 1.5-1.85 billion
2022 200-400 million 1.7-2.25 billion
2023 200-450 million 1.9-2.7 billion
2024 200-500 million 2.1-3.2 billion
2025 200-550 million 2.3-3.75 billion
2025 200-600 million 2.5-4.15 billion

Commons

  • June 2019 we are at 354,280,797 revisions.
  • October we are at 372,682,216 revisions.
  • February 2020 we are at 401,617,996 revisions

Entity size

Average size

  • Average size of items remains pretty steady, ~18KB in March 2019
  • 2019-20 prediction would not see this increase to over ~30KB
  • Lexeme size isn't tracked, but assumed to be much smaller than items.

Max size

  • In 2019 the max size of entities was increased from 2500 to 3000.

Storage in memcached

Currently (March 2019) the size of entities could become an issue for storage in the shared memcached cache when they reach 1MB.

See WMDE/Wikidata/Caching#WikiPageEntityRevisionLookup for more details.

Right now the biggest shared cache entity is less than 200k, meaning the max entity size limit would have to increase to around 15,000 to become an issue[citation needed].

Changes in the way the serialization is stored though could accelerate this.

Number of Entities by type

Grafana: https://grafana.wikimedia.org/d/000000167/wikidata-datamodel

Items

2020-21 predicted growth 15 million - 25 million, resulting in no more than 97 million items.

Past growth:

  • 2016-17 5.3 million
  • 2017-18 17.7 million
  • 2018-19 11.3 million (ending with 53.6 million)
  • 2019-20 18.5 million (ending with 72 million)

Properties

2020-21 predicted grown 1500 - 2000 property increase, resulting in no more than 9300 properties.

This takes into account the fact that over the years the rate of creation has increased every year, and also that commons will start using properties in 2019 and we may see an increase property creation due to that.

Past growth:

  • 2016-17, 900
  • 2017-18, 1200
  • 2018-19, 1500 (ending in 5715)
  • 2019-20, 1550 (ending in 7265)

Lexemes

Lexemes were only released to the world in 2018, so their growth is hard to predict.

The last 9 months (to March 2019) have seen an increase from 3509 to 43500.

Unless something drastic happens we would comfortably stay below 1 million lexemes for 2019-2020. Maybe even until 2021,

No prediction for Forms or Senses here...

MediaInfo

There is no grafana tracking for mediainfo entities currently.

  • 2019-03-xx 273,540 entities, out of 52 million files ≈ 0.5% of files
  • 2019-06-20 1,210,558 entities, out of 54.4 million files ≈ 2.2% of files
  • 2019-11-09 2,861,376 entities, out of 57.1 million files ≈ 5.0% of files
  • 2020-03-05 entities, out of million files ≈ % of files

MediaInfo entities are expected to have the same number as the number of files on Commons (>50 million).

DB Tables size

Latest info on auto inc fields running out of space: https://phabricator.wikimedia.org/P8198

wb_terms

wb_terms is VERY big(on disk), and is going to see no further adoption.

It is going to be killed in 2020 (In March)

"new" term storage

wbt_* tables

text & revisions

These tables will share the same growth pattern in terms of auto inc ids and the need to switch to bigints.

See predicted revision count in WMDE/Wikidata/Growth#Revision_count.

links tables

wikidatawiki

pagelinks

  • 300,484,443 - 14 Nov 2019
  • 1,424,231,788 - 5 March 2020

commonswiki

pagelinks

  • 564,570,593 - 14 Nov 2019
  • 584,754,585 - 5 March 2020

recentchanges & cu_changes

based on predicted revision increase rate WMDE/Wikidata/Growth#Revision_count we would fill the current auto increment fields between 2022-2024.

Query for auto inc data: https://phabricator.wikimedia.org/P10620

wikidatawiki recentchanges

- March 2019 - 919219099 out of 2147483647
- March 2020 - 1167669202 out of 2147483647, ratio 0.5437

wikidatawiki cu_changes

- March 2019 - 899023427 out of 2147483647
- March 2020 - 1151724031 out of 2147483647, ration 0.5363

Misc storage

WikibaseQualityConstraints check data

TBA (we are going to persistently store this stuff)

Usage & Reading

TBA more stuff?

Wikidata.org / Repo

3rd party federated wikis

At some point we will develop federation for 3rd parties. This will likely result in an increase in requests to Special:EntityData and or the API. More details to come in the future...

3rd party WDQS updaters

As identified in https://phabricator.wikimedia.org/T217897#5020183 WDQS updaters both internal to WMF and external hit Special:EntityData a lot. These requests account for most of the cache misses on wikidata.org.

The PHP processing for these queries is fairly light weight, but continued uncached requests here will result in a direct connection to increase reads from the shared entity revision cache in memcached.

WDQS

Naturally this is predicted to increase but this is mainly for the WMF discovery team to worry about.

There will likely be a growth in internal WMF requests (particularly from Wikibase quality constraints) as the checks are planned to run after every edit. Thus as edit rate increases the number of these checks increases.

Comments

Lydia growth thoughts from early 2019

  • Creation Rate
    • Items
      • Interest in project is growing
      • OTOH some groups are splitting out into own projects
      • Creation rate will not slow down
    • Property
      • Creation rate may slow down a bit, but follow existing trend
    • MediaInfo
      • Huge growth expected, number of M entities to be similar to number of files on commons
      • Commons it also expected to grow at a high rate
      • Properties for commons?
        • No significant raise of number expected.
    • Lexemes
      • Early stage of project, significant growth expected
      • Auto generating forms and senses - how much data is actually stored(curatable?) vs generated on the fly(i.e. Only materialize when requested)
  • General edit rate growth
    • Client editing from clients (wikipedias)
      • volumes of edits comparable to bot edit volume currently
  • Growth in the size of the entity
    • On average each item will have more data
  • Data used on client wikis
  • WDQS
    • WMF should be taking care of this
  • External to wikidata?
    • Non-WMF federated wikis accessing Wikidata data