Data Platform/Data Lake/Content/Mediawiki wikitext current
This page describes the dataset on HDFS and Hive that stores the wikitext last-revision per-page of WMF's wikis, as provided through monthly XML Dumps. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline/spark external table wmf.mediawiki_wikitext_current
. A new monthly snapshot is being produced around the 13th of each month (last xml-dumps is made available the 11th); to check whether it is ready to be queried, one can look for the status of the mediawiki-wikitext-current-coord Oozie job. Also visit Analytics/Data access if you don't know how to access this data set.
Underlying data is stored in avro
file format on HDFS at path following the pattern: hdfs:///wmf/data/wmf/mediawiki/wikitext/current/snapshot=YYYY-MM/wiki_db=WIKI_DB
.
This table is deprecated, and should not be used for new use cases. For current use cases, we encourage moving to the new Mediawiki content current v1, which provides daily updates.
Schema
You can get the canonical version of the schema by running describe wmf.mediawiki_wikitext_current
from the hive/beeline/spark command line.
Note that the snapshot
and wiki_db
fields are Hive partitions. They explicitly map to snapshot folders in HDFS. Since the data is relevant only at snapshot level, you should always pick a single snapshot in the where
clause of your query.
col_name | data_type | comment |
---|---|---|
page_id | bigint | id of the page |
page_namespace | int | namespace of the page |
page_title | string | title of the page |
page_redirect_title | string | title of the redirected-to page |
page_restrictions | array<string> | restrictions of the page |
user_id | bigint | id of the user that made the revision (or null/0 if anonymous) |
user_text | string | text of the user that made the revision (either username or IP) |
revision_id | bigint | id of the revision |
revision_parent_id | bigint | id of the parent revision |
revision_timestamp | string | timestamp of the revision (ISO8601 format) |
revision_minor_edit | boolean | whether this revision is a minor edit or not |
revision_comment | string | Comment made with revision |
revision_text_bytes | bigint | bytes number of the revision text |
revision_text_sha1 | string | sha1 hash of the revision text |
revision_text | string | text of the revision |
revision_content_model | string | content model of the revision |
revision_content_format | string | content format of the revision |
snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular imports) |
wiki_db | string | The wiki_db project |
Changes and known problems
Date | Phab
Task |
Snapshot version | Details |
---|---|---|---|
2019-11-01 | task T238858 | 2019-11 | Creation of the table. Data starts to flow regularly (every month) (actually happened in January 2020, but data was made available from 2019-11). |
XMLDumps Row Data
The mediawiki_wikitext_current dataset is computed from the pages_meta_current XML dumps. Those are imported every month onto HDFS and stored in folders following this pattern: hdfs:///wmf/data/raw/mediawiki/dumps/pages_meta_current/YYYYMMDD/WIKI_DB
Note: There is one month difference between the snapshot value of the avro-converted data and the raw data. This is because by convention in Hive we use the date for currently available data (for instance 2019-11 means that November 2019 data is present), while dumps generation date is the date of generation (20191201 means data generation has started on 2019-12-01, therefore having 2019-11 data but not 2019-12).
Example usage
Here is an example of using the current dump to extract references currently-linked media in Spark.
Note: This code doesn't count media made available through templates.
Launch a spark-shell:
spark2-shell \
--master yarn
--executor-memory 8G \
--executor-cores 4 \
--driver-memory 8G \
--conf spark.dynamicAllocation.maxExecutors=64 \
--conf spark.executor.memoryOverhead=2048 \
--jars /srv/deployment/analytics/refinery/artifacts/refinery-job.jar,/srv/deployment/analytics/refinery/artifacts/refinery-hive.jar
Do the computation:
// Media matching and classification code
import scala.util.matching.Regex
import org.wikimedia.analytics.refinery.core.media.{MediaTypeClassification,MediaTypeClassifier}
def firstGroups(str: String, pattern: Regex): Array[String] = {
pattern.findAllMatchIn(str).map(_.group(1)).toArray
}
val extensionsList = MediaTypeClassification.values.map(_.toString.toLowerCase).filter(_ != "other")
val extensionsPattern = extensionList.mkString("|")
// Note: Being careful in devising the regex is important as A LOT of text will be processed
val imagePattern = (s"(?i)([\\w][\\w \\-_]{1,500}\\.(?:$extensionsPattern))").r
val imagesFromText = (txt: String) => firstGroups(txt, imagePattern)
spark.udf.register("imagesFromText", imagesFromText)
spark.udf.register("classify_media", (media: String) => new MediaTypeClassifier().classify(media).getParentType)
// Process text using 8192 partitions (data is big)
spark.sql("SET spark.sql.shuffle.partitions = 8192")
val mediasPerPage = spark.sql("""
WITH titles AS (
SELECT
wiki_db,
page_id,
page_namespace,
explode(imagesFromText(revision_text)) AS page_title
FROM wmf._mediawiki.wikitext_current
WHERE snapshot = '2019-12'
),
medias AS (
SELECT
wiki_db,
page_id,
page_namespace,
lower(replace(COALESCE(page_title, ''), ' ', '_')) AS media
FROM titles
)
SELECT
wiki_db,
page_id,
page_namespace,
classify_media(media) AS media_classification,
media
FROM media
""")
// Write results for future reuse
mediasPerPage.repartition(16).write.mode("overwrite").parquet("/tmp/medias_per_page/snapshot=2019-12")