Data Platform/Data Lake/Content/Mediawiki wikitext current
This page describes the dataset on HDFS and Hive that stores the wikitext last-revision per-page of WMF's wikis, as provided through monthly XML Dumps. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline/spark external table wmf.mediawiki_wikitext_current
. A new monthly snapshot is being produced around the 13th of each month (last xml-dumps is made available the 11th); to check whether it is ready to be queried, one can look for the status of the mediawiki-wikitext-current-coord Oozie job. Also visit Analytics/Data access if you don't know how to access this data set.
Underlying data is stored in avro
file format on HDFS at path following the pattern: hdfs:///wmf/data/wmf/mediawiki/wikitext/current/snapshot=YYYY-MM/wiki_db=WIKI_DB
.
Schema
You can get the canonical version of the schema by running describe wmf.mediawiki_wikitext_current
from the hive/beeline/spark command line.
Note that the snapshot
and wiki_db
fields are Hive partitions. They explicitly map to snapshot folders in HDFS. Since the data is relevant only at snapshot level, you should always pick a single snapshot in the where
clause of your query.
col_name | data_type | comment |
---|---|---|
page_id | bigint | id of the page |
page_namespace | int | namespace of the page |
page_title | string | title of the page |
page_redirect_title | string | title of the redirected-to page |
page_restrictions | array<string> | restrictions of the page |
user_id | bigint | id of the user that made the revision (or null/0 if anonymous) |
user_text | string | text of the user that made the revision (either username or IP) |
revision_id | bigint | id of the revision |
revision_parent_id | bigint | id of the parent revision |
revision_timestamp | string | timestamp of the revision (ISO8601 format) |
revision_minor_edit | boolean | whether this revision is a minor edit or not |
revision_comment | string | Comment made with revision |
revision_text_bytes | bigint | bytes number of the revision text |
revision_text_sha1 | string | sha1 hash of the revision text |
revision_text | string | text of the revision |
revision_content_model | string | content model of the revision |
revision_content_format | string | content format of the revision |
snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular imports) |
wiki_db | string | The wiki_db project |
Changes and known problems
Date | Phab
Task |
Snapshot version | Details |
---|---|---|---|
2019-11-01 | task T238858 | 2019-11 | Creation of the table. Data starts to flow regularly (every month) (actually happened in January 2020, but data was made available from 2019-11). |
XMLDumps Row Data
The mediawiki_wikitext_current dataset is computed from the pages_meta_current XML dumps. Those are imported every month onto HDFS and stored in folders following this pattern: hdfs:///wmf/data/raw/mediawiki/dumps/pages_meta_current/YYYYMMDD/WIKI_DB
Note: There is one month difference between the snapshot value of the avro-converted data and the raw data. This is because by convention in Hive we use the date for currently available data (for instance 2019-11 means that November 2019 data is present), while dumps generation date is the date of generation (20191201 means data generation has started on 2019-12-01, therefore having 2019-11 data but not 2019-12).
Example usage
Here is an example of using the current dump to extract references currently-linked media in Spark.
Note: This code doesn't count media made available through templates.
Launch a spark-shell:
spark2-shell \
--master yarn
--executor-memory 8G \
--executor-cores 4 \
--driver-memory 8G \
--conf spark.dynamicAllocation.maxExecutors=64 \
--conf spark.executor.memoryOverhead=2048 \
--jars /srv/deployment/analytics/refinery/artifacts/refinery-job.jar,/srv/deployment/analytics/refinery/artifacts/refinery-hive.jar
Do the computation:
// Media matching and classification code
import scala.util.matching.Regex
import org.wikimedia.analytics.refinery.core.media.{MediaTypeClassification,MediaTypeClassifier}
def firstGroups(str: String, pattern: Regex): Array[String] = {
pattern.findAllMatchIn(str).map(_.group(1)).toArray
}
val extensionsList = MediaTypeClassification.values.map(_.toString.toLowerCase).filter(_ != "other")
val extensionsPattern = extensionList.mkString("|")
// Note: Being careful in devising the regex is important as A LOT of text will be processed
val imagePattern = (s"(?i)([\\w][\\w \\-_]{1,500}\\.(?:$extensionsPattern))").r
val imagesFromText = (txt: String) => firstGroups(txt, imagePattern)
spark.udf.register("imagesFromText", imagesFromText)
spark.udf.register("classify_media", (media: String) => new MediaTypeClassifier().classify(media).getParentType)
// Process text using 8192 partitions (data is big)
spark.sql("SET spark.sql.shuffle.partitions = 8192")
val mediasPerPage = spark.sql("""
WITH titles AS (
SELECT
wiki_db,
page_id,
page_namespace,
explode(imagesFromText(revision_text)) AS page_title
FROM wmf._mediawiki.wikitext_current
WHERE snapshot = '2019-12'
),
medias AS (
SELECT
wiki_db,
page_id,
page_namespace,
lower(replace(COALESCE(page_title, ''), ' ', '_')) AS media
FROM titles
)
SELECT
wiki_db,
page_id,
page_namespace,
classify_media(media) AS media_classification,
media
FROM media
""")
// Write results for future reuse
mediasPerPage.repartition(16).write.mode("overwrite").parquet("/tmp/medias_per_page/snapshot=2019-12")