Jump to content

Data Platform/Data Lake/Content/Mediawiki wikitext current

From Wikitech
This table is deprecated, and its use in new use cases is highly discouraged.

This page describes the dataset on HDFS and Hive that stores the wikitext last-revision per-page of WMF's wikis, as provided through monthly XML Dumps. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline/spark external table wmf.mediawiki_wikitext_current. A new monthly snapshot is being produced around the 13th of each month (last xml-dumps is made available the 11th); to check whether it is ready to be queried, one can look for the status of the mediawiki-wikitext-current-coord Oozie job. Also visit Analytics/Data access if you don't know how to access this data set.

Underlying data is stored in avro file format on HDFS at path following the pattern: hdfs:///wmf/data/wmf/mediawiki/wikitext/current/snapshot=YYYY-MM/wiki_db=WIKI_DB.

This table is deprecated, and should not be used for new use cases. For current use cases, we encourage moving to the new Mediawiki content current v1, which provides daily updates.

Schema

You can get the canonical version of the schema by running describe wmf.mediawiki_wikitext_current from the hive/beeline/spark command line.

Note that the snapshot and wiki_db fields are Hive partitions. They explicitly map to snapshot folders in HDFS. Since the data is relevant only at snapshot level, you should always pick a single snapshot in the where clause of your query.

col_name data_type comment
page_id bigint id of the page
page_namespace int namespace of the page
page_title string title of the page
page_redirect_title string title of the redirected-to page
page_restrictions array<string> restrictions of the page
user_id bigint id of the user that made the revision (or null/0 if anonymous)
user_text string text of the user that made the revision (either username or IP)
revision_id bigint id of the revision
revision_parent_id bigint id of the parent revision
revision_timestamp string timestamp of the revision (ISO8601 format)
revision_minor_edit boolean whether this revision is a minor edit or not
revision_comment string Comment made with revision
revision_text_bytes bigint bytes number of the revision text
revision_text_sha1 string sha1 hash of the revision text
revision_text string text of the revision
revision_content_model string content model of the revision
revision_content_format string content format of the revision
snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular imports)
wiki_db string The wiki_db project

Changes and known problems

Date Phab

Task

Snapshot version Details
2019-11-01 task T238858 2019-11 Creation of the table. Data starts to flow regularly (every month) (actually happened in January 2020, but data was made available from 2019-11).


XMLDumps Row Data

The mediawiki_wikitext_current dataset is computed from the pages_meta_current XML dumps. Those are imported every month onto HDFS and stored in folders following this pattern: hdfs:///wmf/data/raw/mediawiki/dumps/pages_meta_current/YYYYMMDD/WIKI_DB

Note: There is one month difference between the snapshot value of the avro-converted data and the raw data. This is because by convention in Hive we use the date for currently available data (for instance 2019-11 means that November 2019 data is present), while dumps generation date is the date of generation (20191201 means data generation has started on 2019-12-01, therefore having 2019-11 data but not 2019-12).

Example usage

Here is an example of using the current dump to extract references currently-linked media in Spark.

Note: This code doesn't count media made available through templates.

Launch a spark-shell:

spark2-shell \
  --master yarn
  --executor-memory 8G \
  --executor-cores 4 \
  --driver-memory 8G \
  --conf spark.dynamicAllocation.maxExecutors=64 \
  --conf spark.executor.memoryOverhead=2048 \
  --jars /srv/deployment/analytics/refinery/artifacts/refinery-job.jar,/srv/deployment/analytics/refinery/artifacts/refinery-hive.jar

Do the computation:

// Media matching and classification code

import scala.util.matching.Regex
import org.wikimedia.analytics.refinery.core.media.{MediaTypeClassification,MediaTypeClassifier}

def firstGroups(str: String, pattern: Regex): Array[String] = {
    pattern.findAllMatchIn(str).map(_.group(1)).toArray
}

val extensionsList = MediaTypeClassification.values.map(_.toString.toLowerCase).filter(_ != "other")
val extensionsPattern = extensionList.mkString("|")

// Note: Being careful in devising the regex is important as A LOT of text will be processed
val imagePattern = (s"(?i)([\\w][\\w \\-_]{1,500}\\.(?:$extensionsPattern))").r

val imagesFromText = (txt: String) => firstGroups(txt, imagePattern)

spark.udf.register("imagesFromText", imagesFromText)
spark.udf.register("classify_media", (media: String) => new MediaTypeClassifier().classify(media).getParentType)

// Process text using 8192 partitions (data is big)

spark.sql("SET spark.sql.shuffle.partitions = 8192")

val mediasPerPage = spark.sql("""

WITH titles AS (
  SELECT
    wiki_db,
    page_id,
    page_namespace,
    explode(imagesFromText(revision_text)) AS page_title
  FROM wmf._mediawiki.wikitext_current
    WHERE snapshot = '2019-12'
),

medias AS (
  SELECT
    wiki_db,
    page_id,
    page_namespace,
    lower(replace(COALESCE(page_title, ''), ' ', '_')) AS media
  FROM titles
)

SELECT
  wiki_db,
  page_id,
  page_namespace,
  classify_media(media) AS media_classification,
  media
FROM media

""")

// Write results for future reuse
mediasPerPage.repartition(16).write.mode("overwrite").parquet("/tmp/medias_per_page/snapshot=2019-12")

See also