Analytics/Data Lake/Content/XMLDumps/Mediawiki wikitext current

From Wikitech
Jump to navigation Jump to search

This page describes the dataset on HDFS and Hive that stores the wikitext last-revision per-page of WMF's wikis, as provided through monthly XML Dumps. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline/spark external table wmf.mediawiki_wikitext_current. A new monthly snapshot is being produced around the 13th of each month (last xml-dumps is made available the 11th); to check whether it is ready to be queried, one can look for the status of the mediawiki-wikitext-current-coord Oozie job. Also visit Analytics/Data access if you don't know how to access this data set.

Underlying data is stored in avro file format on HDFS at path following the pattern: hdfs:///wmf/data/wmf/mediawiki/wikitext/current/snapshot=YYYY-MM/wiki_db=WIKI_DB.


Schema

You can get the canonical version of the schema by running describe wmf.mediawiki_wikitext_current from the hive/beeline/spark command line.

Note that the snapshot and wiki_db fields are Hive partitions. They explicitly map to snapshot folders in HDFS. Since the data is relevant only at snapshot level, you should always pick a single snapshot in the where clause of your query.

col_name data_type comment
page_id bigint id of the page
page_namespace int namespace of the page
page_title string title of the page
page_redirect_title string title of the redirected-to page
page_restrictions array<string> restrictions of the page
user_id bigint id of the user that made the revision (or null/0 if anonymous)
user_text string text of the user that made the revision (either username or IP)
revision_id bigint id of the revision
revision_parent_id bigint id of the parent revision
revision_timestamp string timestamp of the revision (ISO8601 format)
revision_minor_edit boolean whether this revision is a minor edit or not
revision_comment string Comment made with revision
revision_text_bytes bigint bytes number of the revision text
revision_text_sha1 string sha1 hash of the revision text
revision_text string text of the revision
revision_content_model string content model of the revision
revision_content_format string content format of the revision
snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular imports)
wiki_db string The wiki_db project

Changes and known problems

Date Phab

Task

Snapshot version Details
2019-11-01 Task T238858 2019-11 Creation of the table. Data starts to flow regularly (every month) (actually happened in January 2020, but data was made available from 2019-11).


XMLDumps Row Data

The mediawiki_wikitext_current dataset is computed from the pages_meta_current XML dumps. Those are imported every month onto HDFS and stored in folders following this pattern: hdfs:///wmf/data/raw/mediawiki/dumps/pages_meta_current/YYYYMMDD/WIKI_DB

Note: There is one month difference between the snapshot value of the avro-converted data and the raw data. This is because by convention in Hive we use the date for currently available data (for instance 2019-11 means that November 2019 data is present), while dumps generation date is the date of generation (20191201 means data generation has started on 2019-12-01, therefore having 2019-11 data but not 2019-12).

Example usage

Here is an example of using the current dump to extract references currently-linked media in Spark.

Note: This code doesn't count media made available through templates.

Launch a spark-shell:

spark2-shell \
  --master yarn
  --executor-memory 8G \
  --executor-cores 4 \
  --driver-memory 8G \
  --conf spark.dynamicAllocation.maxExecutors=64 \
  --conf spark.executor.memoryOverhead=2048 \
  --jars /srv/deployment/analytics/refinery/artifacts/refinery-job.jar,/srv/deployment/analytics/refinery/artifacts/refinery-hive.jar

Do the computation:

// Media matching and classification code

import scala.util.matching.Regex
import org.wikimedia.analytics.refinery.core.media.{MediaTypeClassification,MediaTypeClassifier}

def firstGroups(str: String, pattern: Regex): Array[String] = {
    pattern.findAllMatchIn(str).map(_.group(1)).toArray
}

val extensionsList = MediaTypeClassification.values.map(_.toString.toLowerCase).filter(_ != "other")
val extensionsPattern = extensionList.mkString("|")

// Note: Being careful in devising the regex is important as A LOT of text will be processed
val imagePattern = (s"(?i)([\\w][\\w \\-_]{1,500}\\.(?:$extensionsPattern))").r

val imagesFromText = (txt: String) => firstGroups(txt, imagePattern)

spark.udf.register("imagesFromText", imagesFromText)
spark.udf.register("classify_media", (media: String) => new MediaTypeClassifier().classify(media).getParentType)

// Process text using 8192 partitions (data is big)

spark.sql("SET spark.sql.shuffle.partitions = 8192")

val mediasPerPage = spark.sql("""

WITH titles AS (
  SELECT
    wiki_db,
    page_id,
    page_namespace,
    explode(imagesFromText(revision_text)) AS page_title
  FROM wmf._mediawiki.wikitext_current
    WHERE snapshot = '2019-12'
),

medias AS (
  SELECT
    wiki_db,
    page_id,
    page_namespace,
    lower(replace(COALESCE(page_title, ''), ' ', '_')) AS media
  FROM titles
)

SELECT
  wiki_db,
  page_id,
  page_namespace,
  classify_media(media) AS media_classification,
  media
FROM media

""")

// Write results for future reuse
mediasPerPage.repartition(16).write.mode("overwrite").parquet("/tmp/medias_per_page/snapshot=2019-12")

See also