Jump to content

Data Platform/Internal API requests

From Wikitech

This page documents how to query MediaWiki Action API, MediaWiki REST API, and Wikimedia REST API internally in R and Python, rather than sending requests over the Internet. The code examples here were tested on stat1010.eqiad.wmnet.

Both R and Python approaches assume that HTTPS_PROXY, https_proxy, NO_PROXY, and no_proxy environment variables are already set. Refer to HTTP proxy for setting them manually if they get unset.

Python

Fixing SSL certificate verification error

To avoid

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate

when running the code from a virtual environment (e.g. conda-analytics environment in JupyterHub on stat hosts), use:

import os

os.environ['REQUESTS_CA_BUNDLE'] = '/etc/ssl/certs/ca-certificates.crt'

Thank you Ben Tullis for figuring this out.[1]


Using requests

With the requests library:

import requests

url = 'https://mw-api-int-ro.discovery.wmnet:4446/w/api.php'

headers = {'Host': 'en.wikipedia.org'}

payload = {
    'action': 'query',
    'prop': 'info',
    'titles': 'R_(programming_language)|Python_(programming_language)',
    'format': 'json'
}

resp = requests.get(url, headers=headers, params=payload).json()

Using mwapi

With the mwapi library (which also requires REQUESTS_CA_BUNDLE environment variable):

import mwapi

session = mwapi.Session(host = 'https://mw-api-int-ro.discovery.wmnet:4446')
session.headers['Host'] = 'en.wikipedia.org'

resp = session.get(
    action = 'query',
    prop='info',
    titles = 'R_(programming_language)|Python_(programming_language)'
)

Thank you to Lucas Werkmeisterfor figuring this out.[2]

DataFrame from API response

To convert the response into a nice data frame we can use from_dict from pandas:

import pandas as pd

page_info = pd.DataFrame.from_dict(resp['query']['pages'], orient='index')
pageid ns title contentmodel pagelanguage pagelanguagehtmlcode pagelanguagedir touched lastrevid length
23862 23862 0 Python (programming language) wikitext en en ltr 2022-05-17T15:11:33Z 1088356878 146500
376707 376707 0 R (programming language) wikitext en en ltr 2022-05-16T17:53:29Z 1087609113 59925

R

Using httr2 package:

library(httr2)

req <- request("https://mw-api-int-ro.discovery.wmnet:4446/w/api.php") %>%
    req_headers("Host" = "en.wikipedia.org")

req <- req %>%
    req_url_query(
        action = "query",
        prop = "info",
        titles = "R_(programming_language)|Python_(programming_language)",
        format = "json"
    )

# Fix error "SSL certificate problem: unable to get local issuer certificate":
req <- req %>%
    req_options(ssl_verifypeer = 0)

# Perform the request:
resp <- req %>%
    req_perform() %>%
    resp_body_json()

To convert the response into a nice data frame we can use map_dfr from purrr and as_tibble from tibble:

library(tidyverse)

page_info <- resp$query$pages %>%
    map_dfr(as_tibble)
A tibble: 2 × 10
pageid ns title contentmodel pagelanguage pagelanguagehtmlcode pagelanguagedir touched lastrevid length
<int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <int> <int>
23862 0 Python (programming language) wikitext en en ltr 2022-05-17T15:11:33Z 1088356878 146500
376707 0 R (programming language) wikitext en en ltr 2022-05-16T17:53:29Z 1087609113 59925

References

  1. T361024#9662135
  2. https://github.com/mediawiki-utilities/python-mwapi/issues/45