Data Platform/Internal API requests
This page documents how to query MediaWiki Action API, MediaWiki REST API, and Wikimedia REST API internally in R and Python, rather than sending requests over the Internet. The code examples here were tested on stat1010.eqiad.wmnet.
Both R and Python approaches assume that HTTPS_PROXY
, https_proxy
, NO_PROXY
, and no_proxy
environment variables are already set. Refer to HTTP proxy for setting them manually if they get unset.
Python
Fixing SSL certificate verification error
To avoid
SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate
when running the code from a virtual environment (e.g. conda-analytics environment in JupyterHub on stat hosts), use:
import os
os.environ['REQUESTS_CA_BUNDLE'] = '/etc/ssl/certs/ca-certificates.crt'
Thank you Ben Tullis for figuring this out.[1]
Using requests
With the requests library:
import requests
url = 'https://mw-api-int-ro.discovery.wmnet:4446/w/api.php'
headers = {'Host': 'en.wikipedia.org'}
payload = {
'action': 'query',
'prop': 'info',
'titles': 'R_(programming_language)|Python_(programming_language)',
'format': 'json'
}
resp = requests.get(url, headers=headers, params=payload).json()
Using mwapi
With the mwapi library (which also requires REQUESTS_CA_BUNDLE
environment variable):
import mwapi
session = mwapi.Session(host = 'https://mw-api-int-ro.discovery.wmnet:4446')
session.headers['Host'] = 'en.wikipedia.org'
resp = session.get(
action = 'query',
prop='info',
titles = 'R_(programming_language)|Python_(programming_language)'
)
Thank you to Lucas Werkmeister for figuring this out.[2]
DataFrame from API response
To convert the response into a nice data frame we can use from_dict from pandas:
import pandas as pd
page_info = pd.DataFrame.from_dict(resp['query']['pages'], orient='index')
pageid | ns | title | contentmodel | pagelanguage | pagelanguagehtmlcode | pagelanguagedir | touched | lastrevid | length | |
---|---|---|---|---|---|---|---|---|---|---|
23862 | 23862 | 0 | Python (programming language) | wikitext | en | en | ltr | 2022-05-17T15:11:33Z | 1088356878 | 146500 |
376707 | 376707 | 0 | R (programming language) | wikitext | en | en | ltr | 2022-05-16T17:53:29Z | 1087609113 | 59925 |
R
Using httr2 package:
library(httr2)
req <- request("https://mw-api-int-ro.discovery.wmnet:4446/w/api.php") %>%
req_headers("Host" = "en.wikipedia.org")
req <- req %>%
req_url_query(
action = "query",
prop = "info",
titles = "R_(programming_language)|Python_(programming_language)",
format = "json"
)
# Fix error "SSL certificate problem: unable to get local issuer certificate":
req <- req %>%
req_options(ssl_verifypeer = 0)
# Perform the request:
resp <- req %>%
req_perform() %>%
resp_body_json()
To convert the response into a nice data frame we can use map_dfr from purrr and as_tibble from tibble:
library(tidyverse)
page_info <- resp$query$pages %>%
map_dfr(as_tibble)
pageid | ns | title | contentmodel | pagelanguage | pagelanguagehtmlcode | pagelanguagedir | touched | lastrevid | length |
---|---|---|---|---|---|---|---|---|---|
<int> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <int> | <int> |
23862 | 0 | Python (programming language) | wikitext | en | en | ltr | 2022-05-17T15:11:33Z | 1088356878 | 146500 |
376707 | 0 | R (programming language) | wikitext | en | en | ltr | 2022-05-16T17:53:29Z | 1087609113 | 59925 |