Check legal html

From Wikitech

icinga::monitor::legal puppet module and the Python 3 script check_legal_html implement production checking of some HTML excerpts that should be present on our wikis, such as a copyright & license notice, the terms of use, the WMF privacy policy and (for Wikipedia only) the Wikipedia registered trademark. At of this writing, 3 checks are implemented, one for desktop English Wikipedia, one for mobile English Wikipedia, and one for Desktop Wikibooks.

Background of why this alert was setup and why it is important

task T108081 made it clear that there was a need to continuously check that wiki projects had certain legal texts correctly shown on its footer. This prevents a software bug, or due to human error, we stop showing the right copyright notice and other required legal terms. At first this was thought as a development client-side, however the check had to be done to the deployed production site.

Site Reliability Engineering (back then, Technical operations), took over the solution and in 2015 Chase implemented a quick script to monitor the HTML present in production. However, because the check relied on certain literal text being present, the alert went off somewhat frequently, as admins modified the text to correct spelling errors or the links were moved elsewhere. In 2023, Jaime refactored the script to remove the regular expression checking and used the BeatifulSoup library to parse the HTML, as well as use keywords, rather than literal text to detect the text was correct. Other changes included the check of the actual content being linked to match the expected content, so now punctuation and links can change.

What you should do if you receive this alert

As part of the WMF legal team

Legal is responsible to check the website is compliant with the required logal text. Please check that the website referred on the alert contains a footer with the required legal text:

  • A copyright and license notice
  • The site terms of use
  • A link to the WMF privacy policy
  • Only for Wikipedia, a statement about Wikipedia being a registered trademark

Checking recent changes on the copyright footer could reveal why the alert went off. Please work with the community or contact the MediaWiki development team if some piece of information that should be there is missing due to either an incorrect edit, or a software bug.

If nothing obvious is wrong, including the links, please contact the SRE team / Observability subteam through email or other method, including the details of the alert received, to check if the check automation has some imperfection or they can provide additional details of why it started alerting.

Part of the technical support of the site / Site Reliability Engineering

While reading may be responsible for the implementation of the notice and legal for the content, please work to facilitate their work, as they may not have the technical knowledge or access to do so. Chase or Jaime may also not be around to help you debugging why the alert went off, so here are some tips:

  • If you notice the alert and check that it is not a false positive, please contact legal if they haven't done it already
  • You can run the check from an icinga / alert host (alert1001.wikimedia.org as of the writing of this), like this:
/usr/lib/nagios/plugins/check_legal_html.py -ensure=mobile -site https://en.m.wikipedia.org
  • You can use the -v (verbose mode to understand what was done and what failed). For example:
/usr/lib/nagios/plugins/check_legal_html.py -ensure=desktop_enwp -site https://people.wikimedia.org/~jynus/bad_license_link.html -v
2023-01-12 13:25:47,871 INFO: Checking site: https://people.wikimedia.org/~jynus/bad_license_link.html
2023-01-12 13:25:47,871 INFO: Downloading website: https://people.wikimedia.org/~jynus/bad_license_link.html
2023-01-12 13:25:47,908 INFO: Executing check of copyright...
2023-01-12 13:25:47,908 INFO: Checking text: "creative commons attribution-sharealike license 3.0"
2023-01-12 13:25:47,908 INFO: Expected word creativecommons.org is missing!
2023-01-12 13:25:47,908 INFO: Downloading website: https://en.wikipedia.org/wiki/End-user_license_agreement
2023-01-12 13:25:47,967 INFO: Expected word remix is missing!
2023-01-12 13:25:47,967 ERROR: copyright html not found for https://people.wikimedia.org/~jynus/bad_license_link.html (desktop site).

Here we can see that the checking of a copyright link failed as the link wasn't a link to the creative commons website, and the linked content didn't have the "remix" word, expected on a CC-BY-SA-3.0 license.

  • You can check the source code (available on the Puppet repo) to understand which keywords are expected for each check
  • You can test locally the script against use cases, Jaime setup some
  • Get legal's approval of substantial changes to the code's behavior

As part of the wiki admin team/editing community

Unlike previously, there should be no constraint on the community to correct punctuation errors or move internal links, as the check has into account different wordings, to some extent. The community doesn't need to be notified of check issues, unless a real problem has happened (e.g. an important legal text has been mistakenly erased in an edit). If you get notified by legal that something is not in compliance with their legal advice, please work with them to resolve the issues brought up.

Commons problems with the check

While the check was tried to be future-proofed, not all potential issues could be previewed, requiring human intervention:

  • Network or availability outages: Wiki outages during an extended period of time will result on alert errors. This is to prevent from links going outdated (as both a bad link and an outage would return a 4XX or 5XX error).
  • License changes: even if unlikely, a license change or upgrade may need to retouch the checking automation
  • Changing legal requirements: If additional sections are added or removed, the alert will need update
  • Missing software requirements and environment changes: puppet ensures that icinga and certain python libraries are available but the script could become outdated or miss required software in the future
  • HTML changes: the check assumes a single html <footer> section will exist for the Wiki software, the MediaWiki html structure could change in the future
  • URL changes: the check may need update of its URL parameter if the main page of the projects change and stops redirecting to the main page, or commons licenses get moved.
  • Language limitation: at the moment, only a check of English is supported; localization of the script or setup of a translation list of keywords would be needed to support other languages
  • Differences between mobile and desktop: While currently the check is the same for both mobile and desktop, they don't have the same wording and structure. This could change even further in the future, as mobile and desktop interface is controlled by different stakeholders.

Script reference

$ /usr/lib/nagios/plugins/check_legal_html.py --help
usage: check_legal_html.py [-h] [-site SITE]
                           [-ensure {desktop_enwb,desktop_enwp,mobile}] [-v]

Validate certain legal HTML exists on the footer of a given webpage, as per
legal requirement. See https://phabricator.wikimedia.org/T317169 &
https://phabricator.wikimedia.org/T108081 for more context.

optional arguments:
  -h, --help            show this help message and exit
  -site SITE            The url of the website to check(e.g.
                        'https://en.wikipedia.org/wiki/Main_page'). By default
                        it checks 'localhost'.
  -ensure {desktop_enwb,desktop_enwp,mobile}
                        Selection of checks to perform (mobile site and non-
                        Wikipedias don't have a Wikipedia trademark check).
  -v, --verbose         Enable verbose description of checks, useful to debug
                        the icinga errors.

See also