MediaModeration

From Wikitech

This section describes the configuration and operation of maintenance scripts associated with mw:Extension:MediaModeration and the MediaModeration 2.0 milestone.

Processing images manually, January 2024

As of January 2024, we are running extensions/MediaModeration/maintenance/scanFilesInScanTable.php manually on Wikimedia Commons. The invocation on mwmaint2002 is:

Manual processing

Once we have completed processing the Wikimedia Commons backlog, we will shift to a new phase of the project, where we update operations/puppet repo to process images on a daily basis.

Overview

  • Add items to scan table on upload
  • Obtaining thumbnail for files and sending file contents to PhotoDNA
  • Distribute scanning work by image (SHA-1) using the job queue
  • Use sleep to manage rate limits and target 10M requests per month
  • Update last_checked value always. Update mms_is_match if PhotoDNA gives us a response
  • Database
    • Uses an external store
    • Has three columns:
      • mms_sha1 - can be a match with a SHA-1 in filearchive, image, or oldimage tables
      • mms_last_checked not a MW timestamp, instead uses a shorter format e.g. 20240130 to track day but not time
      • mms_is_match - 1 if the SHA-1 matches, 0 if the SHA-1 was not a match, NULL if no successful scan has occurred yet.
  • For each SHA-1 value to be scanned, do these steps:
    • Iterate over all rows in filearchive, image, and oldimage tables that have the given SHA-1:
      • Check if the image for this row can be scanned by PhotoDNA, otherwise continue to the next row.
      • Attempt to get a suitable thumbnail for the image, and if successful then attempt to get the contents of the thumbnail
      • If the thumbnail or thumbnail contents cannot be generated, then try to get the image contents. If the image contents is not suitable then continue to the next row.
      • Send the image contents to PhotoDNA. If the request fails, then continue to the next row. If this is successful, then end the loop early.
    • Save the new match status returned by PhotoDNA (NULL is the match status if no row was successfully used to scan the SHA-1).
    • If the new match status is positive, send an email indicating a match.

Metrics

Once a day, we emit the following metrics (MediaModerationMetricsFactory):

  • the total table count of the mediamoderation_scan table for a given wiki
  • the number of scanned images (mms_is_match IS NOT NULL) in the mediamoderation_scan table
  • the number of unscanned images (mms_is_match IS NULL) in the mediamoderation_scan table
  • how many unscanned images (mms_is_match IS NULL) which also have been previously attempted to be scanned (mms_last_checked IS NOT NULL) are present for a wiki

The updateMetrics.php script emits these metrics for all wikis via the mediamoderation.pp puppet module (patch).

The metrics are visible on the MediaModeration PhotoDNA dashboard.

PhotoDNA

  • Credentials are available in the Trust and Safety Product team's 1Password
  • Rate limits as of January 2024:
    • 200 requests per second
    • 10 million requests per month