Jump to content

MediaModeration

From Wikitech

This section describes the configuration and operation of maintenance scripts associated with mw:Extension:MediaModeration and the MediaModeration 2.0 milestone.

Processing images automatically, December 2024

As of December 2024, we have started running extensions/MediaModeration/maintenance/scanFilesInScanTable.php uisng puppet so that the invocation of the script is automatic. Files on Wikimedia Commons are scanned within ~90 seconds of upload through a "continuous" scan where the script is restarted every hour. Files on all other wikis are scanned within an hour of upload through a script that gets started every hour to scan newly uploaded images on every wiki except Wikimedia Commons (which is covered by the continuous scan).

The definitions for these are in mediamoderation.pp on operations/puppet.

Processing images manually, late November 2024

As of November 2024, we have finished running extensions/MediaModeration/maintenance/scanFilesInScanTable.php manually on all WMF wikis. However, we found that repeating the scan over the images which failed to scan has caused some of them to be scannable. This means that we are re-running the manual scanning process for images which failed to scan.

This involves running both:

Manual processing for all apart from Wikimedia Commons
Scanning close to upload, with re-attempting scans for failed images

Processing images manually, November 2024

As of November 2024, we have finished running extensions/MediaModeration/maintenance/scanFilesInScanTable.php manually on Wikimedia Commons. We are now scanning the backlog of images on all other wikis using the following code:

Manual processing for all apart from Wikimedia Commons

The above code is used, instead of using all.dblist, because Wikimedia Commons has a too frequent upload rate and as such takes days for there to be no images left to scan when the next batch is fetched.

We are also scanning images very close to upload to Wikimedia Commons using the following. As images are being uploaded frequently, the script never exits as there are still images to scan once the next batch is fetched. This script means that images on Wikimedia Commons are being scanned within 90 seconds of their upload.

Scanning close to upload

Once we have completed scanning on all WMF wikis, we will update operations/puppet to process images on a daily basis.

Processing images manually, January 2024

As of September 2024, we are running extensions/MediaModeration/maintenance/scanFilesInScanTable.php manually on Wikimedia Commons. The invocation on deploy2002 is:

Manual processing

Once we have completed processing the Wikimedia Commons backlog, we will shift to a new phase of the project, where we update operations/puppet repo to process images on a daily basis, and possibly run the scan for Wikimedia Commons such that it runs continuously.

Alerts

Per task T366165, an alert fires when the requests per second of OK requests drops below 3 per second. So far, this has happened when the script has crashed and needs to be restarted, as opposed to a general slow down in processing throughput. Alerts are sent to the #tsp-engineering channel on Slack. Incoming alerts should be silenced on https://alerts.wikimedia.org. The alert is attached to this panel.

Overview

  • Add items to scan table on upload
  • Obtaining thumbnail for files and sending file contents to PhotoDNA
  • Distribute scanning work by image (SHA-1) using the job queue
  • Use sleep to manage rate limits and target 10M requests per month
  • Update last_checked value always. Update mms_is_match if PhotoDNA gives us a response
  • Database
    • Uses an external store
    • Has three columns:
      • mms_sha1 - can be a match with a SHA-1 in filearchive, image, or oldimage tables
      • mms_last_checked not a MW timestamp, instead uses a shorter format e.g. 20240130 to track day but not time
      • mms_is_match - 1 if the SHA-1 matches, 0 if the SHA-1 was not a match, NULL if no successful scan has occurred yet.
  • For each SHA-1 value to be scanned, do these steps:
    • Iterate over all rows in filearchive, image, and oldimage tables that have the given SHA-1:
      • Check if the image for this row can be scanned by PhotoDNA, otherwise continue to the next row.
      • Attempt to get a suitable thumbnail for the image, and if successful then attempt to get the contents of the thumbnail
      • If the thumbnail or thumbnail contents cannot be generated, then try to get the image contents. If the image contents is not suitable then continue to the next row.
      • Send the image contents to PhotoDNA. If the request fails, then continue to the next row. If this is successful, then end the loop early.
    • Save the new match status returned by PhotoDNA (NULL is the match status if no row was successfully used to scan the SHA-1).
    • If the new match status is positive, send an email indicating a match.

Metrics

Once a day, we emit the following metrics (MediaModerationMetricsFactory):

  • the total table count of the mediamoderation_scan table for a given wiki
  • the number of scanned images (mms_is_match IS NOT NULL) in the mediamoderation_scan table
  • the number of unscanned images (mms_is_match IS NULL) in the mediamoderation_scan table
  • how many unscanned images (mms_is_match IS NULL) which also have been previously attempted to be scanned (mms_last_checked IS NOT NULL) are present for a wiki

The updateMetrics.php script emits these metrics for all wikis via the mediamoderation.pp puppet module (patch).

The metrics are visible on the MediaModeration PhotoDNA dashboard.

PhotoDNA

  • Credentials are available in the Trust and Safety Product team's 1Password
  • Rate limits as of January 2024:
    • 200 requests per second
    • 10 million requests per month