Swift/Thumbnails Cleanup

From Wikitech

Media thumbnails are generated on demand (of arbitrary width) and stored in swift, however MediaWiki doesn't know about the full list of widths being requested by users. Further, thumbnails are never cleaned up from swift, thus as part of Techops FY2016-2017 goals there has been investigation on how to periodically and effectively clean up thumbnails stored in swift.

To better understand what widths are in swift vs what the users request we have walked all thumbnails stored and imported in hive. For end-users requests a subset of webrequest data in hive has been used to drive decisions. Finally, data extracted has been summarized in this google spreadsheet. See also phabricator task T162796 for additional context.

Analyzing webrequest data showed that some prerendered widths were rarely requested but stored nevertheless (2560/2880 pixels wide). Thus prerendering for those widths was stopped and all thumbnails larger than 2000px cleaned up.

Collect webrequest data

A subset of webrequest to gather insight on how thumbs are requested

create table filippo.webrequest_upload_thumb_width_201704 stored as parquet as select regexp_extract(webrequest.uri_path, '.*/.*?(\\d+)px-.*', 1) as width, response_size, cache_status, uri_path from wmf.webrequest where uri_path like '%/thumb/%px-%' and webrequest_source='upload' and year = 2017 and month = 4;

# export back to plaintext on hdfs
insert overwrite directory '/user/filippo/webrequest_upload_201704' row format delimited fields terminated by ' ' select width, count(width) as count, avg(response_size) as avg_size from filippo.webrequest_upload_thumb_width_201704 group by width;

Collect swift data

Use thumbstats/swift-thumb-stats --hive_export to get a list of thumbnails present in swift and ready to be imported in hive for further processing and analyzing.

Once criteria for deletions are determined, the list of matching thumbnails in swift can be exported from hive and fed to thumbstats/swift-thumb-cleanup for the actual deletion. e.g. below for width greater than 2000px.

insert overwrite directory '/user/filippo/thumbs_to_delete_notcommons' row format delimited fields terminated by ' '  select container,path from thumbstats where container not like '%commons%' and width > 2000;
insert overwrite directory '/user/filippo/thumbs_to_delete_commons' row format delimited fields terminated by ' '  select container,path from thumbstats where container like '%commons%' and width > 2000;

Further cleanup strategies

The first cleanup for widths greater than 2000px is the one that yielded the most space savings in the short term. Other cleanup strategies have been identified in the Phabricator, including storing only the top100 widths and having Thumbor process the rest instead.