Add Image

From Wikitech

This page contains information about the infrastructure used for the Add Image structured task project (T285587). For project information, see mw:Growth/Personalized first day/Structured tasks/Add an image. (Parts of the infrastructure are used for other image recommendation features, too; currently this page is written from the POV of maintaining Growth team features.)

High-level summary

Add Image is the infrastructure behind a feature which recommends images to be added to articles which don't have any, and provides a streamlined editing interface for doing so. It consists of:

  • A WIP data pipeline (see mw:Structured_Data_Across_Wikimedia/Image_Suggestions/Data_Pipeline) (for details see this merge request, for a quick overview see this image) which
    • creates a Hive dataset of articles (on any project wiki) with no images, and image recommendations based on images in other Wikimedia projects which are connected to the article in some way via Wikidata;
    • loads that dataset into the CirrusSearch index as recommendation.image/exists|1 weighted tags;
    • exports the dataset to Cassandra.
  • A hasrecommendation:image CirrusSearch keyword for searching for articles with recommendations
  • An internal image recommendation API (repo, user docs, ops docs) that provides the information in the Cassandra dataset for the queried page IDs.
    • The prop=growthimagesuggestiondata API exposes the image recommendation data (acting as a proxy to the internal API). This is used by beta cluster wikis and the Android app as a temporary workaround for the lack of a suitable public API (which is tracked at T306349).
    • In the early phases, a proof-of-concept API implementation (sandbox, repo, project page) was used.
  • Integration with the structured task functionality of the GrowthExperiments extension: a browsing interface on Special:Homepage and VisualEditor-based custom editing interface.
  • After a task is resolved, MediaWiki updates the CirrusSearch index. This is exposed via the action=growthinvalidateimagerecommendation API.

Infobox exclusion

The GrowthExperiments extension adds a new hastemplatecollection:<collection> CirrusSearch keyword for searching for articles containing any one of a list of templates (typically a list so long that hastemplate: cannot be used). This is used for excluding articles with infoboxes: it defines the infobox and infoboxtest collections based on the GEInfoboxTemplates and GEInfoboxTemplatesTest community configuration fields.

To update, you can set GEInfoboxTemplatesTest and test with the hastemplatecollection:infoboxtest -hastemplatecollection:infobox and -hastemplatecollection:infoboxtest hastemplatecollection:infobox searches what infobox-containing articles would be added to / removed from the filter.

The list of infoboxes is generated by the tgr/infobox-templates script. There is no accurate way of telling which template is an infobox; the script is a set of heuristics.

Enabling image recommendations on a new wiki

(See mw:Growth#Deployment_table about current status.) The data pipeline and API works for all wikis. To enable on the MediaWiki side:

  • Make sure the image recommendation task type is enabled in community config:
    export PHAB=T123456 # deployment task
    for WIKI in wiki1 wiki2 wiki3 ...; do
        mwscript extensions/GrowthExperiments/maintenance/changeWikiConfig.php $WIKI \
            --page MediaWiki:NewcomerTasks.json \
            --create-only \
            --json \
            --summary "Growth features configuration boilerplate ([[phab:$PHAB]])" \
            image-recommendation \
            '{ "type": "image-recommendation", "group": "medium" }';
    done
    
  • Fetch the list of infoboxes with python infobox-templates.py --format=json $LANG (using tgr/infobox-templates) and set it in community configuration, e.g.
    export PHAB=T123456 # deployment task
    mwscript extensions/GrowthExperiments/maintenance/changeWikiConfig.php $WIKI \
      --page MediaWiki:GrowthExperimentsConfig.json \
      --json \
      --summary "machine-generated list of infobox generators ([[phab:$PHAB]])" \
      GEInfoboxTemplates \
      "`jq --compact-output . <infobox-templates.py output file>`"
    
  • Set $wgGENewcomerTasksImageRecommendationsEnabled.

Section-level images

Section-level image recommendations (T321754) are mostly handled the same way as article-level image recommendations. There are differences in the data pipeline generating the search index tags and Cassandra data, and in the UI logic; but otherwise they are almost identical, except the search index tag is recommendation.image_section (and correspondingly the search keyword being hasrecommendation:image_section), the task type is section-image-recommendation, and in the API responses the section_heading and section_index fields are not null.

Troubleshooting

To test the API, log into a production host and use curl -H 'Accept: application/json' 'http://localhost:6030/public/image_suggestions/suggestions/<wiki id>/<page id>' | jq ..

For indirect testing, you can use the action API:

Example API response

{
      "wiki": "cswiki",
      "page_id": 319,
      "id": "65f0a7ce-ea3b-11ed-80ee-f4e9d4dbbe90",
      "image": "Vladimir_Smicer.jpg",
      "confidence": 50,
      "found_on": null,
      "kind": [
        "istype-section-topics-p18"
      ],
      "origin_wiki": "commonswiki",
      "page_qid": null,
      "page_rev": 22568084,
      "section_heading": "sport,
      "section_index": 2
}

See also