This tool can be used to run a list of titles through media search on Wikimedia Commons and record some simple data, including:
- Total number of hits per title
- Number of titles that have at least one result
- Percentage of titles with results
To record the number of media search hits per title in a list of titles:
php jobs/RecordResultsCount.php --filename="input/[filename]" --langCode="[langcode]"
This script assumes a .tsv file with 2 columns: pageid and title.
Another job, RemoveDisambiguationPages, was designed to strip out Wikipedia disambiguation pages from the original data since they aren't good candidates for adding images via the future API and would therefore skew the results of this test. To run this job:
php jobs/RemoveDisambiguationPages.php --filename="input/[filename]" --langCode="[langcode]"
You can also take a random sample rather than going through an entire file. This script will pick n random, unique indeces and fetch totalhits for them, where n is the limit set via an option (or omit it to use a default limit of 1000).
php jobs/RecordRandomSample.php --filename="input/[filename]" --langCode="[langcode]" --limit=10000
To get results data:
php jobs/GetResultsData.php --filename="output/[filename]"