Google Image Scraper

A python script for web scraping, to collect images for machine learning model training.

a space python ready to steal your images

What is this?

In experimenting with some image classification machine learning, I realized that for some nice industrial things the images that I was needing to work with were very poorly represented in available public data sets.

Obviously the solution was to contact every single customer and ask them to take hundreds of pictures of things from every angle right?

I created this script to kick-start some of those machine learning tasks, and realized it could really be useful for everyone else too.

Is this legal where I live/work/etc.. ?

(THIS IS NOT LEGAL ADVICE - I AM NOT A LAWYER)

tl;dr - Probably not.

Legalities of things vary widely based on your geographic location, and where your code is deployed. If you're in Europe using a virtual machine in the United States for a project that will be deployed in China, I can't really help you on the legality of that, or what law applies.

In general, however, unless license is explicitly granted for use of images, the copyright of those images belongs to the photographer / publisher. As such, re-distribution and use of those images may be restricted in some manner. Its up to you to comply with the law, and this software is being made available as a learning tool, without warranty.

How does Google feel about this?

To quote the Google Terms of Service:

Google reserves the right to suspend or terminate your access to the services or delete your Google Account if any of these things happen:

you materially or repeatedly breach these terms, service-specific additional terms or policies

we’re required to do so to comply with a legal requirement or a court order

your conduct causes harm or liability to a user, third party, or Google — for example, by hacking, phishing, harassing, spamming, misleading others, or scraping content that doesn’t belong to you

So that being said, not all scraping is a ToS violation. People surf Google Images and download content for their own personal use every day, the semantics involve how much, how fast, and improtantly for what purpose. Take care to keep to a resonable number of images, that you intend to use for the proper purposes.

Quick Start

Clone this repository somewhere on your machine

git clone https://github.com/haukened/google-image-scraper
Move into the cloned directory

cd google-image-scraper
Create a python virtual environment (Optional, however keeps dependencies clean)

python3 -m venv .venv

source ./.venv/bin/activate
Install dependencies

pip install -r requirements.txt
Run the scraper

./scrape.py a bear riding a tricycle

Explore the options

./scrape.py -h

usage: scrape.py [-h] [-s] [-v] [-n NUMBER] [-o OUTPUT_DIRECTORY] [query ...]

Downloads images from Google image search

positional arguments:
query                 the search query for images. aka "What you would type into google"

options:
-h, --help            show this help message and exit
-s, --show-browser    show the browser during downloading
-v, --verbose         show debug information
-n NUMBER, --number NUMBER
                        number of images to download
-o OUTPUT_DIRECTORY, --output-directory OUTPUT_DIRECTORY
                        the path to store downloaded images

Created by @haukened | Buy me a beer: https://beer.hauken.us

Examples

Show the browser during scraping, add the -s flag

./scrape.py -s -n 500 jimmy kimmel laughing

Store the images somewhere else

./scrape.py -n 1000 -o /home/<you>/datasets/bears/ the greatest threat to america

haukened / google-image-scraper