GitHub - DrunkCodes/PinterestScraper: Scraping Images from Pinterest

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
EnglishScraper.py		EnglishScraper.py
PintrestScraper.py		PintrestScraper.py
README.txt		README.txt
chromedriver.exe		chromedriver.exe

Repository files navigation

This project is a web crawler/scraper for Pinterest.
Pinterest has one of the internet's largest image databases.
All sorts of people post their useless images here.
Useless as they may be individually, they have potential to be significant together.
For instance, let's say that you're making a general object detector with machine learning.
How will you teach it to recognize things?

You'll most likely require a huge database for your machine to train upon.
This project provides that database.

You can search for ANY keyword, and the program will download ALL images pinterest has on that subject.

This is the gist of how it works:
1. Two Selenium webdrivers are created. For the sake of simplicity and transparency, we will use Chrome.
2. Both will log onto Pinterest with a bogus account that I created just for this purpose.
3. Here is where threading comes in. One window will search the keyword in Pinterest by messing around with the URL,
while the other waits.
4. The first one will now go through the page and scrape all the URLs embedded in the source code that
leads to a detailed page of the picture. This is because the high-resolution pictures are only
in the detailed page, while the original search page only has the thumbnails.
5. The first webdriver will then place the links into a list that the other thread can access.
It's basically a producer/consumer structure. I designed it this way for maximum efficiency and convenience,
as if you just use one driver it will have to constantly go back and forward in the page.
6. After the first webdriver found everything in the page, it will keep attempting to scroll down.
This is because Pinterest has an infinite-scroll structure, meaning that additional source code for the page
will only load after the "scroll down and load" command is given. The first webdriver will keep doing this until
it has found 10000 pictures or cannot scroll any more, whichever comes first.
7. The second webdriver will take the URLS given by the first webdriver and go into each page. There, it will find the
high-resolution image src file for each page and download it into a directory. If the directory is not made, it will make it first.
The second webdriver will keep going until every single image has been downloaded.

To run it, make sure you have Python on your computer first.
Then, make sure to run the PintrestScraper file from that directory. It will NOT work without chromedriver.exe and EnglishScraper.py in the same directory.
Enjoy, and don't do illegal stuff with it.