Skip to content

DrunkCodes/PinterestScraper

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This project is a web crawler/scraper for Pinterest. 
Pinterest has one of the internet's largest image databases. 
All sorts of people post their useless images here.
Useless as they may be individually, they have potential to be significant together.
For instance, let's say that you're making a general object detector with machine learning.
How will you teach it to recognize things?

You'll most likely require a huge database for your machine to train upon.
This project provides that database. 

You can search for ANY keyword, and the program will download ALL images pinterest has on that subject. 

This is the gist of how it works: 
1. Two Selenium webdrivers are created. For the sake of simplicity and transparency, we will use Chrome.
2. Both will log onto Pinterest with a bogus account that I created just for this purpose. 
3. Here is where threading comes in. One window will search the keyword in Pinterest by messing around with the URL,
	while the other waits.
4. The first one will now go through the page and scrape all the URLs embedded in the source code that 
	leads to a detailed page of the picture. This is because the high-resolution pictures are only
	in the detailed page, while the original search page only has the thumbnails. 
5. The first webdriver will then place the links into a list that the other thread can access. 
	It's basically a producer/consumer structure. I designed it this way for maximum efficiency and convenience,
	as if you just use one driver it will have to constantly go back and forward in the page.
6. After the first webdriver found everything in the page, it will keep attempting to scroll down. 
	This is because Pinterest has an infinite-scroll structure, meaning that additional source code for the page 
	will only load after the "scroll down and load" command is given. The first webdriver will keep doing this until 
	it has found 10000 pictures or cannot scroll any more, whichever comes first. 
7. The second webdriver will take the URLS given by the first webdriver and go into each page. There, it will find the 
	high-resolution image src file for each page and download it into a directory. If the directory is not made, it will make it first. 
	The second webdriver will keep going until every single image has been downloaded. 

To run it, make sure you have Python on your computer first.
Then, make sure to run the PintrestScraper file from that directory. It will NOT work without chromedriver.exe and EnglishScraper.py in the same directory. 
Enjoy, and don't do illegal stuff with it.

About

Scraping Images from Pinterest

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%