forked from civiliangame/PinterestScraper
    
        
        - 
                Notifications
    
You must be signed in to change notification settings  - Fork 0
 
DrunkCodes/PinterestScraper
Folders and files
| Name | Name | Last commit message  | Last commit date  | |
|---|---|---|---|---|
Repository files navigation
This project is a web crawler/scraper for Pinterest. Pinterest has one of the internet's largest image databases. All sorts of people post their useless images here. Useless as they may be individually, they have potential to be significant together. For instance, let's say that you're making a general object detector with machine learning. How will you teach it to recognize things? You'll most likely require a huge database for your machine to train upon. This project provides that database. You can search for ANY keyword, and the program will download ALL images pinterest has on that subject. This is the gist of how it works: 1. Two Selenium webdrivers are created. For the sake of simplicity and transparency, we will use Chrome. 2. Both will log onto Pinterest with a bogus account that I created just for this purpose. 3. Here is where threading comes in. One window will search the keyword in Pinterest by messing around with the URL, while the other waits. 4. The first one will now go through the page and scrape all the URLs embedded in the source code that leads to a detailed page of the picture. This is because the high-resolution pictures are only in the detailed page, while the original search page only has the thumbnails. 5. The first webdriver will then place the links into a list that the other thread can access. It's basically a producer/consumer structure. I designed it this way for maximum efficiency and convenience, as if you just use one driver it will have to constantly go back and forward in the page. 6. After the first webdriver found everything in the page, it will keep attempting to scroll down. This is because Pinterest has an infinite-scroll structure, meaning that additional source code for the page will only load after the "scroll down and load" command is given. The first webdriver will keep doing this until it has found 10000 pictures or cannot scroll any more, whichever comes first. 7. The second webdriver will take the URLS given by the first webdriver and go into each page. There, it will find the high-resolution image src file for each page and download it into a directory. If the directory is not made, it will make it first. The second webdriver will keep going until every single image has been downloaded. To run it, make sure you have Python on your computer first. Then, make sure to run the PintrestScraper file from that directory. It will NOT work without chromedriver.exe and EnglishScraper.py in the same directory. Enjoy, and don't do illegal stuff with it.
About
Scraping Images from Pinterest
Resources
Stars
Watchers
Forks
Releases
No releases published
              Packages 0
        No packages published 
      
              Languages
- Python 100.0%