Best Open Source Python Web Scrapers for Linux

Python Web Scrapers for Linux

Web Scrapers Python Linux Clear Filters

Browse free open source Python Web Scrapers for Linux and projects below. Use the toggles on the left to filter open source Python Web Scrapers for Linux by OS, license, language, programming language, and project status.

MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
Deliver secure remote access with OpenVPN.
Trusted by nearly 20,000 customers worldwide, and all major cloud providers.

OpenVPN's products provide scalable, secure remote access — giving complete freedom to your employees to work outside the office while securely accessing SaaS, the internet, and company resources.

Get started — no credit card required.
1

KemonoDownloader

Kemono Downloader - A cross-platform Python app built with PyQt6

Welcome to Kemono Downloader, a versatile Python-based desktop application built with PyQt6, designed to download content from Kemono.su. This tool enables users to archive individual posts or entire creator profiles from services like Patreon, Fanbox, and more, supporting a wide range of file types with customizable settings and advanced features.

1 Review

Downloads: 472 This Week

Last Update: 2025-10-01
See Project
2

Scrapy

A fast, high-level web crawling and web scraping framework

Scrapy is a fast, open source, high-level framework for crawling websites and extracting structured data from these websites. Portable and written in Python, it can run on Windows, Linux, macOS and BSD. Scrapy is powerful, fast and simple, and also easily extensible. Simply write the rules to extract the data, and add new functionality if you wish without having to touch the core. Scrapy does the rest, and can be used in a number of applications. It can be used for data mining, monitoring and automated testing.

Downloads: 20 This Week

Last Update: 2025-07-02
See Project
3

finvizfinance

Finviz analysis python library

finvizfinance is a package that collects financial information from FinViz website. Stock charts, fundamental & technical information, insider information and stock news. Forex charts and performance. Crypto charts and performance. Screener and Group provide data frames for comparing stocks according to different filters and trading signals. Getting information (fundament, description, outer rating, stock news, inside trader) of an individual stock.

Downloads: 14 This Week

Last Update: 2025-09-06
See Project
4

Snoop Project

This is the most powerful software taking into account CIS location

Snoop is an open data intelligence tool (OSINT world). Snoop Project is one of the most promising OSINT tools for finding nicknames. This is the most powerful software taking into account the CIS location. Is your life slideshow? Ask Snoop. Snoop project is developed without taking into account the opinions of the NSA and their friends, that is, it is available to the average user. Snoop is a research work (own database / closed bugbounty) in the field of searching and processing public data on the Internet. In terms of specialized search, Snoop is able to compete with traditional search engines.

Downloads: 11 This Week

Last Update: 2025-01-01
See Project
Photo and Video Editing APIs and SDKs
Trusted by 150 million+ creators and businesses globally

Unlock Picsart's full editing suite by embedding our Editor SDK directly into your platform. Offer your users the power of a full design suite without leaving your site.

Learn More
5

Basketball Reference

NBA Stats API via Basketball Reference

Basketball Reference is a great site (especially for a basketball stats nut like me), and hopefully, they don't get too pissed off at me for creating this. I initially wrote this library as an exercise for creating my first PyPi package, hope you find it valuable! This library was created for another Python project where I was trying to estimate an NBA player's productivity. A lot of sports-related APIs are expensive - luckily, Basketball Reference provides a free service which can be scraped and translated into a usable API.

Downloads: 2 This Week

Last Update: 2025-08-02
See Project
6

Gerapy

Distributed Crawler Management Framework Based on Scrapy

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js. Someone who has worked as a crawler with Python may use Scrapy. Scrapy is indeed a very powerful crawler framework. It has high crawling efficiency and good scalability. It is basically a necessary tool for developing crawlers using Python. If you use Scrapy as a crawler, then of course we can use our own host to crawl when crawling, but when the crawl is very large, we can’t run the crawler on our own machine, a good one. The method is to deploy Scrapy to a remote server for execution. At this time, you might use Scrapyd. With it, we only need to install Scrapyd on the remote server and start the service. We can deploy the Scrapy project we wrote. Go to the remote host. In addition, Scrapyd provides a variety of operations API, which gives you free control over the operation of the Scrapy project.

Downloads: 2 This Week

Last Update: 2023-07-19
See Project
7

Grab Framework Project

Web Scraping Framework

Grab is a python framework for building web scrapers. With Grab you can build web scrapers of various complexity, from simple 5-line scripts to complex asynchronous website crawlers processing millions of web pages. Grab provides an API for performing network requests and for handling the received content e.g. interacting with DOM tree of the HTML document. The single request/response API that allows you to build network request, perform it and work with the received content. The API is built on top of urllib3 and lxml libraries. The Spider API to build asynchronous web crawlers. You write classes that define handlers for each type of network request. Each handler is able to spawn new network requests. Network requests are processed concurrently with a pool of asynchronous web sockets. Grab provides interface called Spider to develop multithreaded web-site scrapers.

Downloads: 2 This Week

Last Update: 2025-09-18
See Project
8

JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.

Scrape job websites into a single spreadsheet with no duplicates. Automated tool for scraping job postings into a .csv file. You can search for jobs with YAML configuration files or by passing command arguments. By performing regular scraping and reviewing, you can cut through the noise of even the busiest job markets. Run funnel with your settings YAML to populate your master CSV file with jobs from available providers. JobFunnel can be easily automated to run nightly with crontab. If you have a job website you'd like to write a scraper for, you are welcome to implement it, Review the Base Scraper for implementation details. JobFunnel supports scraping jobs from the same job website across locales & domains. If you are interested in adding support, you may only need to define session headers and domain strings, Review the Base Scraper for further implementation details.

Downloads: 2 This Week

Last Update: 2024-09-29
See Project
9

CyberScraper 2077

A Powerful web scraper powered by LLM | OpenAI, Gemini & Ollama

CyberScraper 2077 is not just another web scraping tool – it's a glimpse into the future of data extraction. Born from the neon-lit streets of a cyberpunk world, this AI-powered scraper uses OpenAI, Gemini and LocalLLM Models to slice through the web's defenses, extracting the data you need with unparalleled precision and style.

Downloads: 1 This Week

Last Update: 2024-11-08
See Project
No-Nonsense Code-to-Cloud Security for Devs | Aikido
Connect your GitHub, GitLab, Bitbucket, or Azure DevOps account to start scanning your repos for free.

Aikido provides a unified security platform for developers, combining 12 powerful scans like SAST, DAST, and CSPM. AI-driven AutoFix and AutoTriage streamline vulnerability management, while runtime protection blocks attacks.

Start for Free
10

Selectolax

Python binding to Modest and Lexbor engines

A fast HTML5 parser with CSS selectors using Modest and Lexbor engines. Selectolax supports two backends: Modest and Lexbor. By default, all examples use the Modest backend. Most of the features between backends are almost identical, but there are still some differences. Currently, the Lexbor backend is in beta and missing some of the features. To use lexbor, just import the parser and use it in the similar way to the HTMLParser.

Downloads: 1 This Week

Last Update: 2025-09-28
See Project
11

Python Crawler Library

Python Web Crawler Library

A simple library for crawling the web. This library will give you the ability to create macros for crawling web site and preforming simple actions like preforming "log in" and other simple actions in web sites.

Downloads: 3 This Week

Last Update: 2015-06-04
See Project
12

Domain Analyzer Security Tool

Finds all the security information for a given domain name

Domain analyzer is a security analysis tool which automatically discovers and reports information about the given domain. Its main purpose is to analyze domains in an unattended way.

Downloads: 2 This Week

Last Update: 2016-11-26
See Project
13

AutoScraper

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

This project is made for automatic web scraping to make scraping easy. It gets a URL or the HTML content of a web page and a list of sample data that we want to scrape from that page. This data can be text, URL or any HTML tag value of that page. It learns the scraping rules and returns similar elements. Then you can use this learned object with new URLs to get similar content or the exact same element of those new pages.

Downloads: 0 This Week

Last Update: 2023-04-12
See Project
14

Crawlab

Distributed web crawler admin platform for spiders management

Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium. Please use docker-compose to one-click to start up. By doing so, you don't even have to configure MongoDB database. The frontend app interacts with the master node, which communicates with other components such as MongoDB, SeaweedFS and worker nodes. Master node and worker nodes communicate with each other via gRPC (a RPC framework). Tasks are scheduled by the task scheduler module in the master node, and received by the task handler module in worker nodes, which executes these tasks in task runners. Task runners are actually processes running spider or crawler programs, and can also send data through gRPC (integrated in SDK) to other data sources, e.g. MongoDB.

Downloads: 0 This Week

Last Update: 2023-07-26
See Project
15

Distributed Webhunter

Webhunter is a distributed, multi-threaded web crawler designed for both general indexing and crawling the web for focused content.

Downloads: 0 This Week

Last Update: 2013-04-05
See Project
16

Letterboxd Recommendations

Scraping publicly-accessible Letterboxd data for movie recommendations

Scraping publicly-accessible Letterboxd data and creating a movie recommendation model with it that can generate recommendations when provided with a Letterboxd username. A user's "star" ratings are scraped from their Letterboxd profile and assigned numerical ratings from 1 to 10 (accounting for half stars). Their ratings are then combined with a sample of ratings from the top 4000 most active users on the site to create a collaborative filtering recommender model using singular value decomposition (SVD). All movies in the full dataset that the user has not rated are run through the model for predicted scores and the items with the top predicted scores are returned. Due to constraints in time and computing power, the maximum sample size that a user is allowed to select is 500,000 samples, though there are over five million ratings in the full dataset from the top 4000 Letterboxd users alone.

Downloads: 0 This Week

Last Update: 2024-12-27
See Project
17

MechanicalSoup

A Python library for automating interaction with websites

A Python library for automating interaction with websites. MechanicalSoup automatically stores and sends cookies, follows redirects, and can follow links and submit forms. It doesn't do JavaScript. MechanicalSoup was created by M Hickford, who was a fond user of the Mechanize library. Unfortunately, Mechanize was incompatible with Python 3 until 2019 and its development stalled for several years. MechanicalSoup provides a similar API, built on Python giants Requests (for HTTP sessions) and BeautifulSoup (for document navigation). Since 2017 it is a project actively maintained by a small team.

Downloads: 0 This Week

Last Update: 2025-05-30
See Project
18

Nomad - Tiny Search Engine

Nomad is tiny but efficient search engine and web crawler. This works very good for searching with in the set of corporate websites on internet and/or intranet's HTML documents or knowledge repositories.

Downloads: 0 This Week

Last Update: 2013-03-14
See Project
19

ScrapeGraphAI

Python scraper based on AI

Extracting content from websites and local documents using LLM. ScrapeGraphAI is a web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.). Just say which information you want to extract and the library will do it for you.

Downloads: 0 This Week

Last Update: 2025-08-13
See Project
20

Scrapy-Redis

Redis-based components for Scrapy

You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls. Scraped items gets pushed into a redis queued meaning that you can start as many as needed post-processing processes sharing the items queue. Scheduler + Duplication Filter, Item Pipeline, Base Spiders. Default requests serializer is pickle, but it can be changed to any module with loads and dumps functions. Note that pickle is not compatible between python versions. Version 0.3 changed the requests serialization from marshal to cPickle, therefore persisted requests using version 0.2 will not able to work on 0.3. The class scrapy_redis.spiders.RedisSpider enables a spider to read the urls from redis. The urls in the redis queue will be processed one after another, if the first request yields more requests, the spider will process those requests before fetching another url from redis.

Downloads: 0 This Week

Last Update: 2024-07-06
See Project
21

Scrapyd

A service daemon to run Scrapy spiders

Scrapyd can manage multiple projects and each project can have multiple versions uploaded, but only the latest one will be used for launching new spiders. A common (and useful) convention to use for the version name is the revision number of the version control tool you’re using to track your Scrapy project code. For example: r23. The versions are not compared alphabetically but using a smarter algorithm (the same packaging uses) so r10 compares greater to r9, for example. Scrapyd is an application (typically run as a daemon) that listens to requests for spiders to run and spawns a process for each one. Scrapyd also runs multiple processes in parallel, allocating them in a fixed number of slots given by the max_proc and max_proc_per_cpu options, starting as many processes as possible to handle the load.

Downloads: 0 This Week

Last Update: 2023-04-11
See Project
22

ScrapydWeb

Web app for Scrapyd cluster management

Web app for Scrapyd cluster management, with support for Scrapy log analysis & visualization. Make sure that Scrapyd has been installed and started on all of your hosts. Start ScrapydWeb via command scrapydweb. (a config file would be generated for customizing settings on the first startup.) Add your Scrapyd servers, both formats of string and tuple are supported, you can attach basic auth for accessing the Scrapyd server, as well as a string for grouping or labeling. You can select any number of Scrapyd servers by grouping and filtering, and then invoke the HTTP JSON API of Scrapyd on the cluster with just a few clicks.

Downloads: 0 This Week

Last Update: 2025-02-16
See Project
23

Trafilatura

Python & command-line tool to gather text on the Web

Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text-processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats. Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). It also has to be robust and reasonably fast, it runs in production on millions of documents.

Downloads: 0 This Week

Last Update: 2024-12-03
See Project
24

Twitter Intelligence

Twitter Intelligence OSINT project performs tracking and analysis

A project written in Python for Twitter tracking and analysis without using Twitter API. This project is a Python 3.x application. The package dependencies are in the file requirements.txt. Run that command to install the dependencies. SQLite is used as the database. Tweet data is stored on the Tweet, User, Location, Hashtag, HashtagTweet tables. The database is created automatically. analysis.py performs analysis processing. User, hashtag, and location analyzes are performed. You must write Google Map API Key in setting.py to display Google Maps.

Downloads: 0 This Week

Last Update: 2023-04-12
See Project
25

VIT Marks Display

A small program that accesses VIT marks of a specific student

A small attempt while learning interfacing with the web while learning python to get the marks of a specific valid VIT student using basic web scraping techniques

Downloads: 0 This Week

Last Update: 2013-05-30
See Project