SlideShare a Scribd company logo
5
• Client needed to crawl 1B+
pages/week, and identify
frequently changing HUB
pages.
• Scrapy is hard for broad
crawling and had no crawl
frontier capabilities, out of the
box,
• People were tend to favor
Apache Nutch instead of
Scrapy.
Motivation
Hyperlink-Induced Topic Search,
Jon Kleinberg, 1999
5
Most read
13
Frontera architecture: distributed
Kafka topic
SW
DB
Strategy workers
DB workers
13
Most read
16
• Single-thread Scrapy spider gives 1200 pages/min.
from about 100 websites in parallel.
• Spiders to workers ratio is 4:1 (without content)
• 1 Gb of RAM for every SW (state cache, tunable).
• Example:
• 12 spiders ~ 14.4K pages/min.,
• 3 SW and 3 DB workers,
• Total 18 cores.
Hardware requirements
Most read
Frontera: Large-Scale Open Source Web
Crawling Framework
Alexander Sibiryakov, 20 July 2015
sibiryakov@scrapinghub.com
• Born in Yekaterinburg, RU
• 5 years at Yandex, search
quality department: social and
QA search, snippets.
• 2 years at Avast! antivirus,
research team: automatic false
positive solving, large scale
prediction of malicious
download attempts.
Hola los participantes!
2
–Wikipedia: Web Crawler article, July 2015
3
«A Web crawler starts with a list of URLs to visit,
called the seeds. As the crawler visits these
URLs, it identifies all the hyperlinks in the page
and adds them to the list of URLs to visit, called
the crawl frontier.».
tophdart.com
• Client needed to crawl 1B+
pages/week, and identify
frequently changing HUB
pages.
• Scrapy is hard for broad
crawling and had no crawl
frontier capabilities, out of the
box,
• People were tend to favor
Apache Nutch instead of
Scrapy.
Motivation
Hyperlink-Induced Topic Search,
Jon Kleinberg, 1999
5
• Frontera is all about knowing what to crawl next
and when to stop.
• Single-Threaded mode can be used for up to 100
websites (parallel downloading),
• for performance broad crawls there is a distributed
mode.
Frontera: single-threaded and distributed
6
• Online operation: scheduling of new batch,
updating of DB state.
• Storage abstraction: write your own backend
(sqlalchemy, HBase is included).
• Canonical URLs resolution abstraction: each
document has many URLs, which to use?
• Scrapy ecosystem: good documentation, big
community, ease of customization.
Main features
7
• Need of URL metadata and content storage,
• Need of isolation of URL ordering/queueing logic
from the spider
• Advanced URL ordering logic (big websites, or
revisiting)
Single-threaded use cases
8
Single-threaded architecture
9
• Frontera is implemented as a
set of custom scheduler and
spider middleware for Scrapy.
• Frontera doesn’t require
Scrapy, and can be used
separately.
• Scrapy role is process
management and fetching
operation.
• And we’re friends forever!
Frontera and Scrapy
10
• $pip install frontera
• write a spider, or take example one from Frontera
repo,
• edit spider settings.py changing scheduler and
add Frontera’s spider middleware,
• $scrapy crawl [your_spider]
• Check your chosen DB contents after crawl.
Single-threaded Frontera quickstart
• You have set of URLs and need to revisit them (e.g.
to track changes).
• Building a search engine with content retrieval from the
Web.
• All kinds of research work on web graph: gathering links 
statistics, structure of graph, tracking domain count, etc.
• You have a topic and you want to crawl the documents
about that topic.
• More general focused crawling tasks: e.g. you search for 
pages that are big hubs, and frequently changing in time.
Distributed use cases: broad crawls
12
Frontera architecture: distributed
Kafka topic
SW
DB
Strategy workers
DB workers
13
• Communication layer is Apache Kafka: topic
partitioning, offsets mechanism.
• Crawling strategy abstraction: crawling goal, url
ordering, scoring model is coded in separate
module.
• Polite by design: each website is downloaded by
at most one spider.
• Python: workers, spiders.
Main features: distributed
• Apache HBase,
• Apache Kafka,
• Python 2.7+,
• Scrapy 0.24+,
• DNS Service.
Software requirements
CDH (100% Open source
Hadoop distribution)
• Single-thread Scrapy spider gives 1200 pages/min.
from about 100 websites in parallel.
• Spiders to workers ratio is 4:1 (without content)
• 1 Gb of RAM for every SW (state cache, tunable).
• Example:
• 12 spiders ~ 14.4K pages/min.,
• 3 SW and 3 DB workers,
• Total 18 cores.
Hardware requirements
• Network could be a bottleneck
for internal communication.
Solution: increase count of
network interfaces.
• HBase can be backed by
HDDs, and free RAM would
be great for caching the
priority queue.
• Kafka throughput is key
performance issue, make sure
that Kafka brokers has enough
IOPS.
Hardware requirements: gotchas
Quickstart for distributed Frontera
• $pip install distributed-frontera
• prepare HBase and Kafka,
• simple Scrapy spider, passing links and/or content,
• configure Frontera workers and spiders,
• run workers, spiders and pull in the seeds.
Consult http://distributed-frontera.readthedocs.org/ for
more information.
Quick spanish (.es) internet crawl
• fnac.es, rakuten.es, adidas.es,
equiposdefutbol2014.es, druni.es,
docentesconeducacion.es - are
the biggest websites
• 68.7K domains found,
• 46.5M crawled pages overall,
• 1.5 months,
• 22 websites with more than 50M
pages
For more info and graphs check
the poster
Feature plans: distributed version
• Revisit strategy,
• PageRank or HITS-based
strategy,
• Own url parsing and html
parsing,
• Integration to Scrapinghub’s
paid services,
• Testing at larger scales.
Preguntas!
Thank you!
Alexander Sibiryakov,
sibiryakov@scrapinghub.com

More Related Content

What's hot (17)

Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
Dremio Corporation
 
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache ArrowAchieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
Neo4j
 
Apache Metron: Community Driven Cyber Security
Apache Metron: Community Driven Cyber Security Apache Metron: Community Driven Cyber Security
Apache Metron: Community Driven Cyber Security
DataWorks Summit/Hadoop Summit
 
Gaurav web mining
Gaurav web miningGaurav web mining
Gaurav web mining
Gaurav Uniyal
 
KFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature StoreKFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature Store
Databricks
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
Fwdays
 
Challenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingChallenges in Large-Scale Web Crawling
Challenges in Large-Scale Web Crawling
Nate Murray
 
Chapitre 4 no sql
Chapitre 4 no sqlChapitre 4 no sql
Chapitre 4 no sql
Mouna Torjmen
 
Palantir, Quid, RecordedFuture: Augmented Intelligence Frontier
Palantir, Quid, RecordedFuture: Augmented Intelligence FrontierPalantir, Quid, RecordedFuture: Augmented Intelligence Frontier
Palantir, Quid, RecordedFuture: Augmented Intelligence Frontier
Daniel Kornev
 
Module: Content Addressing in IPFS
Module: Content Addressing in IPFSModule: Content Addressing in IPFS
Module: Content Addressing in IPFS
Ioannis Psaras
 
A study on housing and environmental quality of moniya community in ibadan, n...
A study on housing and environmental quality of moniya community in ibadan, n...A study on housing and environmental quality of moniya community in ibadan, n...
A study on housing and environmental quality of moniya community in ibadan, n...
Alexander Decker
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
Neo4j
 
Poisoning attacks on Federated Learning based IoT Intrusion Detection System
Poisoning attacks on Federated Learning based IoT Intrusion Detection SystemPoisoning attacks on Federated Learning based IoT Intrusion Detection System
Poisoning attacks on Federated Learning based IoT Intrusion Detection System
Sai Kiran Kadam
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
Databricks
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
Databricks
 
Odoo system presentation.pdf
Odoo system presentation.pdfOdoo system presentation.pdf
Odoo system presentation.pdf
Святослав Надозирний
 
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache ArrowAchieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
Neo4j
 
KFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature StoreKFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature Store
Databricks
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
Fwdays
 
Challenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingChallenges in Large-Scale Web Crawling
Challenges in Large-Scale Web Crawling
Nate Murray
 
Palantir, Quid, RecordedFuture: Augmented Intelligence Frontier
Palantir, Quid, RecordedFuture: Augmented Intelligence FrontierPalantir, Quid, RecordedFuture: Augmented Intelligence Frontier
Palantir, Quid, RecordedFuture: Augmented Intelligence Frontier
Daniel Kornev
 
Module: Content Addressing in IPFS
Module: Content Addressing in IPFSModule: Content Addressing in IPFS
Module: Content Addressing in IPFS
Ioannis Psaras
 
A study on housing and environmental quality of moniya community in ibadan, n...
A study on housing and environmental quality of moniya community in ibadan, n...A study on housing and environmental quality of moniya community in ibadan, n...
A study on housing and environmental quality of moniya community in ibadan, n...
Alexander Decker
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
Neo4j
 
Poisoning attacks on Federated Learning based IoT Intrusion Detection System
Poisoning attacks on Federated Learning based IoT Intrusion Detection SystemPoisoning attacks on Federated Learning based IoT Intrusion Detection System
Poisoning attacks on Federated Learning based IoT Intrusion Detection System
Sai Kiran Kadam
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
Databricks
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
Databricks
 

Viewers also liked (20)

Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
Scrapinghub
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
Dana Brophy
 
Using Web Data for Finance
Using Web Data for FinanceUsing Web Data for Finance
Using Web Data for Finance
Scrapinghub
 
Mining the web, no experience required
Mining the web, no experience requiredMining the web, no experience required
Mining the web, no experience required
Scrapinghub
 
Relational machine-learning
Relational machine-learningRelational machine-learning
Relational machine-learning
Bhushan Kotnis
 
Scrapinghub Deck for Startups
Scrapinghub Deck for StartupsScrapinghub Deck for Startups
Scrapinghub Deck for Startups
Scrapinghub
 
The story behind La Paella King brand
The story behind La Paella King brandThe story behind La Paella King brand
The story behind La Paella King brand
Michael Robert Gill
 
Large scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedInLarge scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedIn
Mitul Tiwari
 
Web scraping with nutch solr
Web scraping with nutch solrWeb scraping with nutch solr
Web scraping with nutch solr
Mike Frampton
 
Mining legal texts with Python
Mining legal texts with PythonMining legal texts with Python
Mining legal texts with Python
Flávio Codeço Coelho
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
Scrapinghub
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
Nikolai Avteniev
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
Steve Watt
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
Renato Javier Marroquín Mogrovejo
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
Hakka Labs
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
sebastian_nagel
 
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Spark Summit
 
Using a Graph Database for Next-Gen MDM
Using a Graph Database for Next-Gen MDMUsing a Graph Database for Next-Gen MDM
Using a Graph Database for Next-Gen MDM
Neo4j
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
CloudTechnologies
 
Spirit of Valencia Team introduction
Spirit of Valencia Team introductionSpirit of Valencia Team introduction
Spirit of Valencia Team introduction
Michael Robert Gill
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
Scrapinghub
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
Dana Brophy
 
Using Web Data for Finance
Using Web Data for FinanceUsing Web Data for Finance
Using Web Data for Finance
Scrapinghub
 
Mining the web, no experience required
Mining the web, no experience requiredMining the web, no experience required
Mining the web, no experience required
Scrapinghub
 
Relational machine-learning
Relational machine-learningRelational machine-learning
Relational machine-learning
Bhushan Kotnis
 
Scrapinghub Deck for Startups
Scrapinghub Deck for StartupsScrapinghub Deck for Startups
Scrapinghub Deck for Startups
Scrapinghub
 
The story behind La Paella King brand
The story behind La Paella King brandThe story behind La Paella King brand
The story behind La Paella King brand
Michael Robert Gill
 
Large scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedInLarge scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedIn
Mitul Tiwari
 
Web scraping with nutch solr
Web scraping with nutch solrWeb scraping with nutch solr
Web scraping with nutch solr
Mike Frampton
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
Scrapinghub
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
Nikolai Avteniev
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
Steve Watt
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
Hakka Labs
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
sebastian_nagel
 
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Spark Summit
 
Using a Graph Database for Next-Gen MDM
Using a Graph Database for Next-Gen MDMUsing a Graph Database for Next-Gen MDM
Using a Graph Database for Next-Gen MDM
Neo4j
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
CloudTechnologies
 
Spirit of Valencia Team introduction
Spirit of Valencia Team introductionSpirit of Valencia Team introduction
Spirit of Valencia Team introduction
Michael Robert Gill
 
Ad

Similar to Frontera-Open Source Large Scale Web Crawling Framework (20)

Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
PyData
 
Design and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerDesign and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web Crawler
George Ang
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Ontico
 
Web crawling
Web crawlingWeb crawling
Web crawling
Tushar Tilwani
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015
ontopic
 
Colloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerColloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web Crawler
Akshay Pratap Singh
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
mynameismrslide
 
Distributed Crawler Service architecture presentation
Distributed Crawler Service architecture presentationDistributed Crawler Service architecture presentation
Distributed Crawler Service architecture presentation
Gennady Baranov
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
Chandler Huang
 
Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014
Sunny Gupta
 
StormCrawler in the wild
StormCrawler in the wildStormCrawler in the wild
StormCrawler in the wild
Julien Nioche
 
Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021
Max Lapan
 
A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm Crawler
Julien Nioche
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web
Van-Duyet Le
 
webcrawler.pptx
webcrawler.pptxwebcrawler.pptx
webcrawler.pptx
NiteshKumar176268
 
Smart Web Crawling in Search Engine Optimization
Smart Web Crawling in Search Engine OptimizationSmart Web Crawling in Search Engine Optimization
Smart Web Crawling in Search Engine Optimization
bismayabaliarsingh00
 
Webcrawler
Webcrawler Webcrawler
Webcrawler
Govind Raj
 
Crawling the world
Crawling the worldCrawling the world
Crawling the world
Marc Morera
 
Faster and resourceful multi core web crawling
Faster and resourceful multi core web crawlingFaster and resourceful multi core web crawling
Faster and resourceful multi core web crawling
International Journal of Science and Research (IJSR)
 
Web crawler
Web crawlerWeb crawler
Web crawler
crazyprave12490
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
PyData
 
Design and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerDesign and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web Crawler
George Ang
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Ontico
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015
ontopic
 
Colloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerColloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web Crawler
Akshay Pratap Singh
 
Distributed Crawler Service architecture presentation
Distributed Crawler Service architecture presentationDistributed Crawler Service architecture presentation
Distributed Crawler Service architecture presentation
Gennady Baranov
 
Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014
Sunny Gupta
 
StormCrawler in the wild
StormCrawler in the wildStormCrawler in the wild
StormCrawler in the wild
Julien Nioche
 
Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021
Max Lapan
 
A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm Crawler
Julien Nioche
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web
Van-Duyet Le
 
Smart Web Crawling in Search Engine Optimization
Smart Web Crawling in Search Engine OptimizationSmart Web Crawling in Search Engine Optimization
Smart Web Crawling in Search Engine Optimization
bismayabaliarsingh00
 
Crawling the world
Crawling the worldCrawling the world
Crawling the world
Marc Morera
 
Ad

Frontera-Open Source Large Scale Web Crawling Framework

  • 1. Frontera: Large-Scale Open Source Web Crawling Framework Alexander Sibiryakov, 20 July 2015 [email protected]
  • 2. • Born in Yekaterinburg, RU • 5 years at Yandex, search quality department: social and QA search, snippets. • 2 years at Avast! antivirus, research team: automatic false positive solving, large scale prediction of malicious download attempts. Hola los participantes! 2
  • 3. –Wikipedia: Web Crawler article, July 2015 3 «A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier.».
  • 5. • Client needed to crawl 1B+ pages/week, and identify frequently changing HUB pages. • Scrapy is hard for broad crawling and had no crawl frontier capabilities, out of the box, • People were tend to favor Apache Nutch instead of Scrapy. Motivation Hyperlink-Induced Topic Search, Jon Kleinberg, 1999 5
  • 6. • Frontera is all about knowing what to crawl next and when to stop. • Single-Threaded mode can be used for up to 100 websites (parallel downloading), • for performance broad crawls there is a distributed mode. Frontera: single-threaded and distributed 6
  • 7. • Online operation: scheduling of new batch, updating of DB state. • Storage abstraction: write your own backend (sqlalchemy, HBase is included). • Canonical URLs resolution abstraction: each document has many URLs, which to use? • Scrapy ecosystem: good documentation, big community, ease of customization. Main features 7
  • 8. • Need of URL metadata and content storage, • Need of isolation of URL ordering/queueing logic from the spider • Advanced URL ordering logic (big websites, or revisiting) Single-threaded use cases 8
  • 10. • Frontera is implemented as a set of custom scheduler and spider middleware for Scrapy. • Frontera doesn’t require Scrapy, and can be used separately. • Scrapy role is process management and fetching operation. • And we’re friends forever! Frontera and Scrapy 10
  • 11. • $pip install frontera • write a spider, or take example one from Frontera repo, • edit spider settings.py changing scheduler and add Frontera’s spider middleware, • $scrapy crawl [your_spider] • Check your chosen DB contents after crawl. Single-threaded Frontera quickstart
  • 12. • You have set of URLs and need to revisit them (e.g. to track changes). • Building a search engine with content retrieval from the Web. • All kinds of research work on web graph: gathering links  statistics, structure of graph, tracking domain count, etc. • You have a topic and you want to crawl the documents about that topic. • More general focused crawling tasks: e.g. you search for  pages that are big hubs, and frequently changing in time. Distributed use cases: broad crawls 12
  • 13. Frontera architecture: distributed Kafka topic SW DB Strategy workers DB workers 13
  • 14. • Communication layer is Apache Kafka: topic partitioning, offsets mechanism. • Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module. • Polite by design: each website is downloaded by at most one spider. • Python: workers, spiders. Main features: distributed
  • 15. • Apache HBase, • Apache Kafka, • Python 2.7+, • Scrapy 0.24+, • DNS Service. Software requirements CDH (100% Open source Hadoop distribution)
  • 16. • Single-thread Scrapy spider gives 1200 pages/min. from about 100 websites in parallel. • Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example: • 12 spiders ~ 14.4K pages/min., • 3 SW and 3 DB workers, • Total 18 cores. Hardware requirements
  • 17. • Network could be a bottleneck for internal communication. Solution: increase count of network interfaces. • HBase can be backed by HDDs, and free RAM would be great for caching the priority queue. • Kafka throughput is key performance issue, make sure that Kafka brokers has enough IOPS. Hardware requirements: gotchas
  • 18. Quickstart for distributed Frontera • $pip install distributed-frontera • prepare HBase and Kafka, • simple Scrapy spider, passing links and/or content, • configure Frontera workers and spiders, • run workers, spiders and pull in the seeds. Consult http://distributed-frontera.readthedocs.org/ for more information.
  • 19. Quick spanish (.es) internet crawl • fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites • 68.7K domains found, • 46.5M crawled pages overall, • 1.5 months, • 22 websites with more than 50M pages For more info and graphs check the poster
  • 20. Feature plans: distributed version • Revisit strategy, • PageRank or HITS-based strategy, • Own url parsing and html parsing, • Integration to Scrapinghub’s paid services, • Testing at larger scales.