22
33### Scrapy概述
44
5- Scrapy是Python开发的一个非常流行的网络爬虫框架,可以用来抓取Web站点并从页面中提取结构化的数据,被广泛的用于数据挖掘、数据监测和自动化测试等领域。下图展示了Scrapy的基本架构,其中包含了主要组件和系统的数据处理流程(图中的绿色箭头 )。
5+ Scrapy是Python开发的一个非常流行的网络爬虫框架,可以用来抓取Web站点并从页面中提取结构化的数据,被广泛的用于数据挖掘、数据监测和自动化测试等领域。下图展示了Scrapy的基本架构,其中包含了主要组件和系统的数据处理流程(图中带数字的红色箭头 )。
66
7- ![ ] ( ./res/scrapy-architecture.jpg )
7+ ![ ] ( ./res/scrapy-architecture.png )
88
99#### 组件
1010
@@ -20,14 +20,22 @@ Scrapy是Python开发的一个非常流行的网络爬虫框架,可以用来
2020Scrapy的整个数据处理流程由Scrapy引擎进行控制,通常的运转流程包括以下的步骤:
2121
22221 . 引擎询问蜘蛛需要处理哪个网站,并让蜘蛛将第一个需要处理的URL交给它。
23+
23242 . 引擎让调度器将需要处理的URL放在队列中。
25+
24263 . 引擎从调度那获取接下来进行爬取的页面。
27+
25284 . 调度将下一个爬取的URL返回给引擎,引擎将它通过下载中间件发送到下载器。
29+
26305 . 当网页被下载器下载完成以后,响应内容通过下载中间件被发送到引擎;如果下载失败了,引擎会通知调度器记录这个URL,待会再重新下载。
31+
27326 . 引擎收到下载器的响应并将它通过蜘蛛中间件发送到蜘蛛进行处理。
33+
28347 . 蜘蛛处理响应并返回爬取到的数据条目,此外还要将需要跟进的新的URL发送给引擎。
35+
29368 . 引擎将抓取到的数据条目送入条目管道,把新的URL发送给调度器放入队列中。
30- 9 . 上述操作会一直重复直到调度器中没有需要请求的URL,爬虫停止工作。
37+
38+ 上述操作中的2-8步会一直重复直到调度器中没有需要请求的URL,爬虫停止工作。
3139
3240### 安装和使用Scrapy
3341
4553(venv) $ tree
4654.
4755| ____ scrapy.cfg
48- | ____ qianmu
56+ | ____ douban
4957| | ____ spiders
5058| | | ____ __init__.py
5159| | | ____ __pycache__
6876根据刚才描述的数据处理流程,基本上需要我们做的有以下几件事情:
6977
70781 . 在items.py文件中定义字段,这些字段用来保存数据,方便后续的操作。
79+
80+ ``` Python
81+
82+ # -*- coding: utf-8 -*-
83+
84+ # Define here the models for your scraped items
85+ #
86+ # See documentation in:
87+ # https://doc.scrapy.org/en/latest/topics/items.html
88+
89+ import scrapy
90+
91+
92+ class DoubanItem (scrapy .Item ):
93+
94+ name = scrapy.Field()
95+ year = scrapy.Field()
96+ score = scrapy.Field()
97+ director = scrapy.Field()
98+ classification = scrapy.Field()
99+ actor = scrapy.Field()
100+ ```
101+
711022 . 在spiders文件夹中编写自己的爬虫。
103+
104+ ``` Python
105+
106+ # -*- coding: utf-8 -*-
107+ import scrapy
108+ from scrapy.selector import Selector
109+ from scrapy.linkextractors import LinkExtractor
110+ from scrapy.spiders import CrawlSpider, Rule
111+
112+ from douban.items import DoubanItem
113+
114+
115+ class MovieSpider (CrawlSpider ):
116+ name = ' movie'
117+ allowed_domains = [' movie.douban.com' ]
118+ start_urls = [' https://movie.douban.com/top250' ]
119+ rules = (
120+ Rule(LinkExtractor(allow = (r ' https://movie. douban. com/top250\? start=\d + . * ' ))),
121+ Rule(LinkExtractor(allow = (r ' https://movie. douban. com/subject/\d + ' )), callback = ' parse_item' ),
122+ )
123+
124+ def parse_item (self , response ):
125+ sel = Selector(response)
126+ item = DoubanItem()
127+ item[' name' ]= sel.xpath(' //*[@id="content"]/h1/span[1]/text()' ).extract()
128+ item[' year' ]= sel.xpath(' //*[@id="content"]/h1/span[2]/text()' ).re(r ' \( ( \d + ) \) ' )
129+ item[' score' ]= sel.xpath(' //*[@id="interest_sectl"]/div/p[1]/strong/text()' ).extract()
130+ item[' director' ]= sel.xpath(' //*[@id="info"]/span[1]/a/text()' ).extract()
131+ item[' classification' ]= sel.xpath(' //span[@property="v:genre"]/text()' ).extract()
132+ item[' actor' ]= sel.xpath(' //*[@id="info"]/span[3]/a[1]/text()' ).extract()
133+ return item
134+
135+ ```
136+
721373 . 在pipelines.py中完成对数据进行持久化的操作。
138+
139+ ``` Python
140+
141+ # -*- coding: utf-8 -*-
142+
143+ # Define your item pipelines here
144+ #
145+ # Don't forget to add your pipeline to the ITEM_PIPELINES setting
146+ # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
147+ import pymongo
148+
149+ from scrapy.exceptions import DropItem
150+ from scrapy.conf import settings
151+ from scrapy import log
152+
153+
154+ class DoubanPipeline (object ):
155+
156+ def __init__ (self ):
157+ connection = pymongo.MongoClient(settings[' MONGODB_SERVER' ], settings[' MONGODB_PORT' ])
158+ db = connection[settings[' MONGODB_DB' ]]
159+ self .collection = db[settings[' MONGODB_COLLECTION' ]]
160+
161+ def process_item (self , item , spider ):
162+ # Remove invalid data
163+ valid = True
164+ for data in item:
165+ if not data:
166+ valid = False
167+ raise DropItem(" Missing %s of blogpost from %s " % (data, item[' url' ]))
168+ if valid:
169+ # Insert data into database
170+ new_moive= [{
171+ " name" :item[' name' ][0 ],
172+ " year" :item[' year' ][0 ],
173+ " score" :item[' score' ],
174+ " director" :item[' director' ],
175+ " classification" :item[' classification' ],
176+ " actor" :item[' actor' ]
177+ }]
178+ self .collection.insert(new_moive)
179+ log.msg(" Item wrote to MongoDB database %s /%s " %
180+ (settings[' MONGODB_DB' ], settings[' MONGODB_COLLECTION' ]),
181+ level = log.DEBUG , spider = spider)
182+ return item
183+
184+ ```
185+
731864 . 修改settings.py文件对项目进行配置。
74187
188+ ``` Python
189+
190+ # -*- coding: utf-8 -*-
191+
192+ # Scrapy settings for douban project
193+ #
194+ # For simplicity, this file contains only settings considered important or
195+ # commonly used. You can find more settings consulting the documentation:
196+ #
197+ # https://doc.scrapy.org/en/latest/topics/settings.html
198+ # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
199+ # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
200+
201+ BOT_NAME = ' douban'
202+
203+ SPIDER_MODULES = [' douban.spiders' ]
204+ NEWSPIDER_MODULE = ' douban.spiders'
205+
206+
207+ # Crawl responsibly by identifying yourself (and your website) on the user-agent
208+ USER_AGENT = ' Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
209+
210+ # Obey robots.txt rules
211+ ROBOTSTXT_OBEY = True
212+
213+ # Configure maximum concurrent requests performed by Scrapy (default: 16)
214+ # CONCURRENT_REQUESTS = 32
215+
216+ # Configure a delay for requests for the same website (default: 0)
217+ # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
218+ # See also autothrottle settings and docs
219+ DOWNLOAD_DELAY = 3
220+ RANDOMIZE_DOWNLOAD_DELAY = True
221+ # The download delay setting will honor only one of:
222+ # CONCURRENT_REQUESTS_PER_DOMAIN = 16
223+ # CONCURRENT_REQUESTS_PER_IP = 16
224+
225+ # Disable cookies (enabled by default)
226+ COOKIES_ENABLED = True
227+
228+ MONGODB_SERVER = ' 120.77.222.217'
229+ MONGODB_PORT = 27017
230+ MONGODB_DB = ' douban'
231+ MONGODB_COLLECTION = ' movie'
232+
233+ # Disable Telnet Console (enabled by default)
234+ # TELNETCONSOLE_ENABLED = False
235+
236+ # Override the default request headers:
237+ # DEFAULT_REQUEST_HEADERS = {
238+ # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
239+ # 'Accept-Language': 'en',
240+ # }
241+
242+ # Enable or disable spider middlewares
243+ # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
244+ # SPIDER_MIDDLEWARES = {
245+ # 'douban.middlewares.DoubanSpiderMiddleware': 543,
246+ # }
247+
248+ # Enable or disable downloader middlewares
249+ # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
250+ # DOWNLOADER_MIDDLEWARES = {
251+ # 'douban.middlewares.DoubanDownloaderMiddleware': 543,
252+ # }
253+
254+ # Enable or disable extensions
255+ # See https://doc.scrapy.org/en/latest/topics/extensions.html
256+ # EXTENSIONS = {
257+ # 'scrapy.extensions.telnet.TelnetConsole': None,
258+ # }
259+
260+ # Configure item pipelines
261+ # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
262+ ITEM_PIPELINES = {
263+ ' douban.pipelines.DoubanPipeline' : 400 ,
264+ }
265+
266+ LOG_LEVEL = ' DEBUG'
267+
268+ # Enable and configure the AutoThrottle extension (disabled by default)
269+ # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
270+ # AUTOTHROTTLE_ENABLED = True
271+ # The initial download delay
272+ # AUTOTHROTTLE_START_DELAY = 5
273+ # The maximum download delay to be set in case of high latencies
274+ # AUTOTHROTTLE_MAX_DELAY = 60
275+ # The average number of requests Scrapy should be sending in parallel to
276+ # each remote server
277+ # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
278+ # Enable showing throttling stats for every response received:
279+ # AUTOTHROTTLE_DEBUG = False
280+
281+ # Enable and configure HTTP caching (disabled by default)
282+ # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
283+ HTTPCACHE_ENABLED = True
284+ HTTPCACHE_EXPIRATION_SECS = 0
285+ HTTPCACHE_DIR = ' httpcache'
286+ HTTPCACHE_IGNORE_HTTP_CODES = []
287+ HTTPCACHE_STORAGE = ' scrapy.extensions.httpcache.FilesystemCacheStorage'
288+ ```
75289
290+
76291
0 commit comments