2
2
3
3
### Scrapy概述
4
4
5
- Scrapy是Python开发的一个非常流行的网络爬虫框架,可以用来抓取Web站点并从页面中提取结构化的数据,被广泛的用于数据挖掘、数据监测和自动化测试等领域。下图展示了Scrapy的基本架构,其中包含了主要组件和系统的数据处理流程(图中的绿色箭头 )。
5
+ Scrapy是Python开发的一个非常流行的网络爬虫框架,可以用来抓取Web站点并从页面中提取结构化的数据,被广泛的用于数据挖掘、数据监测和自动化测试等领域。下图展示了Scrapy的基本架构,其中包含了主要组件和系统的数据处理流程(图中带数字的红色箭头 )。
6
6
7
- ![ ] ( ./res/scrapy-architecture.jpg )
7
+ ![ ] ( ./res/scrapy-architecture.png )
8
8
9
9
#### 组件
10
10
@@ -20,14 +20,22 @@ Scrapy是Python开发的一个非常流行的网络爬虫框架,可以用来
20
20
Scrapy的整个数据处理流程由Scrapy引擎进行控制,通常的运转流程包括以下的步骤:
21
21
22
22
1 . 引擎询问蜘蛛需要处理哪个网站,并让蜘蛛将第一个需要处理的URL交给它。
23
+
23
24
2 . 引擎让调度器将需要处理的URL放在队列中。
25
+
24
26
3 . 引擎从调度那获取接下来进行爬取的页面。
27
+
25
28
4 . 调度将下一个爬取的URL返回给引擎,引擎将它通过下载中间件发送到下载器。
29
+
26
30
5 . 当网页被下载器下载完成以后,响应内容通过下载中间件被发送到引擎;如果下载失败了,引擎会通知调度器记录这个URL,待会再重新下载。
31
+
27
32
6 . 引擎收到下载器的响应并将它通过蜘蛛中间件发送到蜘蛛进行处理。
33
+
28
34
7 . 蜘蛛处理响应并返回爬取到的数据条目,此外还要将需要跟进的新的URL发送给引擎。
35
+
29
36
8 . 引擎将抓取到的数据条目送入条目管道,把新的URL发送给调度器放入队列中。
30
- 9 . 上述操作会一直重复直到调度器中没有需要请求的URL,爬虫停止工作。
37
+
38
+ 上述操作中的2-8步会一直重复直到调度器中没有需要请求的URL,爬虫停止工作。
31
39
32
40
### 安装和使用Scrapy
33
41
45
53
(venv) $ tree
46
54
.
47
55
| ____ scrapy.cfg
48
- | ____ qianmu
56
+ | ____ douban
49
57
| | ____ spiders
50
58
| | | ____ __init__.py
51
59
| | | ____ __pycache__
68
76
根据刚才描述的数据处理流程,基本上需要我们做的有以下几件事情:
69
77
70
78
1 . 在items.py文件中定义字段,这些字段用来保存数据,方便后续的操作。
79
+
80
+ ``` Python
81
+
82
+ # -*- coding: utf-8 -*-
83
+
84
+ # Define here the models for your scraped items
85
+ #
86
+ # See documentation in:
87
+ # https://doc.scrapy.org/en/latest/topics/items.html
88
+
89
+ import scrapy
90
+
91
+
92
+ class DoubanItem (scrapy .Item ):
93
+
94
+ name = scrapy.Field()
95
+ year = scrapy.Field()
96
+ score = scrapy.Field()
97
+ director = scrapy.Field()
98
+ classification = scrapy.Field()
99
+ actor = scrapy.Field()
100
+ ```
101
+
71
102
2 . 在spiders文件夹中编写自己的爬虫。
103
+
104
+ ``` Python
105
+
106
+ # -*- coding: utf-8 -*-
107
+ import scrapy
108
+ from scrapy.selector import Selector
109
+ from scrapy.linkextractors import LinkExtractor
110
+ from scrapy.spiders import CrawlSpider, Rule
111
+
112
+ from douban.items import DoubanItem
113
+
114
+
115
+ class MovieSpider (CrawlSpider ):
116
+ name = ' movie'
117
+ allowed_domains = [' movie.douban.com' ]
118
+ start_urls = [' https://movie.douban.com/top250' ]
119
+ rules = (
120
+ Rule(LinkExtractor(allow = (r ' https://movie. douban. com/top250\? start=\d + . * ' ))),
121
+ Rule(LinkExtractor(allow = (r ' https://movie. douban. com/subject/\d + ' )), callback = ' parse_item' ),
122
+ )
123
+
124
+ def parse_item (self , response ):
125
+ sel = Selector(response)
126
+ item = DoubanItem()
127
+ item[' name' ]= sel.xpath(' //*[@id="content"]/h1/span[1]/text()' ).extract()
128
+ item[' year' ]= sel.xpath(' //*[@id="content"]/h1/span[2]/text()' ).re(r ' \( ( \d + ) \) ' )
129
+ item[' score' ]= sel.xpath(' //*[@id="interest_sectl"]/div/p[1]/strong/text()' ).extract()
130
+ item[' director' ]= sel.xpath(' //*[@id="info"]/span[1]/a/text()' ).extract()
131
+ item[' classification' ]= sel.xpath(' //span[@property="v:genre"]/text()' ).extract()
132
+ item[' actor' ]= sel.xpath(' //*[@id="info"]/span[3]/a[1]/text()' ).extract()
133
+ return item
134
+
135
+ ```
136
+
72
137
3 . 在pipelines.py中完成对数据进行持久化的操作。
138
+
139
+ ``` Python
140
+
141
+ # -*- coding: utf-8 -*-
142
+
143
+ # Define your item pipelines here
144
+ #
145
+ # Don't forget to add your pipeline to the ITEM_PIPELINES setting
146
+ # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
147
+ import pymongo
148
+
149
+ from scrapy.exceptions import DropItem
150
+ from scrapy.conf import settings
151
+ from scrapy import log
152
+
153
+
154
+ class DoubanPipeline (object ):
155
+
156
+ def __init__ (self ):
157
+ connection = pymongo.MongoClient(settings[' MONGODB_SERVER' ], settings[' MONGODB_PORT' ])
158
+ db = connection[settings[' MONGODB_DB' ]]
159
+ self .collection = db[settings[' MONGODB_COLLECTION' ]]
160
+
161
+ def process_item (self , item , spider ):
162
+ # Remove invalid data
163
+ valid = True
164
+ for data in item:
165
+ if not data:
166
+ valid = False
167
+ raise DropItem(" Missing %s of blogpost from %s " % (data, item[' url' ]))
168
+ if valid:
169
+ # Insert data into database
170
+ new_moive= [{
171
+ " name" :item[' name' ][0 ],
172
+ " year" :item[' year' ][0 ],
173
+ " score" :item[' score' ],
174
+ " director" :item[' director' ],
175
+ " classification" :item[' classification' ],
176
+ " actor" :item[' actor' ]
177
+ }]
178
+ self .collection.insert(new_moive)
179
+ log.msg(" Item wrote to MongoDB database %s /%s " %
180
+ (settings[' MONGODB_DB' ], settings[' MONGODB_COLLECTION' ]),
181
+ level = log.DEBUG , spider = spider)
182
+ return item
183
+
184
+ ```
185
+
73
186
4 . 修改settings.py文件对项目进行配置。
74
187
188
+ ``` Python
189
+
190
+ # -*- coding: utf-8 -*-
191
+
192
+ # Scrapy settings for douban project
193
+ #
194
+ # For simplicity, this file contains only settings considered important or
195
+ # commonly used. You can find more settings consulting the documentation:
196
+ #
197
+ # https://doc.scrapy.org/en/latest/topics/settings.html
198
+ # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
199
+ # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
200
+
201
+ BOT_NAME = ' douban'
202
+
203
+ SPIDER_MODULES = [' douban.spiders' ]
204
+ NEWSPIDER_MODULE = ' douban.spiders'
205
+
206
+
207
+ # Crawl responsibly by identifying yourself (and your website) on the user-agent
208
+ USER_AGENT = ' Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
209
+
210
+ # Obey robots.txt rules
211
+ ROBOTSTXT_OBEY = True
212
+
213
+ # Configure maximum concurrent requests performed by Scrapy (default: 16)
214
+ # CONCURRENT_REQUESTS = 32
215
+
216
+ # Configure a delay for requests for the same website (default: 0)
217
+ # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
218
+ # See also autothrottle settings and docs
219
+ DOWNLOAD_DELAY = 3
220
+ RANDOMIZE_DOWNLOAD_DELAY = True
221
+ # The download delay setting will honor only one of:
222
+ # CONCURRENT_REQUESTS_PER_DOMAIN = 16
223
+ # CONCURRENT_REQUESTS_PER_IP = 16
224
+
225
+ # Disable cookies (enabled by default)
226
+ COOKIES_ENABLED = True
227
+
228
+ MONGODB_SERVER = ' 120.77.222.217'
229
+ MONGODB_PORT = 27017
230
+ MONGODB_DB = ' douban'
231
+ MONGODB_COLLECTION = ' movie'
232
+
233
+ # Disable Telnet Console (enabled by default)
234
+ # TELNETCONSOLE_ENABLED = False
235
+
236
+ # Override the default request headers:
237
+ # DEFAULT_REQUEST_HEADERS = {
238
+ # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
239
+ # 'Accept-Language': 'en',
240
+ # }
241
+
242
+ # Enable or disable spider middlewares
243
+ # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
244
+ # SPIDER_MIDDLEWARES = {
245
+ # 'douban.middlewares.DoubanSpiderMiddleware': 543,
246
+ # }
247
+
248
+ # Enable or disable downloader middlewares
249
+ # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
250
+ # DOWNLOADER_MIDDLEWARES = {
251
+ # 'douban.middlewares.DoubanDownloaderMiddleware': 543,
252
+ # }
253
+
254
+ # Enable or disable extensions
255
+ # See https://doc.scrapy.org/en/latest/topics/extensions.html
256
+ # EXTENSIONS = {
257
+ # 'scrapy.extensions.telnet.TelnetConsole': None,
258
+ # }
259
+
260
+ # Configure item pipelines
261
+ # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
262
+ ITEM_PIPELINES = {
263
+ ' douban.pipelines.DoubanPipeline' : 400 ,
264
+ }
265
+
266
+ LOG_LEVEL = ' DEBUG'
267
+
268
+ # Enable and configure the AutoThrottle extension (disabled by default)
269
+ # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
270
+ # AUTOTHROTTLE_ENABLED = True
271
+ # The initial download delay
272
+ # AUTOTHROTTLE_START_DELAY = 5
273
+ # The maximum download delay to be set in case of high latencies
274
+ # AUTOTHROTTLE_MAX_DELAY = 60
275
+ # The average number of requests Scrapy should be sending in parallel to
276
+ # each remote server
277
+ # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
278
+ # Enable showing throttling stats for every response received:
279
+ # AUTOTHROTTLE_DEBUG = False
280
+
281
+ # Enable and configure HTTP caching (disabled by default)
282
+ # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
283
+ HTTPCACHE_ENABLED = True
284
+ HTTPCACHE_EXPIRATION_SECS = 0
285
+ HTTPCACHE_DIR = ' httpcache'
286
+ HTTPCACHE_IGNORE_HTTP_CODES = []
287
+ HTTPCACHE_STORAGE = ' scrapy.extensions.httpcache.FilesystemCacheStorage'
288
+ ```
75
289
290
+
76
291
0 commit comments