Skip to content

Commit 0f19b23

Browse files
committed
更新了Scrapy的相关文档
1 parent 9385d43 commit 0f19b23

6 files changed

+265
-4
lines changed

Day66-75/05.解析动态内容.md

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
## 解析动态内容
2+
3+
### JavaScript逆向工程
4+
5+
6+
7+
### 使用Selenium
8+
9+
10+
11+
12+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
## 表单交互和验证码处理
2+
3+
### 提交表单
4+
5+
#### 手动提交
6+
7+
8+
9+
#### 自动提交
10+
11+
12+
13+
### 验证码处理
14+
15+
#### 加载验证码
16+
17+
18+
19+
#### 光学字符识别
20+
21+
光学字符识别(OCR)是从图像中抽取文本的工具,可以应用于公安、电信、物流、金融等诸多行业,例如识别车牌,身份证扫描识别、名片信息提取等。在爬虫开发中,如果遭遇了有文字验证码的表单,就可以利用OCR来进行验证码处理。Tesseract-OCR引擎最初是由惠普公司开发的光学字符识别系统,目前发布在Github上,由Google赞助开发。
22+
23+
![](./res/tesseract.gif)
24+
25+
#### 改善OCR
26+
27+
28+
29+
#### 处理更复杂的验证码
30+
31+
32+
33+
#### 验证码处理服务
34+

Day66-75/Scrapy的应用01.md

+219-4
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
### Scrapy概述
44

5-
Scrapy是Python开发的一个非常流行的网络爬虫框架,可以用来抓取Web站点并从页面中提取结构化的数据,被广泛的用于数据挖掘、数据监测和自动化测试等领域。下图展示了Scrapy的基本架构,其中包含了主要组件和系统的数据处理流程(图中的绿色箭头)。
5+
Scrapy是Python开发的一个非常流行的网络爬虫框架,可以用来抓取Web站点并从页面中提取结构化的数据,被广泛的用于数据挖掘、数据监测和自动化测试等领域。下图展示了Scrapy的基本架构,其中包含了主要组件和系统的数据处理流程(图中带数字的红色箭头)。
66

7-
![](./res/scrapy-architecture.jpg)
7+
![](./res/scrapy-architecture.png)
88

99
#### 组件
1010

@@ -20,14 +20,22 @@ Scrapy是Python开发的一个非常流行的网络爬虫框架,可以用来
2020
Scrapy的整个数据处理流程由Scrapy引擎进行控制,通常的运转流程包括以下的步骤:
2121

2222
1. 引擎询问蜘蛛需要处理哪个网站,并让蜘蛛将第一个需要处理的URL交给它。
23+
2324
2. 引擎让调度器将需要处理的URL放在队列中。
25+
2426
3. 引擎从调度那获取接下来进行爬取的页面。
27+
2528
4. 调度将下一个爬取的URL返回给引擎,引擎将它通过下载中间件发送到下载器。
29+
2630
5. 当网页被下载器下载完成以后,响应内容通过下载中间件被发送到引擎;如果下载失败了,引擎会通知调度器记录这个URL,待会再重新下载。
31+
2732
6. 引擎收到下载器的响应并将它通过蜘蛛中间件发送到蜘蛛进行处理。
33+
2834
7. 蜘蛛处理响应并返回爬取到的数据条目,此外还要将需要跟进的新的URL发送给引擎。
35+
2936
8. 引擎将抓取到的数据条目送入条目管道,把新的URL发送给调度器放入队列中。
30-
9. 上述操作会一直重复直到调度器中没有需要请求的URL,爬虫停止工作。
37+
38+
上述操作中的2-8步会一直重复直到调度器中没有需要请求的URL,爬虫停止工作。
3139

3240
### 安装和使用Scrapy
3341

@@ -45,7 +53,7 @@ $
4553
(venv) $ tree
4654
.
4755
|____ scrapy.cfg
48-
|____ qianmu
56+
|____ douban
4957
| |____ spiders
5058
| | |____ __init__.py
5159
| | |____ __pycache__
@@ -68,9 +76,216 @@ $
6876
根据刚才描述的数据处理流程,基本上需要我们做的有以下几件事情:
6977

7078
1. 在items.py文件中定义字段,这些字段用来保存数据,方便后续的操作。
79+
80+
```Python
81+
82+
# -*- coding: utf-8 -*-
83+
84+
# Define here the models for your scraped items
85+
#
86+
# See documentation in:
87+
# https://doc.scrapy.org/en/latest/topics/items.html
88+
89+
import scrapy
90+
91+
92+
class DoubanItem(scrapy.Item):
93+
94+
name = scrapy.Field()
95+
year = scrapy.Field()
96+
score = scrapy.Field()
97+
director = scrapy.Field()
98+
classification = scrapy.Field()
99+
actor = scrapy.Field()
100+
```
101+
71102
2. 在spiders文件夹中编写自己的爬虫。
103+
104+
```Python
105+
106+
# -*- coding: utf-8 -*-
107+
import scrapy
108+
from scrapy.selector import Selector
109+
from scrapy.linkextractors import LinkExtractor
110+
from scrapy.spiders import CrawlSpider, Rule
111+
112+
from douban.items import DoubanItem
113+
114+
115+
class MovieSpider(CrawlSpider):
116+
name = 'movie'
117+
allowed_domains = ['movie.douban.com']
118+
start_urls = ['https://movie.douban.com/top250']
119+
rules = (
120+
Rule(LinkExtractor(allow=(r'https://movie.douban.com/top250\?start=\d+.*'))),
121+
Rule(LinkExtractor(allow=(r'https://movie.douban.com/subject/\d+')), callback='parse_item'),
122+
)
123+
124+
def parse_item(self, response):
125+
sel = Selector(response)
126+
item = DoubanItem()
127+
item['name']=sel.xpath('//*[@id="content"]/h1/span[1]/text()').extract()
128+
item['year']=sel.xpath('//*[@id="content"]/h1/span[2]/text()').re(r'\((\d+)\)')
129+
item['score']=sel.xpath('//*[@id="interest_sectl"]/div/p[1]/strong/text()').extract()
130+
item['director']=sel.xpath('//*[@id="info"]/span[1]/a/text()').extract()
131+
item['classification']= sel.xpath('//span[@property="v:genre"]/text()').extract()
132+
item['actor']= sel.xpath('//*[@id="info"]/span[3]/a[1]/text()').extract()
133+
return item
134+
135+
```
136+
72137
3. 在pipelines.py中完成对数据进行持久化的操作。
138+
139+
```Python
140+
141+
# -*- coding: utf-8 -*-
142+
143+
# Define your item pipelines here
144+
#
145+
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
146+
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
147+
import pymongo
148+
149+
from scrapy.exceptions import DropItem
150+
from scrapy.conf import settings
151+
from scrapy import log
152+
153+
154+
class DoubanPipeline(object):
155+
156+
def __init__(self):
157+
connection = pymongo.MongoClient(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
158+
db = connection[settings['MONGODB_DB']]
159+
self.collection = db[settings['MONGODB_COLLECTION']]
160+
161+
def process_item(self, item, spider):
162+
#Remove invalid data
163+
valid = True
164+
for data in item:
165+
if not data:
166+
valid = False
167+
raise DropItem("Missing %s of blogpost from %s" %(data, item['url']))
168+
if valid:
169+
#Insert data into database
170+
new_moive=[{
171+
"name":item['name'][0],
172+
"year":item['year'][0],
173+
"score":item['score'],
174+
"director":item['director'],
175+
"classification":item['classification'],
176+
"actor":item['actor']
177+
}]
178+
self.collection.insert(new_moive)
179+
log.msg("Item wrote to MongoDB database %s/%s" %
180+
(settings['MONGODB_DB'], settings['MONGODB_COLLECTION']),
181+
level=log.DEBUG, spider=spider)
182+
return item
183+
184+
```
185+
73186
4. 修改settings.py文件对项目进行配置。
74187

188+
```Python
189+
190+
# -*- coding: utf-8 -*-
191+
192+
# Scrapy settings for douban project
193+
#
194+
# For simplicity, this file contains only settings considered important or
195+
# commonly used. You can find more settings consulting the documentation:
196+
#
197+
# https://doc.scrapy.org/en/latest/topics/settings.html
198+
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
199+
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
200+
201+
BOT_NAME = 'douban'
202+
203+
SPIDER_MODULES = ['douban.spiders']
204+
NEWSPIDER_MODULE = 'douban.spiders'
205+
206+
207+
# Crawl responsibly by identifying yourself (and your website) on the user-agent
208+
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
209+
210+
# Obey robots.txt rules
211+
ROBOTSTXT_OBEY = True
212+
213+
# Configure maximum concurrent requests performed by Scrapy (default: 16)
214+
# CONCURRENT_REQUESTS = 32
215+
216+
# Configure a delay for requests for the same website (default: 0)
217+
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
218+
# See also autothrottle settings and docs
219+
DOWNLOAD_DELAY = 3
220+
RANDOMIZE_DOWNLOAD_DELAY = True
221+
# The download delay setting will honor only one of:
222+
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
223+
# CONCURRENT_REQUESTS_PER_IP = 16
224+
225+
# Disable cookies (enabled by default)
226+
COOKIES_ENABLED = True
227+
228+
MONGODB_SERVER = '120.77.222.217'
229+
MONGODB_PORT = 27017
230+
MONGODB_DB = 'douban'
231+
MONGODB_COLLECTION = 'movie'
232+
233+
# Disable Telnet Console (enabled by default)
234+
# TELNETCONSOLE_ENABLED = False
235+
236+
# Override the default request headers:
237+
# DEFAULT_REQUEST_HEADERS = {
238+
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
239+
# 'Accept-Language': 'en',
240+
# }
241+
242+
# Enable or disable spider middlewares
243+
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
244+
# SPIDER_MIDDLEWARES = {
245+
# 'douban.middlewares.DoubanSpiderMiddleware': 543,
246+
# }
247+
248+
# Enable or disable downloader middlewares
249+
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
250+
# DOWNLOADER_MIDDLEWARES = {
251+
# 'douban.middlewares.DoubanDownloaderMiddleware': 543,
252+
# }
253+
254+
# Enable or disable extensions
255+
# See https://doc.scrapy.org/en/latest/topics/extensions.html
256+
# EXTENSIONS = {
257+
# 'scrapy.extensions.telnet.TelnetConsole': None,
258+
# }
259+
260+
# Configure item pipelines
261+
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
262+
ITEM_PIPELINES = {
263+
'douban.pipelines.DoubanPipeline': 400,
264+
}
265+
266+
LOG_LEVEL = 'DEBUG'
267+
268+
# Enable and configure the AutoThrottle extension (disabled by default)
269+
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
270+
#AUTOTHROTTLE_ENABLED = True
271+
# The initial download delay
272+
#AUTOTHROTTLE_START_DELAY = 5
273+
# The maximum download delay to be set in case of high latencies
274+
#AUTOTHROTTLE_MAX_DELAY = 60
275+
# The average number of requests Scrapy should be sending in parallel to
276+
# each remote server
277+
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
278+
# Enable showing throttling stats for every response received:
279+
#AUTOTHROTTLE_DEBUG = False
280+
281+
# Enable and configure HTTP caching (disabled by default)
282+
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
283+
HTTPCACHE_ENABLED = True
284+
HTTPCACHE_EXPIRATION_SECS = 0
285+
HTTPCACHE_DIR = 'httpcache'
286+
HTTPCACHE_IGNORE_HTTP_CODES = []
287+
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
288+
```
75289

290+
76291

Day66-75/res/scrapy-architecture.jpg

-54.5 KB
Binary file not shown.

Day66-75/res/scrapy-architecture.png

52.7 KB
Loading

Day66-75/res/tesseract.gif

791 KB
Loading

0 commit comments

Comments
 (0)