Skip to content

Commit bc69170

Browse files
committed
scrapy爬取微信小程序论坛的数据并保存到json文件中
1 parent e10192a commit bc69170

File tree

20 files changed

+726
-0
lines changed

20 files changed

+726
-0
lines changed

.DS_Store

0 Bytes
Binary file not shown.

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,3 +29,5 @@
2929

3030
## Scrapy 框架爬虫
3131
* [爬取糗事百科的段子保存到 JSON 文件中](./scrapy/qsbk/readme.MD)
32+
* [爬取微信小程序论坛的数据](./scrapy/weixin_community/readme.MD)
33+

scrapy/.DS_Store

0 Bytes
Binary file not shown.

scrapy/weixin_community/.DS_Store

6 KB
Binary file not shown.

scrapy/weixin_community/.idea/misc.xml

Lines changed: 7 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

scrapy/weixin_community/.idea/modules.xml

Lines changed: 8 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

scrapy/weixin_community/.idea/vcs.xml

Lines changed: 6 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

scrapy/weixin_community/.idea/weixin_community.iml

Lines changed: 12 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

scrapy/weixin_community/.idea/workspace.xml

Lines changed: 353 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

scrapy/weixin_community/readme.MD

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# 使用 `CrawlSpider` 爬取微信小程序论坛
2+
1. 创建一个项目
3+
4+
```
5+
scrapy startproject weixin_community
6+
```
7+
8+
2. 创建一个爬虫
9+
10+
```
11+
# 先进入文件夹中
12+
cd weixin_community
13+
14+
# 创建一个爬虫
15+
scrapy genspider -t crawl wx_spider "wxapp-union.com"
16+
```
17+
18+
3. 使用 `Pycharm` 打开项目
19+
20+
4. 设置 `setting.py` 文件
21+
22+
```
23+
ROBOTSTXT_OBEY = False
24+
25+
DOWNLOAD_DELAY = 1
26+
27+
DEFAULT_REQUEST_HEADERS = {
28+
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
29+
'Accept-Language': 'en',
30+
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
31+
}
32+
33+
ITEM_PIPELINES = {
34+
'weixin_community.pipelines.WeixinCommunityPipeline': 300,
35+
}
36+
```
37+
38+
5. 编写爬虫
39+
40+
6. 编写数据模型
41+
42+
7. 编写 `Pipline` 管道
43+
44+
8. 运行测试
45+
46+

0 commit comments

Comments
 (0)