scrapy爬虫保存图片到本地
一、需求
到指定站点爬取图片保存到本地,并支持自动翻页爬取
二、初始化项目
# 创建项目
scrapy startproject zb
# 项目初始化
# "zb_spider"即为创建项目的名称后加上"_spider"
# "mypicture.ipojy.net"即为该项目的允许访问域名
scrapy genspider zb_spider mypicture.ipojy.net
三、配置settings.py
- 打开 zb/zb/settings.py 添加
USER_AGENT
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
- 打开 zb/zb/settings.py 添加图片配置
#imgurl是在items.py中配置的网络爬取得图片地址
IMAGES_URLS_FIELD ="imgurl"
#配置保存本地的地址
#获取当前爬虫项目的绝对路径
project_dir=os.path.abspath(os.path.dirname(__file__))
#组装新的图片路径
IMAGES_STORE=os.path.join(project_dir,'images')
- 修改
ROBOTSTXT_OBEY
为False
避免无法进行回调
ROBOTSTXT_OBEY = False
- 打开 zb/zb/settings.py 修改图片采集配置
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline':5,
'zd.pipelines.ZdPipeline': 300,
}
四、构建items.py
class ZdItem(scrapy.Item):
ids = scrapy.Field()
imgurl = scrapy.Field()
五、构建zb_spider.py采集逻辑
import scrapy
from zd.items import ZdItem
class ZbSpiderSpider(scrapy.Spider):
name = 'zb_spider'
# allowed_domains = ['mypicture.ipojy.net']
start_urls = ['http://mypicture.ipojy.net/?id=*&sharemd5=*&tid=0&eid=0&page=1']
def parse(self, response):
print("#"*60)
movie_list = response.xpath("//td[@style='border-right:1px solid #BBDDE5;']")
for i_item in movie_list:
print("*"*60)
zb_item = ZdItem()
#序号
ids = i_item.xpath(".//font[@style='color:red']/strong/text()").extract_first()
zb_item['ids'] = ids
#图片地址
zb_item['imgurl'] = [i_item.xpath(".//img/@src").extract_first()]
yield zb_item
# 解析下一页
next_link = response.xpath("//span[@id='page-link']/a[3]/@href").extract_first()
if next_link:
print("下一页:",next_link)
yield scrapy.Request(next_link,callback=self.parse)
六、启动项目
- 到含有scrapy.cfg文件的目录中,运行:
scrapy crawl zb_spider
1 条评论
批判锋芒犀利,直指问题症结所在。