聚美优品网页打不开:CrawlSpider爬取聚美优品之翻页(MongoDB)

自从学习了上个案例(CrawlSpider爬虫之爬取17k小说网列表详情及章节并放在一起(CrawlSpider翻页、MongoDB)-CSDN博客),做这个就简单多了,视频教程里也很简单,毕竟是入门CrawlSpider的实战小demo。这个视频教程真的做的很贴心。

聚美优品上打不开兰蔻品牌的链接啊,显示404啊。是不是爬崩溃了😄……

选择雅诗兰黛这个品牌,而且需要在其他页面,才能选择下拉菜单,看把人家聚美优品折腾的,首页都不敢放下拉菜单了~~~~

废话不多说,我忒忙……上代码

app.py

from typing import Iterableimport scrapyfrom scrapy import Requestfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom ..items import jumei_productclass AppSpider(CrawlSpider): name = "app" start_urls = [ "http://search.jumei.com/?filter=0-11-1&search=%E9%9B%85%E8%AF%97%E5%85%B0%E9%BB%9B&bid=4&site=sh"] rules = ( Rule(LinkExtractor(allow=r"http://item.jumeiglobal.com/(.+).html", restrict_xpaths=('//div[@class="s_l_pic"]/a')), callback="parse_detail", follow=False, process_links="process_detail"), ) def start_requests(self) -> Iterable[Request]: max_page = 4 for i in range(1, max_page): url = "http://search.jumei.com/?filter=0-11-" + str( i) + "&search=%E9%9B%85%E8%AF%97%E5%85%B0%E9%BB%9B&bid=4&site=sh" yield Request(url) def process_detail(self, links): for index, link in enumerate(links): # 列表页,每页选5个商品 if index < 5: yield link else: return def parse_detail(self, response): # 商品详情数据信息 title = response.xpath('//div[@class="deal_con_content"]//tr[1]/td[2]/span/text()').get() category = response.xpath('//div[@class="deal_con_content"]//tr[4]/td[2]/span/text()').get() address = response.xpath('//div[@class="deal_con_content"]//tr[6]/td[2]/span/text()').get() expired = response.xpath('//div[@class="deal_con_content"]//tr[8]/td[2]/span/text()').get() item = jumei_product() item["title"] = title item["category"] = category item["address"] = address item["expired"] = expired yield item

列表页选择5个商品,选择循环3个列表页面。

items.py

import scrapyclass jumei_product(scrapy.Item): title = scrapy.Field() category = scrapy.Field() address = scrapy.Field() expired = scrapy.Field()

数据库实体类pipelines.py

import pymongoclass Scrapy02Pipeline: def __init__(self): print("-" * 10, "开始", "-" * 10) self.res = None self.client = pymongo.MongoClient("mongodb://localhost:27017") self.db = self.client["jumei"] self.collection = self.db["landai"] self.collection.delete_many({}) def process_item(self, item, spider): self.res = self.collection.insert_one(dict(item)) # print(self.res.inserted_id) return item def __del__(self): print("-" * 10, "结束", "-" * 10)

是不是感觉有手就行了?😄

学无止境,学到后期,不仅仅是有手就行,要做到无手自行才行吧……

相关推荐

相关文章