教你用python爬取王者荣耀英雄皮肤图片,并将图片保存在各自英雄的文件夹中。(附源码)

代码展示:

在这里插入图片描述

保存在各自的文件夹中

在这里插入图片描述

美么?

在这里插入图片描述
让我们开始爬虫之路

开发环境

windows 10
python3.6

开发工具

pycharm
webdriver

os,re,lxml,jsonpath

打开王者荣耀官网点击游戏资料

在这里插入图片描述

判断是同步加载还是异步加载, 可以确定为异步加载

在这里插入图片描述

点击XHR继续抓包,ename为英雄的ID,cname为英雄的名字

在这里插入图片描述

jsonpath获取

# 第一次请求,获取hero_id hero_name 
    response = requests.get(start_url, headers=headers).json()
    # pprint(response)
    hero_ids = jsonpath.jsonpath(response, '$..ename')
    # pprint(hero_ids)
    hero_names = jsonpath.jsonpath(response, '$..cname')
    # pprint(hero_names)

构造英雄地址

hero_info_url = r’https://pvp.qq.com/web201605/herodetail/{}.shtml’.format(hero_id)

driver访问每一个英雄地址,获取源码,etree解析,xpath提取hero_skin_names,hero_skin_urls

    for hero_name, hero_id in zip(hero_names, hero_ids):
        hero_info_url = r'https://pvp.qq.com/web201605/herodetail/{}.shtml'.format(hero_id)
    <span class="token comment"># 发送英雄详情页请求得到 hero_info_content</span>
    driver<span class="token punctuation">.</span>get<span class="token punctuation">(</span>hero_info_url<span class="token punctuation">)</span>
    <span class="token comment"># 获取页面源码</span>
    hero_info_content <span class="token operator">=</span> driver<span class="token punctuation">.</span>page_source
	<span class="token comment"># 解析网页</span>
    hero_info_content_str <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>hero_info_content<span class="token punctuation">)</span>
    <span class="token comment"># 提取 hero_skin_names hero_skin_urls</span>
    hero_skin_names <span class="token operator">=</span> hero_info_content_str<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span>r<span class="token string">'//ul[@class="pic-pf-list pic-pf-list3"]/@data-imgname'</span><span class="token punctuation">)</span><span class="token punctuation">[</span>
        <span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">'|'</span><span class="token punctuation">)</span>

    hero_skin_urls <span class="token operator">=</span> hero_info_content_str<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span>r<span class="token string">'//ul[@class="pic-pf-list pic-pf-list3"]//img/@data-imgname'</span><span class="token punctuation">)</span>

遍历hero_skin_urls地址,获取图片的二进制数据,最后进行保存,创建英雄的文件夹,将皮肤图片保存在各自的文件夹中

# 补全hero_skin_url地址
            hero_skin_url = r'https:'+hero_skin_url
            # 获取图片的二进制信息
            img_content = requests.get(hero_skin_url, headers=headers).content
            try:
                # 创建文件夹
                if not os.path.exists('./{}'.format(hero_name)):
                    os.mkdir(r'./{}'.format(hero_name))
                with open(r'./{}/{}.jpg'.format(hero_name, hero_skin_name), 'wb')as f:
                    f.write(img_content)
                    print('图片正在下载:{}/{}.jpg'.format(hero_name, hero_skin_name))
        <span class="token keyword">except</span> Exception <span class="token keyword">as</span> e<span class="token punctuation">:</span>
            <span class="token keyword">continue</span>

在执行时本来是以requests库进行获取英雄页面的源码,但是发生报错,报错原因是因为编码问题,所以采用webdriver访问每个英雄页面,driver.page_source获取源码,然后进行数据提取。

源码展示

# !/usr/bin/nev python
# -*-coding:utf8-*-

import requests, os, jsonpath, re
from selenium import webdriver
from pprint import pprint
from lxml import etree

def main():

start_url <span class="token operator">=</span> r<span class="token string">'https://pvp.qq.com/web201605/js/herolist.json'</span>
headers <span class="token operator">=</span> <span class="token punctuation">{
    <span class="token string">'User-Agent'</span><span class="token punctuation">:</span> <span class="token string">'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '</span>
                  <span class="token string">'Chrome/87.0.4280.88 Safari/537.36'</span><span class="token punctuation">,</span>
    <span class="token string">'Referer'</span><span class="token punctuation">:</span> <span class="token string">'https://pvp.qq.com/web201605/herolist.shtml'</span>
<span class="token punctuation">}</span>

driver <span class="token operator">=</span> webdriver<span class="token punctuation">.</span>Chrome<span class="token punctuation">(</span>executable_path<span class="token operator">=</span>r<span class="token string">'D:\python\chromedriver.exe'</span><span class="token punctuation">)</span>

<span class="token comment"># 第一次请求,获取hero_id hero_name hero_skin_names</span>
response <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>start_url<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers<span class="token punctuation">)</span><span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token comment"># pprint(response)</span>
hero_ids <span class="token operator">=</span> jsonpath<span class="token punctuation">.</span>jsonpath<span class="token punctuation">(</span>response<span class="token punctuation">,</span> <span class="token string">'$..ename'</span><span class="token punctuation">)</span>
<span class="token comment"># pprint(hero_ids)</span>
hero_names <span class="token operator">=</span> jsonpath<span class="token punctuation">.</span>jsonpath<span class="token punctuation">(</span>response<span class="token punctuation">,</span> <span class="token string">'$..cname'</span><span class="token punctuation">)</span>
<span class="token comment"># pprint(hero_names)</span>

<span class="token keyword">for</span> hero_name<span class="token punctuation">,</span> hero_id <span class="token keyword">in</span> <span class="token builtin">zip</span><span class="token punctuation">(</span>hero_names<span class="token punctuation">,</span> hero_ids<span class="token punctuation">)</span><span class="token punctuation">:</span>
    hero_info_url <span class="token operator">=</span> r<span class="token string">'https://pvp.qq.com/web201605/herodetail/{}.shtml'</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>hero_id<span class="token punctuation">)</span>


    <span class="token comment"># 发送英雄详情页请求得到 hero_info_content</span>

    driver<span class="token punctuation">.</span>get<span class="token punctuation">(</span>hero_info_url<span class="token punctuation">)</span>
    <span class="token comment"># 获取页面源码</span>
    hero_info_content <span class="token operator">=</span> driver<span class="token punctuation">.</span>page_source
    <span class="token comment"># lxml解析</span>
    hero_info_content_str <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>hero_info_content<span class="token punctuation">)</span>

    <span class="token comment"># 提取 hero_skin_names hero_skin_urls</span>
    hero_skin_names <span class="token operator">=</span> hero_info_content_str<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span>r<span class="token string">'//ul[@class="pic-pf-list pic-pf-list3"]/@data-imgname'</span><span class="token punctuation">)</span><span class="token punctuation">[</span>
        <span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">'|'</span><span class="token punctuation">)</span>

    hero_skin_urls <span class="token operator">=</span> hero_info_content_str<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span>r<span class="token string">'//ul[@class="pic-pf-list pic-pf-list3"]//img/@data-imgname'</span><span class="token punctuation">)</span>

    <span class="token comment"># hero_skin_name进行替换不必要的信息</span>
    <span class="token keyword">for</span> hero_skin_name<span class="token punctuation">,</span> hero_skin_url <span class="token keyword">in</span> <span class="token builtin">zip</span><span class="token punctuation">(</span>hero_skin_names<span class="token punctuation">,</span> hero_skin_urls<span class="token punctuation">)</span><span class="token punctuation">:</span>
        suffix_notation <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>r<span class="token string">'&amp;\d.?'</span><span class="token punctuation">,</span> hero_skin_name<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
        hero_skin_name <span class="token operator">=</span> hero_skin_name<span class="token punctuation">.</span>replace<span class="token punctuation">(</span>suffix_notation<span class="token punctuation">,</span> <span class="token string">''</span><span class="token punctuation">)</span>
        <span class="token comment"># 补全hero_skin_url地址</span>
        hero_skin_url <span class="token operator">=</span> r<span class="token string">'https:'</span><span class="token operator">+</span>hero_skin_url
        <span class="token comment"># 获取图片的二进制信息</span>
        img_content <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>hero_skin_url<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers<span class="token punctuation">)</span><span class="token punctuation">.</span>content
        <span class="token keyword">try</span><span class="token punctuation">:</span>
            <span class="token comment"># 创建文件夹</span>
            <span class="token keyword">if</span> <span class="token operator">not</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>exists<span class="token punctuation">(</span><span class="token string">'./{}'</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>hero_name<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
                os<span class="token punctuation">.</span>mkdir<span class="token punctuation">(</span>r<span class="token string">'./{}'</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>hero_name<span class="token punctuation">)</span><span class="token punctuation">)</span>
            <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span>r<span class="token string">'./{}/{}.jpg'</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>hero_name<span class="token punctuation">,</span> hero_skin_name<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">'wb'</span><span class="token punctuation">)</span><span class="token keyword">as</span> f<span class="token punctuation">:</span>
                f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>img_content<span class="token punctuation">)</span>
                <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'图片正在下载:{}/{}.jpg'</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>hero_name<span class="token punctuation">,</span> hero_skin_name<span class="token punctuation">)</span><span class="token punctuation">)</span>

        <span class="token keyword">except</span> Exception <span class="token keyword">as</span> e<span class="token punctuation">:</span>
            <span class="token keyword">continue</span>

if name == 'main':
main()