防反爬小技巧 | [WinError 10054] 远程主机强迫关闭了一个现有的连接。

其实也称不上技巧，但就像open之后没有close，new之后没有delete[]一样，我们总是容易忽略。

爬取网站数据过程中若出现以下错误，有可能是网站正在挣扎，将爬虫程序识别了出来，关闭我们的请求。

ConnectionResetError: [WinError 10054] 远程主机强迫关闭了一个现有的连接。
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='d1.weather.com.cn', port=80): Read timed out. (read timeout=None)

有个缓解的方法就是请求后及时关闭，实测有用。

your_response_name.close()

注意要在read()之后关闭，不然读取不到内容。代码如下，将爬取的步骤封装成函数。有时如果出现乱码，可能需要改编码方式为gbk，所以加了个带编码方式的，按需使用

from bs4 import BeautifulSoup as bs
from urllib import request
from fake_useragent import UserAgent
def getBS(url):
        heaher = {'USER-Agent':UserAgent().chrome}
        req = request.Request(url, headers=heaher)
        resp = request.urlopen(req)
        html_ = resp.read().decode(encoding='utf-8')
        resp.close()  # 关键语句 要在读取resp的内容后再关闭，与上一行不可互换
        soup = bs(html_, 'html.parser')
        return soup

from bs4 import BeautifulSoup as bs
from urllib import request
from fake_useragent import UserAgent
def getBS(url, encoding='utf-8'): # 默认为utf-8
    heaher = {'USER-Agent':UserAgent().chrome}
    req = request.Request(url, headers=heaher)
    resp = request.urlopen(req)
    html_ = resp.read().decode(encoding=encoding)
    resp.close()  # 关键语句要在读取resp的内容后再关闭，与第五行不可互换
    soup = bs(html_, 'html.parser')
    return soup
# 使用示例
url = 'http://www.weather.com.cn'
url2 = 'https://book.tiexue.net/mil.htm'  # 需使用gbk解析，不然报错
soup = getBS(url)
soup2 = getBS(url2,'gbk')

原文链接：https://blog.csdn.net/weixin_45781186/article/details/121578071