scrapy使用代理ip的报错问题！！！

当我用scrapy使用代理爬取网站的时候，出现了一些错误，想要分享一下。

第一个出错：

Connection to the other side was lost in a non-clean fashion: Connection lost.

当我搜索这个时候，解决方案便是在seetings.py中增加user-agent。

但毕竟bug这种东西千奇百怪，回到正题，我使用了代理，如果是头文件可能出错的话，那我就找一下装有请求头的代码。

发现

if 'proxy' not in request.meta or self.current_proxy.is_expiring:
            print(request.meta)
            self.update_proxy()
            request.meta['proxy'] = self.current_proxy.proxy

感觉这里面有猫腻，果然用print的方式调试，到这边代码就卡出了，然后，额。。。。。

好，介绍一下request.meta:

meta是一个字典，它的主要作用是传递数据。它包含了本次HTTP请求的HEADERS信息，Ip、user-agent和cookie等，都包括在里面，知道了这些就算是足够了，就是meta里面没有proxy，所以会报错，改成

request.meta['REMOTE_ADDR'] = self.current_proxy.proxy

程序又可以开始跑了。

Traceback (most recent call last):
  File "g:\python_learn\python_setup\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "g:\python_learn\python_setup\lib\site-packages\scrapy\core\downloader\middleware.py", line 56, in process_response
    (six.get_method_self(method).__class__.__name__, type(response)))
scrapy.exceptions._InvalidOutput: Middleware IPProxyDownloadMiddleware.process_response must return Response or Request, got <class 'NoneType'>
2019-09-10 11:42:58 [scrapy.core.scraper] ERROR: Error downloading <GET https://xt.meituan.com/meishi/pn1/>
Traceback (most recent call last):
  File "g:\python_learn\python_setup\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "g:\python_learn\python_setup\lib\site-packages\scrapy\core\downloader\middleware.py", line 44, in process_request
    defer.returnValue((yield download_func(request=request, spider=spider)))
  File "g:\python_learn\python_setup\lib\site-packages\twisted\internet\defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 https://xt.meituan.com/meishi/pn1/>

这个bug确实是自己很傻批

def process_response(self,request,response,spider):
        print(response.status)
        if response.status != 200 or 'captcha' in response.url:
            print(response.status)
            if not self.current_proxy.blacked:
                self.current_proxy.blacked = True
                self.update_proxy()
                print('%s代理失效' % self.current_proxy.proxy)
                request.meta['proxy'] = self.current_proxy.proxy
                print(request)
            return request
        return response

主要是由于我原来的return 写错了地方，看来下次的细心了。上面的代码是正确的。主要为了检验代理ip是否可以爬取此网站。

3.pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ')

产生这个的原因主要是因为，字符串插入到pymysql会出现报错。

也就是这段代码：

insert into meishi(name,phone,address) values(%s,%s,%s)

改进后便可以插入字符串了：

insert into meishi(name,phone,address) values('"+name[x]+"','"+phone[x]+"','"+address[x]+"')

day3 2019/9/10 坚持坚持坚持

原文链接：https://blog.csdn.net/qq_42992704/article/details/100694394