关于Python beautifulsoup 输出中文乱码问题

我在学习网络爬虫的过程中遇到一个很奇怪的问题，爬取同一个网站的不同页面（编码方式都为'gb2312'）时，beautifulsoup有时候输出中文是正常的有时候是乱码。

查找资料：http://bbs.chinaunix.net/thread-4084647-1-1.html

上面说：表面上看起来从BeautifulSoup解析后得到的soup，打印出来是乱码，但是实际上其本身已经是，正确的（从原始的GB2312编码）解析（为Unicode）后的了。

之所以乱码，那是因为，打印soup时，调用的是__str__，其默认是UTF-8，所以输出到GBK的cmd中，才显示是乱码。

但实际情况却不是这样：

links = soup.find_all("a", href=re.compile('/.+?'), title=re.compile('.*?'+titlekey+'.*?'))#使用正则表达式查找

上述查找方法返回的是一个空链表，实际上应该不是空的（可以浏览器上查看html源码）

另外：

通过gb18030编码方式输出中文正常：

soup = BeautifulSoup(content, "html.parser", from_encoding=html_encode)

links = soup.find_all("a", href=re.compile('/.+?'), title=re.compile('.*?'+titlekey+'.*?'))#此时返回的list正常。

通过上述分析可以得出BeautifulSoup本身解码方面是有一些问题的。

解决方案：

soup = BeautifulSoup(content.decode(html_encode,errors='ignore'), "html.parser")#html_encode='gb2312'

就是先将byte型的content解码为str后，用str类型数据初始化BeautifulSoup，就可以了