关于python2中的string和unicode

（一）：

d = u'你好'
>>> print d
ÄãºÃ
>>> d
u'\xc4\xe3\xba\xc3'
>>> e = '你好'
>>> print e
你好
>>> e
'\xc4\xe3\xba\xc3'

发现了吗，在IDLE中试图输入中文并指定为unicode失败，编码依旧是系统编码（gbk），加了个前缀u，成了伪unicode。

于是考虑decode获得unicode。

ee=e.decode('gbk')
>>> ee
u'\u4f60\u597d'
>>> type(ee)
<type 'unicode'>
>>> print ee
你好

成功。

（二）：

print u , type(u)
中国	北京 <type 'unicode'>
>>> utf=u.encode('utf-8')
>>> print utf , type(utf)
中国	北京 <type 'str'>
>>> gbk=u.encode('GBK')
>>> print gbk, type(gbk)
中国	北京 <type 'str'>
>>> print len(u), len(utf), len(gbk)
5 13 9

几点说明：

1.string object是由characters组成的sequence，而unicode object是Unicode code units组成的sequence。

2.string里的character是有多种编码方式的，比如单字节的ASCII，双字节的GB2312等等，再比如UTF-8。

3.直接输入的string常量会用系统缺省编码方式来编码，例如在GBK环境下，'你好'会编码成'/xc4/xe3/xba/xc3'，而在UTF-8环境下就成了'/xe4/xbd/xa0/xe5/xa5/xbd'。

4.len(string)返回string的字节数，len(unicode)返回的是字符数

5.print unicode不会乱码。print(unicode)的时候，会把unicode先转成当前编码，然后再输出。

原文链接：https://blog.csdn.net/zzhtheone/article/details/20769527