【Python】使用pandas读取以特殊字符作为分隔符的文本数据
金融民工在日常工作中遇到的小问题,分享一下~(大佬请绕道)
import pandas as pd
df = pd.read_table(file_path, sep='$$', engine='python', header=0)
问题描述:
- 直接用read_csv或read_table读取分隔符为"$$"的文本型数据集时,pandas可能会生成一列为超长字符串(未通过指定的分隔符将一行数据分拆),另一列为"unnamed: 1"且元素均为np.nan的DataFrame,非实际所需
- 同时报告DtypeWarnings,提示部分行含有mixed types
解决方案:
import pandas as pd
file_path = folder_path + file_new_name
# replace '$$' with normal delimiter ','
file_input = open(folder_path + file_original_name, "rt", encoding='utf-8')
file_output = open(file_path, "wt", encoding='utf-8')
for line in file_input:
file_output.write(line.replace('$$', ','))
file_input.close()
file_output.close()
# read the new text file with delimiter ','
data_chunks = pd.read_table(file_path, sep=',', converters=converter_dict, low_memory=False,
encoding='utf-8', header=0, chunksize=100000)
chunk_list = []
for index, chunk in enumerate(data_chunks):
chunk_list.append(chunk) # extract data from io.parsers.TextFileReader
print(">>> chunk{0} loaded...".format(index+1))
print(">>> concating...")
df = pd.concat(chunk_list, ignore_index=True)
print(">>> DataFrame created successfully")
版权声明:本文为weixin_44941638原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。