【Python】使用pandas读取以特殊字符作为分隔符的文本数据

【Python】使用pandas读取以特殊字符作为分隔符的文本数据

金融民工在日常工作中遇到的小问题,分享一下~(大佬请绕道)

import pandas as pd

df = pd.read_table(file_path, sep='$$', engine='python', header=0)

问题描述:

  1. 直接用read_csv或read_table读取分隔符为"$$"的文本型数据集时,pandas可能会生成一列为超长字符串(未通过指定的分隔符将一行数据分拆),另一列为"unnamed: 1"且元素均为np.nan的DataFrame,非实际所需
  2. 同时报告DtypeWarnings,提示部分行含有mixed types

解决方案:

import pandas as pd

file_path = folder_path + file_new_name

# replace '$$' with normal delimiter ','
file_input = open(folder_path + file_original_name, "rt", encoding='utf-8')
file_output = open(file_path, "wt", encoding='utf-8')

for line in file_input:
	file_output.write(line.replace('$$', ','))

file_input.close()
file_output.close()

# read the new text file with delimiter ','
data_chunks = pd.read_table(file_path, sep=',', converters=converter_dict, low_memory=False,
                            encoding='utf-8', header=0, chunksize=100000)
chunk_list = []
for index, chunk in enumerate(data_chunks):
    chunk_list.append(chunk) # extract data from io.parsers.TextFileReader
    print(">>> chunk{0} loaded...".format(index+1))
print(">>> concating...")

df = pd.concat(chunk_list, ignore_index=True)
print(">>> DataFrame created successfully")

版权声明:本文为weixin_44941638原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。