0:介绍
该案例来自《利用Python进行数据分析·第2版》,主要对数据集中的时区信息和os系统用户信息进行分析练习。
一:数据源(USA.gov)数据集链接
二:准备
import
- import json
- import pandas as pd
- from collections import Counter
- import seaborn as sns
将txt转为json
- path='C:/Users/17322/Desktop/datasets/bitly_usagov/example.txt'
- records=[json.loads(line) for line in open(path)]
- records[0]

三:分析数据集时区分布
1:取出字典中的tz(time_zones)字段的值
`time_zones=[rec['tz'] for rec in records if 'tz' in rec]`
time_zones[:10]
输出结果为
out[]:['America/New_York',
'America/Denver',
'America/New_York',
'America/Sao_Paulo',
'America/New_York',
'America/New_York',
'Europe/Warsaw',
'',
'',
'']
2:对时区计数
- 方法一:自定义方法
def get_counts(seq):
counts={}
for x in seq:
if x in counts:
counts[x] += 1
else:
counts[x] = 1
return counts
将时区list传入,得到{tz-values}的字典
counts=get_counts(time_zones)
利用字典sort排序
def top_n_counts(dict,index_n):
value_key_pairs = [(count,tz) for tz,count in dict.items()]
value_key_pairs.sort()
return value_key_pairs[-index_n:]
top_n_counts(counts,9)
输出为:
[(35, 'Europe/Madrid'),
(36, 'Pacific/Honolulu'),
(37, 'Asia/Tokyo'),
(74, 'Europe/London'),
(191, 'America/Denver'),
(382, 'America/Los_Angeles'),
(400, 'America/Chicago'),
(521, ''),
(1251, 'America/New_York')]
- 方法二:利用collections.Counter类
counts = Counter(time_zones)
counts.most_common(9)
out[]:
[('America/New_York', 1251),
('', 521),
('America/Chicago', 400),
('America/Los_Angeles', 382),
('America/Denver', 191),
('Europe/London', 74),
('Asia/Tokyo', 37),
('Pacific/Honolulu', 36),
('Europe/Madrid', 35),
('America/Sao_Paulo', 33)]
方法三:用pandas计数
frame = pd.DataFrame(records)
tz_counts = frame['tz'] .value_counts()
3:可视化
- 为缺失索引赋’Unknown’
full_tz=frame['tz'].fillna('missing')
full_tz[full_tz=='']='Unkown'
tz_counts=clean_tz.value_counts()
用seaborn包可视化
import seaborn as sns
subset = tz_counts[:10]
sns.barplot(x = subset.values,y = subset.index)
输出视图:
四:对上述统计结果按‘Windows’字段分解
1:移除agent缺失项
cframe = frame[frame.a.notnull()]
2:将数据集按tz和os列分类
cframe['os'] = np.where(cframe.a.str.contains('Windows'),'Windows','Not Windows')
by_tz_os = cframe.groupby(['tz','os'])
agg_counts = by_tz_os.size().unstack().fillna(0)
agg_counts[:10]
out[]:
3:取出sum(os)最大的前十个城市
indexer = agg_counts.sum(1).argsort()
count_subset = agg_counts.take(indexer[-10:])
out[]:
4:可视化
count_subset.stack()
sns.barplot(data=count_subset,x='total',y='tz',hue='os')

标准化,展示相对比例
count_subset = count_subset.stack()
count_subset.name = 'total'
count_subset = count_subset.reset_index()
计算Windows/NOT Windows用户比例函数:
def norm_total(group):
group['norm_total']=group.total/group.total.sum()
return group
results = count_subset.groupbyby('tz').apply(norm_total)
sns.barplot(data=results,x='norm_total',y='tz',hue='os')

END.
版权声明:本文为qq_41925850原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。