Apriori算法的python3实现和改进
代码引用《机器学习实战》
改进方法部分来自与《数据挖掘:概念与技术》还有一部分来自于
https://blog.csdn.net/weixin_30702887/article/details/98992919
我这里做总结和实现,记录自己对Apriori算法的学习
首先从事务数据库中选出候选1项集
def createCDDSet(dataSet):
C=[ ]
for tid in dataSet:
for item in tid:
if not [item] in C:
C.append([item])
C.sort()
return list(map(frozenset,C))
这里frozenset用来防止候选项集和频繁项集被修改
再实现从事务数据库中不断筛选出频繁项集LK
def scanD(dataSet,CK,minsupport):#接受候选k项集并输出频繁k项集
ssCnt={}
for tid in dataSet:
for can in CK:
if can.issubset(tid):
if not can in ssCnt :ssCnt[can] =1
else:ssCnt[can] +=1
numItem=float(len(dataSet))
retList=[]
supportData={}
for key in ssCnt:
support=float(ssCnt[key]/numItem)
if support >= minsupport:
retList.insert(0,key)
supportData[key] =support#更新支持度字典
return retList,supportData
接下来创建候选项集(由频繁k项集生成候选k+1项集)
def aprioriGen(LK,k)
retList=[]
lenLK=len(LK)
for i in range(lenLK):
for j in range(i+1,lenLK):
L1= list(LK[i])[:k-2]
L2=list(LK[j])[:k-2]
L1.sort()
L2.sort()
if L1==L2:
retList.append(LK[i]|LK[j])
return retList
最后是Apriori算法的主体
def apriori(dataSet,minsupport):
C1=createCDDSet(dataSet)
D=set()
for tid in dataSet:
tid=frozenset(tid)
D.add(tid)
L1,supportData =scanD(D,C1,minsupport,numItem)
L=[L1]
k=2
while (len(L[k-2])>0):
CK=aprioriGen(L[k-2],k)
LK,supK=scanD(D,CK,minsupport,k)
supportData.update(supK)
L.append(LK)
k +=1
L = [i for i in L if i]#删除空列表
return L,supportData
到这里我们Apriori算法的主体就算完成了
接下来是实现关联规则的挖掘
先进行数据的导入
我这里用的样本是
a,c,e
b,d
b,c
a,b,c,d
a,b
b,c
a,b
a,b,c,e
a,b,c
a,c,e
是txt文本
我用pandas
import pandas as pd
name='menu_orders.txt'
minsupport=0.2
minconfidence=0.5
def createData(name):#预处理数据并输出数据集
D=pd.read_csv(name, header=None,index_col=False,names=['1','2','3'])
D=D.fillna(0)
D=D.values.tolist()
for i in range(len(D)):
D[i]=[j for j in D[i] if j !=0]
return D
然后是计算置信度,并写出关联规则。只有当置信度大于阈值,该关联规则才可输出。
def calculate(dataset):#利用算法计算支持度与置信度
dataset,dic=Apriori_self.apriori(dataset,minsupport)
Rname = []
Rsupport = []
Rconfidence = []
emptylist = []
for i in range(len(dataset)):
for AB in dataset[i]:
for A in emptylist:
if A.issubset(AB):
conf = dic.get(AB) / dic.get(AB - A)
if conf >= minconfidence:
Rname.append(str(AB - A) + '-->' + str(A))
Rconfidence.append(conf)
Rsupport.append(dic.get(AB))
emptylist.append(AB)
return Rname,Rsupport,Rconfidence
这里对关联规则的组合使用集合相减的方法来避免从频繁项集进行自由组合这种麻烦的情况
最终输出,并写入txt文件中
def outputdata(Rname,Rsupport,Rconfidence):
data = {
"关联规则": Rname,
"支持度": Rsupport,
"置信度": Rconfidence
}
df = pd.DataFrame(data,
columns=['关联规则', '支持度', '置信度'])
return df
dataset=createData(name)
R1,R2,R3=calculate(dataset)
df=outputdata(R1,R2,R3)
df.to_csv('Report.txt')
我们来看最终结果
,关联规则,支持度,置信度
0,frozenset({'e'})-->frozenset({'a'}),0.2,1.0
1,frozenset({'e'})-->frozenset({'c'}),0.2,1.0
2,frozenset({'d'})-->frozenset({'b'}),0.2,1.0
3,frozenset({'a'})-->frozenset({'b'}),0.4,0.6666666666666667
4,frozenset({'b'})-->frozenset({'a'}),0.4,0.5
5,frozenset({'a'})-->frozenset({'c'}),0.4,0.6666666666666667
6,frozenset({'c'})-->frozenset({'a'}),0.4,0.6666666666666667
7,frozenset({'b'})-->frozenset({'c'}),0.4,0.5
8,frozenset({'c'})-->frozenset({'b'}),0.4,0.6666666666666667
9,"frozenset({'a', 'c'})-->frozenset({'e'})",0.2,0.5
10,"frozenset({'a', 'e'})-->frozenset({'c'})",0.2,1.0
11,"frozenset({'c', 'e'})-->frozenset({'a'})",0.2,1.0
12,"frozenset({'e'})-->frozenset({'a', 'c'})",0.2,1.0
13,"frozenset({'a', 'b'})-->frozenset({'c'})",0.2,0.5
14,"frozenset({'a', 'c'})-->frozenset({'b'})",0.2,0.5
15,"frozenset({'b', 'c'})-->frozenset({'a'})",0.2,0.5
关于Apriori算法的优化
一:对于已经确定的非频繁项集可以直接在事务数据库中删除,避免多次扫描,减少I/O开销
二:对于频繁k项集,若其中单个项目i出现的次数小于k次,则包含i的项集不可能出现在频繁k+1项集中,应将包含单个i的项目从频繁k项集中删除再进行链接步
优化后的完整代码为
def createCDDSet(dataSet):
C=[ ]
for tid in dataSet:
for item in tid:
if not [item] in C:
C.append([item])
C.sort()
return list(map(frozenset,C))
def scanD(dataSet,CK,minsupport,numItem,k=0):#接受候选k项集并输出频繁k项集
ssCnt={}
for tid in dataSet:
for can in CK:
if can.issubset(tid):
if not can in ssCnt :ssCnt[can] =1
else:ssCnt[can] +=1
#numItem=float(len(dataSet)) #移入apriori方法中,避免多次计算增加复杂度
retList=[]
supportData={}
for key in ssCnt:
support=float(ssCnt[key]/numItem)
if support >= minsupport:
retList.insert(0,key)
# 对D再扫描一次并删除非频繁k项集,此改进为:压缩事务数据库减少扫描数据
else:
for tid in dataSet:
if key==tid: dataSet.remove(tid)
supportData[key] =support
R_List=[]
# 对于频繁k项集,若其中单个项目i出现的次数小于k次,则i不可能出现在频繁k+1项集中,应将包含单个i的项目从频繁k项集中删除再进行链接步
# 改进方向:压缩候选项集CK
if k > 1:
ssCnt = {}
for tid in retList:
for key in tid:
if not key in ssCnt:
ssCnt[key] = 1
else:
ssCnt[key] += 1
tids = []
for tid in retList:
for item in tid:
if item in ssCnt.keys():
if ssCnt[item] < k:
tids.append(tid)
R_List = list(set(retList) - set(tids))
print('优化前频繁项集'+str(retList)+' '+'优化后的频繁项集'+str(R_List))
return retList,supportData,R_List
def aprioriGen(LK,k,RK):#创建候选项集CK,k为输出集合的元素个数
if RK:
LK=RK
else:
pass
retList=[]
lenLK=len(LK)
for i in range(lenLK):
for j in range(i+1,lenLK):
L1= list(LK[i])[:k-2]
L2=list(LK[j])[:k-2]
L1.sort()
L2.sort()
if L1==L2:
retList.append(LK[i]|LK[j])
return retList
def apriori(dataSet,minsupport):
C1=createCDDSet(dataSet)
D=set()
for tid in dataSet:
tid=frozenset(tid)
D.add(tid)
numItem = float(len(D)) #此处为唯一计算numItem,否则数据列表D将由scanD方法删除元素导致numItem的改变
L1,supportData,R1 =scanD(D,C1,minsupport,numItem)
L=[L1]
R=[R1]
k=2
while (len(L[k-2])>0):
CK=aprioriGen(L[k-2],k,R[k-2])
LK,supK,RK=scanD(D,CK,minsupport,numItem,k)
supportData.update(supK)
L.append(LK)
R.append(RK)
k +=1
L = [i for i in L if i]#删除空列表
return L,supportData
优化前后的时间对比
优化前
优化后
优化前输入aprioriGen的频繁二项集
[frozenset({‘e’, ‘a’}), frozenset({‘c’, ‘e’}), frozenset({‘c’, ‘a’}), frozenset({‘c’, ‘b’}), frozenset({‘b’, ‘d’}), frozenset({‘b’, ‘a’})]
优化后
[frozenset({‘b’, ‘a’}), frozenset({‘c’, ‘a’}), frozenset({‘c’, ‘e’}), frozenset({‘e’, ‘a’}), frozenset({‘c’, ‘b’})]
可以看到优化后减少了一项
总结
对于Apriori算法,优化的方法有很多。
- 使用散列表存储项集
- 减少数据库扫描
- 采用划分的方法,先找出局部频繁项集,再从局部频繁项集的集合中找出全局频繁项集,一共只需要扫描两次事务数据库
- 抽样#但是减少了精度
- 对于频繁k项集,若其中单个项目i出现的次数小于k次,则包含i的项不可能出现在频繁k+1项集中,应将包含单个i的项从频繁k项集中删除再进行链接步
- 动态项集计数