基于概率论的分类方法: 朴素贝叶斯
naive bayes是基于贝叶斯定理与特征条件独立假设的分类方法.对于给定的训练数据集,首先基于特征条件假设学习输入/输出的联合概率分布:然后基于此模型,对给定的输入x,利用贝叶斯定理求出后验概率最大的输出y.naive bayes法实现简单,学习与预测的效率都很高,是一种常用的方法
使用Python进行文本分类
准备数据: 从文本中构建词向量
1 | def loadDataSet(): |
1 | listPosts, listClasses = loadDataSet() |
['cute',
'love',
'help',
'garbage',
'quit',
'I',
'problems',
'is',
'park',
'stop',
'flea',
'dalmation',
'licks',
'food',
'not',
'him',
'buying',
'posting',
'has',
'worthless',
'ate',
'to',
'maybe',
'please',
'dog',
'how',
'stupid',
'so',
'take',
'mr',
'steak',
'my']
1 | setOfWords2Vec(myVocbList, listPosts[0]) |
[0,
0,
1,
0,
0,
0,
1,
0,
0,
0,
1,
0,
0,
0,
0,
0,
0,
0,
1,
0,
0,
0,
0,
1,
1,
0,
0,
0,
0,
0,
0,
1]
训练算法: 从词向量计算概率
1 | import numpy as np |
1 | listPosts, listClasses = loadDataSet() |
0.5
1 | p0V |
array([ 0.04166667, 0.04166667, 0.04166667, 0. , 0. ,
0.04166667, 0.04166667, 0.04166667, 0. , 0.04166667,
0.04166667, 0.04166667, 0.04166667, 0. , 0. ,
0.08333333, 0. , 0. , 0.04166667, 0. ,
0.04166667, 0.04166667, 0. , 0.04166667, 0.04166667,
0.04166667, 0. , 0.04166667, 0. , 0.04166667,
0.04166667, 0.125 ])
1 | p1V |
array([ 0. , 0. , 0. , 0.05263158, 0.05263158,
0. , 0. , 0. , 0.05263158, 0.05263158,
0. , 0. , 0. , 0.05263158, 0.05263158,
0.05263158, 0.05263158, 0.05263158, 0. , 0.10526316,
0. , 0.05263158, 0.05263158, 0. , 0.10526316,
0. , 0.15789474, 0. , 0.05263158, 0. ,
0. , 0. ])
测试算法: 根据现实情况修改分类器
拉普拉斯平滑(Laplace smoothing)
若用Laplace smoothing,然后取自然对数,参考统计学习方法,代码如下:
1 | def trainNB02(trainMatrix, trainCategory): |
1 | listPosts, listClasses = loadDataSet() |
0.5
1 | p0V |
array([-2.56494936, -2.56494936, -2.56494936, -3.25809654, -3.25809654,
-2.56494936, -2.56494936, -2.56494936, -3.25809654, -2.56494936,
-2.56494936, -2.56494936, -2.56494936, -3.25809654, -3.25809654,
-2.15948425, -3.25809654, -3.25809654, -2.56494936, -3.25809654,
-2.56494936, -2.56494936, -3.25809654, -2.56494936, -2.56494936,
-2.56494936, -3.25809654, -2.56494936, -3.25809654, -2.56494936,
-2.56494936, -1.87180218])
1 | p1V |
array([-3.04452244, -3.04452244, -3.04452244, -2.35137526, -2.35137526,
-3.04452244, -3.04452244, -3.04452244, -2.35137526, -2.35137526,
-3.04452244, -3.04452244, -3.04452244, -2.35137526, -2.35137526,
-2.35137526, -2.35137526, -2.35137526, -3.04452244, -1.94591015,
-3.04452244, -2.35137526, -2.35137526, -3.04452244, -1.94591015,
-3.04452244, -1.65822808, -3.04452244, -2.35137526, -3.04452244,
-3.04452244, -3.04452244])
朴素贝叶斯分类函数
1 | def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): |
1 | testingNB() |
['love', 'my', 'dalmation'] classified as: 0
['stupid', 'garbage'] classified as: 1
准备数据: 文档词袋模型
上面的模型仅统计单词是否出现,这里统计单词出现的总次数
1 | def bagOfWord2VecMN(vocabList, inputSet): |
示例: 使用朴素贝叶斯过滤垃圾邮件
准备数据: 切分文本
[s.lower() for s in re.split(r’\W*’, str) if len(s) > 2]
测试算法: 使用朴素贝叶斯进行交叉验证
1 | import re |
1 | import random |
1 | spamTest() |
示例: 使用朴素贝叶斯分类器从个人广告中回去区域倾向
收集数据: 导入RSS源
1 | import feedparser |
25
1 | def calcMostFreq(vocabList, fullText): |
1 | def localWords(feed1, feed0): |
1 | a, b, c = localWords(ny, sy) |
the error rate is: 0.25
分析数据: 显示地域相关的用词
1 | def getTopWords(ny, sf): |
1 | getTopWords(ny, sy) |
the error rate is: 0.15
NY------------
enjoyable
asian
hanging
shot
95748
technique
ucsc
languages
SF------------
all
focus
month
enjoyable
doujins
hanging
shot
certainly