朴素贝叶斯

朴素贝叶斯

  • 创建词汇表
    将文档中的新词集合添加到词汇表
  • 词集模型(set-of-words model)
    通过词汇表将文档转换为文档向量, 向量的每个元素表示词汇表中的单词在输入文档中是否出现
  • 词袋模型(bag-of-words model)
    文档向量的每个函数表示词汇表中的单词在文档中出现的次数
  • 使用正则表达式切分文本
    捕获所有单词, 去掉少于两个字母的字符串, 并将所有字符串转换为小写

p.s. 书中使用的正则表达式为 r'\W*', 运行时出现警告 split() requires a non-empty pattern match. 官方文档如下:

Note: split() doesn’t currently split a string on an empty pattern match. For example:

Even though ‘x*’ also matches 0 ‘x’ before ‘a’, between ‘b’ and ‘c’, and after ‘c’, currently these matches are ignored. The correct behavior (i.e. splitting on empty matches too and returning [”, ‘a’, ‘b’, ‘c’, ”]) will be implemented in future versions of Python, but since this is a backward incompatible change, a FutureWarning will be raised in the meanwhile.
Patterns that can only match empty strings currently never split the string. Since this doesn’t match the expected behavior, a ValueError will be raised starting from Python 3.5:

训练函数和分类函数

  • 训练函数

  • 分类函数

垃圾邮件测试函数

p.s. 官方提供的测试数据中有非法字符

Leave a Reply

%d bloggers like this: