AI人工智能-深度学习处理文本-第四周（小白）

NLP的基础–分词

什么是分词

分词是将连续的文本序列切分成一个个有意义的。独立的词单元的过程。

分词的重要性

在NLP流水线中，分词通常是文本预处理的第一步。后续的所有任务，如词性标注、句法分析、情感分析、机器翻译等，都建立在正确的分词基础之上。如果分词错了，就像盖楼地基没打好，后面的工作很容易产生偏差。因此，分词被誉为“NLP的基石”。

为什么要分词

•1.分词是一个被长期研究的任务，通过了解分词算法的发展，可以看到NLP的研究历程

•2.分词是NLP中一类问题的代表

•3.分词很常用，很多NLP任务建立在分词之上

分词的难点

歧义切分新词/专有名词/改造词

主要的分词方法

基于词典的匹配方法

核心思想是：有一个大的词典（词库），将待分词的句子与词典中的词进行匹配，如果能匹配上，就切分出来

正向最大匹配

从左到右，尽可能匹配词典中最长的词

实现方式一：

•1.找出词表中最大词长度

•2.从字符串开头开始选取最大词长度的窗口，检查窗口内的词是否在词表中

•3.如果在词表中，在词边界处进行切分，之后移动到词边界处，重复步骤2

•4.如果不在词表中，窗口右边界回退一个字符，之后检查窗口词是否在词表中

切分过程

• 北京大学生前来报到

代码实现上



#分词方法：最大正向切分的第一种实现方式
 
import re
import time
 
#加载词典
def load_word_dict(path):
    max_word_length = 0
    word_dict = {}  #用set也是可以的。用list会很慢
    with open(path, encoding="utf8") as f:
        for line in f:
            word = line.split()[0]
            word_dict[word] = 0
            max_word_length = max(max_word_length, len(word))
    return word_dict, max_word_length
 
#先确定最大词长度
#从长向短查找是否有匹配的词
#找到后移动窗口
def cut_method1(string, word_dict, max_len):
    words = []
    while string != '':
        lens = min(max_len, len(string))
        word = string[:lens]
        while word not in word_dict:
            if len(word) == 1:
                break
            word = word[:len(word) - 1]
        words.append(word)
        string = string[len(word):]
    return words
 
#cut_method是切割函数
#output_path是输出路径
def main(cut_method, input_path, output_path):
    word_dict, max_word_length = load_word_dict("dict.txt")
    writer = open(output_path, "w", encoding="utf8")
    start_time = time.time()
    with open(input_path, encoding="utf8") as f:
        for line in f:
            words = cut_method(line.strip(), word_dict, max_word_length)
            writer.write(" / ".join(words) + "
")
    writer.close()
    print("耗时：", time.time() - start_time)
    return
 
 
string = "测试字符串"
word_dict, max_len = load_word_dict("dict.txt")
# print(cut_method1(string, word_dict, max_len))
 
main(cut_method1, "corpus.txt", "cut_method1_output.txt")

实现方式二–利用前缀字典

•1.从前向后进行查找

•2.如果窗口内的词是一个词前缀则继续扩大窗口

•3.如果窗口内的词不是一个词前缀，则记录已发现的词，并将窗口移动到词边界

词表：

•北京

•北京大学

•北京大学生

•大学生

0代表不是一个词，但是是词的前缀

1代表是一个词

{

“北”: 0,

“北京”: 1,

“北京大”: 0,

“北京大学”: 1,

“北京大学生”: 1,

“大”: 0,

“大学”: 0,

“大学生”: 1

}

切分过程：

• 北京大学生前来报到

词表：

北京

北京大学

生前

报道

代码实现



#分词方法最大正向切分的第二种实现方式
 
import re
import time
import json
 
#加载词前缀词典
#用0和1来区分是前缀还是真词
#需要注意有的词的前缀也是真词，在记录时不要互相覆盖
def load_prefix_word_dict(path):
    prefix_dict = {}
    with open(path, encoding="utf8") as f:
        for line in f:
            word = line.split()[0]
            for i in range(1, len(word)):
                if word[:i] not in prefix_dict: #不能用前缀覆盖词
                    prefix_dict[word[:i]] = 0  #前缀
            prefix_dict[word] = 1  #词
    return prefix_dict
 
 
#输入字符串和字典，返回词的列表
def cut_method2(string, prefix_dict):
    if string == "":
        return []
    words = []  # 准备用于放入切好的词
    start_index, end_index = 0, 1  #记录窗口的起始位置
    window = string[start_index:end_index] #从第一个字开始
    find_word = window  # 将第一个字先当做默认词
    while start_index < len(string):
        #窗口没有在词典里出现
        if window not in prefix_dict or end_index > len(string):
            words.append(find_word)  #记录找到的词
            start_index += len(find_word)  #更新起点的位置
            end_index = start_index + 1
            window = string[start_index:end_index]  #从新的位置开始一个字一个字向后找
            find_word = window
        #窗口是一个词
        elif prefix_dict[window] == 1:
            find_word = window  #查找到了一个词，还要在看有没有比他更长的词
            end_index += 1
            window = string[start_index:end_index]
        #窗口是一个前缀
        elif prefix_dict[window] == 0:
            end_index += 1
            window = string[start_index:end_index]
    #最后找到的window如果不在词典里，把单独的字加入切词结果
    if prefix_dict.get(window) != 1:
        words += list(window)
    else:
        words.append(window)
    return words
 
 
#cut_method是切割函数
#output_path是输出路径
def main(cut_method, input_path, output_path):
    word_dict = load_prefix_word_dict("dict.txt")
    writer = open(output_path, "w", encoding="utf8")
    start_time = time.time()
    with open(input_path, encoding="utf8") as f:
        for line in f:
            words = cut_method(line.strip(), word_dict)
            writer.write(" / ".join(words) + "
")
    writer.close()
    print("耗时：", time.time() - start_time)
    return
 
 
string = "王羲之草书《平安帖》共有九行"
# string = "你到很多有钱人家里去看"
# string = "金鹏期货北京海鹰路营业部总经理陈旭指出"
# string = "伴随着优雅的西洋乐"
# string = "非常的幸运"
prefix_dict = load_prefix_word_dict("dict.txt")
# print(cut_method2(string, prefix_dict))
# print(json.dumps(prefix_dict, ensure_ascii=False, indent=2))
main(cut_method2, "corpus.txt", "cut_method2_output.txt")

逆（反）向最大匹配

从右到左，尽可能去匹配词典中最长的词。原理同正向。

简单做个对比

词表：

北京

北京大学

大学生

前来

生前

报到

友好

的哥

哥谭

市民

北京大学生前来报到

正向匹配：    北京大学 / 生前 / 来 / 报到

反向匹配：     北京 / 大学生 / 前来 / 报到

友好的哥谭市民

正向匹配：    友好 / 的哥 / 谭 / 市民

反向匹配：    友好 / 的 / 哥谭 / 市民

双向最大匹配

同时进行正向最大切分，和负向最大切分，之后比较两者结果，决定切分方式。

如何比较？

1.单字词

词表中可以有单字，从分词的角度，我们也会把它称为一个词

2.非字典词

未在词表中出现过的词，一般都会被分成单字

3.词总量

不同切分方法得到的词数可能不同

以上三种的优缺点

优点：简单、速度快，对于词典中存在的词效果很好。

缺点：严重依赖词典的质量（是否收录了新词）；无法解决歧义问题；对于未登录词（新词、人名、地名等）识别能力差。

基于统计的方法

随着机器学习的发展，人们开始利用统计信息来进行分词。核心思想是：相邻的字同时出现的次数越多，就越可能构成一个词。

主要算法： 隐马尔可夫模型、条件随机场 等。

核心概念：利用互信息、信息熵等统计量来判断字与字之间的结合紧密程度。

互信息：衡量两个字组成的片段是否是一个稳固的搭配。例如，“自然”和“语言”经常一起出现，它们的互信息值会很高，因此“自然语言”很可能是一个词。

信息熵：衡量一个字在其左右语境中的不确定性。如果一个字（如“蜘”）的后面几乎总是跟着另一个特定的字（如“蛛”），那么它的右信息熵就很低，说明“蜘蛛”很可能是一个词。

优缺点：

优点：能够发现新词，对未登录词的处理能力优于基于词典的方法。

缺点：需要大量的人工标注数据进行训练；计算量较大。

基于深度学习的方法

这是当前的主流和前沿方向。它将分词任务看作一个序列标注问题。

核心思想：给句子中的每一个字打上一个标签，然后根据标签序列进行切分。

常用标签体系：

BME： B（词首）， M（词中）， E（词尾）， S（单独成词）

例如：“自然语言处理” 会被标注为 [B, M, E, B, M, E] （“自然”是 B-E，“语言”是 B-E，“处理”是 B-E）。

BIOS： B（词首）， I（词中）， E（词尾）， S（单独成词）， O（专名等外部标签）

模型架构：

输入层：将每个字转换为字向量。

特征提取层：使用 Bi-LSTM 或 BERT 等预训练模型来捕捉上下文信息。Bi-LSTM可以同时看到某个字左边和右边的上下文，这对于消除歧义至关重要。

标签预测层：使用 CRF 来对最终的标签进行约束，确保输出的标签序列是合法的（例如，B后面不能接S）。

流程：输入句子 -> 字向量 -> Bi-LSTM/BERT -> CRF -> 输出标签序列 -> 根据标签合并成词。

优缺点：

优点：准确率高，能很好地利用上下文信息解决歧义和新词问题，是当前效果最好的方法。

缺点：模型复杂，需要大量的训练数据和计算资源；是一个“黑箱”，可解释性差。

常用工具

python中最流行的中文工具是 Jieba.

支持的模式

支持三种分词模式

精确模式

试图将句子最精确的切开，适合文本分析

全模式

把句子中所有可以成词的词语都扫描出来，速度非常快，但是解决不了歧义

搜索引擎模式

在精确模式的基础上，对长词再次切分，提高召回率，适合与搜索引擎分词

jieba的使用

常用的分词函数是

jieba.cut

jieba.cut_for_search

jieba.cut

接受需要分词的字符串（sentence）、控制是否采用全模式的cut_all参数以及控制是否使用HMM模型的HMM参数

jieba.cut_for_search

接受需要分词的字符串以及是否使用HMM模型的HMM参数，使用与搜索引擎构建倒排索引的分词，粒度较细

这两个函数返回的是一个可迭代的生成器 (generator)，可以使用 for 循环获取分词后的每一个词语（unicode），也可以使用 list() 将其转化为列表。

jieba.lcut 以及 jieba.lcut_for_search 直接返回 list。

代码示例



# encoding=utf-8
import jieba
 
# 待分词的字符串
sentence = "我来到北京清华大学"
 
# 精确模式 (默认)
seg_list_default = jieba.cut(sentence, cut_all=False)
print("精确模式: " + "/ ".join(seg_list_default))
 
# 全模式
seg_list_all = jieba.cut(sentence, cut_all=True)
print("全模式: " + "/ ".join(seg_list_all))
 
# 搜索引擎模式
seg_list_search = jieba.cut_for_search(sentence)
print("搜索引擎模式: " + "/ ".join(seg_list_search))
 
 
 
返回的结果
'''
精确模式: 我/ 来到/ 北京/ 清华大学
全模式: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
搜索引擎模式: 我/ 来到/ 北京/ 清华/ 华大/ 大学/ 清华大学
'''

关于分词，通过pytorch与jieba编写一个分词模型



#coding:utf8
 
import torch
import torch.nn as nn
import jieba
import numpy as np
import random
import json
from torch.utils.data import DataLoader
 
"""
基于pytorch的网络编写一个分词模型
我们使用jieba分词的结果作为训练数据
看看是否可以得到一个效果接近的神经网络模型
"""
 
class TorchModel(nn.Module):
    def __init__(self, input_dim, hidden_size, num_rnn_layers, vocab):
        super(TorchModel, self).__init__()
        self.embedding = nn.Embedding(len(vocab) + 1, input_dim, padding_idx=0) #shape=(vocab_size, dim)
        self.rnn_layer = nn.RNN(input_size=input_dim,
                            hidden_size=hidden_size,
                            batch_first=True,
                            num_layers=num_rnn_layers,
                            )
        self.classify = nn.Linear(hidden_size, 2)  # w = hidden_size * 2
        self.loss_func = nn.CrossEntropyLoss(ignore_index=-100)
 
    #当输入真实标签，返回loss值；无真实标签，返回预测值
    def forward(self, x, y=None):
        x = self.embedding(x)  #input shape: (batch_size, sen_len), output shape:(batch_size, sen_len, input_dim)
        x, _ = self.rnn_layer(x)  #output shape:(batch_size, sen_len, hidden_size)  
        y_pred = self.classify(x)   #output shape:(batch_size, sen_len, 2) -> y_pred.view(-1, 2) (batch_size*sen_len, 2)
        if y is not None:
            #cross entropy
            #y_pred : n, class_num    [[0.1,0.9], [0.7,0.3]]
            #y      : n               [ 0,         1       ]
 
            #y:batch_size, sen_len  = 2 * 5
            #[[0,0,1,0,1],[0,1,1, -100, -100]]  y
            #[0,0,1,0,1,  0,1,1,-100.-100]    y.view(-1) shape= n = batch_size*sen_len
          
            return self.loss_func(y_pred.view(-1, 2), y.view(-1))
        else:
            return y_pred
 
class Dataset:
    def __init__(self, corpus_path, vocab, max_length):
        self.vocab = vocab
        self.corpus_path = corpus_path
        self.max_length = max_length
        self.load()
 
    def load(self):
        self.data = []
        with open(self.corpus_path, encoding="utf8") as f:
            for line in f:
                sequence = sentence_to_sequence(line, self.vocab)
                label = sequence_to_label(line)
                sequence, label = self.padding(sequence, label)
                sequence = torch.LongTensor(sequence)
                label = torch.LongTensor(label)
                self.data.append([sequence, label])
                #使用部分数据做展示，使用全部数据训练时间会相应变长
                if len(self.data) > 10000:
                    break
 
    #将文本截断或补齐到固定长度
    def padding(self, sequence, label):
        sequence = sequence[:self.max_length]
        sequence += [0] * (self.max_length - len(sequence))
        label = label[:self.max_length]
        label += [-100] * (self.max_length - len(label))
        return sequence, label
 
    def __len__(self):
        return len(self.data)
 
    def __getitem__(self, item):
        return self.data[item]
 
#文本转化为数字序列，为embedding做准备
def sentence_to_sequence(sentence, vocab):
    sequence = [vocab.get(char, vocab['unk']) for char in sentence]
    return sequence
 
#基于结巴生成分级结果的标注
def sequence_to_label(sentence):
    words = jieba.lcut(sentence)
    label = [0] * len(sentence)
    pointer = 0
    for word in words:
        pointer += len(word)
        label[pointer - 1] = 1
    return label
 
#加载字表
def build_vocab(vocab_path):
    vocab = {}
    with open(vocab_path, "r", encoding="utf8") as f:
        for index, line in enumerate(f):
            char = line.strip()
            vocab[char] = index + 1   #每个字对应一个序号
    vocab['unk'] = len(vocab) + 1
    return vocab
 
#建立数据集
def build_dataset(corpus_path, vocab, max_length, batch_size):
    dataset = Dataset(corpus_path, vocab, max_length) #diy __len__ __getitem__
    data_loader = DataLoader(dataset, shuffle=True, batch_size=batch_size) #torch
    return data_loader
 
 
def main():
    epoch_num = 5        #训练轮数
    batch_size = 20       #每次训练样本个数
    char_dim = 50         #每个字的维度
    hidden_size = 100     #隐含层维度
    num_rnn_layers = 1    #rnn层数
    max_length = 20       #样本最大长度
    learning_rate = 1e-3  #学习率
    vocab_path = "chars.txt"  #字表文件路径
    corpus_path = "../corpus.txt"  #语料文件路径
    vocab = build_vocab(vocab_path)       #建立字表
    data_loader = build_dataset(corpus_path, vocab, max_length, batch_size)  #建立数据集
    model = TorchModel(char_dim, hidden_size, num_rnn_layers, vocab)   #建立模型
    optim = torch.optim.Adam(model.parameters(), lr=learning_rate)     #建立优化器
    #训练开始
    for epoch in range(epoch_num):
        model.train()
        watch_loss = []
        for x, y in data_loader:
            optim.zero_grad()    #梯度归零
            loss = model.forward(x, y)   #计算loss
            loss.backward()      #计算梯度
            optim.step()         #更新权重
            watch_loss.append(loss.item())
        print("=========
第%d轮平均loss:%f" % (epoch + 1, np.mean(watch_loss)))
    #保存模型
    torch.save(model.state_dict(), "model.pth")
    return
 
#最终预测
def predict(model_path, vocab_path, input_strings):
    #配置保持和训练时一致
    char_dim = 50  # 每个字的维度
    hidden_size = 100  # 隐含层维度
    num_rnn_layers = 1  # rnn层数
    vocab = build_vocab(vocab_path)       #建立字表
    model = TorchModel(char_dim, hidden_size, num_rnn_layers, vocab)   #建立模型
    model.load_state_dict(torch.load(model_path))   #加载训练好的模型权重
    model.eval()
    for input_string in input_strings:
        #逐条预测
        x = sentence_to_sequence(input_string, vocab)
        with torch.no_grad():
            result = model.forward(torch.LongTensor([x]))[0]
            result = torch.argmax(result, dim=-1)  #预测出的01序列
            #在预测为1的地方切分，将切分后文本打印出来
            for index, p in enumerate(result):
                if p == 1:
                    print(input_string[index], end=" ")
                else:
                    print(input_string[index], end="")
            print()
 
 
if __name__ == "__main__":
    main()
    input_strings = ["同时，国内有望出台，新汽车刺激方案",
                     "沪胶后市有望延续强势！",
                     "经过两个交易日的强势调整后",
                     "昨日上海天然橡胶期货价格再度大幅上扬"]
    
    predict("model.pth", "chars.txt", input_strings)

新词发现-让机器学会“创造”词汇

核心问题

想象一下，你是一个外星人学习中文，没有词典，只能通过阅读大量文章来”猜”哪些字组合在一起是一个词。

为什么要做这个？

语言是活的：每年都有新词出现（如”内卷”、”元宇宙”）

固定词表会过时，影响后续任务效果

两个关键指标

1. 内部凝固度（紧密度）

通俗理解：几个字总是一起出现，就像”形影不离的好朋友”

例子：

“巧克力”三个字经常一起出现 → 凝固度高

“吃了饭”虽然也常出现，但”了饭”不太单独出现 → 凝固度相对较低

2. 左右熵（灵活性）

通俗理解：一个词能在不同语境中使用，就像”社交达人”能跟很多人交朋友

例子：

“吃饭”前面可以是”我”、”你”、”他”，后面可以是”了”、”饭”、”面条” → 左右熵高

“有限公司”通常只在公司名中出现 → 左右熵低

发现新词的条件：既要内部紧密，又要外部灵活！

TF-IDF

核心思想

TF-IDF就像是一个”关键词探测器”，能自动找出每篇文章中最有代表性的词语。

两个组成部分

1. TF（词频）- 在本文章中的重要性

公式：TF = 某个词在本文出现次数 / 本文总词数

通俗理解：在一篇关于苹果公司的文章中，”苹果”这个词肯定出现很多次 → TF值高

2. IDF（逆文档频率）- 在整个文集中的独特性

公式：IDF = log(总文档数 / 包含该词的文档数)

通俗理解：

“的”、”是”这种词几乎每篇文章都有 → IDF值很低

“黑洞”这种专业词只在少数天文学文章出现 → IDF值很高

TF-IDF = TF × IDF

最终效果：选出那些在本篇频繁出现，但在其他文章很少见的词

假设有4篇文档讨论水果：

词”苹果”：在很多文档都出现 → IDF不高

词”爱”：只在1篇文档出现 → IDF很高
→ 虽然”苹果”和”爱”在某篇中TF相同，但”爱”的TF-IDF更高，更能代表那篇文章的特色

假设我们有一个包含3篇文档的迷你语料库：

文档1：我爱吃苹果和香蕉。

文档2：苹果是一种水果。

文档3：香蕉和苹果都很好吃。

现在，我们想计算词 “苹果” 在 文档1 中的TF-IDF值。

1. 计算 TF(“苹果”, 文档1)

文档1总词数：6 (我, 爱, 吃, 苹果, 和, 香蕉)

“苹果”出现次数：1

TF = 1 / 6 ≈ 0.167

2. 计算 IDF(“苹果”, D)

文档集合中文档总数 N = 3

包含“苹果”的文档数量：所有3篇文档都包含了“苹果”。

IDF = log(3 / (3 + 1)) = log(3/4) = log(0.75) ≈ -0.125 (这里使用自然对数)

3. 计算 TF-IDF

TF-IDF = 0.167 * (-0.125) ≈ -0.021

等等，怎么是负值？这通常是因为我们使用了自然对数。在实践中，为了确保值为正，IDF公式有时会写作 log(N / (1 + df(t))) 或使用其他底数的对数，但核心思想不变。我们更关心的是相对大小。

现在，我们计算一个更有区分度的词，比如“爱”。

1. 计算 TF(“爱”, 文档1)

TF = 1 / 6 ≈ 0.167

2. 计算 IDF(“爱”, D)

包含“爱”的文档数量：只有文档1。

IDF = log(3 / (1 + 1)) = log(3/2) = log(1.5) ≈ 0.405

3. 计算 TF-IDF

TF-IDF = 0.167 * 0.405 ≈ 0.068

对比一下：

“苹果”的TF-IDF ≈ -0.021

“爱”的TF-IDF ≈ 0.068

虽然“苹果”在文档1中出现的频率和“爱”一样，但“爱”在整个语料库中只出现了一次，因此它的IDF值更高，最终的TF-IDF权重也更高，更能代表文档1的独特内容。

代码示例：计算TF-IDF



import jieba
import math
import os
import json
from collections import defaultdict
 
"""
tfidf的计算和使用
"""
 
#统计tf和idf值
def build_tf_idf_dict(corpus):
    tf_dict = defaultdict(dict)  #key:文档序号，value：dict，文档中每个词出现的频率
    idf_dict = defaultdict(set)  #key:词， value：set，文档序号，最终用于计算每个词在多少篇文档中出现过
    for text_index, text_words in enumerate(corpus):
        for word in text_words:
            if word not in tf_dict[text_index]:
                tf_dict[text_index][word] = 0
            tf_dict[text_index][word] += 1
            idf_dict[word].add(text_index)
    idf_dict = dict([(key, len(value)) for key, value in idf_dict.items()])
    return tf_dict, idf_dict
 
#根据tf值和idf值计算tfidf
def calculate_tf_idf(tf_dict, idf_dict):
    tf_idf_dict = defaultdict(dict)
    for text_index, word_tf_count_dict in tf_dict.items():
        for word, tf_count in word_tf_count_dict.items():
            tf = tf_count / sum(word_tf_count_dict.values())
            #tf-idf = tf * log(D/(idf + 1))
            tf_idf_dict[text_index][word] = tf * math.log(len(tf_dict)/(idf_dict[word]+1))
    return tf_idf_dict
 
#输入语料 list of string
#["xxxxxxxxx", "xxxxxxxxxxxxxxxx", "xxxxxxxx"]
def calculate_tfidf(corpus):
    #先进行分词
    corpus = [jieba.lcut(text) for text in corpus]
    tf_dict, idf_dict = build_tf_idf_dict(corpus)
    tf_idf_dict = calculate_tf_idf(tf_dict, idf_dict)
    return tf_idf_dict
 
#根据tfidf字典，显示每个领域topK的关键词
def tf_idf_topk(tfidf_dict, paths=[], top=10, print_word=True):
    topk_dict = {}
    for text_index, text_tfidf_dict in tfidf_dict.items():
        word_list = sorted(text_tfidf_dict.items(), key=lambda x:x[1], reverse=True)
        topk_dict[text_index] = word_list[:top]
        if print_word:
            print(text_index, paths[text_index])
            for i in range(top):
                print(word_list[i])
            print("----------")
    return topk_dict
 
def main():
    dir_path = r"category_corpus/"
    corpus = []
    paths = []
    for path in os.listdir(dir_path):
        path = os.path.join(dir_path, path)
        if path.endswith("txt"):
            corpus.append(open(path, encoding="utf8").read())
            paths.append(os.path.basename(path))
    tf_idf_dict = calculate_tfidf(corpus)
    tf_idf_topk(tf_idf_dict, paths)
 
if __name__ == "__main__":
    main()

三大应用场景

1.搜索引擎

工作原理：

预先计算所有网页的TF-IDF

搜索时，把你的查询词与每个网页的TF-IDF值匹配

匹配度最高的网页排在最前面

代码示例：



import jieba
import math
import os
import json
from collections import defaultdict
from calculate_tfidf import calculate_tfidf, tf_idf_topk
"""
基于tfidf实现简单搜索引擎
"""
 
jieba.initialize()
 
#加载文档数据（可以想象成网页数据），计算每个网页的tfidf字典
def load_data(file_path):
    corpus = []
    with open(file_path, encoding="utf8") as f:
        documents = json.loads(f.read())
        for document in documents:
            corpus.append(document["title"] + "
" + document["content"])
        tf_idf_dict = calculate_tfidf(corpus)
    return tf_idf_dict, corpus
 
def search_engine(query, tf_idf_dict, corpus, top=3):
    query_words = jieba.lcut(query)
    res = []
    for doc_id, tf_idf in tf_idf_dict.items():
        score = 0
        for word in query_words:
            score += tf_idf.get(word, 0)
        res.append([doc_id, score])
    res = sorted(res, reverse=True, key=lambda x:x[1])
    for i in range(top):
        doc_id = res[i][0]
        print(corpus[doc_id])
        print("--------------")
    return res
 
if __name__ == "__main__":
    path = "news.json"
    tf_idf_dict, corpus = load_data(path)
    while True:
        query = input("请输入您要搜索的内容:")
        search_engine(query, tf_idf_dict, corpus)

2.文本摘要

实现方式：

找出每篇文章的TF-IDF最高的关键词

挑选包含这些关键词最多的句子

把这些句子组合成摘要

代码示例



import jieba
import math
import os
import random
import re
import json
from collections import defaultdict
from calculate_tfidf import calculate_tfidf, tf_idf_topk
"""
基于tfidf实现简单文本摘要
"""
 
jieba.initialize()
 
#加载文档数据（可以想象成网页数据），计算每个网页的tfidf字典
def load_data(file_path):
    corpus = []
    with open(file_path, encoding="utf8") as f:
        documents = json.loads(f.read())
        for document in documents:
            assert "
" not in document["title"]
            assert "
" not in document["content"]
            corpus.append(document["title"] + "
" + document["content"])
        tf_idf_dict = calculate_tfidf(corpus)
    return tf_idf_dict, corpus
 
#计算每一篇文章的摘要
#输入该文章的tf_idf词典，和文章内容
#top为人为定义的选取的句子数量
#过滤掉一些正文太短的文章，因为正文太短在做摘要意义不大
def generate_document_abstract(document_tf_idf, document, top=3):
    sentences = re.split("？|！|。", document)
    #过滤掉正文在五句以内的文章
    if len(sentences) <= 5:
        return None
    result = []
    for index, sentence in enumerate(sentences):
        sentence_score = 0
        words = jieba.lcut(sentence)
        for word in words:
            sentence_score += document_tf_idf.get(word, 0)
        sentence_score /= (len(words) + 1)
        result.append([sentence_score, index])
    result = sorted(result, key=lambda x:x[0], reverse=True)
    #权重最高的可能依次是第10，第6，第3句，将他们调整为出现顺序比较合理，即3,6,10
    important_sentence_indexs = sorted([x[1] for x in result[:top]])
    return "。".join([sentences[index] for index in important_sentence_indexs])
 
#生成所有文章的摘要
def generate_abstract(tf_idf_dict, corpus):
    res = []
    for index, document_tf_idf in tf_idf_dict.items():
        title, content = corpus[index].split("
")
        abstract = generate_document_abstract(document_tf_idf, content)
        if abstract is None:
            continue
        corpus[index] += "
" + abstract
        res.append({"标题":title, "正文":content, "摘要":abstract})
    return res
 
 
if __name__ == "__main__":
    path = "news.json"
    tf_idf_dict, corpus = load_data(path)
    res = generate_abstract(tf_idf_dict, corpus)
    writer = open("abstract.json", "w", encoding="utf8")
    writer.write(json.dumps(res, ensure_ascii=False, indent=2))
    writer.close()

3.文本相似度计算

步骤：

每篇文章取TF-IDF最高的N个词

把这些词的出现频率变成”向量”（一组数字）

计算向量之间的”余弦相似度”（夹角越小越相似）

代码示例



#coding:utf8
import jieba
import math
import os
import json
from collections import defaultdict
from calculate_tfidf import calculate_tfidf, tf_idf_topk
 
"""
基于tfidf实现文本相似度计算
"""
 
jieba.initialize()
 
#加载文档数据（可以想象成网页数据），计算每个网页的tfidf字典
#之后统计每篇文档重要在前10的词，统计出重要词词表
#重要词词表用于后续文本向量化
def load_data(file_path):
    corpus = []
    with open(file_path, encoding="utf8") as f:
        documents = json.loads(f.read())
        for document in documents:
            corpus.append(document["title"] + "
" + document["content"])
    tf_idf_dict = calculate_tfidf(corpus)
    topk_words = tf_idf_topk(tf_idf_dict, top=5, print_word=False)
    vocab = set()
    for words in topk_words.values():
        for word, score in words:
            vocab.add(word)
    print("词表大小：", len(vocab))
    return tf_idf_dict, list(vocab), corpus
 
 
#passage是文本字符串
#vocab是词列表
#向量化的方式：计算每个重要词在文档中的出现频率
def doc_to_vec(passage, vocab):
    vector = [0] * len(vocab)
    passage_words = jieba.lcut(passage)
    for index, word in enumerate(vocab):
        vector[index] = passage_words.count(word) / len(passage_words)
    return vector
 
#先计算所有文档的向量
def calculate_corpus_vectors(corpus, vocab):
    corpus_vectors = [doc_to_vec(c, vocab) for c in corpus]
    return corpus_vectors
 
#计算向量余弦相似度
def cosine_similarity(vector1, vector2):
    x_dot_y = sum([x*y for x, y in zip(vector1, vector2)])
    sqrt_x = math.sqrt(sum([x ** 2 for x in vector1]))
    sqrt_y = math.sqrt(sum([x ** 2 for x in vector2]))
    if sqrt_y == 0 or sqrt_y == 0:
        return 0
    return x_dot_y / (sqrt_x * sqrt_y + 1e-7)
 
 
#输入一篇文本，寻找最相似文本
def search_most_similar_document(passage, corpus_vectors, vocab):
    input_vec = doc_to_vec(passage, vocab)
    result = []
    for index, vector in enumerate(corpus_vectors):
        score = cosine_similarity(input_vec, vector)
        result.append([index, score])
    result = sorted(result, reverse=True, key=lambda x:x[1])
    return result[:4]
 
 
if __name__ == "__main__":
    path = "news.json"
    tf_idf_dict, vocab, corpus = load_data(path)
    corpus_vectors = calculate_corpus_vectors(corpus, vocab)
    passage = "魔兽争霸"
    for corpus_index, score in search_most_similar_document(passage, corpus_vectors, vocab):
        print("相似文章:
", corpus[corpus_index].strip())
        print("得分：", score)
        print("--------------")