python文本字符串比对_python-模糊字符串比较

python-模糊字符串比较

我正在努力完成的是一个程序,该程序读取文件并根据原始句子比较每个句子。 与原始句子完全匹配的句子将得到1分,而与之相反的句子将得到0分。所有其他模糊句子将得到1到0分之间的分数。

我不确定要使用哪种操作在Python 3中完成此操作。

我包括了示例文本,其中文本1是原始文本,其他前面的字符串是比较文本。

文字:样本

文字1:那是一个黑暗而暴风雨的夜晚。 我一个人坐在红色的椅子上。 我并不孤单,因为我只有三只猫。

文字20:那是一个阴暗而暴风雨的夜晚。 我独自一人坐在深红色的椅子上。 我并不孤单,因为我有三只猫//应该得分最高但不能得分1

文字21:那是一个阴暗而狂暴的夜晚。 我一个人坐在一个深红色的大教堂上。 我并不孤单,因为我有三只猫//分数应低于文字20

文字22:我一个人坐在一个深红色的大教堂上。 我并不孤单,因为我有三只猫科动物。 那是一个阴暗而狂暴的夜晚。//分数应低于文字21,但不能低于0

文字24:那是一个黑暗而暴风雨的夜晚。 我并不孤单。 我没有坐在红色的椅子上。 我有三只猫。//应该得分为0!

4个解决方案

96 votes

有一个名为difflib的软件包。通过pip安装:

pip install fuzzywuzzy

简单用法:

>>> from fuzzywuzzy import fuzz

>>> fuzz.ratio("this is a test", "this is a test!")

96

该软件包建立在difflib的基础上。您问为什么不仅仅使用它? 除了更简单之外,它还具有许多不同的匹配方法(例如令牌顺序不敏感,部分字符串匹配),这使其在实践中更加强大。 process.extract函数特别有用:从集合中找到最佳匹配的字符串和比率。 从他们的自述文件:

偏比

>>> fuzz.partial_ratio("this is a test", "this is a test!")

100

代币分类率

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

90

>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

100

代币设定比率

>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")

84

>>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")

100

处理

>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]

>>> process.extract("new york jets", choices, limit=2)

[('New York Jets', 100), ('New York Giants', 78)]

>>> process.extractOne("cowboys", choices)

("Dallas Cowboys", 90)

congusbongus answered 2019-10-25T04:21:53Z

79 votes

标准库中有一个模块(称为SequenceMatcher),可以比较字符串并根据它们的相似性返回分数。 SequenceMatcher类应该做您想要做的。

编辑:来自python提示符的小例子:

>>> from difflib import SequenceMatcher as SM

>>> s1 = ' It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.'

>>> s2 = ' It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines.'

>>> SM(None, s1, s2).ratio()

0.9112903225806451

HTH!

mac answered 2019-10-25T04:22:25Z

15 votes

unicode的索引和搜索速度比unicode(bytes)快得多。

from fuzzyset import FuzzySet

corpus = """It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines

It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines

I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night.

It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats."""

corpus = [line.lstrip() for line in corpus.split("\n")]

fs = FuzzySet(corpus)

query = "It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats."

fs.get(query)

# [(0.873015873015873, 'It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines')]

警告:注意不要在模糊集中混用unicode和bytes。

hobs answered 2019-10-25T04:22:59Z

1 votes

该任务称为复述识别,这是自然语言处理研究的活跃领域。 我已经链接了几篇最新的论文,您可以在GitHub上找到其中的许多开源代码。

请注意,所有回答的问题均假设两个句子之间存在某些字符串/表面相似性,而实际上两个字符串相似性很少的句子在语义上可以相似。

如果您对这种相似性感兴趣,可以使用Skip-Thoughts。根据GitHub指南安装软件,然后转到自述文件中的释义检测部分:

import skipthoughts

model = skipthoughts.load_model()

vectors = skipthoughts.encode(model, X_sentences)

这会将您的句子(X_sentences)转换为向量。 稍后,您可以通过以下方式找到两个向量的相似性:

similarity = 1 - scipy.spatial.distance.cosine(vectors[0], vectors[1])

我们假设vector [0]和vector1是要查找其分数的X_sentences [0]和X_sentences1的对应向量。

还有其他将句子转换为向量的模型,您可以在此处找到。

将句子转换为向量后,相似度只是找到这些向量之间的余弦相似度的问题。

Ash answered 2019-10-25T04:24:05Z