从多个PDF中快速搜索字符串

问题描述

当我们有很多文献时，如果想从众多文献中搜索一个特定的字符串，我们难道要逐个PDF打开找吗，那么多文献，而且全是PDF，逐个打开，Ctrl + F搜索也不现实，肿么办，难不成为自己的文献库构建个索引吗，在本机构建文本语料库索引工作量不小，我们能不能找个轻量级的办法呢，当然可以，收到Linux中常用的搜索命令grep的启发，那么我们能直接用grep命令搜PDF文件吗，当然不能，grep命令是搜文本文件的（各类code源代码，plain text等），那我们把PDF文献全部转换为TXT，然后再用grep行不行，行，但是能不能把这个转换步骤也省了，必须能，怎么一分钟实现，看下文。

以下方法仅适用于linux系统，windows系统请自行找替代方案。

解决方法

安装pdfgrep即可，点我进入官网，不建议按照官网提供的源码自行编译安装（很可能会出现一些依赖错误，解决起来比较麻烦），直接使用命令行从ubuntu仓库中安装即可：
sudo apt-get update -y
sudo apt-get install -y pdfgrep

使用方法

用法与grep高度一致，更多example，也可参考官网文档如下：点我查看文档，支持递归搜索哦。

例如在论文文件夹搜索关键词cosine，命令如下
pdfgrep -n -i 'cosine' *.pdf
结果片段如下：

(base) ergou@dell:~/Desktop/paper_reading/papers$ pdfgrep -n -i 'cosine' *.pdf
paper0.pdf:22:in VSM or BoW, are compared using similarity measure like Cosine similarity (Vu et al.
paper1.pdf:1:space. Given their vector embeddings, we then use cosine
paper1.pdf:3:first tokenizes the input text and then calculates vectors for                     to average them as the cosine similarity function depends
paper1.pdf:3:or tf-idf, these vectors are contextualized; they consider                   choice of summing or averaging would not influence the cosine
paper1.pdf:3:and problem report as a potential match. Another positive                      the euclidian similarity, and cosine similarity [9], [43]. The
paper1.pdf:4:dimensions. In contrast, the cosine similarity measures the
paper1.pdf:4:Previous research [1]—[3], [9], showed that cosine similarity                                          05/2012              09/2018
paper1.pdf:5:analyze DeepMatcher's cosine similarity values to understand
paper1.pdf:5:consuming over 80% battery. Had to uninstall to even                       cosine similarity, it added one additional suggestion per step
paper1.pdf:6:as many relevant bug reports in the issue tracker as                                 Cosine Similarity Analysis. We analyzed the cosine sim
paper1.pdf:6:as many relevant bug reports in the issue tracker as                                 Cosine Similarity Analysis. We analyzed the cosine sim
paper1.pdf:6:suggested bug reports to three, the MAP score                                    irrelevant bug report suggestions. Figure 4 shows the cosine
paper1.pdf:6:of problem reports for which DeepMatcher                                            We found that VLC has the lowest cosine similarity score
paper1.pdf:6:the MAP and the hit ratio scores for each                                    (26 matches). The lower cosine similarity indicates a higher
paper1.pdf:7:by the developers. our previously reported plot of the cosine                     report 546 days after the corresponding problem report for
paper1.pdf:7:the highest cosine similarity score and highest noun overlap                          It is essential for app developers to address users'
paper1.pdf:11:sensitive embeddings on which we applied cosine similarity to                     [15] M. Honnibal, I. Montani, S. Van Landeghem, and A.
paper2.pdf:3:encoding implies computing ?? based on sine and cosine functions.                   (other formulas stay the same):
paper2.pdf:6:the cosine learning rate schedule [25] with 2000 warm-up steps              Utilizing structure in the self-attention mechanism is much more
paper3.pdf:6:as an abstraction candidate.                         iTrust’s 54 use cases), we ﬁrst compute the cosine similarity
paper3.pdf:6:cosine(abst, R) =           cosine(abst, ri ).             (5)
paper3.pdf:6:cosine(abst, R) =           cosine(abst, ri ).             (5)
paper3.pdf:7:candidates by their cosine similarity scores in a descending                      OpenMRS allows for customizable electronic
Generating Question Titles for Stack Overflow from Mined Code Snippets.pdf:5:and then uses a cosine similarity function to calculate their similarity. Allamanis et al. [2] proposed
paper4.pdf:6:      “burnout”> is identiﬁed as an abstraction candidate.                       iTrust’s 54 use cases), we ﬁrst compute the cosine similarity
paper4.pdf:6:                                                                                               cosine(abst, R) =           cosine(abst, ri ).             (5)
paper4.pdf:6:                                                                                               cosine(abst, R) =           cosine(abst, ri ).             (5)
paper4.pdf:7:candidates by their cosine similarity scores in a descending     OpenMRS allows for customizable electronic medical record
paper5.pdf:7:manner. We compute the cosine similarity between the two              Considering complex and confusing
paper6.pdf:2:of semantically similar user comments                                                    to generate embeddings that are close in terms of their cosine
paper7.pdf:5:from identifier ?? of type T, and replacing the abbreviation with its           two abbreviations based on the cosine similarity [3] of the resulting
paper7.pdf:11:conclusions in future, however, evaluation on more applications               [3] 2021. Cosine Similarity. https://en.wikipedia.org/wiki/Cosine_similarity.
paper7.pdf:11:conclusions in future, however, evaluation on more applications               [3] 2021. Cosine Similarity. https://en.wikipedia.org/wiki/Cosine_similarity.

pdfgrep不支持递归搜索子文件夹，那么我们可以结合find命令来实现该功能，命令如下：
find . -name "*.pdf" -print0 | xargs -0 pdfgrep -i pattern

pattern既可以是单词，也可以是用单引号包起来的正则表达式，友情提醒，不要直接在正则表达式中用空格，因为很可能由于PDF编码问题无法检索到结果，建议使用边界符号\b或者限制长度的通配符，例如：
find . -name "*.pdf" -print0 | xargs -0 pdfgrep -i 'known.{1,3}and.{1,3}unknown'

赶紧用起来，再也不怕记性差了，根据记忆片段，搜就完事了。

原文链接：https://blog.csdn.net/huludan/article/details/125407548