从多个PDF中快速搜索字符串

问题描述

当我们有很多文献时,如果想从众多文献中搜索一个特定的字符串,我们难道要逐个PDF打开找吗,那么多文献,而且全是PDF,逐个打开,Ctrl + F搜索也不现实,肿么办,难不成为自己的文献库构建个索引吗,在本机构建文本语料库索引工作量不小,我们能不能找个轻量级的办法呢,当然可以,收到Linux中常用的搜索命令grep的启发,那么我们能直接用grep命令搜PDF文件吗,当然不能,grep命令是搜文本文件的(各类code源代码,plain text等),那我们把PDF文献全部转换为TXT,然后再用grep行不行,行,但是能不能把这个转换步骤也省了,必须能,怎么一分钟实现,看下文。

以下方法仅适用于linux系统,windows系统请自行找替代方案。

解决方法

安装pdfgrep即可,点我进入官网,不建议按照官网提供的源码自行编译安装(很可能会出现一些依赖错误,解决起来比较麻烦),直接使用命令行从ubuntu仓库中安装即可:
sudo apt-get update -y
sudo apt-get install -y pdfgrep

使用方法

用法与grep高度一致,更多example,也可参考官网文档如下:点我查看文档,支持递归搜索哦。

例如在论文文件夹搜索关键词cosine,命令如下
pdfgrep -n -i 'cosine' *.pdf
结果片段如下:

(base) ergou@dell:~/Desktop/paper_reading/papers$ pdfgrep -n -i 'cosine' *.pdf
paper0.pdf:22:in VSM or BoW, are compared using similarity measure like Cosine similarity (Vu et al.
paper1.pdf:1:space. Given their vector embeddings, we then use cosine
paper1.pdf:3:first tokenizes the input text and then calculates vectors for                     to average them as the cosine similarity function depends
paper1.pdf:3:or tf-idf, these vectors are contextualized; they consider                   choice of summing or averaging would not influence the cosine
paper1.pdf:3:and problem report as a potential match. Another positive                      the euclidian similarity, and cosine similarity [9], [43]. The
paper1.pdf:4:dimensions. In contrast, the cosine similarity measures the
paper1.pdf:4:Previous research [1][3], [9], showed that cosine similarity                                          05/2012              09/2018
paper1.pdf:5:analyze DeepMatcher's cosine similarity values to understand
paper1.pdf:5:consuming over 80% battery. Had to uninstall to even                       cosine similarity, it added one additional suggestion per step
paper1.pdf:6:as many relevant bug reports in the issue tracker as                                 Cosine Similarity Analysis. We analyzed the cosine sim­
paper1.pdf:6:as many relevant bug reports in the issue tracker as                                 Cosine Similarity Analysis. We analyzed the cosine sim­
paper1.pdf:6:suggested bug reports to three, the MAP score                                    irrelevant bug report suggestions. Figure 4 shows the cosine
paper1.pdf:6:of problem reports for which DeepMatcher                                            We found that VLC has the lowest cosine similarity score
paper1.pdf:6:the MAP and the hit ratio scores for each                                    (26 matches). The lower cosine similarity indicates a higher
paper1.pdf:7:by the developers. our previously reported plot of the cosine                     report 546 days after the corresponding problem report for
paper1.pdf:7:the highest cosine similarity score and highest noun overlap                          It is essential for app developers to address users'
paper1.pdf:11:sensitive embeddings on which we applied cosine similarity to                     [15] M. Honnibal, I. Montani, S. Van Landeghem, and A.
paper2.pdf:3:encoding implies computing ?? based on sine and cosine functions.                   (other formulas stay the same):
paper2.pdf:6:the cosine learning rate schedule [25] with 2000 warm-up steps              Utilizing structure in the self-attention mechanism is much more
paper3.pdf:6:as an abstraction candidate.                         iTrust’s 54 use cases), we first compute the cosine similarity
paper3.pdf:6:cosine(abst, R) =           cosine(abst, ri ).             (5)
paper3.pdf:6:cosine(abst, R) =           cosine(abst, ri ).             (5)
paper3.pdf:7:candidates by their cosine similarity scores in a descending                      OpenMRS allows for customizable electronic
Generating Question Titles for Stack Overflow from Mined Code Snippets.pdf:5:and then uses a cosine similarity function to calculate their similarity. Allamanis et al. [2] proposed
paper4.pdf:6:      “burnout”> is identified as an abstraction candidate.                       iTrust’s 54 use cases), we first compute the cosine similarity
paper4.pdf:6:                                                                                               cosine(abst, R) =           cosine(abst, ri ).             (5)
paper4.pdf:6:                                                                                               cosine(abst, R) =           cosine(abst, ri ).             (5)
paper4.pdf:7:candidates by their cosine similarity scores in a descending     OpenMRS allows for customizable electronic medical record
paper5.pdf:7:manner. We compute the cosine similarity between the two              Considering complex and confusing
paper6.pdf:2:of semantically similar user comments                                                    to generate embeddings that are close in terms of their cosine
paper7.pdf:5:from identifier ?? of type T, and replacing the abbreviation with its           two abbreviations based on the cosine similarity [3] of the resulting
paper7.pdf:11:conclusions in future, however, evaluation on more applications               [3] 2021. Cosine Similarity. https://en.wikipedia.org/wiki/Cosine_similarity.
paper7.pdf:11:conclusions in future, however, evaluation on more applications               [3] 2021. Cosine Similarity. https://en.wikipedia.org/wiki/Cosine_similarity.

pdfgrep不支持递归搜索子文件夹,那么我们可以结合find命令来实现该功能,命令如下:
find . -name "*.pdf" -print0 | xargs -0 pdfgrep -i pattern

pattern既可以是单词,也可以是用单引号包起来的正则表达式,友情提醒,不要直接在正则表达式中用空格,因为很可能由于PDF编码问题无法检索到结果,建议使用边界符号\b或者限制长度的通配符,例如:
find . -name "*.pdf" -print0 | xargs -0 pdfgrep -i 'known.{1,3}and.{1,3}unknown'

赶紧用起来,再也不怕记性差了,根据记忆片段,搜就完事了。


版权声明:本文为huludan原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。