文本挖掘和交叉引用

文本挖掘和交叉引用

我想制作一个带有引用的邻接矩阵。

我想制作一个包含 130 个单词的索引,并根据这 130 个单词搜索 130 篇论文。手动搜索过程很漫长。但我想让搜索过程自动化。

有人可以建议是否可以通过文本挖掘或任何其他方式来实现这一点吗?

答案1

可能与主题无关,因为我不知道如何在 LaTeX 中执行此操作,但您可以使用pdftotext将 pdf 转换为文本,然后扫描您的搜索词并计算出现的次数:

pdftotext input.pdf - | grep "search term" | wc -l

然后你可以使用 Python 创建一个发生次数表

import subprocess
import os
import numpy as np

pdf_folder = '/path/to/folder'
pdfs = [f for f in os.listdir(folder) if os.path.isfile(os.path.join(folder, f)) and os.path.splitext(f)[-1] == '.pdf']
search_terms = ['search', 'me']
latex_filename = 'citematrix.tex'
pdf_fullpaths = [os.path.join(pdf_folder, pdf) for pdf in pdfs]
cite_matrix = np.zeros(shape=(len(search_terms), len(pdfs)), dtype=np.int)

for y, search_term in enumerate(search_terms):
    for x, pdf in enumerate(pdf_fullpaths):
        pdftotext = subprocess.Popen(['pdftotext', pdf, '-'], stdout=subprocess.PIPE)
        grep = subprocess.Popen(['grep', '-i', search_term], stdin=pdftotext.stdout, stdout=subprocess.PIPE)
        wc = subprocess.Popen(['wc', '-l'], stdin=grep.stdout, stdout=subprocess.PIPE)
        count = int([c.strip() for c in wc.stdout][0])
        cite_matrix[y, x] = count

然后,您可以将找到搜索词的所有 pdf 文件名写入 latex 文件中:

latex = '\\documentclass{article}\n\\begin{document}\n'

for y, search_term in enumerate(search_terms):
    latex += search_term + ': '
    if np.sum(cite_matrix[y]) < 1:
        latex += 'No occurence'
        continue
    occurences = 0
    for x, pdf in enumerate(pdfs):
        if cite_matrix[y, x] < 1:
            continue
        if occurences > 0:
            latex += ', '
        latex += pdf
        occurences += 1
latex += '\n\\end{document}'

with open(latex_filename, 'w') as out:
    out.write(latex)

或者如果你知道相应的 bibtex 键,你可以引用它们:

bibtex_keys = ['Author2019', 'Buthor2020']
bib_filename = 'refs.bib'

latex = '\\documentclass{article}\n\\usepackage[backend=biber]{biblatex}\n\\addbibresource{'+bib_filename+'}\n\\begin{document}\n'
for y, search_term in enumerate(search_terms):
    latex += search_term + ': '
    if np.sum(cite_matrix[y]) < 1:
        latex += 'No occurence'
        continue
    latex += '\\cite{'
    occurences = 0
    for x, key in enumerate(bibtex_keys):
        if cite_matrix[y, x] > 0:
            if occurences > 0:
                latex += ', '
            latex += key
            occurences += 1
    latex += '}\n\n'
latex += '\\printbibliography\n\\end{document}'

with open(latex_filename, 'w') as out:
    out.write(latex)

或者您可以将整个表格写入 Excel 表......

相关内容