我想制作一个带有引用的邻接矩阵。
我想制作一个包含 130 个单词的索引,并根据这 130 个单词搜索 130 篇论文。手动搜索过程很漫长。但我想让搜索过程自动化。
有人可以建议是否可以通过文本挖掘或任何其他方式来实现这一点吗?
答案1
可能与主题无关,因为我不知道如何在 LaTeX 中执行此操作,但您可以使用pdftotext
将 pdf 转换为文本,然后扫描您的搜索词并计算出现的次数:
pdftotext input.pdf - | grep "search term" | wc -l
然后你可以使用 Python 创建一个发生次数表
import subprocess
import os
import numpy as np
pdf_folder = '/path/to/folder'
pdfs = [f for f in os.listdir(folder) if os.path.isfile(os.path.join(folder, f)) and os.path.splitext(f)[-1] == '.pdf']
search_terms = ['search', 'me']
latex_filename = 'citematrix.tex'
pdf_fullpaths = [os.path.join(pdf_folder, pdf) for pdf in pdfs]
cite_matrix = np.zeros(shape=(len(search_terms), len(pdfs)), dtype=np.int)
for y, search_term in enumerate(search_terms):
for x, pdf in enumerate(pdf_fullpaths):
pdftotext = subprocess.Popen(['pdftotext', pdf, '-'], stdout=subprocess.PIPE)
grep = subprocess.Popen(['grep', '-i', search_term], stdin=pdftotext.stdout, stdout=subprocess.PIPE)
wc = subprocess.Popen(['wc', '-l'], stdin=grep.stdout, stdout=subprocess.PIPE)
count = int([c.strip() for c in wc.stdout][0])
cite_matrix[y, x] = count
然后,您可以将找到搜索词的所有 pdf 文件名写入 latex 文件中:
latex = '\\documentclass{article}\n\\begin{document}\n'
for y, search_term in enumerate(search_terms):
latex += search_term + ': '
if np.sum(cite_matrix[y]) < 1:
latex += 'No occurence'
continue
occurences = 0
for x, pdf in enumerate(pdfs):
if cite_matrix[y, x] < 1:
continue
if occurences > 0:
latex += ', '
latex += pdf
occurences += 1
latex += '\n\\end{document}'
with open(latex_filename, 'w') as out:
out.write(latex)
或者如果你知道相应的 bibtex 键,你可以引用它们:
bibtex_keys = ['Author2019', 'Buthor2020']
bib_filename = 'refs.bib'
latex = '\\documentclass{article}\n\\usepackage[backend=biber]{biblatex}\n\\addbibresource{'+bib_filename+'}\n\\begin{document}\n'
for y, search_term in enumerate(search_terms):
latex += search_term + ': '
if np.sum(cite_matrix[y]) < 1:
latex += 'No occurence'
continue
latex += '\\cite{'
occurences = 0
for x, key in enumerate(bibtex_keys):
if cite_matrix[y, x] > 0:
if occurences > 0:
latex += ', '
latex += key
occurences += 1
latex += '}\n\n'
latex += '\\printbibliography\n\\end{document}'
with open(latex_filename, 'w') as out:
out.write(latex)
或者您可以将整个表格写入 Excel 表......