我想做一个包来处理pdf文档的自动标记。我试图解决用章节、节、小节、小小节……、小节标记段落,用小段落标记段落的问题。我读过tagpdf包的例子,它执行章节、节、小节等的标记,但首先它只适用于文档中那些部分的名称和文档文本,这些部分和文本未显示在pdf中,其次它只适用于scrbook类,但如果可能的话,我希望它适用于任何类别的文档(我不知道是否可以通过latex/lualatex检查文档的类别)。我尝试使用\everypar
命令,但据我所知,阅读我的日志,似乎这个命令只被调用了n+1次,其中n是节和小节的数量。我的文档现在只需要在lualatex中编译。非常感谢大家的帮助。
测试标记
\documentclass[12pt]{scrbook}
\usepackage{tagpdf}
\tagpdfsetup{interwordspace=true,activate-all,uncompress}
\usepackage{amsmath,amssymb}
\title{test document}
\author{AlexanderKozlovskiy}
\date{\today}
%\maketitle (why this not works?)
%Marking the toc entries
%around the whole entry so only structure:
\newcommand\tagscrtocentry[1]{\tagstructbegin{tag=TOCI}#1\tagstructend}
%leaf so structure and mc:
\newcommand\tagscrtocpagenumber[1]{%
\tagstructbegin{tag=Reference}%
\tagmcbegin{tag=Reference}%
#1%
\tagmcend
\tagstructend}
\DeclareTOCStyleEntry[
entryformat=\tagscrtocentry,
pagenumberformat=\tagscrtocpagenumber]{tocline}{chapter}
\DeclareTOCStyleEntry[
entryformat=\tagscrtocentry,
pagenumberformat=\tagscrtocpagenumber]{tocline}{section}
\DeclareTOCStyleEntry[
entryformat=\tagscrtocentry,
pagenumberformat=\tagscrtocpagenumber]{tocline}{subsection}
\DeclareTOCStyleEntry[
entryformat=\tagscrtocentry,
pagenumberformat=\tagscrtocpagenumber]{tocline}{subsubsection}
\DeclareTOCStyleEntry[
entryformat=\tagscrtocentry,
pagenumberformat=\tagscrtocpagenumber]{tocline}{paragraph}
\renewcommand{\addtocentrydefault}[3]{%
\ifstr{#3}{}{}
{%\
\ifstr{#2}{}
{%
\addcontentsline{toc}{#1}
{%
\protect\nonumberline
\tagstructbegin{tag=P}%
\tagmcbegin{tag=P}%
#3%
\tagmcend
\tagstructend
}%
}%
{%
\addcontentsline{toc}{#1}{%
\tagstructbegin{tag=Lbl}%
\tagmcbegin{tag=Lbl}%
\protect\numberline{#2}%
\tagmcend\tagstructend
\tagstructbegin{tag=P}%
\tagmcbegin{tag=P}%
#3%
\tagmcend
\tagstructend
}%
}%
}}%
% the dots must be marked too
\makeatletter
\renewcommand*{\TOCLineLeaderFill}[1][.]{%
\leaders\hbox{$\m@th
\mkern \@dotsep mu\hbox{\tagmcbegin{artifact}#1\tagmcend}\mkern \@dotsep
mu$}\hfill
}
%%%%%%%%%
% Sectioning commands
%%%%%%%%
\ExplSyntaxOn
\prop_new:N \g_tag_section_level_prop
\prop_gput:Nnn \g_tag_section_level_prop {chapter}{H1}
\prop_gput:Nnn \g_tag_section_level_prop {section}{H2}
\prop_gput:Nnn \g_tag_section_level_prop {subsection}{H3}
\prop_gput:Nnn \g_tag_section_level_prop {subsubsection}{H4}
\prop_gput:Nnn \g_tag_section_level_prop {paragraph}{H5}
%new 0.6, as attributes are local we have to put \tagmcbegin everywhere.
\renewcommand{\chapterlinesformat}[3]
{
\@hangfrom
{
\tagstructbegin{tag=\prop_item:Nn\g_tag_section_level_prop{chapter}}
\tl_if_empty:nF{#2}
{
\tagmcbegin {tag=\prop_item:Nn\g_tag_section_level_prop{chapter}}
#2
\tagmcend
}
}
{\tagmcbegin {tag=\prop_item:Nn\g_tag_section_level_prop{chapter}}
#3\tagmcend\tagstructend}%
}
%unnumbered sections level give an empty mc, need to think about it.
\renewcommand{\sectionlinesformat}[4]
{
\@hangfrom
{\hskip #2
\tagstructbegin{tag=\prop_item:Nn\g_tag_section_level_prop{#1}}
\tl_if_empty:nF{#3}
{
\tagmcbegin {tag=\prop_item:Nn\g_tag_section_level_prop{#1}}
#3
\tagmcend
}
}
{\tagmcbegin {tag=\prop_item:Nn\g_tag_section_level_prop{#1}}
#4
\tagmcend\tagstructend}%
}
\ExplSyntaxOff
\AfterTOCHead{\tagstructbegin{tag=TOC}}
\AfterStartingTOC{\tagstructend} %end TOC
\begin{document}
\tagstructbegin{tag=Document}
%do tagging of paragraphs
\ExplSyntaxOn
\everypar{
\message{the_size_of_stack_of_structure_elements_is_\seq_count:N \g__uftag_struct_stack_seq} %i dont know,why spaces ignore when i try input something in log,so i use _ instead of space character.
\int_case:nn {\seq_count:N \g__uftag_struct_stack_seq}
{
{2}{\tagstructbegin{tag=P}\tagmcbegin{tag=P}}
{4}{\tagstructend \tagstructbegin{tag=P}\tagmcbegin{tag=P}}}}
\ExplSyntaxOff
\begin{centering}
test tagging of parts of documents\\
\end{centering}
\newpage
\tableofcontents
\newpage
\chapter{first chapter}
start testing of tagging of paragraphs
\section{test of section}
{\tiny
this is test document,which allow to do tests of tagging sections and paragraphs
\subsection{subsection 1}
test
again test
\begin{description}
\item[1] lemon
\item[2] orange
again testing of tagging parts of document
\item[3] red
\item[4] green
\end{description}}
\newpage
\subsection{new test}
end of test of tagging of document.
\ExplSyntaxOn
\int_step_inline:nnn{2}{\seq_count:N \g__uftag_struct_stack_seq }
{\tagstructend}
\ExplSyntaxOff
\end{document}
答案1
目前,还没有简单的方法可以“针对各种文档”做到这一点,因为
LaTeX 和几乎所有类和相关包都缺少合适的接口。
解决这个问题并不容易。鉴于该领域的实现方式多种多样,我怀疑仅通过对 latex 内核进行一些更改是否能够解决这个问题。
要将结构添加到\section
标准类之一,您可以重新定义\@sect
latex 内核的内部命令。但这对来自 KOMA-bundle 或 memoir 或 revtex4-1 等的类不起作用,因为它们都会忽略或覆盖该命令。
对于\part
并且\chapter
内核中没有任何东西——每个类都有自己的实现。
因此,我目前的长期计划是开发一个合适的钩子系统,然后说服所有主要类和包在正确的位置添加这些钩子。然后标记包可以简单地将标记命令添加到这些钩子中。
我愿意不是计划扩展 tagpdf 包,为各种类和包添加补丁,以实现此功能。以下几个示例仅供参考例子来证明获得一个结构是可能的。
答案2
最近,该功能已添加到 TeXLive 的 2023-06-01 版本中。请参阅https://www.latex-project.org/news/2023/06/10/issue37-of-latex2e-released/。它的范围相对较小(仅允许 3 种文档类型并执行基本部分/小节/图形/一些 amsmath 方程式),但据我所知,这是朝着正确方向迈出的重要一步。贡献者还表示,他们正在努力在后续版本中扩大自动标记的范围。