如何自动标记章节、节、子节和段落?

如何自动标记章节、节、子节和段落?

我想做一个包来处理pdf文档的自动标记。我试图解决用章节、节、小节、小小节……、小节标记段落,用小段落标记段落的问题。我读过tagpdf包的例子,它执行章节、节、小节等的标记,但首先它只适用于文档中那些部分的名称和文档文本,这些部分和文本未显示在pdf中,其次它只适用于scrbook类,但如果可能的话,我希望它适用于任何类别的文档(我不知道是否可以通过latex/lualatex检查文档的类别)。我尝试使用\everypar命令,但据我所知,阅读我的日志,似乎这个命令只被调用了n+1次,其中n是节和小节的数量。我的文档现在只需要在lualatex中编译。非常感谢大家的帮助。

测试标记

\documentclass[12pt]{scrbook}
\usepackage{tagpdf}
\tagpdfsetup{interwordspace=true,activate-all,uncompress}
\usepackage{amsmath,amssymb}
\title{test document}
\author{AlexanderKozlovskiy}
\date{\today}
%\maketitle (why this not works?)
%Marking the toc entries
%around the whole entry so only structure:
\newcommand\tagscrtocentry[1]{\tagstructbegin{tag=TOCI}#1\tagstructend}

%leaf so structure and mc:
\newcommand\tagscrtocpagenumber[1]{%
 \tagstructbegin{tag=Reference}%
 \tagmcbegin{tag=Reference}%
 #1%
 \tagmcend
 \tagstructend}

\DeclareTOCStyleEntry[
   entryformat=\tagscrtocentry,
   pagenumberformat=\tagscrtocpagenumber]{tocline}{chapter}
\DeclareTOCStyleEntry[
   entryformat=\tagscrtocentry,
   pagenumberformat=\tagscrtocpagenumber]{tocline}{section}
\DeclareTOCStyleEntry[
   entryformat=\tagscrtocentry,
   pagenumberformat=\tagscrtocpagenumber]{tocline}{subsection}
\DeclareTOCStyleEntry[
   entryformat=\tagscrtocentry,
   pagenumberformat=\tagscrtocpagenumber]{tocline}{subsubsection}
\DeclareTOCStyleEntry[
   entryformat=\tagscrtocentry,
   pagenumberformat=\tagscrtocpagenumber]{tocline}{paragraph}

\renewcommand{\addtocentrydefault}[3]{%
 \ifstr{#3}{}{}
   {%\
   \ifstr{#2}{}
    {%
     \addcontentsline{toc}{#1}
      {%
       \protect\nonumberline
       \tagstructbegin{tag=P}%
       \tagmcbegin{tag=P}%
        #3%
       \tagmcend
       \tagstructend
      }%
    }%
    {%
    \addcontentsline{toc}{#1}{%
     \tagstructbegin{tag=Lbl}%
     \tagmcbegin{tag=Lbl}%
     \protect\numberline{#2}%
     \tagmcend\tagstructend
     \tagstructbegin{tag=P}%
     \tagmcbegin{tag=P}%
      #3%
     \tagmcend
     \tagstructend
     }%
    }%
   }}%

% the dots must be marked too
\makeatletter
\renewcommand*{\TOCLineLeaderFill}[1][.]{%
  \leaders\hbox{$\m@th
    \mkern \@dotsep mu\hbox{\tagmcbegin{artifact}#1\tagmcend}\mkern \@dotsep
    mu$}\hfill
}

%%%%%%%%%
% Sectioning commands
%%%%%%%%

\ExplSyntaxOn
\prop_new:N   \g_tag_section_level_prop
\prop_gput:Nnn \g_tag_section_level_prop {chapter}{H1}
\prop_gput:Nnn \g_tag_section_level_prop {section}{H2}
\prop_gput:Nnn \g_tag_section_level_prop {subsection}{H3}
\prop_gput:Nnn \g_tag_section_level_prop {subsubsection}{H4}
\prop_gput:Nnn \g_tag_section_level_prop {paragraph}{H5}

%new 0.6, as attributes are local we have to put \tagmcbegin everywhere.
\renewcommand{\chapterlinesformat}[3]
 {
  \@hangfrom
   {
    \tagstructbegin{tag=\prop_item:Nn\g_tag_section_level_prop{chapter}}
    \tl_if_empty:nF{#2}
     {
      \tagmcbegin    {tag=\prop_item:Nn\g_tag_section_level_prop{chapter}}
      #2
      \tagmcend
     }
   }
   {\tagmcbegin    {tag=\prop_item:Nn\g_tag_section_level_prop{chapter}}
    #3\tagmcend\tagstructend}%
}

%unnumbered sections level give an empty mc, need to think about it.
\renewcommand{\sectionlinesformat}[4]
 {
  \@hangfrom
   {\hskip #2
    \tagstructbegin{tag=\prop_item:Nn\g_tag_section_level_prop{#1}}
    \tl_if_empty:nF{#3}
    {
     \tagmcbegin    {tag=\prop_item:Nn\g_tag_section_level_prop{#1}}
     #3
     \tagmcend
    }
   }
   {\tagmcbegin    {tag=\prop_item:Nn\g_tag_section_level_prop{#1}}
    #4
    \tagmcend\tagstructend}%
 }
\ExplSyntaxOff
\AfterTOCHead{\tagstructbegin{tag=TOC}}
\AfterStartingTOC{\tagstructend} %end TOC

\begin{document}
\tagstructbegin{tag=Document}
%do tagging of paragraphs
\ExplSyntaxOn
\everypar{
\message{the_size_of_stack_of_structure_elements_is_\seq_count:N \g__uftag_struct_stack_seq} %i dont know,why spaces ignore when i try input something in log,so i use _ instead of space character.
\int_case:nn {\seq_count:N \g__uftag_struct_stack_seq}
  {
   {2}{\tagstructbegin{tag=P}\tagmcbegin{tag=P}}
{4}{\tagstructend \tagstructbegin{tag=P}\tagmcbegin{tag=P}}}}
\ExplSyntaxOff
\begin{centering}
test tagging of parts of documents\\
\end{centering}

\newpage

\tableofcontents

\newpage

\chapter{first chapter}

start testing of tagging of paragraphs

\section{test of section}

{\tiny

this is test document,which allow to do tests of tagging sections and paragraphs

\subsection{subsection 1}

test

again test

\begin{description}

\item[1] lemon

\item[2] orange

again testing of tagging parts of document

\item[3] red

\item[4] green

\end{description}}

\newpage

\subsection{new test}

end of test of tagging of document.

\ExplSyntaxOn
\int_step_inline:nnn{2}{\seq_count:N \g__uftag_struct_stack_seq }
{\tagstructend}
\ExplSyntaxOff
\end{document}

答案1

目前,还没有简单的方法可以“针对各种文档”做到这一点,因为
LaTeX 和几乎所有类和相关包都缺少合适的接口。

解决这个问题并不容易。鉴于该领域的实现方式多种多样,我怀疑仅通过对 latex 内核进行一些更改是否能够解决这个问题。

要将结构添加到\section标准类之一,您可以重新定义\@sectlatex 内核的内部命令。但这对来自 KOMA-bundle 或 memoir 或 revtex4-1 等的类不起作用,因为它们都会忽略或覆盖该命令。

对于\part并且\chapter内核中没有任何东西——每个类都有自己的实现。

因此,我目前的长期计划是开发一个合适的钩子系统,然后说服所有主要类和包在正确的位置添加这些钩子。然后标记包可以简单地将标记命令添加到这些钩子中。

我愿意不是计划扩展 tagpdf 包,为各种类和包添加补丁,以实现此功能。以下几个示例仅供参考例子来证明获得一个结构是可能的。

答案2

最近,该功能已添加到 TeXLive 的 2023-06-01 版本中。请参阅https://www.latex-project.org/news/2023/06/10/issue37-of-latex2e-released/。它的范围相对较小(仅允许 3 种文档类型并执行基本部分/小节/图形/一些 amsmath 方程式),但据我所知,这是朝着正确方向迈出的重要一步。贡献者还表示,他们正在努力在后续版本中扩大自动标记的范围。

相关内容