（自动）注释 LaTeX 源文件以使其更具可读性

Question 1

我有一个快速的解决方案，很多时候都很有用。请注意，仅供个人使用(1)；它可以被改进，添加就地编辑、错误控制等等。但我认为它也很有用。这个想法是利用 LaTeX 编号本身。

因此，首先，您需要在文档中添加标签（无论如何这都是好的）：

\documentclass[12pt]{article}
\begin{document}
\section{a}
\label{sec:a}

\section{b}
\label{sec:b}

\subsection{b a}
\label{sec:ba}

\newpage

\subsection{b b}
\label{sec:bb}

\section{c}
\label{sec:c}
\end{document}

接下来像以前一样运行 Latex ，假设它被称为walla.tex.现在运行这个小 python 脚本：

#!/usr/bin/env python
#
# 
import sys
import re

labels=[]
# build a list of label
for l in open(sys.argv[1] + ".aux"):
    if l.find("newlabel{") != -1: 
        m = re.search(r'\\newlabel{(.*?)}{{(.*?)}{(.*?)}}', l)
        if m:
            labels.append("label: %s will be number: %s at page: %s" % (
                m.group(1), m.group(2), m.group(3)))
        else:
            labels.append(l)

# scan input file
for l in  open(sys.argv[1] + ".tex"):
    if l.find("\\label") != -1:
        # we have a label, try to match it
        m = re.search(r'\\label{(.*?)}', l)
        # if not m: continue ERROR not managed here
        key = m.group(1) 
        for lab in labels:
            if lab.find(key) != -1:
                # modify this to pretty print
                sys.stdout.write("%%%%%% %s\n" % lab.strip())
                break
    # output the tex file avoiding old ones
    if not l.startswith(r'%%% label'):
        sys.stdout.write(l)

调用它find_tex_labels，使其可执行，然后运行它find_tex_labels walla > walla_annotated.tex（注意，参数中没有扩展名）。

您将在输出中看到带注释的 LaTeX 文件：

\documentclass[12pt]{article}
\begin{document}
\section{a}
%%% label: sec:a will be number: 1 at page: 1
\label{sec:a}

\section{b}
%%% label: sec:b will be number: 2 at page: 1
\label{sec:b}

\subsection{b a}
%%% label: sec:ba will be number: 2.1 at page: 1
\label{sec:ba}

\newpage

\subsection{b b}
%%% label: sec:bb will be number: 2.2 at page: 2
\label{sec:bb}

\section{c}
%%% label: sec:c will be number: 3 at page: 2
\label{sec:c}
\end{document}

...这适用于所有标签。当我在没有 LaTeX 的设备上进行编辑时，我发现它对于交叉引用方程等非常有用。您现在可以用原来的 walla.tex 替换新的。

您有责任保持事物同步......并且不要在任何地方使用“%%% label”注释。

脚注：

(1)我多次承诺要完善它。然后考虑到我是唯一一个使用它的人，当错误出现时我会纠正它们......并且从来没有时间来清理它。

Answer

我有一个快速的解决方案，很多时候都很有用。请注意，仅供个人使用(1)；它可以被改进，添加就地编辑、错误控制等等。但我认为它也很有用。这个想法是利用 LaTeX 编号本身。

因此，首先，您需要在文档中添加标签（无论如何这都是好的）：

\documentclass[12pt]{article}
\begin{document}
\section{a}
\label{sec:a}

\section{b}
\label{sec:b}

\subsection{b a}
\label{sec:ba}

\newpage

\subsection{b b}
\label{sec:bb}

\section{c}
\label{sec:c}
\end{document}

接下来像以前一样运行 Latex ，假设它被称为walla.tex.现在运行这个小 python 脚本：

#!/usr/bin/env python
#
# 
import sys
import re

labels=[]
# build a list of label
for l in open(sys.argv[1] + ".aux"):
    if l.find("newlabel{") != -1: 
        m = re.search(r'\\newlabel{(.*?)}{{(.*?)}{(.*?)}}', l)
        if m:
            labels.append("label: %s will be number: %s at page: %s" % (
                m.group(1), m.group(2), m.group(3)))
        else:
            labels.append(l)

# scan input file
for l in  open(sys.argv[1] + ".tex"):
    if l.find("\\label") != -1:
        # we have a label, try to match it
        m = re.search(r'\\label{(.*?)}', l)
        # if not m: continue ERROR not managed here
        key = m.group(1) 
        for lab in labels:
            if lab.find(key) != -1:
                # modify this to pretty print
                sys.stdout.write("%%%%%% %s\n" % lab.strip())
                break
    # output the tex file avoiding old ones
    if not l.startswith(r'%%% label'):
        sys.stdout.write(l)

调用它find_tex_labels，使其可执行，然后运行它find_tex_labels walla > walla_annotated.tex（注意，参数中没有扩展名）。

您将在输出中看到带注释的 LaTeX 文件：

\documentclass[12pt]{article}
\begin{document}
\section{a}
%%% label: sec:a will be number: 1 at page: 1
\label{sec:a}

\section{b}
%%% label: sec:b will be number: 2 at page: 1
\label{sec:b}

\subsection{b a}
%%% label: sec:ba will be number: 2.1 at page: 1
\label{sec:ba}

\newpage

\subsection{b b}
%%% label: sec:bb will be number: 2.2 at page: 2
\label{sec:bb}

\section{c}
%%% label: sec:c will be number: 3 at page: 2
\label{sec:c}
\end{document}

...这适用于所有标签。当我在没有 LaTeX 的设备上进行编辑时，我发现它对于交叉引用方程等非常有用。您现在可以用原来的 walla.tex 替换新的。

您有责任保持事物同步......并且不要在任何地方使用“%%% label”注释。

脚注：

(1)我多次承诺要完善它。然后考虑到我是唯一一个使用它的人，当错误出现时我会纠正它们......并且从来没有时间来清理它。

Question 2

相对困难的部分是，您必须缓冲注释行，以查看它是否需要更新，以防下一行是部分指示器。如果该数据位于同一行或下一行，那就更简单了。

以下内容应该可以帮助您。它可以被调用，python script.py input output或者您可以省略输出并将其写入标准输出。不要执行“python script.py xx.tex xx.tex”，而是写入临时文件并将其复制回原始文件。

这会更新表单的现有行，%x.y.z rest of comment保持rest of comment不变。如果还没有这样的注释，则会插入它。特殊注释应从行首开始，分段命令也应如此。

import sys

class ProcessLaTeX:
    def __init__(self, ifp, ofp):
        self.ofp = ofp
        self.prev_comment = None
        self.level = []
        for line in ifp:
            self.process(line)
        # emit last line if comment
        if self.prev_comment:
            self.ofp.write(self.prev_comment)

    def output(self, line):
        pass

    def process(self, line):
        if line[0] == '%':
            # store comment line, emitting any previously stored line
            if self.prev_comment:
                self.ofp.write(self.prev_comment)
            self.prev_comment = line
            return
        lvl = self.check_level(line)
        if lvl > -1:
            self.output_level_comment(lvl)
        if self.prev_comment:
            self.ofp.write(self.prev_comment)
            self.prev_comment = None
        self.ofp.write(line)

    def output_level_comment(self, lvl):
        if self.prev_comment: # check if we overwrite an old one
            # do not use the starting '%' and final newline
            words = self.prev_comment[1:-1].split(' ', 1)
            for c in words[0]:
                if c not in '01234567890.':
                    self.ofp.write(self.prev_comment)
                    self.prev_comment = None
                    break
        self.level.append(0) # in case this is a deeper level
        self.level[lvl] += 1
        self.level = self.level[:lvl+1] # cut of excess levels
        lvls = '%' + '.'.join([str(l) for l in self.level])
        if self.prev_comment: # overwrite the previous words[1]
            words[0] = lvls
            outs = ' '.join(words)
            if not outs[-1] == '\n':
                outs += '\n'
            self.prev_comment = None
        else:
            outs = lvls + '\n'
        self.ofp.write(outs)

    def check_level(self, line):
        if line and not line[0] == '\\':
            return -1
        cmd = line[1:].split('{', 1)[0]
        try:
            res = ['section', 'subsection', 'subsubsection',
                     'paragraph', 'subparagraph'].index(cmd)
        except ValueError:
            return -1
        return res

out = sys.stdout if len(sys.argv) < 3 else open(sys.argv[2], 'w')
pl = ProcessLaTeX(open(sys.argv[1]), out)

Answer

相对困难的部分是，您必须缓冲注释行，以查看它是否需要更新，以防下一行是部分指示器。如果该数据位于同一行或下一行，那就更简单了。

以下内容应该可以帮助您。它可以被调用，python script.py input output或者您可以省略输出并将其写入标准输出。不要执行“python script.py xx.tex xx.tex”，而是写入临时文件并将其复制回原始文件。

这会更新表单的现有行，%x.y.z rest of comment保持rest of comment不变。如果还没有这样的注释，则会插入它。特殊注释应从行首开始，分段命令也应如此。

import sys

class ProcessLaTeX:
    def __init__(self, ifp, ofp):
        self.ofp = ofp
        self.prev_comment = None
        self.level = []
        for line in ifp:
            self.process(line)
        # emit last line if comment
        if self.prev_comment:
            self.ofp.write(self.prev_comment)

    def output(self, line):
        pass

    def process(self, line):
        if line[0] == '%':
            # store comment line, emitting any previously stored line
            if self.prev_comment:
                self.ofp.write(self.prev_comment)
            self.prev_comment = line
            return
        lvl = self.check_level(line)
        if lvl > -1:
            self.output_level_comment(lvl)
        if self.prev_comment:
            self.ofp.write(self.prev_comment)
            self.prev_comment = None
        self.ofp.write(line)

    def output_level_comment(self, lvl):
        if self.prev_comment: # check if we overwrite an old one
            # do not use the starting '%' and final newline
            words = self.prev_comment[1:-1].split(' ', 1)
            for c in words[0]:
                if c not in '01234567890.':
                    self.ofp.write(self.prev_comment)
                    self.prev_comment = None
                    break
        self.level.append(0) # in case this is a deeper level
        self.level[lvl] += 1
        self.level = self.level[:lvl+1] # cut of excess levels
        lvls = '%' + '.'.join([str(l) for l in self.level])
        if self.prev_comment: # overwrite the previous words[1]
            words[0] = lvls
            outs = ' '.join(words)
            if not outs[-1] == '\n':
                outs += '\n'
            self.prev_comment = None
        else:
            outs = lvls + '\n'
        self.ofp.write(outs)

    def check_level(self, line):
        if line and not line[0] == '\\':
            return -1
        cmd = line[1:].split('{', 1)[0]
        try:
            res = ['section', 'subsection', 'subsubsection',
                     'paragraph', 'subparagraph'].index(cmd)
        except ValueError:
            return -1
        return res

out = sys.stdout if len(sys.argv) < 3 else open(sys.argv[2], 'w')
pl = ProcessLaTeX(open(sys.argv[1]), out)

Question 3

正如我所想，你正在寻找的nl是节定界符选项。从info nl：

nl将其输入分解为（逻辑）页面；默认情况下，每个逻辑页顶部的行号重置为 1。 nl将所有输入文件视为单个文档；它不会重置文件之间的行号或逻辑页。
逻辑页由三部分组成：标头,身体，和页脚。任何部分都可以为空。每个都可以采用与其他不同的样式进行编号。
逻辑页各部分的开头在输入文件中由恰好包含以下分隔符字符串之一的行指示：
- \:\:\:- 标题的开始；
- \:\: - 主体的开始；
- \: - 页脚的开始。

您可以在命令行上设置nl的逻辑页限制符，例如：-d

nl -dCC <infile

...在哪里CC代表任意两个字符来替换\:文档中所示的。鉴于您的输入，我认为这是没有必要的 - 我们只需在适用的情况下插入默认值并进行一些输入过滤即可。这是nl和sed配对在我编写的 shell 函数中，旨在递归地过滤自身：

sd() { n='
';     nl -bp"^\\\\$1section" -w1 -s"$n\:\:\:$n" |
       sed '/./!d;/^[0-9]/!s/^[[:blank:]]*//;/^%[0-9.]*$/h;t
            s/./%&/;x;/%/G;s//./2;/\n.*/h;s///;x;s/\n//;N
            s/\(\(.*\)\(\n\)\)\(\(.*\)\(..\)\)/\4\3\1\5/'
}

我向它提供了类似于示例数据的内容，并将其输出通过管道传输回其中几次：

sd <<\IN |sd sub | sd subsub | sd subsubsub
\begin{document}
\section{}
some ordinary lines
\subsection{}
whatever
\subsubsection{}
\subsection{}
\subsubsection{}
\subsubsubsection{}
\section{}
\subsection{}
\end{document}
IN

正如上面打印的那样运行：

\begin{document}
%1
\section{}
some ordinary lines
%1.1
\subsection{}
whatever
%1.1.1
\subsubsection{}
%1.2
\subsection{}
%1.2.1
\subsubsection{}

\:\:\:
%1.2.1.1
\:\:
\subsubsubsection{}
%2
\section{}
%2.1
\subsection{}
\end{document}

如您所见，过滤器作业不是完全地完成了，但它似乎完成了这项工作。根据输入的内容对其输入的主体nl进行编号- 并且它开始对每个逻辑页进行计数 - 由一行组成-b-bp'attern'仅有的其逻辑页头分隔符\:\:\:。

所以...过滤它的输出 - 它已经包含在的eparator argsed中设置的分隔符，并且基本上只是稍微重新排列它，以便在下一次传递时找到它的部分分隔符。不过，它也会在其旧空间中保留最后一行的副本- 如果保留空间在遇到以数字开头的行时不为空，则会将该行附加到其保留空间的内容中，后面是.这就是肉和土豆，真的。nl-ssednlsed^%[0-9.]*$h.

尽管如此——正如我所说，事情还没有完成。最后一次在输出中留下了节分隔符和空行。因此，要清理它：

sd <<\IN |sd sub | sd subsub | sd subsubsub | grep -v '^\\:\|^$'
\begin{document}
\section{}
some ordinary lines
\subsection{}
whatever
\subsubsection{}
\subsection{}
\subsubsection{}
\subsubsubsection{}
\subsubsection{}
\subsubsubsection{}
\section{}
\subsection{}
\end{document}
IN

输出：

\begin{document}
%1
\section{}
some ordinary lines
%1.1
\subsection{}
whatever
%1.1.1
\subsubsection{}
%1.2
\subsection{}
%1.2.1
\subsubsection{}
%1.2.1.1
\subsubsubsection{}
%1.2.2
\subsubsection{}
%1.2.2.1
\subsubsubsection{}
%2
\section{}
%2.1
\subsection{}
\end{document}

Answer