我的文档中有一个.toc
(目录文件).tex
。
它包含很多行,其中一些具有以下形式
\contentsline {part}{Some title here\hfil }{5}
\contentsline {chapter}{\numberline {}Person name here}{5}
我知道如何grep
为part
和为chapter
。但我想过滤这些行并将输出保存在csv
如下文件中:
{Some title here},{Person name here},{5}
或者没有大括号
Some title here,Person name here,5
1.当然,最后一对中的数字(页码){}
对于两行都是相同的,因此我们可以只过滤第二行。
2.请注意,可能会出现一些空对{}
,也可能包含另一个对{}
。例如,它可以是
\contentsline {part}{Title with math $\frac{a}{b}$\hfil }{15}
应过滤为
Title with math $\frac{a}{b}$
编辑1:我能够使用以下方法获得行尾没有大括号的数字
grep '{part}' file.toc | awk -F '[{}]' '{print $(NF-1)}'
编辑2:我能够过滤线路chapter
并清除垃圾
grep '{chapter}' file.toc | sed 's/\\numberline//' | sed 's/\\contentsline//' | sed 's/{chapter}//' | sed 's/{}//' | sed 's/^ {/{/'
没有空格的输出是
{Person name here}{5}
编辑3:我能够过滤part
并清理输出
\contentsline {chapter}{\numberline {}Person name here}{5}
返回
{Title with math $\frac{a}{b}$}{15}
答案1
这是使用 GNU awk
,使用 POSIXawk
会很麻烦(缺少gensub
,我不止一次使用过)。
#!/usr/bin/env gawk
function join(array, result, i)
{
result = array[0];
end = length(array) - 1;
for (i = 1; i <= end; i++)
result = result "," array[i];
return result;
}
function push(arr, elem)
{
arr[length(arr)] = elem;
}
# split("", arr) is a horribly unreadable way to clear an array
BEGIN { split("", arr); }
/{part}|{chapter}/ {
l = gensub(".*{(.+)}{(.+)}{([0-9]+)}$", "\\1,\\3,\\2", "g");
if ("part" == substr(l, 0, 4)) {
if (length(arr) > 0) { print join(arr); }
split("", arr);
push(arr, gensub("^(.*),(.*),(.*)$", "\\2,\\3","g", l));
} else {
push(arr, gensub("^(.*),(.*),(.*)$", "\\3","g", l));
}
}
END { print join(arr); }
这利用了正则表达式是贪婪的事实,因此匹配每次都会获得整行。这比我一开始想象的要付出更多的努力。
输入以下内容:
\contentsline {part}{Some title here\hfil }{5}
\contentsline {chapter}{\numberline {}Person name here}{5}
blah blah
\contentsline {chapter}{\numberline {}Person name here}{5}
blah blah
blah blah
\contentsline {chapter}{\numberline {}Person name here}{5}
\contentsline {chapter}{\numberline {}Person name here}{5}
blah blah
blah blah
\contentsline {chapter}{\numberline {}Person name here}{5}
\contentsline {chapter}{\numberline {}Person name here}{5}
\contentsline {part}{Some title here\hfil }{7}
\contentsline {chapter}{\numberline {}Person name here}{7}
blah blah
blah blah
\contentsline {chapter}{\numberline {}Person name here}{7}
blah blah
\contentsline {part}{Some title here\hfil }{9}
blah blah
blah blah
\contentsline {chapter}{\numberline {}Person name here}{9}
我们生产cat input | awk -f the_above_script.awk
:
5,Some title here\hfil ,\numberline {}Person name here,\numberline {}Person name here,\numberline {}Person name here,\numberline {}Person name here,\numberline {}Person name here,\numberline {}Person name here
7,Some title here\hfil ,\numberline {}Person name here,\numberline {}Person name here
9,Some title here\hfil ,\numberline {}Person name here
页码取自包含后发生的{part}
任何内容。这允许一本书的某些部分包含多个章节。{chapter}
{part}
答案2
使用 PerlText::Balanced
模块,顶层{}
的内容可以这样提取:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::Balanced qw(extract_bracketed);
# this will of course fail if the input is one multiple lines, as this
# is only a line-by-line parser of standard input or the filenames
# passed to this script
while ( my $line = readline ) {
if ( $line =~ m/\\contentsline / ) {
my @parts = extract_contents($line);
# emit as CSV (though ideally instead use Text::CSV module)
print join( ",", @parts ), "\n";
} else {
#print "NO MATCH ON $line";
}
}
sub extract_contents {
my $line = shift;
my @parts;
# while we can get a {} bit out of the input line, anywhere in the
# input line
while ( my $part = extract_bracketed( $line, '{}', qr/[^{]*/ ) ) {
# trim off the delimiters
$part = substr $part, 1, length($part) - 2;
push @parts, $part;
}
return @parts;
}
通过一些输入:
% < input
not content line
\contentsline {chapter}{\numberline {}Person name here}{5}
\contentsline {part}{Title with math $\frac{a}{b}$\hfil }{15}
also not content line
% perl parser input
chapter,\numberline {}Person name here,5
part,Title with math $\frac{a}{b}$\hfil ,15
%
答案3
在TXR
@(repeat)
\contentsline {part}{@title\hfil }{@page}
@ (trailer)
@ (skip)
\contentsline {chapter}{\numberline {}@author}{@page}
@ (do (put-line `@title,@author,@page`))
@(end)
样本数据:
\lorem{ipsum}
\contentsline {part}{The Art of The Meringue\hfil }{5}
a
b
c
j
\contentsline {chapter}{\numberline {}Doug LeMonjello}{5}
\contentsline {part}{Parachuting Primer\hfil }{16}
\contentsline {chapter}{\numberline {}Hugo Phirst}{16}
\contentsline {part}{Making Sense of $\frac{a}{b}$\hfil }{19}
\contentsline {part}{War and Peace\hfil }{27}
\contentsline {chapter}{\numberline {}D. Vide}{19}
\contentsline {part}{War and Peace\hfil }{19}
跑步:
$ txr title-auth.txr data
The Art of The Meringue,Doug LeMonjello,5
Parachuting Primer,Hugo Phirst,16
Making Sense of $\frac{a}{b}$,D. Vide,19
笔记:
- 因为
@(trailer)
使用时,作者给出的台词不必严格遵循其部分。数据可以引入几个\contentsline {part}
元素,然后是与chapter
页码匹配的行。 @(skip)
意味着搜索整个剩余数据。通过添加数字参数来限制范围可以提高性能。如果可以假设总是{chapter}
在之后的 50 行内找到匹配{part}
,我们可以使用@(skip 50)
.