提取文档统计信息?- 第 xy 章有多少页?计算修复内容?

提取文档统计信息?- 第 xy 章有多少页?计算修复内容?

我有一个很大的 pdflatex 文档,其中的每一章都位于一个额外的文本文件中,并包含在\include{chapter3.tex}...

  • 我如何提取每章的页码并将其写入文本文件?
    我想知道每章有多少页(在某一时刻)并获取一个列表,例如
    (可以使用包含的文本文件来完成,还是最好在每章开头定义标签,然后读出这些标签的页码)?

第 1 章:5 页
第 2 章:10 页
第 3 章:4 页

文档:19页

  • 此外,我想计算某些命令的出现次数,例如\N{note}\NK{note}(我已经使用每个章节的包在文档中定义了这些命令来创建注释,fixme并将它们写入文本文件中,例如:

第 \N \NK
第 1 章: 20 3
第 2 章: 3 5

文件:23 8

答案1

我假设您不想修改源文件(否则会更容易 - 在每个文件后添加一些智能命令\chapter)。

你可以做这样的事情:

\let\origchapter=\chapter
\def\chapter{\label{chap:#1}\origchapter}

这样做的缺点是标签与以前的页面。为了做得更好,您必须注意\chapter(可选星号,可选参数)的语法,这实现起来有点耗时,但完全可行(而且也不是那么困难)。

然后只需使用一些 perl/python/lua/whatever 脚本来解析 .aux 文件。

或者,你可以使用类似

\write\somefilehandler{Chapter: \value{chapter}\thepage}

而不是\label;那么您必须\somefilehandler在文档开头打开并关闭它\AtEndDocument

至于你的第二个问题,一个简单的想法是:

\newcounter{foocount}
\let\origfoo=\foo
\def\foo{\stepcounter{foocounter}\origfoo}
\AtEndDocument{\message{\string\foo: \value{foocounter}}}

这将在日志文件和终端上为您提供整个文档的总数。如果您想要按章节计算总数,则可以执行\messages(或\writes)\chapter,按照上述精神重新定义它。

答案2

正如我在上面的评论中提到的,这是我用来按部分拆分生成的 PDF 文件的 Python 脚本。您可能需要根据自己的需要对其进行调整,我希望它对您有用。它调用pdftk以进行实际拆分。

也许有一个更“标准”的解决方案,我希望有人可以添加评论。

#!/usr/bin/python

class breakPt:
    def __init__(self,pg,title,num):
        self.pg,self.title,self.num=pg,title,num
import re,os
pts=[]
for l in file('master.toc'):
    #if (not '{part}' in l) and (not '{section}' in l): continue
    m=re.match(r'^\\contentsline\W*{(section|part|chapter)}{(.*)}{([0-9]+)}({[^}]*})?$',l)
    if not m: continue
    #print 'Match:',m.group(1,2,3)
    type,raw,pg=m.group(1,2,3)
    if type=='section':
        m=re.match(r'^\\numberline\W*{([0-9]+)}(.*)$',raw)
        num,raw=m.group(1,2)
        raw=re.sub(r'\\FN@sf@gobble@opt .*$','',raw) # strip footnote
        raw=re.sub(r'\\IeC\W*{.*?([a-zA-Z]) ?}',r'\1',raw) # remove accents
        raw=re.sub(r'\\emph\W*{(.*?)}',r'\1',raw) # remove \\emph
        raw=re.sub(r'(:|\W*\\&|\W*\().*$','',raw) # take just the "first part" as name
        raw=re.sub(r' a ',r' ',raw) # remove 'a' as conjunction
        raw=re.sub(r'[^a-zA-Z]+','_',raw) # remove commans
        raw=raw.lower()
    pts.append(breakPt(int(pg),raw,int(num) if type=='section' else -1))
    #print 'added',pts[-1].num,pts[-1].title,pts[-1].pg

for i,pt in enumerate(pts):
    bgPg,endPg=pt.pg,(pts[i+1].pg-1 if i+1<len(pts) else -1)
    if pt.num<0: continue
    #print pt.num,bgPg,endPg,pt.title
    pgSpec='%02d-%02d'%(bgPg,endPg) if endPg>0 else '%02d-end'%bgPg
    out='%02d-%s.pdf'%(pt.num,pt.title)
    print pgSpec,out
    os.system('pdftk master.pdf cat %s output %s'%(pgSpec,out))

答案3

我最终使用了基于 的解决方案bash script

因为其他人可能会感兴趣,所以我在这里分享。然而我不得不说这是实验并且可能包含一些非常恶意的黑客,因为我对脚本编写不是很有经验bash

#!/bin/bash

# script overwrites file DocStat.txt with recent statistics of latex writing project:
# Block 1: date, number of pages and file size of PDF
# Block 2: All chapters with title and number of pages
# Block 3: All chapters and sectinos with title and number of pages 
# Block 4: word count statistics using textcount.pl for all chapters
# (during script runtime, a temporary file Docstat.tmp is created for collecting the output)
# the script scans the aux files to extract the page numbers, where sections begin

date > Docstat.txt
grep "Output written on" Diss.log >> Docstat.txt

grep "contentsline {chapter}"  Diss.toc | sed 's/\\contentsline //g' | sed 's/\\numberline //g' >> Docstat.tmp

NEin=$(grep "newlabel{anf:Kap}\|newlabel{end:Kap}"  1_Introduction.aux | awk 'BEGIN {
  FS="[{}]+"
 } {
  if ($2=="anf:Kap")
   KapAnf=$4
  if ($2=="end:Kap")
   KapEnd=$4
 } END {
#   print KapAnf
   print KapEnd-KapAnf+1
 }')

NGru=$(grep "newlabel{anf:Kap}\|newlabel{end:Kap}"  2_Theory.aux | awk 'BEGIN {
  FS="[{}]+"
 } {
  if ($2=="anf:Kap")
   KapAnf=$4
  if ($2=="end:Kap")
   KapEnd=$4
 } END {
#   print KapAnf
   print KapEnd-KapAnf+1
 }')

 NExp=$(grep "newlabel{anf:Kap}\|newlabel{end:Kap}"  3_Experimental.aux | awk 'BEGIN {
  FS="[{}]+"
 } {
  if ($2=="anf:Kap")
   KapAnf=$4
  if ($2=="end:Kap")
   KapEnd=$4
 } END {
#   print KapAnf
   print KapEnd-KapAnf+1
 }')

  NEuD=$(grep "newlabel{anf:Kap}\|newlabel{end:Kap}"  5_ResultsAndDiscussion.aux | awk 'BEGIN {
  FS="[{}]+"
 } {
  if ($2=="anf:Kap")
   KapAnf=$4
  if ($2=="end:Kap")
   KapEnd=$4
 } END {
#   print KapAnf
   print KapEnd-KapAnf+1
 }')

   NZus=$(grep "newlabel{anf:Kap}\|newlabel{end:Kap}"  7_Conclusion.aux | awk 'BEGIN {
  FS="[{}]+"
 } {
  if ($2=="anf:Kap")
   KapAnf=$4
  if ($2=="end:Kap")
   KapEnd=$4
 } END {
#   print KapAnf
   print KapEnd-KapAnf+1
 }')


NLit=$(awk 'BEGIN {
  FS="[{}]+"
 } {
  if ($3=="References") #manuelle Anpassung, weil Lit. keine Nummer hat
   KapA=$4
  if ($4=="Publications")
#   KapB=$5-5 # manual correction by 5 pages!!
#   KapB= $KapB-4 # manuelle Korrektur um 4 Seite!!
 } END {
   print KapB-KapA+1
 }' Docstat.tmp) #&& echo $DIFFERENZ

NGes=$(awk 'BEGIN {
  FS="[{}]+"
 } {
  if ($4=="Introduction")
   KapA=$5
  if ($3=="References") #manuelle Anpassung, weil Lit. keine Nummer hat
   KapB=$4
 } END {
   print KapB-KapA+1
 }' Docstat.tmp) #&& echo $DIFFERENZ


NGesLit=$(($NGes + $NLit)) 

echo " " >> Docstat.txt
echo "==== page numbers" >>Docstat.txt
echo "total_(withoutRefs): $NGes S." >> Docstat.txt
echo "total:             $NGesLit S." >> Docstat.txt
echo " " >>Docstat.txt
echo "Introduction: $NEin S." >> Docstat.txt
echo "Theory:       $NGru S." >> Docstat.txt
echo "Experimental: $NExp S." >> Docstat.txt
echo "Results:      $NEuD S." >> Docstat.txt
echo "Conclusions:  $NZus S." >> Docstat.txt
echo "References:   $NLit S." >> Docstat.txt

head -14 Docstat.txt

echo '==== number of lines'
grep Zeilenzahl Diss.log # must be 

grep linenumber *.aux | sed 's/\\setcounter{linenumber}/ /g'

growlnotify -t "Diss Statistik: $NGes / $NGesLit S." -m "Ein $NEin, Gru $NGru, Erg $NEuD"

echo " " >> Docstat.txt
echo '==== Chapters' >> Docstat.txt
cat Docstat.tmp >> Docstat.txt

echo "" >> Docstat.txt


echo '==== Details' >> Docstat.txt
grep "contentsline \({chapter}\|{section}\)"  Diss.toc | sed 's/\\contentsline //g' | sed 's/\\numberline //g' >> Docstat.txt

echo '==== Word count' >> Docstat.txt
texcount  1_Introduction.tex >> Docstat.txt
texcount  2_Theory.tex >> Docstat.txt 
texcount  3_Experimental.tex >> Docstat.txt
texcount  5_Results.tex >> Docstat.txt
texcount  7_Conclusion.tex >> Docstat.txt

echo '==== Number of lines' >> Docstat.txt
grep Zeilenzahl Diss.log >> Docstat.txt
echo 'by chapter' >> Docstat.txt
grep linenumber *.aux | sed 's/\\setcounter{linenumber}/ /g' >> Docstat.txt

rm Docstat.tmp

相关内容