如何随机提取PDF文件中的某些页面,并对其进行排序,最后捆绑成一个新的PDF?

如何随机提取PDF文件中的某些页面,并对其进行排序,最后捆绑成一个新的PDF?

amazon.com 上的一些书有摘录(前几页加上附录),我们可以在购买之前大致了解其内容、布局等。从统计学上讲,我认为如果摘录的内容是随机从相应书籍中摘录一些页面(30-50 页可能就足够了),然后按页数升序排列,最后将它们打包成新的 PDF,效果会更好。

我的问题是:如何在 LaTeX 中做到这一点?

最小工作示例

% compile with pdflatex -shell-escape 
% =============================================================================================
\def\NoticeThatIAmUsingThisPackageToExtractSomePagesFromAnExternalPDFFileInMyComputer{pdfpages}
% =============================================================================================
\documentclass{article}
\usepackage{filecontents}
\begin{filecontents*}{book.tex}
\documentclass{book}
\usepackage{blindtext}
\begin{document}
\Blinddocument
\end{document}
\end{filecontents*}

\immediate\write18{pdflatex book.tex}
\immediate\write18{pdflatex book.tex}


\usepackage{\NoticeThatIAmUsingThisPackageToExtractSomePagesFromAnExternalPDFFileInMyComputer}

\def\NumberOfPagesOfExcerpt{50}

\begin{document}

% do randomization, sorting and bundling here!
% \includepdf[pages=-]{book}
\end{document}

答案1

一个lualatex办法:

\documentclass{article}
\usepackage{luatextra}
\usepackage{filecontents}

\begin{filecontents*}{book.tex}
\documentclass{book}
\usepackage{blindtext}
\begin{document}
\Blinddocument
\end{document}
\end{filecontents*}

\begin{luacode}
function get_random_pages(randPages,totalPages, randSeed)
    --[[--
    Constructs a sorted list of randPages random page numbers within a range 1..totalPages     
    @Parameter: randPages
        The number of random pages to extract
    @Parameter: totalPages
        Total number of pages in a pdf file
    @Parameter: randSeed
        Random seed
    --]]--
    local pagesLeft= {} 
    local pageList = {}
    for pageNo=1, totalPages, 1 do
      table.insert(pagesLeft,pageNo)
    end

    math.randomseed (randSeed)

    local r
    for i=1, randPages do
      r=math.random(#pagesLeft)
      table.insert(pageList,pagesLeft[r])
      table.remove(pagesLeft,r)
    end   
    table.sort(pageList)
    local s="\\includepdf[pages={"
    s=s..pageList[1]
    for i=2, randPages do
      s=s..","..pageList[i]
    end
    s=s.."}]{book}"
    tex.print(s)
end
\end{luacode}


\immediate\write18{pdflatex book.tex}
\immediate\write18{pdflatex book.tex}

\usepackage{pdfpages}

\def\NumberOfPagesOfExcerpt{9}
\def\NumberOfPagesInPdf{17}
\def\randomSeed{27449}

\begin{document}

% do randomization, sorting and bundling here!
  \directlua{get_random_pages(\NumberOfPagesOfExcerpt,\NumberOfPagesInPdf,\randomSeed)}

\end{document}

用 进行处理 lualatex -shell-escape random_pages.tex

编辑:

  • table.concat按照建议使用的标准函数@Aditya

  • \randomPages使用可选随机种子参数定义的命令,

  • pdf 中的页数通过 pdftex 基元定义,如下所示 这里

random_pages.tex

\documentclass{article}
\usepackage{luatextra}
\usepackage{filecontents}

\begin{filecontents*}{book.tex}
\documentclass{article}
\usepackage{geometry}
\geometry{
paperwidth=74mm,
paperheight=105mm,
margin=2em,
bottom=9ex,
nohead
}
\usepackage{blindtext}
\begin{document}
\Blinddocument
\end{document}
\end{filecontents*}

\begin{luacode}
function get_random_pages(randPages,totalPages, randSeed)
    --[[--
    Constructs a sorted list of randPages random page numbers within a range 1..totalPages     
    @Parameter: randPages
        The number of random pages to extract
    @Parameter: totalPages
        Total number of pages in a pdf file
    @Parameter: randSeed
        Random seed: used only if >0
    --]]--
    local pagesLeft= {} 
    local pageList = {}
    for pageNo=1, totalPages, 1 do
      table.insert(pagesLeft,pageNo)
    end  
    if randSeed>0 then math.randomseed(randSeed) end
    local r
    for i=1, math.min(randPages,totalPages) do
      r=math.random(#pagesLeft)
      table.insert(pageList,pagesLeft[r])
      table.remove(pagesLeft,r)
    end   
    table.sort(pageList)
    local s="\\includepdf[pages={"
    s=s..table.concat(pageList,",")
    s=s.."}]{book}"
    tex.print(s)
end
\end{luacode}

\immediate\write18{pdflatex book.tex}
\immediate\write18{pdflatex book.tex}

\usepackage{pdfpages}

\def\NumberOfPagesOfExcerpt{42}
\def\randomSeed{27449}

\makeatletter
\newcommand\@randomPages[3]{%
\pdfximage{#2}%
\def\NumberOfPagesInPdf{\the\pdflastximagepages}%
\directlua{get_random_pages(#1,\NumberOfPagesInPdf,#3)}%
}
\def\randomPages{%
\@ifnextchar[{\@with}{\@without}}%
\def\@with[#1]#2#3{%
\@randomPages{#2}{#3}{#1}%
}%
\def\@without#1#2{%
\@randomPages{#1}{#2}{0}%
}%
\makeatother

\begin{document}
% do randomization, sorting and bundling here!

%  \randomPages[\randomSeed]{10}{book.pdf} % supposed to produce a fixed set of pages every time
  \randomPages{10}{book.pdf}         % supposed to produce a different set of pages every time 

\end{document}

答案2

这是一个解决方案ConTeXt Lua 文档. 适当修改参数filenamen(稍后我会发布使用命令行参数的版本)。

将其保存为filter.cld(注意扩展名!),然后使用 进行处理context filter.cld

local random = math.random
local format = string.format

-- Sample n items out of m without replacement
function reservoirsample (n, m)
    local sampledlist = {}
    if n == 0 then return sampledlist end
    for i = 1, m do 
        -- Take the first n samples
        if i <= n then
            sampledlist[i] = i
        else
        -- Randomly replace one sample
            local j = random(i)
            if j < n then 
               sampledlist[j] = i
            end
        end
    end
    table.sort(sampledlist)
    return sampledlist
end

local filename="fonts-mkiv.pdf"
local n = 20

context.starttext()

-- Example taken from grph-inc.lua
local fig = figures.push { name = filename }
figures.identify()
figures.check()
local nofpages = fig.used.pages
figures.pop()

selected = reservoirsample(n, nofpages)

print(format("::: File %s has %d pages, selecting %d", filename, nofpages, n))
print(format("::: %s", table.concat(selected, ", ")))

for i = 1,#selected do
  context.startTEXpage()
  context.externalfigure( {filename}, {page=selected[i]} )
  context.stopTEXpage()
end

context.stoptext()

答案3

这是一个仅使用数学和循环位的解决方案pgf。它借用了 Mark Wibrow 不久前在 pgf-users 邮件列表中编写的一些代码改组pgfmath列表。中的列表pgfmath是用哈希实现的,而不是单个标记列表。

获取随机列表列表 {1,...,},我创建列表 {1,...,} 然后 knuth 将其打乱。然后我对第一个进行冒泡排序元素。然后对每个从 1 到我包括PDF 的第页。

\documentclass{article}
\usepackage{pgf,pgffor}
\usepackage{pdfpages}

\makeatletter

% declare a list by its elements
% e.g., \pgfmathdeclarelist{mylist}{{foo}{bar}{baz}}
\def\pgfmathdeclarelist#1#2{%
    \def\pgfmath@list@name{#1}%
    \c@pgfmath@counta=0%
    \pgfmath@declarelistlist#2{\pgfmath@stop}%
}%
\def\pgfmath@declarelistlist#1{%
    \ifx#1\pgfmath@stop%
        \expandafter\edef\csname pgfmath@list@\pgfmath@list@name @length\endcsname{\the\c@pgfmath@counta}%
    \else%
        \advance\c@pgfmath@counta by1\relax%
        \pgfutil@namedef{pgfmath@list@\pgfmath@list@name @\the\c@pgfmath@counta}{#1}%
        \expandafter\pgfmath@declarelistlist%
    \fi%
}

% get a list item
% \pgfmathgetlistitem{\cs}{mylist}{3} lets \cs be the 3rd item of mylist
\def\pgfmathgetlistitem#1#2#3{%
   \expandafter\let\expandafter#1\expandafter=\csname pgfmath@list@#2@#3\endcsname%
}

% set a list item
% \pgfmathsetlistitem{mylist}{3}{foo} defines the 3rd item of mylist to be foo
% caution - you may need the 3rd argument expanded first.
\def\pgfmathsetlistitem#1#2#3{%
   \pgfutil@namedef{pgfmath@list@#1@#2}{#3}%
}

% get the length of a list
% \pgfmathgetlistlength{\mylistlength}{mylist} lets \mylistlength be the length of the list.
\def\pgfmathgetlistlength#1#2{%
   \expandafter\let\expandafter#1\expandafter=\csname pgfmath@list@#2@length\endcsname%
}

% set the length of a list
% \pgfmathsetlistlength{mylist}{length} defines the length of mylist to be length
\def\pgfmathsetlistlength#1#2{%
   \expandafter\edef\csname pgfmath@list@#1@length\endcsname{#2}
}


\def\pgfmathknuthshuffle#1{%
    \pgfmathgetlistlength\pgfmath@len{#1}%
    \pgfmathloop%
    \ifnum\pgfmathcounter>\pgfmath@len%
    \else%
        \pgfmathrandominteger\pgfmath@temp{1}{\pgfmath@len}%
        \pgfmathgetlistitem\pgfmath@@temp{#1}{\pgfmathcounter}%
        \pgfmathgetlistitem\pgfmath@@@temp{#1}{\pgfmath@temp}%
        \def\pgfmath@marshal{\pgfmathsetlistitem{#1}}%
        \expandafter\pgfmath@marshal\expandafter{\expandafter\pgfmath@temp\expandafter}\expandafter{\pgfmath@@temp}%
        \expandafter\pgfmath@marshal\expandafter{\expandafter\pgfmathcounter\expandafter}\expandafter{\pgfmath@@@temp}%
    \repeatpgfmathloop%
}

\def\NumberOfPagesOfExcerpt{9}
\def\NumberOfPagesInPdf{17}

% Populate page list. Rather than use \pgfmathdeclarelist we allocate the list and assign in a loop.
% sorry for the \global... pgf's \foreach creates a group. 
\def\s@pagelist{pagelist} % makes expansion easier
\pgfmathsetlistlength{pagelist}{\NumberOfPagesInPdf}
\foreach \i in {1,...,\NumberOfPagesInPdf}{
   \global\expandafter\pgfmathsetlistitem\expandafter\s@pagelist\expandafter\i\expandafter{\i}
}

\pgfmathknuthshuffle{pagelist}

% now a bubble sort on the first \NumberOfPagesOfExcerpt items in the list.
\pgfmathtruncatemacro{\n}{\NumberOfPagesOfExcerpt-1}
\foreach \j in {1,...,\n}{
   \pgfmathtruncatemacro{\k}{\NumberOfPagesOfExcerpt-\j}
   \foreach \i in {1,...,\k}{
      \pgfmathtruncatemacro{\iplusone}{\i+1}
      \pgfmathgetlistitem{\pagei}{pagelist}{\i}
      \pgfmathgetlistitem{\pageiplusone}{pagelist}{\iplusone}
      \ifnum\pagei>\pageiplusone
          \global\expandafter\pgfmathsetlistitem\expandafter\s@pagelist\expandafter\i\expandafter{\pageiplusone}
          \global\expandafter\pgfmathsetlistitem\expandafter\s@pagelist\expandafter\iplusone\expandafter{\pagei}       
      \fi
   }
}

\makeatother

\begin{document}

\foreach \i in {1,...,\NumberOfPagesOfExcerpt}{
   \pgfmathgetlistitem{\pagei}{pagelist}{\i}
   \includepdf[pages=\pagei]{book.pdf}
}

\end{document}

正如您所见,它有点混乱,但它不需要 lua 或外部脚本。IANACS,所以我也不知道它有多高效。但如果您想要高效,您就不会在 TeX 中完成这项工作。:-)

答案4

excerpting.exe调用LaTeX 内部命名的外部随机器。

LaTeX 代码:

% compile with pdflatex -shell-escape 
\documentclass{article}
\usepackage{pdfpages}
\def\bookfilename{status-lua}% http://chat.stackexchange.com/transcript/41?m=8712421#8712421
\def\take{30}
\def\seeder{1}
\def\auxiliaryfilename{random.txt}

\pdfximage{\bookfilename.pdf}
\immediate\write18{excerpting \the\pdflastximagepages\space \take\space \seeder\space \auxiliaryfilename}

\begin{document}
\newread\reader
\openin\reader=\auxiliaryfilename\relax
    \loop
        \read\reader to \data
        \unless\ifeof\reader
        \includepdf[pages=\data]{\bookfilename}
    \repeat
\closein\reader
\end{document}

C# (Fisher-Yates 改组):

// excerpting.cs
using System;
using System.IO;
using System.Linq;

namespace Excerpting
{
    class Program
    {
        static void Main(string[] args)
        {
            int total = int.Parse(args[0]);
            int take = int.Parse(args[1]);
            int seeder = int.Parse(args[2]);
            string filename = args[3];


            int[] array = Enumerable.Range(1, total).ToArray();

            Random random = new Random(seeder);
            for (int i = total - 1; i > 0; i--)
            {
                int j = random.Next(i+1);
                int temp = array[i];
                array[i] = array[j];
                array[j] = temp;
            }

            File.WriteAllLines(filename, array.Take(take).OrderBy(x => x).Select(x => x.ToString()));
        }
    }
}

C#(随机排序):

有人声称随机排序具有均匀分布,但我还没有检查过。

// excerpting.cs
using System;
using System.IO;
using System.Linq;

namespace Excerpting
{
    class Program
    {
        static void Main(string[] args)
        {
            int total = int.Parse(args[0]);
            int take = int.Parse(args[1]);
            int seeder = int.Parse(args[2]);
            string filename = args[3];

            Random random = new Random(seeder);
            string[] array = Enumerable.Range(1, total)
                                       .OrderBy(x => random.Next())
                                       .Take(take)
                                       .OrderBy(x => x)
                                       .Select(x => x.ToString())
                                       .ToArray();

            File.WriteAllLines(filename, array);
        }
    }
}

相关内容