如何避免因换行而导致行边缘出现短单词？

Question

这里有两个目标：

不要在紧跟标点符号的短单词后停顿，
不要在标点符号前的短单词前停顿，

受到良好断线规律的约束。

一个简单的解决方案是将标点符号声明为特别适合换行的位置（负惩罚，幅度足够大）。这将使 TeX 能够在尝试在标点符号处换行与其他换行考虑（不良、缺点、其他惩罚）之间进行权衡，但不能保证绝对不会出现此类换行。

以下是前后对比图，以进行说明：

如你看到的，

第一段中，, it第三行末尾的修改后移到了下一行。
第二段中，el.第四行开头的和at,第六行开头的修改后移至上一行。
第三段已经加入，以表明这个技巧并不能保证成功：it.第四行开头的仍然在那里，因为根本无法将其放在上一行。

这是通过以下方式实现的：

\catcode`.=\active \def.{\char`.\penalty -200\relax}
\catcode`,=\active \def,{\char`,\penalty -200\relax}

在以下文档中：

\documentclass{article}
\begin{document}
\frenchspacing % Makes it easier
\hsize=20em
\parskip=10pt

% First, three paragraphs with the default settings
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit placerat justo, sed dictum sem. Donec erat elit, tincidunt non, it vel, tincidunt vehicula velit. Etiam pharetra ante at porta elementum. In nulla purus, faucibus non accumsan non, consequat eget.

Natis nulla blandit luctus tellus, sit amet posuere lacus maxius quis. In sit amet mattis est, a vehiula velit. Nam interum solicitudin el. In faucibus vulputate purus nec consectelur crass metus ipsum, blandit iln ullamcorpert at, portitor vita dolor. Duis sed mauris i inset inculis malesuada. Quisque laoret eu dui eget sage melittis corpum verborum.

Volutpat libero ac auctor. Donec semper, as id ultrices rhoncus, lectus nulla consequat nisi, ac sagitis risus lectus vel felis. Ut gravida it. Nam malesuada ante turpis eget. Ipsum factum verbum verdit.

\pagebreak

% Now the same text, with the meanings of . and , changed.
\catcode`.=\active \def.{\char`.\penalty -200\relax}
\catcode`,=\active \def,{\char`,\penalty -200\relax}

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit placerat justo, sed dictum sem. Donec erat elit, tincidunt non, it vel, tincidunt vehicula velit. Etiam pharetra ante at porta elementum. In nulla purus, faucibus non accumsan non, consequat eget.

Natis nulla blandit luctus tellus, sit amet posuere lacus maxius quis. In sit amet mattis est, a vehiula velit. Nam interum solicitudin el. In faucibus vulputate purus nec consectelur crass metus ipsum, blandit iln ullamcorpert at, portitor vita dolor. Duis sed mauris i inset inculis malesuada. Quisque laoret eu dui eget sage melittis corpum verborum.

Volutpat libero ac auctor. Donec semper, as id ultrices rhoncus, lectus nulla consequat nisi, ac sagitis risus lectus vel felis. Ut gravida it. Nam malesuada ante turpis eget. Ipsum factum verbum verdit.

% Change it back
\catcode`.=12 \catcode`,=12
\pagebreak

% Same text again, to show that nothing's permanently changed.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit placerat justo, sed dictum sem. Donec erat elit, tincidunt non, it vel, tincidunt vehicula velit. Etiam pharetra ante at porta elementum. In nulla purus, faucibus non accumsan non, consequat eget.

Natis nulla blandit luctus tellus, sit amet posuere lacus maxius quis. In sit amet mattis est, a vehiula velit. Nam interum solicitudin el. In faucibus vulputate purus nec consectelur crass metus ipsum, blandit iln ullamcorpert at, portitor vita dolor. Duis sed mauris i inset inculis malesuada. Quisque laoret eu dui eget sage melittis corpum verborum.

Volutpat libero ac auctor. Donec semper, as id ultrices rhoncus, lectus nulla consequat nisi, ac sagitis risus lectus vel felis. Ut gravida it. Nam malesuada ante turpis eget. Ipsum factum verbum verdit.

\end{document}

笔记：

.如果像这样改变和的含义会破坏某些东西，我不会感到惊讶。,（事实上，我很惊讶这个例子中什么都没有弄乱，然后我意识到 catcode 更改不适用于已经读入的标记。）
你可以调整惩罚：我使用 -200 只是作为例子，但从 -1 到 -9999 的任何数字都会有一些效果。（在这个例子中，所有这些更改生效的阈值似乎是 -175，尽管甚至在 -100 时也发生了一个更改。）≤ -10000 的惩罚会强制换行，这不是您想要的。
您可以对更多标点符号执行相同操作（?!:;），或者对不同的标点符号设置不同的惩罚。
事情有点困难\nonfrenchspacing（默认），标点符号后的空格更大。这也许是可行的，但想出这些例子需要做很多工作，所以我没有继续。留作练习 :-)
使用 LuaTeX，你甚至可以更改换行算法，这将是一个很酷的方式保证行边缘处没有短词（如果这是您需要的）。

编辑：我忍不住要在 LuaTeX 中实现“保证”解决方案。此版本应适用于和\frenchspacing。\nonfrenchspacing它的作用是检测某些序列并插入无限（10000）个惩罚以防止中断：

(punct, space, short_word, space) -> (punct, space, short_word, penalty, space)

和

(space, short_word, punct) -> (penalty, space, short_word, punct)

对于上面的例子，这将产生：

请注意最后一段中的溢出框，因为约束非常严格，但这就是我们所要求的。（无论如何，您可能不会遇到带有更宽和更长段落的溢出框，并且您可以用通常的方式修复它们，例如重写或添加\emergencystretch等。）

产生上述内容（甚至想法）的代码很可能有错误，甚至可能导致你的 LuaTeX 编译崩溃，但它在这里：

\documentclass{article}
\directlua{dofile("strict.lua")}
\begin{document}
\frenchspacing % Keeping same example as before
\hsize=20em
\parskip=10pt

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit placerat justo, sed dictum sem. Donec erat elit, tincidunt non, it vel, tincidunt vehicula velit. Etiam pharetra ante at porta elementum. In nulla purus, faucibus non accumsan non, consequat eget.

Natis nulla blandit luctus tellus, sit amet posuere lacus maxius quis. In sit amet mattis est, a vehiula velit. Nam interum solicitudin el. In faucibus vulputate purus nec consectelur crass metus ipsum, blandit iln ullamcorpert at, portitor vita dolor. Duis sed mauris i inset inculis malesuada. Quisque laoret eu dui eget sage melittis corpum verborum.

Volutpat libero ac auctor. Donec semper, as id ultrices rhoncus, lectus nulla consequat nisi, ac sagitis risus lectus vel felis. Ut gravida it. Nam malesuada ante turpis eget. Ipsum factum verbum verdit.
\end{document}

哪里strict.lua：

function is_punct(n)
   if node.type(n.id) ~= 'glyph' then return false end
   if n.char > 127 then return false end
   c = string.char(n.char)
   if c == '.' or c =='?' or c == '!' or c == ':' or c == ';' or c == ',' then
      return true
   end
   return false
end

function no_punct_short_word_eol(head)
   -- Prevents having a line that ends like "<punctuation><space><short_word>"
   -- How we do this:
   --   (1) detect such short words (punct, space, short_word, space)
   --   (2) insert a penalty of 10000 between the short_word and the following space.
   -- More concretely:
   --   * A punctuation is one of .?!:;, which are the ones affected by \frenchspacing
   --   * A space is any glue node.
   --   * A short_word is a sequence of only glyph and kern nodes.
   -- So we maintain a state machine: default -> seen_punct -> seen_space -> seen_word
   -- where in the last state we maintain length. If we're in seen_word state and we see
   -- a glue, and length is less than threshold, insert a penalty before the glue.
   state = 'default'
   root = head
   while head do
      if state == 'default' then
         if is_punct(head) then
            state = 'seen_punct'
         end
      elseif state == 'seen_punct' then
         if node.type(head.id) == 'glue' then
            state = 'seen_space'
         else
            state = 'default'
         end
      elseif state == 'seen_space' then
         if node.type(head.id) == 'glyph' then
            state = 'seen_word'
            length = 1
         elseif is_punct(head) then
            state = 'seen_punct'
         else
            state = 'default'
         end
      elseif state == 'seen_word' then
         if node.type(head.id) == 'glue' and length <= 2 then
            -- Moment of truth
            penalty = node.new('penalty')
            penalty.penalty = 10000
            root, new = node.insert_before(root, head, penalty)
            -- TODO: Is 'head' invalidated now? Docs don't say anything...
            state = 'default'
         elseif node.type(head.id) == 'glyph' or node.type(head.id) == 'kern' then
            if node.type(head.id) == 'glyph' then length = length + 1 end
         else
            state = 'default'
         end
      else
         assert(false, string.format('Impossible state %s', state))
      end
      head = head.next
   end
   return root
end
luatexbase.add_to_callback('pre_linebreak_filter', no_punct_short_word_eol, 'Prevent short words after punctuation at end of sentence')

function no_bol_short_word_punct(head)
   -- Prevents having a line that starts like "<short_word><punctuation>"
   -- How we do this:
   --   (1) detect such short words (space, short_word, punct)
   --   (2) insert a penalty of 10000 between the space and the following short_word.
   -- More concretely:
   --   * A punctuation is one of .?!:;, which are the ones affected by \frenchspacing
   --   * A space is any glue node.
   --   * A short_word is a sequence of only glyph and kern nodes.
   -- So we maintain a state machine: default -> seen_space -> seen_word
   -- where in the last state we maintain length. If we're in seen_word state and we see
   -- a punct, and length is less than threshold, insert a penalty before the glue.
   -- Note that for this to work, we need to maintain a pointer to where we saw the glue.
   state = 'default'
   root = head
   before_space = nil
   while head do
      if state == 'default' then
         if node.type(head.id) == 'glue' then
            state = 'seen_space'
            before_space = head.prev
         end
      elseif state == 'seen_space' then
         if node.type(head.id) == 'glyph' then
            state = 'seen_word'
            length = 1
         else
            state = 'default'
         end
      elseif state == 'seen_word' then
         if is_punct(head) and length <= 2 then
            -- Moment of truth
            penalty = node.new('penalty')
            penalty.penalty = 10000
            root, new = node.insert_after(root, before_space, penalty)
            -- TODO: Is 'head' invalidated now? Docs don't say anything...
            state = 'default'
         elseif node.type(head.id) == 'glyph' or node.type(head.id) == 'kern' then
            if node.type(head.id) == 'glyph' then length = length + 1 end
         elseif node.type(head.id) == 'glue' then
            state = 'seen_space'
            before_space = head.prev
         else
            state = 'default'
         end
      else
         assert(false, string.format('Impossible state %s', state))
      end
      head = head.next
   end
   return root
end
luatexbase.add_to_callback('pre_linebreak_filter', no_bol_short_word_punct, 'Prevent short words at beginning of sentence before punctuation')

Answer 1