尝试使用ucharclasses

2024-5-24 • tag-icon

拉丁现代罗马字母对于某些 Unicode 组合的输出效果不佳，包括 ā́（带有长音符和尖音符的小拉丁字母“a”）。可以使用TIPA 封装对于这些组合，因此：

\documentclass{article}
\usepackage{fontspec}
\setmainfont{Latin Modern Roman}
\usepackage{tipa}
\begin{document}

\begin{tabular}{ll}
  native Latin Modern Roman `a with acute stacked on top of macron': & ā́\\
  TIPA's `a with acute stacked on top of macron': & \textipa{\'=a}\\
\end{tabular}
\end{document}

在 XeLaTeX 中编译为：

从很多方面来看，这都是一个不错的解决方案，但有一个问题是 TIPA 版本的 PDF 内部表示并不理想。（丑陋的）“原生”版本可以正确地从 PDF 文档中复制粘贴为ā́，但 TIPA 版本则为́ā。

有没有办法在 TIPA 版本之上关联一个“Unicode”表示（例如像扫描 PDF 的 OCR 层）？

编辑1（回复@Davislor）：

我正在使用fontspec，但想使用拉丁现代 Unicode 并且仍然能够在我的文档中使用我需要的某些特殊字符。

尝试使用`ucharclasses`

这部分有效，但会产生其他问题。为了进行比较，以下是“直接”拉丁现代和 CMU 样本：

\documentclass{article}
\usepackage{fontspec}
\defaultfontfeatures{Ligatures=TeX,Numbers=OldStyle,Contextuals=Alternate}
\setmainfont{Latin Modern Roman}
\setsansfont{Latin Modern Sans}
\newfontfamily{\cmuserif}{CMU Serif}
\newfontfamily{\cmusans}{CMU Sans Serif}
\newfontfamily{\cmutypewriter}{CMU Typewriter Text}
\begin{document}
\noindent \textbf{Latin Modern Roman}:\\  ā́ ȭ R̥r̥ éè ā üë ì í ī ī̀ ī́ è ṁṭ ß \textit{ß R̥r̥ ā́ ȭ}\\
{\sffamily\noindent \textbf{Latin Modern Sans}:\\ ā́ ȭ R̥r̥ éè ā üë ì í ī ī̀ ī́ è ṁṭ ß \textit{ß R̥r̥ ā́ ȭ}\\}
\texttt{\textbf{Latin Modern Typewriter}:\\
 ā́ ȭ R̥r̥ éè ā üë ì í ī ī̀ ī́ è ṁṭ ß \textit{ß R̥r̥ ā́ ȭ}}\\

{\cmuserif\noindent\textbf{CMU Serif}:\\  ā́ ȭ R̥r̥ éè ā üë ì í ī ī̀ ī́ è ṁṭ ß \textit{ß R̥r̥ ā́ ȭ}\\
{\cmusans\noindent \textbf{CMU Sans Serif}:\\ ā́ ȭ R̥r̥ éè ā üë ì í ī ī̀ ī́ è ṁṭ ß \textit{ß R̥r̥ ā́ ȭ}\\}
{\cmutypewriter\noindent{\textbf{CMU Typewriter}:\\
ā́ ȭ R̥r̥ éè ā üë ì í ī ī̀ ī́ è ṁṭ ß \textit{ß R̥r̥ ā́ ȭ}}\\}
\end{document}

拉丁现代语存在重大问题ā́, ȭ, ī̀, ī́, R̥, r̥（省略变音符号或将两个变音符号混在一起，难以区分）和排版不佳ṁ（上点未对齐）。卡内基梅隆大学通常可以正确处理这些问题（或至少可读性好），但打字机界面除外。（拉丁现代语的重音符号和分音符/变音符号设置通常更好看：éè üë；而 Eszett ß 的排版明显不同。）

我们ucharclasses可以修复部分（但不是全部）拉丁现代字体问题，但代价是无法使用无衬线字体和打字机字体：

\documentclass{article}
\usepackage{fontspec}
\defaultfontfeatures{Ligatures=TeX,Numbers=OldStyle,Contextuals=Alternate}
\setmainfont{Latin Modern Roman}
\setsansfont{Latin Modern Sans}
\newfontfamily{\defaultfont}{Latin Modern Roman}[SmallCapsFont={Latin Modern Roman Caps}]
\newfontfamily{\lmroman}{Latin Modern Roman}[SmallCapsFont={Latin Modern Roman Caps}]
\newfontfamily{\lmsans}{Latin Modern Sans}
\newfontfamily{\cmuserif}{CMU Serif}
\newfontfamily{\cmusans}{CMU Sans Serif}

\usepackage[Latin,LatinExtendedA]{ucharclasses}
\setDefaultTransitions{\defaultfont}{}
\setTransitionsForLatin{\defaultfont}{}
\setTransitionTo{LatinExtendedA}{\cmuserif}{} % fixes macron+acute, e.g. ā́
\setTransitionTo{LatinExtendedB}{\cmuserif}{}
\setTransitionTo{LatinExtendedC}{\cmuserif}{}
\setTransitionTo{LatinExtendedD}{\cmuserif}{}
\setTransitionTo{LatinExtendedE}{\cmuserif}{}
\setTransitionTo{LatinExtendedAdditional}{\cmuserif}{} % fixes overdot, e.g. ṁ
%\setTransitionTo{LatinSupplement}{\cmuserif}{} % essets, ß etc.
\setTransitionTo{CombiningDiacriticalMarks}{\cmuserif}{} % should contain ring below (U+0325), but doesn't fix R̥ r̥ (!)
\setTransitionTo{CombiningDiacriticalMarksExtended}{\cmuserif}{}
\setTransitionTo{CombiningDiacriticalMarksForSymbols}{\cmuserif}{}
\setTransitionTo{CombiningDiacriticalMarksSupplement}{\cmuserif}{}
\setTransitionTo{CombiningHalfMarks}{\cmuserif}{}

\begin{document}
\noindent \textbf{Latin Modern Roman with CMU Serif `fixes' with \texttt{ucharclasses}}:\\  ā́ ȭ R̥r̥ éè ā üë ì í ī ī̀ ī́ è ṁṭ ß \textit{ß R̥r̥ ā́ ȭ}\\
{\sffamily\noindent \textbf{Latin Modern Sans with CMU Serif `fixes' with \texttt{ucharclasses}}:\\ ā́ ȭ R̥r̥ éè ā üë ì í ī ī̀ ī́ è ṁṭ ß \textit{ß R̥r̥ ā́ ȭ}\\}
\texttt{\textbf{Latin Modern Typewriter with CMU Serif `fixes' with \texttt{ucharclasses}}:\\
 ā́ ȭ R̥r̥ éè ā üë ì í ī ī̀ ī́ è ṁṭ ß \textit{ß R̥r̥ ā́ ȭ}}
\end{document}

ā́, ī̀, ī́, ṁ已修复，但ȭ, R̥, r̥仍存在乱码或缺少变音符号的情况。并且无法使用 sans 和打字机字体。

尝试使用`catcode`

本着使用的精神newunicodechar，我尝试了以下操作：

\def\awithmacronacute{\textipa{\'=a}}
% or \def\awithmacronacute{\cmuserif{ā́}}
\catcode`\ā́=\active
\defā́{\awithmacronacute}

不幸的是，\catcode不适用于由 Unicode 字符组合而成的字符：

! Improper alphabetic constant.
<to be read again> 
                   \ā́ 
l.17 \catcode`\ā́
                 =\active
?

尝试`\ifpdfstringunicode`

我尝试了以下操作：

\def\awithacuteandmacron{%
  \texorpdfstring{\textipa{\'=a}}%
  {\ifpdfstringunicode{ā́}}}

当我将\awithacuteandmacron排版称为 a ā́ 时，' ā当我尝试将其从 PDF 中复制出来时它仍然存在。

我认为仅使用 CMU 字体代替 Latin Modern 最终是众多不完善的解决方案中最好的，因为 CMU 似乎具有更好的字形覆盖率，并且可以比 Latin Modern 更好地处理组合 unicode。

尝试使用ucharclasses

尝试使用catcode

尝试\ifpdfstringunicode

相关内容

尝试使用`ucharclasses`

尝试使用`catcode`

尝试`\ifpdfstringunicode`