通过unicode位置输入标记

通过unicode位置输入标记

使用 luatex,我可以通过命令通过 unicode 位置注入标记(参见以下示例)。我想知道这是否也可以用 pdflatex 来实现。

它不需要具备 luatex 示例的所有功能。

没关系

  • 如果文件采用 utf8 编码
  • 如果它不适用于所有 unicode 块,并且需要对某个块进行一些预处理
  • 如果这个预处理必须在某些或类似的定义之后进行\DeclareUnicodeCharacter
  • 如果输入必须是小写(20ac),而不是(20AC)
  • 如果命令不可扩展,因此在 \edef 中不起作用(如果 € 的当前含义是可扩展的,则 luatex 版本可以在那里工作)。

luatex 示例

%run with lualatex
\documentclass{article}
\usepackage{luacode}

\begin{luacode}
function inserttokenbyunicodeposition (unicodeposition)
 tex.sprint("^^^^"..string.lower(unicodeposition))
end
\end{luacode} 

\newcommand\unicodetoutf[1]{\directlua{inserttokenbyunicodeposition("#1")}}

\newcommand\unicodetoutfchar[1]{\char"#1} 

\begin{document}
\catcode`\€=\active \def€{Hallo}

€ ^^^^20ac \unicodetoutf{20AC} %<-- should output Hallo

\unicodetoutfchar{20AC} %<-- unwanted output   

\end{document}

pdflatex MWE

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}

\DeclareUnicodeCharacter{20AC}{Hallo}

\newcommand\unicodetoutf[1]{??}

\begin{document}

€ ^^e2^^82^^ac \unicodetoutf{20AC} %<-- should output Hallo

\end{document}

答案1

更新:

自 2018/04/01 LaTeX 发布以来,inputenc+utf8 默认处于活动状态(原文如此),这破坏了下面的 v3 方法,因为范围 128-255 内的 ascii 字节现在处于活动状态,并且它们在这里被假定为 catcode 12。

我对我的 v3 代码做了一些小的编辑,使其可以在为旧 LaTeX 或新 LaTeX 加载或不加载 inputenc 的情况下使用。

(pdflatex 仅限答案)

目录

  • v1v2利用 LaTeX 内核中已经存在的宏来生成 LaTeX 字体选择系统最终需要的宏。

  • v3完全不同。它产生可扩展UTF-8 字节。它根本不使用 inputenc,但会产生活动的前导字节,以便它已准备好“inputenc+utf8”。活动标记使用 进行保护\protect(这里没有 e-TeX \protected),要将它们写入文件流,需要发出\set@display@protect。它们可以在 内部使用,但出于同样的原因,\edef需要 LaTeX 。此外,宏现在允许数字输入,因此在十六进制输入的情况下必须明确使用 (我不明白为什么 LaTeX与 相反,也不这样做)。这里的优点是很容易设置循环,使用关联的计数器或其他设备连续生成此类字符。为了说明这一点,mwe 使用包,但这仅用于示例。\protected@edef\UniToUTF"\DeclareUnicodeCharacterutf8xxinttools


v1

这里快速记下了一些技巧,似乎可以通过 pdflatex 实现你想要的功能。一般情况下可能需要进行一些修正(今天时间不够)。

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}

\DeclareUnicodeCharacter{20AC}{Hallo}

\makeatletter
\newcommand\unicodetoutf[1]{{\count@ "#1 \parse@XML@charref}%
   \UTFviii@tmp}
\let\@preamblecmds\@empty
\makeatother


\begin{document}

€ ^^e2^^82^^ac \unicodetoutf{20AC} %<-- should output Hallo

\end{document}

在此处输入图片描述

v2

以下是针对避免和允许 ascii 字符的问题进行的编辑\let\@preamblecmds\@empty

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[LGR,T1]{fontenc}

%\tracingmacros1
\DeclareUnicodeCharacter{20AC}{Hallo}
%\tracingmacros0

\makeatletter
\let\UF@parse@UTFviii@a\parse@UTFviii@a
\let\UF@parse@UTFviii@b\parse@UTFviii@b
\newcommand*\UF@parse@XML@charref{%
  \ifnum\count@<"A0\relax
     \xdef\UTFviii@tmp{\char\the\count@\space}% \count@ not globally modified
  \else\ifnum\count@<"800\relax
     \UF@parse@UTFviii@a,%
     \UF@parse@UTFviii@b C\UTFviii@two@octets.,%
  \else\ifnum\count@<"10000\relax
     \UF@parse@UTFviii@a;%
     \UF@parse@UTFviii@a,%
     \UF@parse@UTFviii@b E\UTFviii@three@octets.{,;}%
   \else
     \UF@parse@UTFviii@a;%
     \UF@parse@UTFviii@a,%
     \UF@parse@UTFviii@a!%
     \UF@parse@UTFviii@b F\UTFviii@four@octets.{!,;}%
    \fi
    \fi
  \fi
}
\newcommand\unicodetoutf[1]{%
  {\count@ "#1 \UF@parse@XML@charref}%
   \UTFviii@tmp
}

\makeatother

\begin{document}

€ ^^e2^^82^^ac \unicodetoutf{20AC} %<-- should output Hallo

A ^^41 \unicodetoutf{0041}

e ^^65 \unicodetoutf{0065}

é ^^c3^^a9 \unicodetoutf{00E9}

\fontencoding{LGR}\selectfont

α ^^ce^^b1 \unicodetoutf{03B1}
\end{document}

在此处输入图片描述

utf8.def这个想法是将LaTeX 团队已经编码的 Unicode插入到 UTF-8 中进行解析。


v3

这是可扩展的方法。我长期犹豫 7 位 ascii 范围和 catcode 的问题,首先它们中的许多被 TeX/LaTeX 声明为忽略标记。最后我决定全部使用 catcode 12。然而,当写入文件时,它们有时仍会以 TeX 的核心^^符号输出,而不是真正的 UTF-8 7 位 ascii 字节。

设置将所有这些标记扩展为 catcode 12 标记的宏并不是完全明显的,我选择了一种相当系统的方式,而不是使用快捷方式。

请查看代码注释以了解更多详细信息。

\documentclass{article}

\makeatletter

% First, the most complicated, define macros extending to catcode 12
% tokens in the 7bit ascii range. Unfortunately, when writing out
% to a file, some of them will get converted back to ^^ notation, and
% not be written there as UTF-8 bytes.

%% 7bit ASCII RANGE (with all its TeX catcode complications)

%% First we handle the upper part (ascii code 127 handled later)

\begingroup
\catcode`\^ 12
\catcode`\_ 12
\catcode`\{ 12
\catcode`\} 12
\catcode`\~ 12
\catcode`\? 0
\catcode`\< 1
\catcode`\> 2
\catcode`\\ 12
?def?y<?endgroup
?count@ 64
?@tfor?x:=%
@ABCDEFGHIJKLMNO%
PQRSTUVWXYZ[\]^_%
`abcdefghijklmno%
pqrstuvwxyz{|}~?do<%
    ?expandafter?xdef?csname utf8byte?the?count@?endcsname<?expandafter?string?x>%
    ?advance?count@?@ne>%
>?y

%% Now we handle the lower part

\begingroup
\catcode127 14
\def\z{utf8byte}%
\def\x{%
\count@ 0
\loop
  \catcode\count@ 12
\advance\count@\@ne
\ifnum\count@<64
\repeat}\x^^?
\def\y{^^?
\count@=0\relax^^?
\@tfor\x:=^^?
^^@^^A^^B^^C^^D^^E^^F^^G^^H^^I^^J^^K^^L^^M^^N^^O^^?
^^P^^Q^^R^^S^^T^^U^^V^^W^^X^^Y^^Z^^[^^\^^]^^^^^_^^?
^^`^^a^^b^^c^^d^^e^^f^^g^^h^^i^^j^^k^^l^^m^^n^^o^^?
^^p^^q^^r^^s^^t^^u^^v^^w^^x^^y^^z^^{^^|^^}^^~\do{^^?
\expandafter\xdef\csname\z\the\count@\endcsname{\x}^^?
\advance\count@\@ne}}^^?
\y\endgroup

%% ascii codes 63 and 127 were not handled yet

\@namedef{utf8byte63}{?}%

\begingroup
\catcode127 12
\global\@namedef{utf8byte127}{^^?}%
\endgroup

%% Now, the easier part, we define the UTF-8 continuation bytes
%% (catcode 12) and the UTF-8 leading bytes (as \protect'ed active
%% tokens). This set-up is ready for utf8+inputenc but does not need
%% it for its definitions. (this document does not load inputenc)
%% UPDATE: but LaTeX newer than 2018/04/01 loads inputenc+utf8 per
%% default, so a small change has been made below.

%% CONTINUATION BYTES

\begingroup
% prior to LaTeX 2018/04/01 these 64 bytes were assigned catcode12
% by LaTeX in absence of inputenc; but nowadays these bytes are active
% by default. Hence I modified \x into \expandafter\string\x which
% works whether or not the document incorporating this code is with 
% old (with or without utf8+inputenc) LaTeX or with newer LaTeX
\count@ "80
\@tfor\x:=%
^^80^^81^^82^^83^^84^^85^^86^^87^^88^^89^^8a^^8b^^8c^^8d^^8e^^8f%
^^90^^91^^92^^93^^94^^95^^96^^97^^98^^99^^9a^^9b^^9c^^9d^^9e^^9f%
^^a0^^a1^^a2^^a3^^a4^^a5^^a6^^a7^^a8^^a9^^aa^^ab^^ac^^ad^^ae^^af%
^^b0^^b1^^b2^^b3^^b4^^b5^^b6^^b7^^b8^^b9^^ba^^bb^^bc^^bd^^be^^bf%
\do{\expandafter\xdef\csname utf8byte\the\count@\endcsname{\expandafter\string\x}%
    \advance\count@\@ne}%

%% LEADING BYTES

%% we will make them TeX active, as inputenc does with utf8
%% We \protect them à la LaTeX, and writing to a file will
%% have to be made in a \set@display@protect context

\count@ "C2
\loop
\catcode\count@\active
\advance\count@ 1
\ifnum\count@ < "F5
\repeat

\count@ "C2

\@tfor\x:=^^c2^^c3^^c4^^c5^^c6^^c7^^c8^^c9^^ca^^cb^^cc^^cd^^ce^^cf%
^^d0^^d1^^d2^^d3^^d4^^d5^^d6^^d7^^d8^^d9^^da^^db^^dc^^dd^^de^^df%
^^e0^^e1^^e2^^e3^^e4^^e5^^e6^^e7^^e8^^e9^^ea^^eb^^ec^^ed^^ee^^ef%
^^f0^^f1^^f2^^f3^^f4%
\do{\expandafter\protected@xdef
      \csname utf8byte\the\count@\endcsname{\expandafter\protect\x}%
    \advance\count@\@ne}%

\endgroup

%% Time now to define our expandable Unicode to UTF-8 converter

%% a "case-switch" utility
\long\def\xintdothis #1#2\xintorthat #3{\fi #1}%
\let\xintorthat \@firstofone

%% EXPANDABLE UNICODE TO UTF-8 BYTES CONVERTER

%% The macro accepts any numexpr compatible input.
%% This means hexadecimal input must be prefixed by "
%% (and must use uppercase hex digits.)
%% Advantage with such numerical inputs is to allow easy usage
%% over some arithmetic range with counts or counters.

%% NOTE: \numexpr is the only e-TeX extension used.

\newcommand*\UniToUTF[1]{\expandafter\UniToUTF@\the\numexpr#1.}%

\def\UniToUTF@ #1.{%
  \ifnum#1<"80 \xintdothis\uni@ascii\fi
  \ifnum#1<"800 \xintdothis\uni@twooctets\fi
  \ifnum#1<"10000 \xintdothis\uni@threeoctets\fi
%% maybe add here some out of range check ?
  \xintorthat\uni@fouroctets {#1}%
}

%% CONVERSION FORMULAS MORALLY BASED ON EXPLANATIONS OF
%% https://en.wikipedia.org/wiki/UTF-8
%% USING OCTAL NOTATION

\def\uni@ascii#1{\csname utf8byte#1\endcsname}%

\def\uni@twooctets#1{\expandafter\uni@twooctets@i
    \the\numexpr(#1+32)/64-1.#1.}%
\def\uni@twooctets@i#1.#2.{%
    \csname utf8byte\the\numexpr192+#1\expandafter\expandafter\expandafter
    \endcsname
    \csname utf8byte\the\numexpr128+#2-#1*64\endcsname
}%

\def\uni@threeoctets#1{\expandafter\uni@threeoctets@i
    \the\numexpr(#1+32)/64-1.#1.}%
\def\uni@threeoctets@i#1.#2.{\expandafter\uni@threeoctets@ii
    \the\numexpr(#1+32)/64-1\expandafter.%
    \the\numexpr#1\expandafter\expandafter\expandafter.%
    \csname utf8byte\the\numexpr128+#2-#1*64\endcsname}%
\def\uni@threeoctets@ii#1.#2.{%
    \csname utf8byte\the\numexpr224+#1\expandafter\expandafter\expandafter
    \endcsname
    \csname utf8byte\the\numexpr128+#2-#1*64\endcsname
}%

\def\uni@fouroctets#1{\expandafter\uni@fouroctets@i
    \the\numexpr(#1+32)/64-1.#1.}%
\def\uni@fouroctets@i#1.#2.{\expandafter\uni@fouroctets@ii
    \the\numexpr(#1+32)/64-1\expandafter.%
    \the\numexpr#1\expandafter\expandafter\expandafter.%
    \csname utf8byte\the\numexpr128+#2-#1*64\endcsname}%
\def\uni@fouroctets@ii#1.#2.{\expandafter\uni@fouroctets@iii
    \the\numexpr(#1+32)/64-1\expandafter.%
    \the\numexpr#1\expandafter\expandafter\expandafter.%
    \csname utf8byte\the\numexpr128+#2-#1*64\endcsname}%
\def\uni@fouroctets@iii#1.#2.{%
    \csname utf8byte\the\numexpr240+#1\expandafter\expandafter\expandafter
    \endcsname
    \csname utf8byte\the\numexpr128+#2-#1*64\endcsname}%

\makeatother

% xinttools loaded only for easying up loops in illustrative code

\usepackage{xinttools}

\makeatletter

\newcommand*{\WriteOutSixtyFourUTFchars}[2]{%
% #1 = starting Unicode code point
% #2 = out stream
  \begingroup
    \set@display@protect %%<<<---- makes \protect=\string
    \def\mymacro##1{\UniToUTF{#1+##1}}%
    \immediate\write#2{\xintApplyUnbraced\mymacro{\xintSeq{0}{63}}}%
  \endgroup
}

% For \edef we need to take into account  LaTeX's protection mechanism
% for the UTF-8
% leading bytes, as we defined them to be active for inputenc+utf8 readiness.
\begingroup
\global\let\protectedATedef\protected@edef
\endgroup
\makeatother

\begin{document}

\protectedATedef\x{\UniToUTF{"20AC}}
\typeout{\meaning\x}
\typeout{\x}
\typeout{\UniToUTF{"20AC}}
\typeout{}
\protectedATedef\x{\UniToUTF{"03B1}}
\typeout{\meaning\x}
\typeout{\x}
\typeout{\UniToUTF{"03B1}}
\typeout{}
\protectedATedef\x{\UniToUTF{"0416}}
\typeout{\meaning\x}
\typeout{\x}
\typeout{\UniToUTF{"0416}}

{\catcode127 12
\if\UniToUTF{127}\string^^?\else\ERROR\fi
}
\if\UniToUTF{126}\string~\else\ERROR\fi

{\makeatletter
\if\UniToUTF{37}\expandafter\@gobble\string\%\else\ERROR\fi}

\newwrite\foo

\immediate\openout\foo=\jobname ASCIIMEANINGS.txt

% we don't use \typeout here, but write to another file,
% else Emacs/AUCTeX wrongly
% reports a compilation error due to misinterpreation of log
% containing the ! character

\begingroup
\makeatletter
\set@display@protect %%<<<---- makes \protect=\string
\xintFor* #1 in{\xintSeq{0}{127}}\do
{\protectedATedef\x{\UniToUTF{#1}}%
 \immediate\write\foo{\UniToUTF{#1} (\meaning\x)}}
\endgroup

\immediate\closeout\foo

\immediate\openout\foo=\jobname ASCII.txt

\WriteOutSixtyFourUTFchars{"0000}\foo
\WriteOutSixtyFourUTFchars{"0040}\foo

\immediate\closeout\foo

\immediate\openout\foo=\jobname0380.txt

\WriteOutSixtyFourUTFchars{"0380}\foo
\WriteOutSixtyFourUTFchars{"03C0}\foo

\immediate\closeout\foo

\immediate\openout\foo=\jobname0400.txt

\WriteOutSixtyFourUTFchars{"0400}\foo
\WriteOutSixtyFourUTFchars{"0440}\foo
\WriteOutSixtyFourUTFchars{"0480}\foo
\WriteOutSixtyFourUTFchars{"04C0}\foo
\WriteOutSixtyFourUTFchars{"0500}\foo

\immediate\closeout\foo

\immediate\openout\foo=\jobname2000.txt

\WriteOutSixtyFourUTFchars{"2000}\foo
\WriteOutSixtyFourUTFchars{"2040}\foo

\immediate\closeout\foo

\immediate\openout\foo=\jobname4000.txt

\WriteOutSixtyFourUTFchars{"4000}\foo
\WriteOutSixtyFourUTFchars{"4040}\foo

\immediate\closeout\foo

\immediate\openout\foo=\jobname10000.txt

\WriteOutSixtyFourUTFchars{"10000}\foo
\WriteOutSixtyFourUTFchars{"10040}\foo

\immediate\closeout\foo
\end{document}

为了说明其功能,我们在日志中找到了以下内容:

macro:->\protect €

macro:->\protect α
α
α

macro:->\protect Ж
Ж
Ж

然后在文件中\jobname ASCII.txt

^^@^^A^^B^^C^^D^^E^^F^^G^^H 
^^L^^M^^N^^O^^P^^Q^^R^^S^^T^^U^^V^^W^^X^^Y^^Z^^[^^\^^]^^^^^_ !"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~^^?

CTRL-ICTRL-JCTRL-K是真正的字节)

然后在文件中\jobname ASCIIMEANINGS.txt(这是默认情况,人们想知道为什么CTRL-K垂直制表符被处理为真正的字节...请参阅下面关于 pdflatex-8bit选项的注释...)

^^@ (macro:->^^@)
^^A (macro:->^^A)
^^B (macro:->^^B)
^^C (macro:->^^C)
^^D (macro:->^^D)
^^E (macro:->^^E)
^^F (macro:->^^F)
^^G (macro:->^^G)
^^H (macro:->^^H)
     (macro:->  )

 (macro:->
)
 (macro:->)
^^L (macro:->^^L)
^^M (macro:->^^M)
^^N (macro:->^^N)
^^O (macro:->^^O)
^^P (macro:->^^P)
^^Q (macro:->^^Q)
^^R (macro:->^^R)
^^S (macro:->^^S)
^^T (macro:->^^T)
^^U (macro:->^^U)
^^V (macro:->^^V)
^^W (macro:->^^W)
^^X (macro:->^^X)
^^Y (macro:->^^Y)
^^Z (macro:->^^Z)
^^[ (macro:->^^[)
^^\ (macro:->^^\)
^^] (macro:->^^])
^^^ (macro:->^^^)
^^_ (macro:->^^_)
  (macro:-> )
! (macro:->!)
" (macro:->")
# (macro:->#)
$ (macro:->$)
% (macro:->%)
& (macro:->&)
' (macro:->')
( (macro:->()
) (macro:->))
* (macro:->*)
+ (macro:->+)
, (macro:->,)
- (macro:->-)
. (macro:->.)
/ (macro:->/)
0 (macro:->0)
1 (macro:->1)
2 (macro:->2)
3 (macro:->3)
4 (macro:->4)
5 (macro:->5)
6 (macro:->6)
7 (macro:->7)
8 (macro:->8)
9 (macro:->9)
: (macro:->:)
; (macro:->;)
< (macro:-><)
= (macro:->=)
> (macro:->>)
? (macro:->?)
@ (macro:->@)
A (macro:->A)
B (macro:->B)
C (macro:->C)
D (macro:->D)
E (macro:->E)
F (macro:->F)
G (macro:->G)
H (macro:->H)
I (macro:->I)
J (macro:->J)
K (macro:->K)
L (macro:->L)
M (macro:->M)
N (macro:->N)
O (macro:->O)
P (macro:->P)
Q (macro:->Q)
R (macro:->R)
S (macro:->S)
T (macro:->T)
U (macro:->U)
V (macro:->V)
W (macro:->W)
X (macro:->X)
Y (macro:->Y)
Z (macro:->Z)
[ (macro:->[)
\ (macro:->\)
] (macro:->])
^ (macro:->^)
_ (macro:->_)
` (macro:->`)
a (macro:->a)
b (macro:->b)
c (macro:->c)
d (macro:->d)
e (macro:->e)
f (macro:->f)
g (macro:->g)
h (macro:->h)
i (macro:->i)
j (macro:->j)
k (macro:->k)
l (macro:->l)
m (macro:->m)
n (macro:->n)
o (macro:->o)
p (macro:->p)
q (macro:->q)
r (macro:->r)
s (macro:->s)
t (macro:->t)
u (macro:->u)
v (macro:->v)
w (macro:->w)
x (macro:->x)
y (macro:->y)
z (macro:->z)
{ (macro:->{)
| (macro:->|)
} (macro:->})
~ (macro:->~)
^^? (macro:->^^?)

重要更新:要获取整个 7 位 ascii 范围的真正字节,请使用pdflatex -8bit

然后在文件中\jobname 0380.txt

΁΂΃΄΅Ά·ΈΉΊ΋Ό΍ΎΏΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡ΢ΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξο
πρςστυφχψωϊϋόύώϏϐϑϒϓϔϕϖϗϘϙϚϛϜϝϞϟϠϡϢϣϤϥϦϧϨϩϪϫϬϭϮϯϰϱϲϳϴϵ϶ϷϸϹϺϻϼϽϾϿ

然后在文件中\jobname 0400.txt

ЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмноп
рстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿ
Ҁҁ҂҃҄҅҆҇҈҉ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿ
ӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿ
ԀԁԂԃԄԅԆԇԈԉԊԋԌԍԎԏԐԑԒԓԔԕԖԗԘԙԚԛԜԝԞԟԠԡԢԣԤԥԦԧԨԩԪԫԬԭԮԯ԰ԱԲԳԴԵԶԷԸԹԺԻԼԽԾԿ

然后在文件中\jobname 4000.txt

䀀䀁䀂䀃䀄䀅䀆䀇䀈䀉䀊䀋䀌䀍䀎䀏䀐䀑䀒䀓䀔䀕䀖䀗䀘䀙䀚䀛䀜䀝䀞䀟䀠䀡䀢䀣䀤䀥䀦䀧䀨䀩䀪䀫䀬䀭䀮䀯䀰䀱䀲䀳䀴䀵䀶䀷䀸䀹䀺䀻䀼䀽䀾䀿
䁀䁁䁂䁃䁄䁅䁆䁇䁈䁉䁊䁋䁌䁍䁎䁏䁐䁑䁒䁓䁔䁕䁖䁗䁘䁙䁚䁛䁜䁝䁞䁟䁠䁡䁢䁣䁤䁥䁦䁧䁨䁩䁪䁫䁬䁭䁮䁯䁰䁱䁲䁳䁴䁵䁶䁷䁸䁹䁺䁻䁼䁽䁾䁿

作为示例。浏览器中的渲染将取决于其字体设置。

以下是\jobname 10000.txtEmacs 缓冲区中显示的文件:

在此处输入图片描述

更新图像以显示 ASCIIMEANINGS 文件如何在 Emacs 缓冲区中呈现,证明在^^K的默认用法中,那里确实是一个字节pdflatex

在此处输入图片描述

重要补充说明

这是默认的情况,但使用pdflatex -8bit解决了我们所有的苦恼,这里是 ASCII 输出文件的 Emacs 缓冲区屏幕截图:

在此处输入图片描述

因此,-8bit我们确实可以从 pdflatex 内部创建任意 UTF-8 编码文件,仅使用 Unicode 代码点作为输入。

答案2

\documentclass{article}
\usepackage[utf8]{inputenc}
\DeclareUnicodeCharacter{20AC}{Hallo}
\DeclareUnicodeCharacter{12345}{Bye}
\begin{document}

完成问题所要求的事情分为两个部分:

  • 将 Unicode 代码点转换为其 UTF-8 编码,例如将 U+20AC 转换为字节序列E2 82 AC
  • 将这些字节注入或相当于将这些字节输入到输入流中。

第一部分是 UTF-8 编码器。它已经在utf8ienc.dtx/中实现utf8.def,但下面是一个新的实现。编写它很有趣,并且与官方实现不同。(首先,它使用了 e-TeX 扩展,\numexpr这在编写原始实现时是无法假定的……但除了使用那个新的原语之外,它不使用任何外部宏,并且既可以在普通 (e)Te​​X 中使用,也可以在 LaTeX 中使用。)

这就是它的工作原理(以及如何UTF-8定义如下:给定一个数字n(代表一个 Unicode 代码点,例如"20AC数字 8364,代表代码点U+20AC 欧元符号),

  • 查看的大小n,确定 UTF-8 编码将有多少个字节
  • 从右边一次剥去六位:这些带有10前缀的位成为最右边的字节
  • 对于第一个字节,使用剩余的位和取决于字节数的前缀

执行此操作的代码:

% Division and remainder. Based on https://tex.stackexchange.com/a/34449/48 but with bugs fixed. Note: assumes #2 positive
\def\modulo#1#2{\ifnum \numexpr(#1 - (#1/#2)*(#2))\relax < 0 (#1 - (#1/#2)*(#2) + #2) \else (#1 - (#1/#2)*(#2)) \fi}
\def\truncdiv#1#2{((#1 - \modulo{#1}{#2})/(#2))}
% The hypothetical continuation bytes: last byte, the one before that, etc.
\def\utfByteLastOne  #1{\numexpr(128 + \modulo{#1}{64})\relax}
\def\utfByteLastTwo  #1{\numexpr(128 + \modulo{\truncdiv{#1}{64}}{64})\relax}
\def\utfByteLastThree#1{\numexpr(128 + \modulo{\truncdiv{#1}{(64*64)}}{64})\relax}
\def\utfByteLastFour #1{\numexpr(128 + \modulo{\truncdiv{#1}{(64*64*64)}}{64})\relax}
% The actual individual bytes in the stream, for lengths 1 to 4.
\def\utfStreamOneByteOne    #1{#1}
\def\utfStreamTwoByteOne    #1{\numexpr(\utfByteLastTwo{#1} + 64)\relax}
\def\utfStreamTwoByteTwo    #1{\utfByteLastOne{#1}}
\def\utfStreamThreeByteOne  #1{\numexpr(\utfByteLastThree{#1} + 96)\relax}
\def\utfStreamThreeByteTwo  #1{\utfByteLastTwo{#1}}
\def\utfStreamThreeByteThree#1{\utfByteLastOne{#1}}
\def\utfStreamFourByteOne   #1{\numexpr(\utfByteLastFour{#1} + 112)\relax}
\def\utfStreamFourByteTwo   #1{\utfByteLastThree{#1}}
\def\utfStreamFourByteThree #1{\utfByteLastTwo{#1}}
\def\utfStreamFourByteFour  #1{\utfByteLastOne{#1}}
% Expands to \utfCallbackOne{#1} or ... or \utfCallbackFour{#1} depending on whether the code point #1 has 1, 2, 3, or 4 bytes in its UTF-8 encoding.
\def\utfStreamFromNumber#1{%
    \ifnum #1 < 128
        \utfCallbackOne{#1}%
    \else \ifnum #1 < 2048  % 2^11
              \utfCallbackTwo{#1}%
          \else \ifnum #1 < 65536  % 2^16
                    \utfCallbackThree{#1}%
                \else
                    \utfCallbackFour{#1}%
                \fi
          \fi
    \fi
}

此 UTF-8 编码器的接口非常通用(并且完全独立于inputenc),因为您可以简单地重新定义\utfCallbackOne\utfCallbackFour,以对 UTF-8 流的字节执行不同的操作。例如,请参阅第一次修订该答案的版本仅打印出字节(而不打印\usepackage[utf8]{inputenc})。


第二部分(将任意字节/标记注入输入流,就像用户输入的一样)在 LuaTeX 中很简单,但在其他引擎中却并非易事。首先,如果我们想要做的只是模拟inputenc如果插入了这些标记(这是问题所要求的),那么我们可以利用将定义inputenc存储\DeclareUnicodeCharacter在名为u8:<byte1><byte2><byte3>等的内部宏中的事实,并简单地调用这些宏。下面的代码使用\lccode\lowercase技巧(受启发utf8ienc.dtx)来构造包含“特殊”字节的宏名称。

% These get the full byte stream -- here, do whatever you want with the individual bytes.
% In this case, these callbacks call internal control sequences u8:... defined by \DeclareUnicodeCharacter
\def\utfCallbackOne  #1{{\char\utfStreamOneByteOne{#1}}}
\def\utfCallbackTwo  #1{{\lccode`A=\utfStreamTwoByteOne{#1}\lccode`B=\utfStreamTwoByteTwo{#1}\lowercase{\csname u8:AB\endcsname}}}
\def\utfCallbackThree#1{{\lccode`A=\utfStreamThreeByteOne{#1}\lccode`B=\utfStreamThreeByteTwo{#1}\lccode`C=\utfStreamThreeByteThree{#1}\lowercase{\csname u8:ABC\endcsname}}}
\def\utfCallbackFour #1{{\lccode`A=\utfStreamFourByteOne{#1}\lccode`B=\utfStreamFourByteTwo{#1}\lccode`C=\utfStreamFourByteThree{#1}\lccode`D=\utfStreamFourByteFour{#1}\lowercase{\csname u8:ABCD\endcsname}}}
\newcommand\unicodetoutf[1]{\utfStreamFromNumber{"#1}}

€ ^^e2^^82^^ac \unicodetoutf{20AC}

© ^^c2^^a9 \unicodetoutf{00A9}

答案3

这以完全可扩展的方式直接实现了 Unicode 到 UTF-8 的转换算法。

abcd它使用将 UTF-8 序列(最多四个字符)转换为控制序列的标准特性\u8:abcd

如果未定义 Unicode 点,pdflatex则停止并显示错误

! Undefined control sequence.
<argument> \ERROR 
                  BAD UTF (U+AAAA)
l.98 \unicodetoutf{AAAA}

以下是代码

% algorithm from https://home.kpn.nl/vanadovv/uni/utf8conversion.html

\documentclass{article}
\usepackage{xparse}

\ExplSyntaxOn
\NewExpandableDocumentCommand{\unicodetoutf}{m}
 {
  \int_compare:nTF { "#1 < 128 }
   {
    \char_generate:nn { "#1 } { 12 }
   }
   {
    \egreg_unicodetoutf:n { #1 }
   }
 }

\cs_new:Nn \egreg_unicodetoutf:n
 {
  \__egreg_unicodetoutf:nf { #1 }
   {
    \int_compare:nTF { "#1 < 2048 }
     {
      \__egreg_unicodetoutf_two:n { #1 }
     }
     {
      \int_compare:nTF { "#1 < 65536 }
       {
        \__egreg_unicodetoutf_three:n { #1 }
       }
       {
        \__egreg_unicodetoutf_four:n { #1 }
       }
     }
   }
 }

\cs_new:Nn \__egreg_unicodetoutf:nn
 {
  \cs_if_exist_use:cF { #2 } { \ERROR BAD~UTF~(U+#1) }
 }
\cs_generate_variant:Nn \__egreg_unicodetoutf:nn { nf }

\cs_new:Nn \__egreg_unicodetoutf_two:n
 {
  u8:
  \char_generate:nn { 192 + \int_div_truncate:nn { "#1 } { 64 } } { 12 }
  \char_generate:nn { 128 + \int_mod:nn { "#1 } { 64 } } { 12 }
 }

\cs_new:Nn \__egreg_unicodetoutf_three:n
 {
  u8:
  \char_generate:nn { 224 + \int_div_truncate:nn { "#1 } { 4096} } { 12 }
  \char_generate:nn
   {
    128 + \int_mod:nn { \int_div_truncate:nn { "#1 } { 64 } } { 64 }
   }
   { 12 }
  \char_generate:nn { 128 + \int_mod:nn { "#1 } { 64 } } { 12 }
 }

\cs_new:Nn \__egreg_unicodetoutf_four:n
 {
  u8:
  \char_generate:nn { 240 + \int_div_truncate:nn { "#1 } { 262144 } } { 12 }
  \char_generate:nn
   {
    128 + \int_mod:nn { \int_div_truncate:nn { "#1 } { 4096 } } { 64 }
   }
   { 12 }
  \char_generate:nn
   {
    128 + \int_mod:nn { \int_div_truncate:nn { "#1 } { 64 } } { 64 }
   }
   { 12 }
  \char_generate:nn { 128 + \int_mod:nn { "#1 } { 64 } } { 12 }
 }
\ExplSyntaxOff

\DeclareUnicodeCharacter{00E9}{b}
\DeclareUnicodeCharacter{20AC}{Hallo}
\DeclareUnicodeCharacter{10006}{World}

\begin{document}

a ^^61 \unicodetoutf{0061}

é ^^c3^^a9 \unicodetoutf{00E9}

€ ^^e2^^82^^ac \unicodetoutf{20AC} %<-- should output Hallo

相关内容