检测文本中缺失的字形

Question 1

这与您之前所采用的方法不同，但也许您可以使用 pythonstr.replace()或re.sub()方法从文本主体中解析出十六进制字符串。即：

如果十六进制是可预测的：

originalText = "\xc3\xa5Test"
filteredText = originalText.replace("\xc3\xa5", "")

或者如果您需要使用正则表达式匹配任何十六进制字符：

import re

originalText = "\xc3\xa5Test"
filteredText = re.sub(r'[^\x00-\x7f]', r'', originalText)

有关此策略的更多良好讨论

Answer

这与您之前所采用的方法不同，但也许您可以使用 pythonstr.replace()或re.sub()方法从文本主体中解析出十六进制字符串。即：

如果十六进制是可预测的：

originalText = "\xc3\xa5Test"
filteredText = originalText.replace("\xc3\xa5", "")

或者如果您需要使用正则表达式匹配任何十六进制字符：

import re

originalText = "\xc3\xa5Test"
filteredText = re.sub(r'[^\x00-\x7f]', r'', originalText)

有关此策略的更多良好讨论

Question 2

Unicode 整形引擎

使用 Unicode 整形引擎（例如 Harfbuzz）来检测缺失的字形。这是一个有效示例：

from pyharfbuzz import shape
f = "/usr/local/lib/python3.6/site-packages/werkzeug/debug/shared/ubuntu.ttf"
t = "®"
s = shape(f, t)
print(s[1]['glyph_name'])
t = "რ"
s = shape(f, t)
print(s[1]['glyph_name'])

输出

registered
.notdef

检查时 IDLE3 中的输出如下：

>>> t = "®"
>>> s = shape(f, t)
>>> s
[{'cluster': 0, 'glyph_name': 'registered', 'x_advance': 29.453125, 'y_advance': 0.0, 'x_offset': 0.0, 'y_offset': 0.0}]
>>> t = "რ"
>>> s = shape(f, t)
>>> s
[{'cluster': 0, 'glyph_name': '.notdef', 'x_advance': 36.0, 'y_advance': 0.0, 'x_offset': 0.0, 'y_offset': 0.0}]

检查正确的字体路径，我只是选择了我在当前机器上看到的第一个字体路径。

笔记：

我确信 Gtk/Pango 有类似的功能，Pango 已经在底层转用 Harfbuzz。不过，我没有使用此类库的经验。

Answer

Unicode 整形引擎

使用 Unicode 整形引擎（例如 Harfbuzz）来检测缺失的字形。这是一个有效示例：

from pyharfbuzz import shape
f = "/usr/local/lib/python3.6/site-packages/werkzeug/debug/shared/ubuntu.ttf"
t = "®"
s = shape(f, t)
print(s[1]['glyph_name'])
t = "რ"
s = shape(f, t)
print(s[1]['glyph_name'])

输出

registered
.notdef

检查时 IDLE3 中的输出如下：

>>> t = "®"
>>> s = shape(f, t)
>>> s
[{'cluster': 0, 'glyph_name': 'registered', 'x_advance': 29.453125, 'y_advance': 0.0, 'x_offset': 0.0, 'y_offset': 0.0}]
>>> t = "რ"
>>> s = shape(f, t)
>>> s
[{'cluster': 0, 'glyph_name': '.notdef', 'x_advance': 36.0, 'y_advance': 0.0, 'x_offset': 0.0, 'y_offset': 0.0}]

检查正确的字体路径，我只是选择了我在当前机器上看到的第一个字体路径。

笔记：

我确信 Gtk/Pango 有类似的功能，Pango 已经在底层转用 Harfbuzz。不过，我没有使用此类库的经验。

Question 3

已经想出解决方案了……最初我以为财富文本文件不包含十六进制字符。结果发现这是错误的。因此，当我意识到这一点时，我想出了以下解决方案：

import codecs
fortune = <call the fortune program>
output = ""
for c in fortune:
    if codecs.encode( str.encode( c ), "hex" ) == b'07':
        continue

    output += c                   

print( output )

Answer

已经想出解决方案了……最初我以为财富文本文件不包含十六进制字符。结果发现这是错误的。因此，当我意识到这一点时，我想出了以下解决方案：

import codecs
fortune = <call the fortune program>
output = ""
for c in fortune:
    if codecs.encode( str.encode( c ), "hex" ) == b'07':
        continue

    output += c                   

print( output )

检测文本中缺失的字形

答案1

答案2

Unicode 整形引擎

答案3

相关内容