如何知道某个字体在单个 PDF 页面中使用了多少次?

如何知道某个字体在单个 PDF 页面中使用了多少次?

我知道如何列出单个 PDF 页面中使用的字体,以及如何识别在执行文本提取时哪些字体容易输出乱码文本。我想知道如何计算单个 PDF 页面中每种字体在文本中使用的次数。例如,假设 my_file.pdf 的 PDF 第 4 页使用 Arial、TimesNewRoman 和 MyriadPro-Regular。我希望得到类似这样的结果:

Page 4
Arial: 1200 characters
TimesNewRoman: 200 characters
MyriadPro-Regular: 10 characters

SE 问题相关,但我想要的是计算多个页面中千个字符的出现次数。

我曾尝试将一个 PDF 页面在线转换为 SVG(https://cloudconvert.com/pdf-to-svg),保留字体路径,但我得到:

<g id="Layer-1" data-name="Artifact">
<clipPath id="cp12">
<path transform="matrix(1,0,0,-1,237.4765,826.6425)" d="M 0 -8 L 120.047 -8 L 120.047 .800049 L 0 .800049 Z "/>
</clipPath>
<g clip-path="url(#cp12)">
<symbol id="font_8_2d">
<path d="M .001302084 .21533203 L .086751308 .22705078 C .089029949 .17236328 .09928385 .13492839 .11751302 .114746097 C .13574219 .094563808 .16097005 .084472659 .19319661 .084472659 C .21695964 .084472659 .23746746 .08984375 .25472007 .10058594 C .27197267 .11165365 .28385417 .12646485 .2903646 .14501953 C .296875 .16389974 .30013023 .19384766 .30013023 .23486328 L .30013023 .72802737 L .39485679 .72802737 L .39485679 .24023438 C .39485679 .18033855 .38753257 .13395183 .37288413 .10107422 C .3585612 .06819662 .33561198 .04313151 .30403648 .025878907 C .27278648 .008626302 .23600261 0 .19368489 0 C .13085938 0 .08268229 .018066407 .04915365 .05419922 C .015950522 .09033203 .00000000062088176 .14404297 .001302084 .21533203 Z "/>
</symbol>
<symbol id="font_8_7b">
<path d="M 0 .2709961 C 0 .36702476 .026692709 .43815104 .080078128 .484375 C .12467448 .52278646 .17903646 .5419922 .24316406 .5419922 C .31445313 .5419922 .37272135 .5185547 .41796876 .4716797 C .46321617 .42513023 .48583985 .3606771 .48583985 .2783203 C .48583985 .21158855 .4757487 .15901692 .4555664 .12060547 C .43570964 .08251953 .40657554 .052897138 .36816407 .03173828 C .33007813 .010579427 .28841148 0 .24316406 0 C .17057292 0 .111816409 .02327474 .06689453 .06982422 C .022298178 .116373699 0 .18343099 0 .2709961 M .09033203 .2709961 C .09033203 .20458985 .10481771 .15478516 .13378906 .12158203 C .16276042 .08870443 .19921875 .072265628 .24316406 .072265628 C .28678385 .072265628 .32307945 .08886719 .35205079 .12207031 C .38102214 .15527344 .3955078 .20589192 .3955078 .27392579 C .3955078 .33805339 .38085938 .386556 .3515625 .4194336 C .32259117 .45263673 .28645835 .46923829 .24316406 .46923829 C .19921875 .46923829 .16276042 .45279948 .13378906 .41992188 C .10481771 .38704429 .09033203 .33740235 .09033203 .2709961 M .24414063 .6777344 L .18896485 .59472659 L .088378909 .59472659 L .19384766 .7314453 L .28759767 .7314453 L .39746095 .59472659 L .29785157 .59472659 L .24414063 .6777344 Z "/>
</symbol>
<symbol id="font_8_4d">
<path d="M .111328128 .82421877 L .111328128 .92626956 L .19921875 .92626956 L .19921875 .82421877 L .111328128 .82421877 M 0 .009277344 L .016601563 .083984378 C .034179689 .079427089 .048014326 .07714844 .05810547 .07714844 C .07600912 .07714844 .08935547 .08317057 .09814453 .095214847 C .106933597 .106933597 .111328128 .13655599 .111328128 .18408203 L .111328128 .7290039 L .19921875 .7290039 L .19921875 .1821289 C .19921875 .11832682 .19091797 .07389323 .1743164 .048828126 C .15315755 .016276041 .118001308 0 .068847659 0 C .045084638 0 .022135416 .003092448 0 .009277344 Z "/>
</symbol>
<symbol id="font_8_11">
<path d="M 0 0 L 0 .100097659 L .100097659 .100097659 L .100097659 0 L 0 0 Z "/>
</symbol>
<use xlink:href="#font_8_27" transform="matrix(8,0,0,-8,238.09369,832.9585)" fill="#0000ff"/>
<use xlink:href="#font_8_2d" transform="matrix(8,0,0,-8,243.47255,833.05618)" fill="#0000ff"/>
<use xlink:href="#font_8_3" transform="matrix(8,0,0,-8,247.2525,832.9585)" fill="#0000ff"/>
<use xlink:href="#font_8_28" transform="matrix(8,0,0,-8,250.10932,832.9585)" fill="#0000ff"/>
<use xlink:href="#font_8_4f" transform="matrix(8,0,0,-8,255.32422,832.9585)" fill="#0000ff"/>
<use xlink:href="#font_8_48" transform="matrix(8,0,0,-8,256.88148,833.05227)" fill="#0000ff"/>
<use xlink:href="#font_8_57" transform="matrix(8,0,0,-8,261.17713,833.0132)" fill="#0000ff"/>
<use xlink:href="#font_8_55" transform="matrix(8,0,0,-8,263.78004,832.9585)" fill="#0000ff"/>
<use xlink:href="#font_8_7b" transform="matrix(8,0,0,-8,266.19013,833.05227)" fill="#0000ff"/>
<use xlink:href="#font_8_51" transform="matrix(8,0,0,-8,270.89985,832.9585)" fill="#0000ff"/>
<use xlink:href="#font_8_4c" transform="matrix(8,0,0,-8,275.35176,832.9585)" fill="#0000ff"/>
<use xlink:href="#font_8_46" transform="matrix(8,0,0,-8,276.909,833.05227)" fill="#0000ff"/>
<use xlink:href="#font_8_52" transform="matrix(8,0,0,-8,280.86213,833.05227)" fill="#0000ff"/>
<use xlink:href="#font_8_3" transform="matrix(8,0,0,-8,285.0445,832.9585)" fill="#0000ff"/>
<use xlink:href="#font_8_10" transform="matrix(8,0,0,-8,287.5274,831.23977)" fill="#0000ff"/>
<use xlink:href="#font_8_3" transform="matrix(8,0,0,-8,289.9375,832.9585)" fill="#0000ff"/>
<use xlink:href="#font_8_24" transform="matrix(8,0,0,-8,292.14878,832.9585)" fill="#0000ff"/>
<use xlink:href="#font_8_46" transform="matrix(8,0,0,-8,297.80903,833.05227)" fill="#0000ff"/>
<use xlink:href="#font_8_48" transform="matrix(8,0,0,-8,301.7895,833.05227)" fill="#0000ff"/>
<use xlink:href="#font_8_56" transform="matrix(8,0,0,-8,306.1906,833.05227)" fill="#0000ff"/>
<use xlink:href="#font_8_56" transform="matrix(8,0,0,-8,310.1906,833.05227)" fill="#0000ff"/>
<use xlink:href="#font_8_48" transform="matrix(8,0,0,-8,314.2375,833.05227)" fill="#0000ff"/>
<use xlink:href="#font_8_1d" transform="matrix(8,0,0,-8,319.11518,832.9585)" fill="#0000ff"/>
<use xlink:href="#font_8_3" transform="matrix(8,0,0,-8,320.61653,832.9585)" fill="#0000ff"/>
<use xlink:href="#font_8_57" transform="matrix(8,0,0,-8,322.98014,833.0132)" fill="#0000ff"/>
<use xlink:href="#font_8_4d" transform="matrix(8,0,0,-8,324.69633,834.6421)" fill="#0000ff"/>
<use xlink:href="#font_8_4a" transform="matrix(8,0,0,-8,327.09733,834.6421)" fill="#0000ff"/>
<use xlink:href="#font_8_52" transform="matrix(8,0,0,-8,331.55314,833.05227)" fill="#0000ff"/>
<use xlink:href="#font_8_11" transform="matrix(8,0,0,-8,336.46208,832.9585)" fill="#0000ff"/>
<use xlink:href="#font_8_4d" transform="matrix(8,0,0,-8,337.59233,834.6421)" fill="#0000ff"/>
<use xlink:href="#font_8_58" transform="matrix(8,0,0,-8,340.24723,833.05227)" fill="#0000ff"/>
<use xlink:href="#font_8_56" transform="matrix(8,0,0,-8,344.4296,833.05227)" fill="#0000ff"/>
<use xlink:href="#font_8_11" transform="matrix(8,0,0,-8,348.91007,832.9585)" fill="#0000ff"/>
<use xlink:href="#font_8_45" transform="matrix(8,0,0,-8,350.93095,833.05227)" fill="#0000ff"/>
<use xlink:href="#font_8_55" transform="matrix(8,0,0,-8,355.37504,832.9585)" fill="#0000ff"/>
</g>
</g>

上面的内容是生成的 SVG 文件的一部分。我不知道如何解释它,也许我可以在这里获取字体频率,但我认为不可能将例如“Arial”链接到此 SVG 文件中的字体符号。

相关内容