在 Windows XP 中查找文件名中包含非 ASCII 字符的文件

Question 1

以下是使用 Powershell 的方法：

gci -recurse . | where {$_.Name -cmatch "[^\u0000-\u007F]"}

Answer

以下是使用 Powershell 的方法：

gci -recurse . | where {$_.Name -cmatch "[^\u0000-\u007F]"}

Question 2

我最终为此编写了一个 Python 脚本。发布它以防它对任何人有帮助。欢迎随时移至 StackOverflow。

import sys, os


def main(argv):
    if len(argv) != 2:
        raise Exception('Syntax: FindUnicodeFiles.py <directory>')

    startdir = argv[1]
    if not os.path.isdir(startdir):
        raise Exception('"%s" is not a directory' % startdir)

    for r in recurse_breadth_first(startdir, is_unicode_filename):
        print(r)


def recurse_breadth_first(dirpath, test_func):
    namesandpaths = [(f, os.path.join(dirpath, f)) for f in os.listdir(dirpath)]

    for (name, path) in namesandpaths:
        if test_func(name):
            yield path

    for (_, path) in namesandpaths:
        if os.path.isdir(path):
            for r in recurse_breadth_first(path, test_func):
                yield r


def is_unicode_filename(filename):
    return any(ord(c) >= 0x7F for c in filename)


if __name__ == '__main__':
    main(sys.argv)

Answer

我最终为此编写了一个 Python 脚本。发布它以防它对任何人有帮助。欢迎随时移至 StackOverflow。

import sys, os


def main(argv):
    if len(argv) != 2:
        raise Exception('Syntax: FindUnicodeFiles.py <directory>')

    startdir = argv[1]
    if not os.path.isdir(startdir):
        raise Exception('"%s" is not a directory' % startdir)

    for r in recurse_breadth_first(startdir, is_unicode_filename):
        print(r)


def recurse_breadth_first(dirpath, test_func):
    namesandpaths = [(f, os.path.join(dirpath, f)) for f in os.listdir(dirpath)]

    for (name, path) in namesandpaths:
        if test_func(name):
            yield path

    for (_, path) in namesandpaths:
        if os.path.isdir(path):
            for r in recurse_breadth_first(path, test_func):
                yield r


def is_unicode_filename(filename):
    return any(ord(c) >= 0x7F for c in filename)


if __name__ == '__main__':
    main(sys.argv)

Question 3

打开命令提示符cmd 执行命令：chcp 65001这会将代码页更改为 UTF-8

然后，创建文件列表：dir c:\myfolder /b /s >c:\filelist.txt

现在您有了一个文本文件，utf-8 编码，其中包含您要搜索的所有文件名。现在您可以将此文件加载到记事本++（它是一个免费的文本编辑器）并使用正则表达式搜索非 ascii 字符，例如[^\x{0000}-\x{007F}]：

Answer

打开命令提示符cmd 执行命令：chcp 65001这会将代码页更改为 UTF-8

然后，创建文件列表：dir c:\myfolder /b /s >c:\filelist.txt

现在您有了一个文本文件，utf-8 编码，其中包含您要搜索的所有文件名。现在您可以将此文件加载到记事本++（它是一个免费的文本编辑器）并使用正则表达式搜索非 ascii 字符，例如[^\x{0000}-\x{007F}]：

Question 4

这是 Microsoft 字符类。它仅适用于区分大小写的匹配。大写 P 表示“不”。

gci -recurse . | where { $_.name -cmatch '\P{IsBasicLatin}' } # 0000 - 007F

或者只是不可打印，不是从空格到波浪号，也不是制表符：

gci -recurse . | where { $_.name -cmatch '[^ -~\t]' }

搜索非 ASCII 字符时会遇到一些问题，具体取决于您如何搜索。有 2 个非 ASCII 字符，其小写版本为 ASCII、土耳其语İ(0x130) 和开尔文符号K(0x212a)。在此示例中，小写 i 和大写 I 都匹配为非 ASCII：

# powershell
echo i I | where { $_ -match '[\u0080-\uffff]' }

i
I

一种解决方法是使用区分大小写的搜索（此处为 cmatch）：

echo i I | where { $_ -cmatch '[\u0080-\uffff]' }  # no output

开尔文K匹配为 ASCII（不区分大小写）：

[char]0x0212a | select-string '[\u0000-\u007f]' 

K

区分大小写可以正常工作。

[char]0x0212a | select-string '[\u0000-\u007f]' -CaseSensitive # no output

Answer