从街道号码中拆分/提取街道名称的终极工具

从街道号码中拆分/提取街道名称的终极工具

我有一些地址.csv不同的国际格式

Example Street 1
Teststraße 2
Teststr. 1-5
Baker Street 221b
221B Baker Street
19th Ave 3B
3B 2nd Ave
1-3 2nd Mount x Ave
105 Lock St # 219
Test Street, 1
BookAve, 54, Extra Text 123#

例如,我们在德国写作Teststraße 2,在美国写作2 Test Street

有没有办法分离/提取所有街道名称和街道号码? 输出名称.csv

Example Street
Teststraße
Teststr.
Baker Street
Baker Street
19th Ave
2nd Ave
2nd Mount Good Ave
Lock St # 219
Test Street
BookAve

输出数字.csv

1
2
1-5
221b
221B
3B
3B
1-3
105
1
54

输出-extra_text.csv











Extra Text 123#

我使用的是 macOS 13.. shell 是 zsh 5.8.1 或 bash-3.2


我的想法是:你可以先对地址进行排序,如下所示:

x=The-adress-line;
if [ x = "begins with a letter"];
    then 
    if [ x = "begins with a letter + number + SPACE"];
        then
        echo 'something like "1A Street"';
        # NUMBER = '1A' / NAME = 'Street'
    else
        echo 'It begins with the STREET-NAME';
    fi;
elif [ x = "begins with a number"];
    then
    echo 'maybe STREET-NAME like "19th Ave 19B" or STREET-NUMBER like "19B Street"';
    # NUMBER = '19B' / NAME = '19th Ave' or 'Street'
    if [ x = "begins with a number + SPACE"];
        then
        echo 'It begins with the STREET-NUMBER like "1 Street"';
        # NUMBER = '1' / NAME = 'Street'
    elif [ x = "is (number)(text)(space)(text)(number(maybe-text))"];
        then
            echo 'For example 19th Street 19B -> The last number+text is the number (19B)'
            # NUMBER = '19B' / NAME = '19th Street'
    elif [ x = "is (number(maybe-text))(space)(number)(text)(space)(text)"];
        then
        echo 'For example 19B 19th Street -> The first number+text is the number (19B)'
            # NUMBER = '19B' / NAME = '19th Street'
    else
        echo 'INVALID';
else
    echo 'INVALID';
fi;

答案1

恕我直言,您所能做的就是尽力使用一系列正则表达式来匹配您所知道的地址,例如使用 GNU awk 作为第三个参数 tomatch()\s简写[[:space:]]以及 3 个可能定义的正则表达式:

$ cat tst.awk
BEGIN { OFS="\",\"" }
{
    name = number = type = ""
    gsub(/"/,"\"\"")
}
match($0,/^([^0-9]+)([0-9]+(-[0-9]+)?[[:alpha:]]?)$/,a) {
    # Example Street 1
    # Teststraße 2
    # Teststr. 1-5
    # Baker Street 221b
    # Test Street, 1
    type   = 1
    name   = a[1]
    number = a[2]
}
!type && match($0,/^([0-9]+[[:alpha:]])\s+([^0-9]+)$/,a) {
    # 221B Baker Street
    type   = 2
    name   = a[2]
    number = a[1]
}
!type && match($0,/^([0-9]+[[:alpha:]]{2}.*)\s+([0-9]+[[:alpha:]]?)$/,a) {
    # 19th Ave 3B
    type   = 3
    name   = a[1]
    number = a[2]
}
{
    gsub(/^\s+|\s+$/,"",name)
    gsub(/^\s+|\s+$/,"",number)
    if ( !doneHdr++ ) {
        print "\"" "type", "name", "number", "$0" "\""
    }
    print "\"" type, name, number, $0 "\""
}

$ awk -f tst.awk file
"type","name","number","$0"
"1","Example Street","1","Example Street 1"
"1","Teststraße","2","Teststraße 2"
"1","Teststr.","1-5","Teststr. 1-5"
"1","Baker Street","221b","Baker Street 221b"
"2","Baker Street","221B","221B Baker Street"
"3","19th Ave","3B","19th Ave 3B"
"","","","3B 2nd Ave"
"","","","1-3 2nd Mount x Ave"
"","","","105 Lock St # 219"
"1","Test Street,","1","Test Street, 1"
"","","","BookAve, 54, Extra Text 123#"

您可以添加其他正则表达式来以适当的顺序匹配您所知道的地址格式,这样,如果一个地址可能匹配 2 个或更多正则表达式,那么您首先会得到更具限制性的正则表达式。您实际上可能想要修改上面的内容,以便在地址匹配 2 个或更多正则表达式时打印警告,然后您可能想要调整、重新排序或合并它们。

如果您到达的print行仍然为type空,则这是“无效”情况,然后您可以编写/添加一个新的正则表达式来匹配它们(如果合适)。

我确实希望您会遇到这样的情况:您根本无法编写代码来区分一种地址格式与另一种地址格式,但希望这种尽力而为的方法足以满足您的需求。如果您有城市/州/县,您可以随时使用谷歌地图来卷曲地址,看看它是否真实,作为您无法识别的地址的最后努力(但如果您只尝试这样做,那么这将花费很长时间适用于您的所有地址)。

一旦地址识别算法开始工作,就可以随心所欲地生成输出,我只是将 CSV 转储到上面,以便于开发/测试。

相关内容