我有一些地址.csv不同的国际格式
Example Street 1
Teststraße 2
Teststr. 1-5
Baker Street 221b
221B Baker Street
19th Ave 3B
3B 2nd Ave
1-3 2nd Mount x Ave
105 Lock St # 219
Test Street, 1
BookAve, 54, Extra Text 123#
例如,我们在德国写作Teststraße 2
,在美国写作2 Test Street
有没有办法分离/提取所有街道名称和街道号码? 输出名称.csv
Example Street
Teststraße
Teststr.
Baker Street
Baker Street
19th Ave
2nd Ave
2nd Mount Good Ave
Lock St # 219
Test Street
BookAve
输出数字.csv
1
2
1-5
221b
221B
3B
3B
1-3
105
1
54
输出-extra_text.csv
Extra Text 123#
我使用的是 macOS 13.. shell 是 zsh 5.8.1 或 bash-3.2
我的想法是:你可以先对地址进行排序,如下所示:
x=The-adress-line;
if [ x = "begins with a letter"];
then
if [ x = "begins with a letter + number + SPACE"];
then
echo 'something like "1A Street"';
# NUMBER = '1A' / NAME = 'Street'
else
echo 'It begins with the STREET-NAME';
fi;
elif [ x = "begins with a number"];
then
echo 'maybe STREET-NAME like "19th Ave 19B" or STREET-NUMBER like "19B Street"';
# NUMBER = '19B' / NAME = '19th Ave' or 'Street'
if [ x = "begins with a number + SPACE"];
then
echo 'It begins with the STREET-NUMBER like "1 Street"';
# NUMBER = '1' / NAME = 'Street'
elif [ x = "is (number)(text)(space)(text)(number(maybe-text))"];
then
echo 'For example 19th Street 19B -> The last number+text is the number (19B)'
# NUMBER = '19B' / NAME = '19th Street'
elif [ x = "is (number(maybe-text))(space)(number)(text)(space)(text)"];
then
echo 'For example 19B 19th Street -> The first number+text is the number (19B)'
# NUMBER = '19B' / NAME = '19th Street'
else
echo 'INVALID';
else
echo 'INVALID';
fi;
答案1
恕我直言,您所能做的就是尽力使用一系列正则表达式来匹配您所知道的地址,例如使用 GNU awk 作为第三个参数 tomatch()
和\s
简写[[:space:]]
以及 3 个可能定义的正则表达式:
$ cat tst.awk
BEGIN { OFS="\",\"" }
{
name = number = type = ""
gsub(/"/,"\"\"")
}
match($0,/^([^0-9]+)([0-9]+(-[0-9]+)?[[:alpha:]]?)$/,a) {
# Example Street 1
# Teststraße 2
# Teststr. 1-5
# Baker Street 221b
# Test Street, 1
type = 1
name = a[1]
number = a[2]
}
!type && match($0,/^([0-9]+[[:alpha:]])\s+([^0-9]+)$/,a) {
# 221B Baker Street
type = 2
name = a[2]
number = a[1]
}
!type && match($0,/^([0-9]+[[:alpha:]]{2}.*)\s+([0-9]+[[:alpha:]]?)$/,a) {
# 19th Ave 3B
type = 3
name = a[1]
number = a[2]
}
{
gsub(/^\s+|\s+$/,"",name)
gsub(/^\s+|\s+$/,"",number)
if ( !doneHdr++ ) {
print "\"" "type", "name", "number", "$0" "\""
}
print "\"" type, name, number, $0 "\""
}
$ awk -f tst.awk file
"type","name","number","$0"
"1","Example Street","1","Example Street 1"
"1","Teststraße","2","Teststraße 2"
"1","Teststr.","1-5","Teststr. 1-5"
"1","Baker Street","221b","Baker Street 221b"
"2","Baker Street","221B","221B Baker Street"
"3","19th Ave","3B","19th Ave 3B"
"","","","3B 2nd Ave"
"","","","1-3 2nd Mount x Ave"
"","","","105 Lock St # 219"
"1","Test Street,","1","Test Street, 1"
"","","","BookAve, 54, Extra Text 123#"
您可以添加其他正则表达式来以适当的顺序匹配您所知道的地址格式,这样,如果一个地址可能匹配 2 个或更多正则表达式,那么您首先会得到更具限制性的正则表达式。您实际上可能想要修改上面的内容,以便在地址匹配 2 个或更多正则表达式时打印警告,然后您可能想要调整、重新排序或合并它们。
如果您到达的print
行仍然为type
空,则这是“无效”情况,然后您可以编写/添加一个新的正则表达式来匹配它们(如果合适)。
我确实希望您会遇到这样的情况:您根本无法编写代码来区分一种地址格式与另一种地址格式,但希望这种尽力而为的方法足以满足您的需求。如果您有城市/州/县,您可以随时使用谷歌地图来卷曲地址,看看它是否真实,作为您无法识别的地址的最后努力(但如果您只尝试这样做,那么这将花费很长时间适用于您的所有地址)。
一旦地址识别算法开始工作,就可以随心所欲地生成输出,我只是将 CSV 转储到上面,以便于开发/测试。