如何读取文件中的每个单词并将其替换为另一个文件中的替代单词（如果找到）

Question 1

在每个 Unix 机器上的任何 shell 中使用任何 awk：

$ cat tst.awk
BEGIN { FS="=" }
NR==FNR {
    map[$1] = $2
    next
}
{
    head = ""
    tail = $0
    while ( match(tail,/[^,= ]+/) ) {
        old = substr(tail,RSTART,RLENGTH)
        new = (old in map ? map[old] : old)
        head = head substr(tail,1,RSTART-1) new
        tail = substr(tail,RSTART+RLENGTH)
    }
    print head tail
}

$ awk -f tst.awk Search_Replace_File.txt fileA.txt
1, This is a Record Ten, Value1, Dummy_val1 Ten, SUN
2, This is a Record Twenty, Value2, Dummy_val2 Twenty, SNOW
3, This is a Record Thirty, Value3, Dummy_val3 Thirty, SNOW
4, This is a Record Forty, Value4, Dummy_val4 Forty, SUN

我上面的假设是您的输入单词都不包含,、=或空格，但任何其他字符都可以。

此外，如果一个旧单词映射到一个新单词，并且该新单词也可以映射到另一个新单词，则上面的代码将不会这样做，因为这会导致无限递归，只有第一个映射会保留。

Answer

在每个 Unix 机器上的任何 shell 中使用任何 awk：

$ cat tst.awk
BEGIN { FS="=" }
NR==FNR {
    map[$1] = $2
    next
}
{
    head = ""
    tail = $0
    while ( match(tail,/[^,= ]+/) ) {
        old = substr(tail,RSTART,RLENGTH)
        new = (old in map ? map[old] : old)
        head = head substr(tail,1,RSTART-1) new
        tail = substr(tail,RSTART+RLENGTH)
    }
    print head tail
}

$ awk -f tst.awk Search_Replace_File.txt fileA.txt
1, This is a Record Ten, Value1, Dummy_val1 Ten, SUN
2, This is a Record Twenty, Value2, Dummy_val2 Twenty, SNOW
3, This is a Record Thirty, Value3, Dummy_val3 Thirty, SNOW
4, This is a Record Forty, Value4, Dummy_val4 Forty, SUN

我上面的假设是您的输入单词都不包含,、=或空格，但任何其他字符都可以。

此外，如果一个旧单词映射到一个新单词，并且该新单词也可以映射到另一个新单词，则上面的代码将不会这样做，因为这会导致无限递归，只有第一个映射会保留。

Question 2

我们可以使用 awk 来完成此操作，如下所示：

awk '
BEGIN {
  d = "[$]{2}"
  w = "[[:alpha:]][_[:alnum:]]*"
  re = d w d "|" "[#]?" w
}
FS == "="{a[$1]=$2;next}
{
  z = ""
  t = $0
  gsub(re, RS "&" RS, t)
  nf = split(t, x, RS)
  for (i=1; i<=nf; i++)
    z = z ((i%2) ? x[i] : ((x[i] in a) ? a[x[i]] : x[i]))
  print z
}
' FS="=" Search_Replace_File.txt FS=" " fileA.txt
1, This is a  Record Ten, Value1, Dummy_val1 Ten, SUN
2, This is a Record Twenty, Value2, Dummy_val2 Twenty, SNOW
3, This is a Record Thirty, Value3, Dummy_val3 Thirty, SNOW
4, This is a Record Forty, Value4, Dummy_val4 Forty, SUN

定义单词的正则表达式。
通过换行符划分当前行中的单词。
然后在换行符上分割当前行。
所有单词都是偶数字段。
检查数组 a 中是否找到单词，然后替换它们。
打印修改后的行。

Answer

我们可以使用 awk 来完成此操作，如下所示：

awk '
BEGIN {
  d = "[$]{2}"
  w = "[[:alpha:]][_[:alnum:]]*"
  re = d w d "|" "[#]?" w
}
FS == "="{a[$1]=$2;next}
{
  z = ""
  t = $0
  gsub(re, RS "&" RS, t)
  nf = split(t, x, RS)
  for (i=1; i<=nf; i++)
    z = z ((i%2) ? x[i] : ((x[i] in a) ? a[x[i]] : x[i]))
  print z
}
' FS="=" Search_Replace_File.txt FS=" " fileA.txt
1, This is a  Record Ten, Value1, Dummy_val1 Ten, SUN
2, This is a Record Twenty, Value2, Dummy_val2 Twenty, SNOW
3, This is a Record Thirty, Value3, Dummy_val3 Thirty, SNOW
4, This is a Record Forty, Value4, Dummy_val4 Forty, SUN

定义单词的正则表达式。
通过换行符划分当前行中的单词。
然后在换行符上分割当前行。
所有单词都是偶数字段。
检查数组 a 中是否找到单词，然后替换它们。
打印修改后的行。

Question 3

使用乐（以前称为 Perl_6）

~$ raku -pe 'BEGIN my %h = (          \ 
               "One" => "Ten",        \ 
               "Two" => "Twenty",     \
               "Three" => "Thirty",   \
               "Four" => "Forty",     \
               q[$$MOON$$] => "SUN",  \
               q[#LATER] => "SNOW");  \ 
             s:g/ [ ^ | <punct>+ | <blank>+] <( @(%h.keys) )> [ <punct>+ | <blank>+ | $ ] /%h{$/}/;'  file

这是用 Raku（Perl 家族的一种编程语言）编写的答案。上面-pe使用了类似 sed 的自动打印命令行标志。哈希%h被声明为内联。注意$必须转义，但是"\$\$MOON\$\$"可以q[$$MOON$$]按上面的方式编写，减少对反斜杠的需要。

替换的核心是s///，它使用:g全局修饰符。在匹配域（左半部分）内，@(%h.keys)哈希键被强制转换为@-sigiled 数组，并且这些被理解为匹配域内的文字字符串。在替换域（右半部分）中，$/匹配变量用于恢复相应密钥的value，该密钥被替换为。

这里的问题是“字" 通常被定义为字母数字加- _（下划线）。在这种情况下，您将使用 Raku 的<<（左）和>>（右）零宽度正则表达式锚点，因为它们分别代表左和右单词边界。如果没有这些边界标记，某些东西likeFourteen将被错误地替换为Fortyteen. （请参阅下面示例输入文件的最后一行：示例输出显示正确的结果）。

由于OP已请求使用以以下开头/结尾的键的解决方案非-字母数字加_字符（从而排除使用零宽度字边界锚），一种方法是尝试描述可能性，如下所示：

s:g/ [ ^ | <punct>+ | <blank>+] <( @(%h.keys) )> [ <punct>+ | <blank>+ | $ ] /%h{$/}/;

输入示例：

1, This is a Record One, Value1, Dummy_val1 One, $$MOON$$
2, This is a Record Two, Value2, Dummy_val2 Two, #LATER
3, This is a Record Three, Value3, Dummy_val3 Three, #LATER
4, This is a Record Four, Value4, Dummy_val4 Four, $$MOON$$
5, This is a Record Fourteen, Value14, Dummy_val14 Fourteen, #LATER

示例输出：

1, This is a Record Ten, Value1, Dummy_val1 Ten, SUN
2, This is a Record Twenty, Value2, Dummy_val2 Twenty, SNOW
3, This is a Record Thirty, Value3, Dummy_val3 Thirty, SNOW
4, This is a Record Forty, Value4, Dummy_val4 Forty, SUN
5, This is a Record Fourteen, Value14, Dummy_val14 Fourteen, SNOW

也许更好（更可靠）的方法是更仔细地选择非-单词键，例如确保它们以以下开头/结尾非-单词字符（例如#LATER#代替#LATER）。然后使用二哈希值，如下所示：

~$ raku -pe 'BEGIN    my %words = ("One" => "Ten", "Two" => "Twenty", "Three" => "Thirty", "Four" => "Forty")  \
             andthen  my %non-words = (q[$$MOON$$] => "SUN", q[#LATER#] => "SNOW");  \
             s:g/ << @(%words.keys) >> /%words{$/}/;  \
             s:g/ [ ^ | <punct>+ | <blank>+] <( @(%non-words.keys) )> [ <punct>+ | <blank>+ | $ ] /%non-words{$/}/;'  file

此代码采用相同的示例输入文件（更新#LATER到后#LATER#），并生成与上面相同的示例输出。

https://docs.raku.org/language/regexes#Regex_interpolation
https://docs.raku.org/language/regexes
https://docs.raku.org
https://raku.org

Answer