连接具有重复开头的文本行

Question 1

perl -p0E 'while(s/^((.+?)\t.*)\n\2\t/$1<br>/gm){}'

（在我用了 6 年的笔记本电脑上，处理一个 23MB、1.5M 行的字典需要 2 秒）

Answer

perl -p0E 'while(s/^((.+?)\t.*)\n\2\t/$1<br>/gm){}'

（在我用了 6 年的笔记本电脑上，处理一个 23MB、1.5M 行的字典需要 2 秒）

Question 2

这是标准程序awk

awk '
{
  k=$2
  for (i=3;i<=NF;i++)
    k=k " " $i
  if (! a[$1])
    a[$1]=k
  else
    a[$1]=a[$1] "<br>" k
}
END{
  for (i in a)
    print i "\t" a[i]
}' long.text.file

如果文件按行中的第一个单词排序，则脚本可以更简单

awk '
{
  if($1==k)
    printf("%s","<br>")
  else {
    if(NR!=1)
      print ""
    printf("%s\t",$1)
  }
  for(i=2;i<NF;i++)
    printf("%s ",$i)
  printf("%s",$NF)
  k=$1
}
END{
print ""
}' long.text.file

要不就bash

unset n
while read -r word definition
do
    if [ "$last" = "$word" ]
    then
        printf "<br>%s" "$definition"
    else 
        if [ "$n" ]
        then
            echo
        else
            n=1
        fi
        printf "%s\t%s" "$word" "$definition"
        last="$word"
     fi
done < long.text.file
echo

Answer

这是标准程序awk

awk '
{
  k=$2
  for (i=3;i<=NF;i++)
    k=k " " $i
  if (! a[$1])
    a[$1]=k
  else
    a[$1]=a[$1] "<br>" k
}
END{
  for (i in a)
    print i "\t" a[i]
}' long.text.file

如果文件按行中的第一个单词排序，则脚本可以更简单

awk '
{
  if($1==k)
    printf("%s","<br>")
  else {
    if(NR!=1)
      print ""
    printf("%s\t",$1)
  }
  for(i=2;i<NF;i++)
    printf("%s ",$i)
  printf("%s",$NF)
  k=$1
}
END{
print ""
}' long.text.file

要不就bash

unset n
while read -r word definition
do
    if [ "$last" = "$word" ]
    then
        printf "<br>%s" "$definition"
    else 
        if [ "$n" ]
        then
            echo
        else
            n=1
        fi
        printf "%s\t%s" "$word" "$definition"
        last="$word"
     fi
done < long.text.file
echo

Question 3

这确实是标准awk。这是一个不会更改操作数据的简洁解决方案：

awk 'BEGIN { FS="\t" }
     $1!=key { if (key!="") print out ; key=$1 ; out=$0 ; next }
     { out=out"<br>"$2 }
     END { print out }'

Answer

这确实是标准awk。这是一个不会更改操作数据的简洁解决方案：

awk 'BEGIN { FS="\t" }
     $1!=key { if (key!="") print out ; key=$1 ; out=$0 ; next }
     { out=out"<br>"$2 }
     END { print out }'

Question 4

在Python中：

import sys

def join(file_name, join_text):
    prefix = None
    current_line = ''
    for line in open(file_name):
        if line and line[-1] == '\n':
            line = line[:-1]
        try:
            first_word, rest = line.split('\t', 1)
        except:
            first_word = None  # empty line or one without tab
            rest = line
        if first_word == prefix:
            current_line += join_text + rest
        else:
            if current_line:
                print current_line
            current_line = line
            prefix = first_word

    if current_line:  # do the last line(s)
        print current_line


join(sys.argv[2], sys.argv[1])

这需要分隔符 ( <br>) 作为程序的第一个参数，文件名作为第二个参数

Answer

在Python中：

import sys

def join(file_name, join_text):
    prefix = None
    current_line = ''
    for line in open(file_name):
        if line and line[-1] == '\n':
            line = line[:-1]
        try:
            first_word, rest = line.split('\t', 1)
        except:
            first_word = None  # empty line or one without tab
            rest = line
        if first_word == prefix:
            current_line += join_text + rest
        else:
            if current_line:
                print current_line
            current_line = line
            prefix = first_word

    if current_line:  # do the last line(s)
        print current_line


join(sys.argv[2], sys.argv[1])

这需要分隔符 ( <br>) 作为程序的第一个参数，文件名作为第二个参数

连接具有重复开头的文本行

答案1

答案2

答案3

答案4

相关内容