计算文件中以每个字母开头的单词数

Question 1

一种方式......（编辑以避免两次计算相同的单词）

$ echo "my nice name is Mike Meller" | tr ' ' '\n' | sort -f | uniq -i | sed -nr 's/^([a-z]).*/\U\1/Ip' | uniq -c | sort -r
  3 M
  2 N
  1 I

tr ' ' '\n'将空格更改为换行符
sort -f对行进行排序，以便相同的条目放在一起，即使大小写不同
uniq -i删除重复的单词，忽略大小写
sed -nr 's/^([a-z]).*/\U\1/Ip'删除除第一个字母以外的所有内容，将所有字母更改为大写，如果该行不以字母开头，则不打印该行
uniq -c计算相同的行
sort -r降序排序

（echo "my nice name is Mike Meller"用。。。来代替cat name-of-your-file）

Answer

一种方式......（编辑以避免两次计算相同的单词）

$ echo "my nice name is Mike Meller" | tr ' ' '\n' | sort -f | uniq -i | sed -nr 's/^([a-z]).*/\U\1/Ip' | uniq -c | sort -r
  3 M
  2 N
  1 I

tr ' ' '\n'将空格更改为换行符
sort -f对行进行排序，以便相同的条目放在一起，即使大小写不同
uniq -i删除重复的单词，忽略大小写
sed -nr 's/^([a-z]).*/\U\1/Ip'删除除第一个字母以外的所有内容，将所有字母更改为大写，如果该行不以字母开头，则不打印该行
uniq -c计算相同的行
sort -r降序排序

（echo "my nice name is Mike Meller"用。。。来代替cat name-of-your-file）

Question 2

和perl：

perl -Mopen=locale -lne '
  $c{uc $_}++ for /\b\p{Alpha}/g;
  END{for (sort {$c{$b} <=> $c{$a}} keys %c) {print "$c{$_} $_"}}'

请注意，如果某些字母以分解形式出现。例如，如果É输入为É（即 E 后跟 U+0301 组合重音）而不是预组合É(U+00E9)，则它将被计为E，而不是ÉNor É。

如果这是一个问题，那么最好的方法可能是首先分解文本（因为某些字素没有预先组成的形式）并在字素簇的基础上工作。无论如何，有些类似的ﬁ东西你可能想要分解：

比较：

$ printf 'my ﬁne name is \uc9ric, maybe E\u301ric, certainly not Eric\n' |
  perl -Mopen=locale -lne '
    $c{uc $_}++ for /\b\p{Alpha}/g;
    END{for (sort {$c{$b} <=> $c{$a}} keys %c) {print "$c{$_} $_"}}'
2 E
2 N
2 M
1 C
1 FI
1 É
1 I

和：

$ printf 'my ﬁne name is \uc9ric, maybe E\u301ric, certainly not Eric\n' |
  perl -Mopen=locale -MUnicode::Normalize -lne '
    $c{uc $_}++ for NFKD($_) =~ /\b(?=\p{Alpha})\X/g;
    END{for (sort {$c{$b} <=> $c{$a}} keys %c) {print "$c{$_} $_"}}'
2 É
2 M
2 N
1 E
1 I
1 C
1 F

Answer

和perl：

perl -Mopen=locale -lne '
  $c{uc $_}++ for /\b\p{Alpha}/g;
  END{for (sort {$c{$b} <=> $c{$a}} keys %c) {print "$c{$_} $_"}}'

请注意，如果某些字母以分解形式出现。例如，如果É输入为É（即 E 后跟 U+0301 组合重音）而不是预组合É(U+00E9)，则它将被计为E，而不是ÉNor É。

如果这是一个问题，那么最好的方法可能是首先分解文本（因为某些字素没有预先组成的形式）并在字素簇的基础上工作。无论如何，有些类似的ﬁ东西你可能想要分解：

比较：

$ printf 'my ﬁne name is \uc9ric, maybe E\u301ric, certainly not Eric\n' |
  perl -Mopen=locale -lne '
    $c{uc $_}++ for /\b\p{Alpha}/g;
    END{for (sort {$c{$b} <=> $c{$a}} keys %c) {print "$c{$_} $_"}}'
2 E
2 N
2 M
1 C
1 FI
1 É
1 I

和：

$ printf 'my ﬁne name is \uc9ric, maybe E\u301ric, certainly not Eric\n' |
  perl -Mopen=locale -MUnicode::Normalize -lne '
    $c{uc $_}++ for NFKD($_) =~ /\b(?=\p{Alpha})\X/g;
    END{for (sort {$c{$b} <=> $c{$a}} keys %c) {print "$c{$_} $_"}}'
2 É
2 M
2 N
1 E
1 I
1 C
1 F

Question 3

GNU awk：

gawk '
  { for (i=1; i<=NF; i++) count[toupper(substr($i,1,1))]++ } 
  END {
    PROCINFO["sorted_in"] = "@val_num_desc"
    for (key in count) print count[key], key
  }
' file

Answer

GNU awk：

gawk '
  { for (i=1; i<=NF; i++) count[toupper(substr($i,1,1))]++ } 
  END {
    PROCINFO["sorted_in"] = "@val_num_desc"
    for (key in count) print count[key], key
  }
' file

Question 4

我希望这不是一个家庭作业？ ;-) 棘手的部分是你不想将 Meller 中的“L”数两次，对吗？于是就有了“独特”。

$cat t
my nice name is Mike Meller

然后是执行转换的命令管道：

$tr '[a-z]' '[A-Z]' < t |     # Convert all to upper case
fold -b -w 1 |                # Break into one letter per line
awk -f t.awk |                # Pipe the whole mess to awk to count
sort -r -n                    # Sort in reverse numeric order

awk 脚本最好分解成一个单独的文件，尽管您可以将其全部放入 bash 一行中：

$cat t.awk    
/ / {                         # Match spaces,
  for (c in wc) {dc[c]+=1}    #  Accumulate word count (wc) into doc count (dc)
  split("",wc)                #  Reset the word count
}

!/ / {                        # Match non-spaces,
  if (wc[$1] == "") wc[$1]=1  #  If haven't already seen char in this word, mark it Donny
}

# Finally, output the count and the letter
END {
  for (c in wc) {dc[c]+=1}    # Accumulate one last time, in case there is no trailing space
  for (c in dc) {print c, dc[c]}
}

它产生（对我来说）这个输出：

$tr '[a-z]' '[A-Z]' < t | fold -b -w 1 | awk -f t.awk  | sort -r -n
4 M
4 E
3 I
2 N
1 Y
1 S
1 R
1 L
1 K
1 C
1 A

Answer

我希望这不是一个家庭作业？ ;-) 棘手的部分是你不想将 Meller 中的“L”数两次，对吗？于是就有了“独特”。

$cat t
my nice name is Mike Meller

然后是执行转换的命令管道：

$tr '[a-z]' '[A-Z]' < t |     # Convert all to upper case
fold -b -w 1 |                # Break into one letter per line
awk -f t.awk |                # Pipe the whole mess to awk to count
sort -r -n                    # Sort in reverse numeric order

awk 脚本最好分解成一个单独的文件，尽管您可以将其全部放入 bash 一行中：

$cat t.awk    
/ / {                         # Match spaces,
  for (c in wc) {dc[c]+=1}    #  Accumulate word count (wc) into doc count (dc)
  split("",wc)                #  Reset the word count
}

!/ / {                        # Match non-spaces,
  if (wc[$1] == "") wc[$1]=1  #  If haven't already seen char in this word, mark it Donny
}

# Finally, output the count and the letter
END {
  for (c in wc) {dc[c]+=1}    # Accumulate one last time, in case there is no trailing space
  for (c in dc) {print c, dc[c]}
}

它产生（对我来说）这个输出：

$tr '[a-z]' '[A-Z]' < t | fold -b -w 1 | awk -f t.awk  | sort -r -n
4 M
4 E
3 I
2 N
1 Y
1 S
1 R
1 L
1 K
1 C
1 A

计算文件中以每个字母开头的单词数

答案1

答案2

答案3

答案4

相关内容