从文本中提取以辅音开头并以元音结尾的单词

从文本中提取以辅音开头并以元音结尾的单词

我必须制作一个 Linux shell 程序,将一个文本文件转换为另一个文本文件,该文件仅包含以辅音开头并以元音结尾的单词,从而消除数字和标点符号。

元音=aoeui 辅音=bcdfghjklmnpqrstvwxyz

就是保留原文的格式,只去掉不符合要求的单词(以元音开头,以辅音结尾)、数字和标点符号。

我尝试过grep,或者sed,但我无法得出任何结论。

答案1

POSIXly:

consonants=BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz
vowels=AEIOUaeiou

< file tr -cs '[:alpha:]' '[\n*]' |
  grep -x "[$consonants].*[$vowels]"

将报告以英语辅音之一开头并以英语元音之一结尾的所有字母字符序列(在您的语言环境中如此分类)。

< file tr -cs '[:alpha:]' '[\n*]' |
  grep -x "[$consonants][$consonants$vowels]*[$vowels]"

将限制为仅包含英文字母的内容(不会匹配 on,Stéphane因为 theé不是其中之一)允许字母)。

< file tr -cs "$consonants$vowel" '[\n*]' |
  grep -x "[$consonants].*[$vowels]"

将忽略不是这些英文字母之一的任何字符(因此会在 findperidicoinside 中查找periódico)。

(请注意,某些tr实现(例如 GNU)tr不支持多字节字符,因此无论如何都会被那些 ó/é 字符阻塞)。

举个例子,在:

FooBar Fee123 foo-bar periódico

输入,在 FreeBSD 系统(具有 POSIX 的系统tr)上的典型 en_US.UTF-8 语言环境中,您将得到 3 种解决方案:

1            2           3

Fee          Fee         Fee
foo          foo         foo
periódico                peri
                         dico

并不是说,虽然它们都不会匹配作为 U+00E9 字符输入的Blé位置,但所有人都会在U+0301后面的位置找到组合锐音重音(不是字母字符),而第一个不会匹配t与带有组合锐音的书面形式相匹配。éBleBléeStéphane

要解决这个问题,您可以使用perl而不是tr第一种方法来在过滤之前保留组合标记grep

< file perl -Mopen=locale -pe 's/[^\pL\pM]+/\n/g' |
  grep -x "[$consonants].*[$vowels]"

或者做所有事情perl

< file perl -Mopen=locale -lne 'print for
  grep /^[bcdfghj-np-tv-z].*[aeiou]$/i, /[\pL\pM]+/g'

答案2

使用 GNU grep

grep -io '\<[bcdfghjklmnpqrstvwxyz][a-z]*[aeiou]\>'

答案3

grep

grep -oiw '[bcdfghjklmnpqrstvwxyz][a-z]*[aeiou]'

首先括号表达式匹配辅音、第二个任意字母 az 和最后一个元音。

答案4

要过滤掉保留初始文本格式的所需单词 -awk解决方案:

样本textfile内容:

Any delicate you how kindness horrible outlived servants. You high bed wish help call draw side. Girl quit if case mr sing as no have. At none neat am do over will. Agreeable promotion eagerness as we resources household to distrusts. Polite do object at passed it is. Small for ask shade water manor think men begin. 

He oppose at thrown desire of no. Announcing impression unaffected day his are unreserved indulgence. Him hard find read are you sang. Parlors visited noisier how explain pleased his see suppose. Do ashamed assured on related offence at equally totally. Use mile her whom they its. Kept hold an want as he bred of. Was dashwood landlord cheerful husbands two. Estate why theirs indeed him polite old settle though she. In as at regard easily narrow roused adieus. 

So delightful up dissimilar by unreserved it connection frequently. Do an high room so in paid. Up on cousin ye dinner should in. Sex stood tried walls manor truth shy and three his. Their to years so child truth. Honoured peculiar families sensible up likewise by on in. 

工作:

awk -v IGNORECASE=1 '{ 
       for(i=1;i<=NF;i++) 
           if ($i~/^[bcdfghjklmnpqrstvwxz][a-z]*[aoeui]$/) 
               printf "%s ",$i; print "" 
       }' textfile > newfile

内容newfile

delicate horrible case no none do we to Polite do shade 

He desire see Do mile he polite settle 

So Do so three to so sensible likewise

----------

要过滤掉单独行中的每个单词 -grep解决方案:

grep -woi '[bcdfghjklmnpqrstvwxz][a-z]*[aoeui]' oldfile > newfile
  • -w( --word-regexp) - 测试是匹配的子字符串必须位于行的开头,或者前面有一个非单词组成字符。同样,它必须位于行尾或后跟非单词组成字符。

相关内容