为每行的句子添加标签

为每行的句子添加标签

所以,基本上我有这样的台词:

TEXT1910\text0001 “ My hand is broken , ” said the sailor , “ and smoked the pipe . ” 

我希望它们看起来像这样:

TEXT1910\text0001 <s> “ My hand is broken , ” said the sailor , “ and smoked the pipe . ” </s>

我尝试通过使用以下命令来使其工作:

cat text.ign | sed -e 's/\(.*\) \(.*\)/ <s> \1 <\/s>\2/' | less

但这会产生:

<s> TEXT1910\text0001 “ My hand is broken , ” said the sailor , “ and smoked the pipe . ” </s>

答案1

如果我正确地解释了你的目标,请尝试:

sed 's| | <s> |; s|$|</s>|'

例如,从您的文件开始:

$ cat text.ign 
TEXT1910\text0001 “ My hand is broken , ” said the sailor , “ and smoked the pipe . ” 

并且,运行我们的命令:

$ sed 's| | <s> |; s|$|</s>|' text.ign
TEXT1910\text0001 <s> “ My hand is broken , ” said the sailor , “ and smoked the pipe . ” </s>

怎么运行的:

  • s| | <s> |将第一个空白替换为<s>.

    Sed 允许使用任何字符作为替换命令中的分隔符。这里我们使用|而不是传统的/.

  • s|$|</s>|添加</s>到行尾。

    由于我们使用|作为分隔符,因此不需要转义 中的反斜杠</s>

原始命令发生了什么

从问题中,我们有:

$ sed -e 's/\(.*\) \(.*\)/ <s> \1 <\/s>\2/' text.ign 
 <s> TEXT1910\text0001 “ My hand is broken , ” said the sailor , “ and smoked the pipe . ” </s>

这里的问题是 sed 正则表达式匹配最左边最长匹配。这意味着第一个\(.*\)匹配从行开头到行中最后一个空格的所有内容。另一个\(.*\)匹配最后一个空格之后的任何内容。

由于示例中的行以空格结尾,这意味着 \(.*\)匹配整行,而另一行\(.*\)不匹配任何内容。因此<s>放置在整行之前和</s>之后。

答案2

这很简单,只需使用:

sed -Ee 's/(.*[0-9])(.*)/\1 <s>\2 <\/s>/'

在你的情况下:

cat file | sed -Ee 's/(.*[0-9])(.*)/\1 <s>\2 <\/s>/' | less

但你应该更喜欢在 sed 命令后使用文件名,尽量避免使用管道,即:

sed -Ee 's/(.*[0-9])(.*)/\1 <s> \2 <\/s>/' file

使用选项直接编辑文件i

答案3

$ awk '{ $1 = $1 " <s>"; $(NF+1) = "</s>"; print }' file
TEXT1910\text0001 <s> “ My hand is broken , ” said the sailor , “ and smoked the pipe . ” </s>

<s>这只是在第一个空格分隔的字段后面添加一个空格和开始标签,然后</s>在末尾添加结束标签作为新字段。然后它打印修改后的行。

请注意,这会将数据中的多个空格折叠为单个空格。

答案4

您需要的是第一个双引号到最后一个双引号被包裹在类似 html 的构造中,这是可行的,如下所示:

sed -e 's|".*"|<s> & </s>|'  inputfile

相关内容