所以,基本上我有这样的台词:
TEXT1910\text0001 “ My hand is broken , ” said the sailor , “ and smoked the pipe . ”
我希望它们看起来像这样:
TEXT1910\text0001 <s> “ My hand is broken , ” said the sailor , “ and smoked the pipe . ” </s>
我尝试通过使用以下命令来使其工作:
cat text.ign | sed -e 's/\(.*\) \(.*\)/ <s> \1 <\/s>\2/' | less
但这会产生:
<s> TEXT1910\text0001 “ My hand is broken , ” said the sailor , “ and smoked the pipe . ” </s>
答案1
如果我正确地解释了你的目标,请尝试:
sed 's| | <s> |; s|$|</s>|'
例如,从您的文件开始:
$ cat text.ign
TEXT1910\text0001 “ My hand is broken , ” said the sailor , “ and smoked the pipe . ”
并且,运行我们的命令:
$ sed 's| | <s> |; s|$|</s>|' text.ign
TEXT1910\text0001 <s> “ My hand is broken , ” said the sailor , “ and smoked the pipe . ” </s>
怎么运行的:
s| | <s> |
将第一个空白替换为<s>
.Sed 允许使用任何字符作为替换命令中的分隔符。这里我们使用
|
而不是传统的/
.s|$|</s>|
添加</s>
到行尾。由于我们使用
|
作为分隔符,因此不需要转义 中的反斜杠</s>
。
原始命令发生了什么
从问题中,我们有:
$ sed -e 's/\(.*\) \(.*\)/ <s> \1 <\/s>\2/' text.ign
<s> TEXT1910\text0001 “ My hand is broken , ” said the sailor , “ and smoked the pipe . ” </s>
这里的问题是 sed 正则表达式匹配最左边最长匹配。这意味着第一个\(.*\)
匹配从行开头到行中最后一个空格的所有内容。另一个\(.*\)
匹配最后一个空格之后的任何内容。
由于示例中的行以空格结尾,这意味着 \(.*\)
匹配整行,而另一行\(.*\)
不匹配任何内容。因此<s>
放置在整行之前和</s>
之后。
答案2
这很简单,只需使用:
sed -Ee 's/(.*[0-9])(.*)/\1 <s>\2 <\/s>/'
在你的情况下:
cat file | sed -Ee 's/(.*[0-9])(.*)/\1 <s>\2 <\/s>/' | less
但你应该更喜欢在 sed 命令后使用文件名,尽量避免使用管道,即:
sed -Ee 's/(.*[0-9])(.*)/\1 <s> \2 <\/s>/' file
使用选项直接编辑文件i
。
答案3
$ awk '{ $1 = $1 " <s>"; $(NF+1) = "</s>"; print }' file
TEXT1910\text0001 <s> “ My hand is broken , ” said the sailor , “ and smoked the pipe . ” </s>
<s>
这只是在第一个空格分隔的字段后面添加一个空格和开始标签,然后</s>
在末尾添加结束标签作为新字段。然后它打印修改后的行。
请注意,这会将数据中的多个空格折叠为单个空格。
答案4
您需要的是第一个双引号到最后一个双引号被包裹在类似 html 的构造中,这是可行的,如下所示:
sed -e 's|".*"|<s> & </s>|' inputfile