搜索替换(引号除外)

搜索替换(引号除外)

我有以下文本,需要用换行符替换所有空格(引号中的任何内容除外)。

输入

This is an example text with    some      spaces.
This should be 2nd line.
However the spaces between "quotes    should not    change".
last line

输出应如下所示:

This
is
an
example
text
with    
some
spaces.
This
should
be
2nd
line.
However
the
spaces
between
"quotes    should not    change".
last
line

我尝试使用 awk/sed/perl 但无法弄清楚除了引号之外的位置。

引用的文本不会超过一行。

答案1

编辑:我的解决方案完全是矫枉过正。我不知道我在想什么。这个问题可以通过一个极其简单的正则表达式来解决。看解决方案由...所提交乔奥


Pythonshlex几乎这是开箱即用的。这是一个示例脚本:

#!/usr/bin/env python2
# -*- coding: ascii -*-
"""tokens.py"""

import sys
import shlex

with open(sys.argv[1], 'r') as textfile:
    text = ''.join(textfile.readlines())
    for token in shlex.split(text, posix=False):
        print(token)

如果您的数据在文件中data.txt(例如),那么您可以像这样运行它:

python tokens.py data.txt

这是它产生的输出:


一个
例子
文本
一些
空间。
应该
第二名
线。
然而
空间
之间
“报价不应改变”
最后的
线

请注意,它将句点放在单独的一行上。这是因为它以结束引号结束标记。由于您给出的示例似乎不需要多行字符串或转义字符,因此滚动您自己的小词法分析器可能并不难。这是我想出的:

#!/usr/bin/env python2
# -*- coding: ascii -*-
"""tokens.py"""

import sys

def tokenize(string):
    """Break a string into tokens using white-space as the only delimiter
    while respecting double-quoted substrings and keeping the double-quote
    characters in the resulting token."""

    # List to store the resulting list of tokens
    tokens = []

    # List to store characters as we build the current token
    token = []

    # Flag to keep track of whether or not
    # we're currently in a quoted substring
    quoted = False

    # Iterate through the string one character at a time
    for character in string:

        # If the character is a space then we either end the current
        # token (if quoted is False) or add the space to the current
        # token (if quoted is True)
        if character == ' ':
            if quoted:
                token.append(character)
            elif token:
                tokens.append(''.join(token))
                token = []

        # A double-quote character is always added to the token
        # It also toggles the 'quoted' flag
        elif character == '"':
            token.append(character)
            if quoted:
                quoted = False
            else:
                quoted = True

        # All other characters are added to the token
        else:
            token.append(character)

    # Whatever is left at the end becomes another token
    if token:
        tokens.append(''.join(token))

    # Return the resulting list of strings
    return(tokens)

if __name__=="__main__":
    """Read in text from a file and pring out the resulting tokens."""
    with open(sys.argv[1], 'r') as textfile:
        text = ''.join(textfile.readlines()).replace("\n", " ")
        for token in tokenize(text):
            print(token)

这将准确产生您所要求的结果。您可以很容易地用另一种语言(如 Perl)实现该算法。我只是碰巧更喜欢 Python。

答案2

使用 GNU-grep:

grep -Po '(".*?"|\S)+' file.txt

答案3

如果可以删除原始文本中的空行:

sed -r 's/("[^"]*"[^ ]?)/\n\1\n/g' input.txt |
sed -r '/^"/!s/\s{1,}/\n/g' |
sed '/^$/d'

如果应保留原始文本中的空行:

sed -r 's/("[^"]*"[^ ]?)/###\n\1\n###/g' input.txt |
sed -r '/^"/!s/\s{1,}/\n/g' |
sed '/###/d'

输入(测试比较复杂)

This is an "example text" with    some      spaces.
This should be 2nd line.
"However the spaces" between "quotes    should not    change".
"last line"

输出

This
is
an
"example text"
with
some
spaces.
This
should
be
2nd
line.
"However the spaces"
between
"quotes    should not    change".
"last line"

相关内容