如何在命令行中从 csv 文件检索电子邮件和网站？

Question 1

基于纯文本的、绝对更详细的 Python 选项：

#!/usr/bin/env python3
import sys

f = sys.argv[1]; out = sys.argv[2]

with open(out, "wt") as wr:
    with open(f) as read:
        for l in read:
            for s in l.strip().split(","):
                if any(["@" in s, "www" in s, "http" in s]):
                    wr.write(s+"\n")

或者，只是为了好玩，再压缩一点：

#!/usr/bin/env python3
import sys

with open(sys.argv[2], "wt") as wr:
    with open(sys.argv[1]) as read:
        [[wr.write(s+"\n") for s in l.strip().split(",") if any(["@" in s, "www" in s, "http" in s])] for l in read]

使用

将脚本复制到一个空文件中，另存为get_stuff.py
使用源文件和目标输出文件作为参数运行它：
```
python3 /path/to/get_stuff.py <input_file> <output_file>
```

结果：

[email protected]
http://www.example.com
[email protected]
http://www.example.com
[email protected]
[email protected]
www.example.com
[email protected]

时间比较

有趣的是更小文件（如示例中所示），该sed选项速度更快，但大文件python选项更快：

在一个 150,000 行的文件上：

sed

real    0m0.073s
user    0m0.068s
sys     0m0.000s

Python

real    0m0.046s
user    0m0.044s
sys     0m0.000s

在一个 10 行的文件中：

sed

real    0m0.003s
user    0m0.000s
sys     0m0.000s

Python

real    0m0.037s
user    0m0.032s
sys     0m0.000s

（我应该说我有一个古老的盒子，在一台严肃的机器上所有时间都应该更短）

这个想法可能是，特别是如果你需要从许多更小循环中的文件，使用 sed，大循环中的文件，使用 python。

对于单个文件，无论大小，0.073和之间的差异0.046完全无关紧要。

此外

下面是从整个（普通）文件目录中提取相同数据的版本。

#!/usr/bin/env python3
import sys
import os

dr = sys.argv[1]

def extract(f, out):
    with open(out, "wt") as wr:
        with open(f) as read:
            [[wr.write(s+"\n") for s in l.strip().split(",") if any(
                ["@" in s, "www" in s, "http" in s]
                )] for l in read]

for file in os.listdir(dr):
    f = os.path.join(dr, file); out = os.path.join(dr, "extracted_"+file)
    extract(f, out)

脚本将从每个文件中创建一个包含提取数据的新文件。从文件：

somefile.csv

它将创建第二个文件，名为：

extracted_somefile.csv

Answer

基于纯文本的、绝对更详细的 Python 选项：

#!/usr/bin/env python3
import sys

f = sys.argv[1]; out = sys.argv[2]

with open(out, "wt") as wr:
    with open(f) as read:
        for l in read:
            for s in l.strip().split(","):
                if any(["@" in s, "www" in s, "http" in s]):
                    wr.write(s+"\n")

或者，只是为了好玩，再压缩一点：

#!/usr/bin/env python3
import sys

with open(sys.argv[2], "wt") as wr:
    with open(sys.argv[1]) as read:
        [[wr.write(s+"\n") for s in l.strip().split(",") if any(["@" in s, "www" in s, "http" in s])] for l in read]

使用

将脚本复制到一个空文件中，另存为get_stuff.py
使用源文件和目标输出文件作为参数运行它：
```
python3 /path/to/get_stuff.py <input_file> <output_file>
```

结果：

[email protected]
http://www.example.com
[email protected]
http://www.example.com
[email protected]
[email protected]
www.example.com
[email protected]

时间比较

有趣的是更小文件（如示例中所示），该sed选项速度更快，但大文件python选项更快：

在一个 150,000 行的文件上：

sed

real    0m0.073s
user    0m0.068s
sys     0m0.000s

Python

real    0m0.046s
user    0m0.044s
sys     0m0.000s

在一个 10 行的文件中：

sed

real    0m0.003s
user    0m0.000s
sys     0m0.000s

Python

real    0m0.037s
user    0m0.032s
sys     0m0.000s

（我应该说我有一个古老的盒子，在一台严肃的机器上所有时间都应该更短）

这个想法可能是，特别是如果你需要从许多更小循环中的文件，使用 sed，大循环中的文件，使用 python。

对于单个文件，无论大小，0.073和之间的差异0.046完全无关紧要。

此外

下面是从整个（普通）文件目录中提取相同数据的版本。

#!/usr/bin/env python3
import sys
import os

dr = sys.argv[1]

def extract(f, out):
    with open(out, "wt") as wr:
        with open(f) as read:
            [[wr.write(s+"\n") for s in l.strip().split(",") if any(
                ["@" in s, "www" in s, "http" in s]
                )] for l in read]

for file in os.listdir(dr):
    f = os.path.join(dr, file); out = os.path.join(dr, "extracted_"+file)
    extract(f, out)

脚本将从每个文件中创建一个包含提取数据的新文件。从文件：

somefile.csv

它将创建第二个文件，名为：

extracted_somefile.csv

Question 2

我觉得你想要的输出缺少两行？

$ sed -r 's|.*,([^,]+@[^0-9]+),.*|\1|' file | tr ',' '\n'
[email protected]
http://www.example.com
[email protected]
http://www.example.com
[email protected]
http://example.com
[email protected]
www.example.com
[email protected]

如果不是，请澄清。

解释

-r使用 ERE
s|old|new|old用。。。来代替new
.*,以逗号结尾的任何字符
([^,]+@[^0-9]+),.*在之前保存一些非逗号字符@，然后在逗号之前保存一些非数字的字符 - 匹配之后的任何内容，以便我们可以丢弃它
\1反向引用已保存的模式
tr ',' '\n'将剩余的逗号改为换行符（我求助于管道，tr因为字段不一致，但可能可以巧妙地避免）

Answer

我觉得你想要的输出缺少两行？

$ sed -r 's|.*,([^,]+@[^0-9]+),.*|\1|' file | tr ',' '\n'
[email protected]
http://www.example.com
[email protected]
http://www.example.com
[email protected]
http://example.com
[email protected]
www.example.com
[email protected]

如果不是，请澄清。

解释

-r使用 ERE
s|old|new|old用。。。来代替new
.*,以逗号结尾的任何字符
([^,]+@[^0-9]+),.*在之前保存一些非逗号字符@，然后在逗号之前保存一些非数字的字符 - 匹配之后的任何内容，以便我们可以丢弃它
\1反向引用已保存的模式
tr ',' '\n'将剩余的逗号改为换行符（我求助于管道，tr因为字段不一致，但可能可以巧妙地避免）

如何在命令行中从 csv 文件检索电子邮件和网站？

答案1

使用

时间比较

此外

答案2

解释

相关内容