使用 shell 脚本或 awk 从文件中提取字符串

Question 1

第一个例子用于grep抓取所有链接，例如：

$ grep -o 'http[^"]*' file
http://www.dakar.com
http://www.docomolabs-usa.com/
http://www.google.com/
http://www.hpl.hp.com/
http://www.ibm.com/
http://research.microsoft.com/
http://www.vmware.com/

第二次使用时，awk在第一个字段所在的行上打印第二个字段Host:：

$ awk '$1=="Host:"{print $2}' file
mail.google.com
mail.google.com
mail.google.com
www.slashdot.org
slashdot.org
store.dakar.com
genweb.ostg.com
pagead2.googlesyndication.com
ad.doubleclick.net
bs.serving-sys.com
ds-ll.serving-sys.com
images.slashdot.org
store.dakar.com
www.google-analytics.com
www.google.com
www.usenix.org
www.thelocal.se
www.usenix.org
www.thelocal.se

Answer

第一个例子用于grep抓取所有链接，例如：

$ grep -o 'http[^"]*' file
http://www.dakar.com
http://www.docomolabs-usa.com/
http://www.google.com/
http://www.hpl.hp.com/
http://www.ibm.com/
http://research.microsoft.com/
http://www.vmware.com/

第二次使用时，awk在第一个字段所在的行上打印第二个字段Host:：

$ awk '$1=="Host:"{print $2}' file
mail.google.com
mail.google.com
mail.google.com
www.slashdot.org
slashdot.org
store.dakar.com
genweb.ostg.com
pagead2.googlesyndication.com
ad.doubleclick.net
bs.serving-sys.com
ds-ll.serving-sys.com
images.slashdot.org
store.dakar.com
www.google-analytics.com
www.google.com
www.usenix.org
www.thelocal.se
www.usenix.org
www.thelocal.se

Question 2

一种简单的方法，elinks(1)利用倾倒文件，如手册页中所述

   -dump [0|1] (default: 0)
       Print formatted plain-text versions of given URLs to stdout.

可能：

$ elinks -dump < infile | awk '$0~/\s*[[:digit:]]*\. http/ {print $2}'
http://www.dakar.com/
http://www.docomolabs-usa.com/
http://www.google.com/
http://www.hpl.hp.com/
http://www.hpl.hp.com/
http://research.microsoft.com/
http://www.vmware.com/

当然，这可能会捕获不需要的行。改进使用的正则表达式以符合您的标准。

其他文本模式浏览器（lynx、links）和一些分页器（w3m）也有一个dump选项。

Answer

一种简单的方法，elinks(1)利用倾倒文件，如手册页中所述

   -dump [0|1] (default: 0)
       Print formatted plain-text versions of given URLs to stdout.

可能：

$ elinks -dump < infile | awk '$0~/\s*[[:digit:]]*\. http/ {print $2}'
http://www.dakar.com/
http://www.docomolabs-usa.com/
http://www.google.com/
http://www.hpl.hp.com/
http://www.hpl.hp.com/
http://research.microsoft.com/
http://www.vmware.com/

当然，这可能会捕获不需要的行。改进使用的正则表达式以符合您的标准。

其他文本模式浏览器（lynx、links）和一些分页器（w3m）也有一个dump选项。

Question 3

grep使用-o选项每行提取给定模式的文本。例如，以下命令提取所有格式为\cite{引文关键词}来自乳胶文件。

grep -o '[\]cite{[a-zA-Z0-9,-]*}' inputfile.tex

要将输出重定向到另一个文件，请使用

grep -o '[\]cite{[a-zA-Z0-9,-]*}' inputfile.tex > outputfile.tex

Answer

grep使用-o选项每行提取给定模式的文本。例如，以下命令提取所有格式为\cite{引文关键词}来自乳胶文件。

grep -o '[\]cite{[a-zA-Z0-9,-]*}' inputfile.tex

要将输出重定向到另一个文件，请使用

grep -o '[\]cite{[a-zA-Z0-9,-]*}' inputfile.tex > outputfile.tex

Question 4

假设您想从现有文件（在本例中称为 blag.text）中提取此内容，您可以使用cat blag.txt| grep http |cut -d \" -f2第一个示例

首先，使用 grep 提取包含 http 的行。这将为您提供类似的行<li><a href="http://www.dakar.com" TARGET=_BLANK>dakar.com</a></li>。然后我们使用引号作为 cut 的分隔符，但由于引号也用于括住字符串，因此我们需要使用/

对于第二种情况，您可能希望 grep 查找“host”，然后使用 : 作为分隔符（您也可以以相同的方式在冒号后使用空格），这 cat blag2.txt | grep Host |cut -d : -f2 是我的做法，但cat blag2.txt | grep Host |cut -d \ -f2更优雅。有二/ 之后的空格，一个是我们用作分隔符的空格，另一个用于将其与下一个参数分开。

Answer

假设您想从现有文件（在本例中称为 blag.text）中提取此内容，您可以使用cat blag.txt| grep http |cut -d \" -f2第一个示例

首先，使用 grep 提取包含 http 的行。这将为您提供类似的行<li><a href="http://www.dakar.com" TARGET=_BLANK>dakar.com</a></li>。然后我们使用引号作为 cut 的分隔符，但由于引号也用于括住字符串，因此我们需要使用/

对于第二种情况，您可能希望 grep 查找“host”，然后使用 : 作为分隔符（您也可以以相同的方式在冒号后使用空格），这 cat blag2.txt | grep Host |cut -d : -f2 是我的做法，但cat blag2.txt | grep Host |cut -d \ -f2更优雅。有二/ 之后的空格，一个是我们用作分隔符的空格，另一个用于将其与下一个参数分开。

使用 shell 脚本或 awk 从文件中提取字符串

答案1

答案2

答案3

答案4

相关内容