grep 正斜杠到正斜杠

grep 正斜杠到正斜杠

我有包含大量 URL 的 HTML 文件。我尝试仅 grep 以斜杠结尾的正斜杠,例如:

"/index.php/pub/xx/en/details/123456/"
"/index.php/pub/xx/en/details/993455/xxx/ff/3e/"
"/index.php/pub/xx/en/details/74939300/"
"/index.php/pub/xx/en/details/9584443/"
"/index.php/pub/xx/en/details/9583832/cdf/dr/wwe/"

我的预期结果是:

/index.php/pub/xx/en/details/123456/
/index.php/pub/xx/en/details/74939300/
/index.php/pub/xx/en/details/9584443/

答案1

grep

grep -P -o '/.*?/[0-9]+/'

sed

sed -E 's|["]*(/.*?)(/[0-9]+/).*|\1\2|'

答案2

cat file.txt
"/index.php/pub/xx/en/details/123456/"
"/index.php/pub/xx/en/details/993455/xxx/ff/3e/"
"/index.php/pub/xx/en/details/74939300/"
"/index.php/pub/xx/en/details/9584443/"
"/index.php/pub/xx/en/details/9583832/cdf/dr/wwe/"
grep -Po '"\K[^"]+?/\d+/(?=")' file.txt
/index.php/pub/xx/en/details/123456/
/index.php/pub/xx/en/details/74939300/
/index.php/pub/xx/en/details/9584443/

解释:

-Po             # Perl regex, only the matched string
"               # a double quote
\K              # forget it
[^"]+?          # 1 or more non double quote, not greedy
/               # a slash
\d+             # 1 or more digits
/               # a slash
(?=")           # positive lookahead, make sure we have a double quote after

答案3

你说你想 grep “正斜杠到正斜杠”。我猜这意味着你想得到第一个斜杠 通过最后一个斜杠,省略第一个斜杠之前和最后一个斜杠之后的任何字符(即"引号,在您的数据中)。没有任何解释,您表明您只想获取恰好有六个路径名组件的行;即七个斜杠。

一个命令,使用 PCRE:

grep -Po '(?<=")/([^/]*/){6}(?=")' file.txt

两个命令,无需 PCRE:

grep -E '"/([^/]*/){6}"' file.txt | grep -o '/.*/'

相关内容