我有包含大量 URL 的 HTML 文件。我尝试仅 grep 以斜杠结尾的正斜杠,例如:
"/index.php/pub/xx/en/details/123456/"
"/index.php/pub/xx/en/details/993455/xxx/ff/3e/"
"/index.php/pub/xx/en/details/74939300/"
"/index.php/pub/xx/en/details/9584443/"
"/index.php/pub/xx/en/details/9583832/cdf/dr/wwe/"
我的预期结果是:
/index.php/pub/xx/en/details/123456/
/index.php/pub/xx/en/details/74939300/
/index.php/pub/xx/en/details/9584443/
答案1
和grep
:
grep -P -o '/.*?/[0-9]+/'
和sed
:
sed -E 's|["]*(/.*?)(/[0-9]+/).*|\1\2|'
答案2
cat file.txt
"/index.php/pub/xx/en/details/123456/"
"/index.php/pub/xx/en/details/993455/xxx/ff/3e/"
"/index.php/pub/xx/en/details/74939300/"
"/index.php/pub/xx/en/details/9584443/"
"/index.php/pub/xx/en/details/9583832/cdf/dr/wwe/"
grep -Po '"\K[^"]+?/\d+/(?=")' file.txt
/index.php/pub/xx/en/details/123456/
/index.php/pub/xx/en/details/74939300/
/index.php/pub/xx/en/details/9584443/
解释:
-Po # Perl regex, only the matched string
" # a double quote
\K # forget it
[^"]+? # 1 or more non double quote, not greedy
/ # a slash
\d+ # 1 or more digits
/ # a slash
(?=") # positive lookahead, make sure we have a double quote after
答案3
你说你想 grep “正斜杠到正斜杠”。我猜这意味着你想得到第一个斜杠
通过最后一个斜杠,省略第一个斜杠之前和最后一个斜杠之后的任何字符(即"
引号,在您的数据中)。没有任何解释,您表明您只想获取恰好有六个路径名组件的行;即七个斜杠。
一个命令,使用 PCRE:
grep -Po '(?<=")/([^/]*/){6}(?=")' file.txt
两个命令,无需 PCRE:
grep -E '"/([^/]*/){6}"' file.txt | grep -o '/.*/'