我得到了不同域的 url 列表,我希望用 sed、awk 或类似的东西删除主机名,只保留路径。其中没有包含端口或用户名@密码的网址。
输入:
http://www.example.com/
https://www.example.com/
http://example.com/blog/
https://example.com/blog/
https://www.example.co.uk/blog/
https://example.co.uk/blog/
https://sub.example.co.uk/blog/
https://www.example.com/blog/
https://www.example.com/cases/page/4/
https://www.example.com/cdn-cgi/challenge-platform/h/g/cv/result/7c9123dc38da6841
https://www.example.com/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js
https://www.example.co.uk/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js
https://sub.example.co.uk/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js
输出应该是:
/
/
/blog/
/blog/
/blog/
/blog/
/blog/
/blog/
/cases/page/4/
/cdn-cgi/challenge-platform/h/g/cv/result/7c9123dc38da6841
/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js
/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js
/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js
我希望有人可以帮助我,因为我只能找到正则表达式命令。我不知道如何将它们正确转换为 sed 或 awk 命令
答案1
和perl
:
perl -pe 's|^([^/:]+:)?//[^/]*||' < your-file
将删除一个可选方案(以同时处理http://host/path
和//host/path
),后跟//
除该字符之外的所有字符(例如/
,将删除host
和)。user:password@host:8080
ftp://user:password@host:8080/pub
等效sed
的可以是:
LC_ALL=C sed 's|^\([^/:]\{1,\}:\)\{0,1\}//[^/]*||' < your-file
无论如何,s/pattern/replacement/
and 的运算符sed
都perl
采用正则表达式作为模式,所谓基本正则表达式为了sed
,Perl正则表达式对于perl
(这改善并扩展了扩展正则表达式如今许多sed
实现也支持该选项)。-E
还有一个URI
模块perl
可用于将 URI 解析为结构化对象。
perl -MURI -lpe '$_ = URI->new($_)->path' < your-file
请注意,它会丢弃查询字符串(如 中http://host/path?query
)和片段(如 中)(http://host/file.html#anchor
如果有)。如果您希望包含查询(如果有),请替换->path
为。->path_query
答案2
使用 linux coreutils 可以很容易地做到这一点:
cut -d '/' -f 3- somefilewithyoururls.txt | sed 's/^/\//'
剪切第三个之后的所有内容/
,然后用 a 替换该行的开头/
。不需要复杂的正则表达式。
答案3
使用任何 sed:
$ sed 's:[^/]*//[^/]*::' file
/
/
/blog/
/blog/
/blog/
/blog/
/blog/
/blog/
/cases/page/4/
/cdn-cgi/challenge-platform/h/g/cv/result/7c9123dc38da6841
/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js
/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js
/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js
答案4
使用乐(以前称为 Perl_6)
~$ raku -MURL -ne 'my $url = URL.new($_); put "/" ~ .path.join("/") for $url;' file
示例输出:
/
/
/blog
/blog
/blog
/blog
/blog
/blog
/cases/page/4
/cdn-cgi/challenge-platform/h/g/cv/result/7c9123dc38da6841
/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js
/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js
/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js
对于 Raku 来说,加载URL
模块可能是最干净的答案,因为它可以处理 url 中的用户名/密码。上面,识别出的path
元素前面是正斜杠,join
后面是正/
斜杠,然后是 out put
。
简化上面的代码可以让您知道哪些元素被识别:
~$ raku -MURL -ne 'my $url = URL.new($_); .raku.put for $url;' file
URL.new(scheme => "http", username => Str, password => Str, hostname => "www.example.com", port => Int, path => [], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "www.example.com", port => Int, path => [], query => {}, fragment => Str)
URL.new(scheme => "http", username => Str, password => Str, hostname => "example.com", port => Int, path => ["blog"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "example.com", port => Int, path => ["blog"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "www.example.co.uk", port => Int, path => ["blog"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "example.co.uk", port => Int, path => ["blog"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "sub.example.co.uk", port => Int, path => ["blog"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "www.example.com", port => Int, path => ["blog"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "www.example.com", port => Int, path => ["cases", "page", "4"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "www.example.com", port => Int, path => ["cdn-cgi", "challenge-platform", "h", "g", "cv", "result", "7c9123dc38da6841"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "www.example.com", port => Int, path => ["cdn-cgi", "challenge-platform", "h", "g", "scripts", "jsd", "7fe83wdcs", "invisible.js"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "www.example.co.uk", port => Int, path => ["cdn-cgi", "challenge-platform", "h", "g", "scripts", "jsd", "7fe83wdcs", "invisible.js"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "sub.example.co.uk", port => Int, path => ["cdn-cgi", "challenge-platform", "h", "g", "scripts", "jsd", "7fe83wdcs", "invisible.js"], query => {}, fragment => Str)
如果你真的敢用正则表达式解析 URL(你确定没有数据被恶意制作吗?),那么以下是 @Stéphane_Chazelas 发布的 Perl 答案的相当直接的翻译:
~$ raku -pe 's|^ ( <-[/:]>+ \: )? \/ \/ <-[/]>* ||;' < file