使用 sed/awk 从 URL 中删除主机名

Question 1

和perl：

perl -pe 's|^([^/:]+:)?//[^/]*||' < your-file

将删除一个可选方案（以同时处理http://host/path和//host/path），后跟//除该字符之外的所有字符（例如/，将删除host和）。user:password@host:8080ftp://user:password@host:8080/pub

等效sed的可以是：

LC_ALL=C sed 's|^\([^/:]\{1,\}:\)\{0,1\}//[^/]*||' < your-file

无论如何，s/pattern/replacement/and 的运算符sed都perl采用正则表达式作为模式，所谓基本正则表达式为了sed，Perl正则表达式对于perl（这改善并扩展了扩展正则表达式如今许多sed实现也支持该选项）。-E

还有一个URI模块perl可用于将 URI 解析为结构化对象。

perl -MURI -lpe '$_ = URI->new($_)->path' < your-file

请注意，它会丢弃查询字符串（如中http://host/path?query）和片段（如中）（http://host/file.html#anchor如果有）。如果您希望包含查询（如果有），请替换->path为。->path_query

Answer

和perl：

perl -pe 's|^([^/:]+:)?//[^/]*||' < your-file

将删除一个可选方案（以同时处理http://host/path和//host/path），后跟//除该字符之外的所有字符（例如/，将删除host和）。user:password@host:8080ftp://user:password@host:8080/pub

等效sed的可以是：

LC_ALL=C sed 's|^\([^/:]\{1,\}:\)\{0,1\}//[^/]*||' < your-file

无论如何，s/pattern/replacement/and 的运算符sed都perl采用正则表达式作为模式，所谓基本正则表达式为了sed，Perl正则表达式对于perl（这改善并扩展了扩展正则表达式如今许多sed实现也支持该选项）。-E

还有一个URI模块perl可用于将 URI 解析为结构化对象。

perl -MURI -lpe '$_ = URI->new($_)->path' < your-file

请注意，它会丢弃查询字符串（如中http://host/path?query）和片段（如中）（http://host/file.html#anchor如果有）。如果您希望包含查询（如果有），请替换->path为。->path_query

Question 2

使用 linux coreutils 可以很容易地做到这一点：

cut -d '/' -f 3- somefilewithyoururls.txt | sed 's/^/\//'

剪切第三个之后的所有内容/，然后用 a 替换该行的开头/。不需要复杂的正则表达式。

Answer

使用 linux coreutils 可以很容易地做到这一点：

cut -d '/' -f 3- somefilewithyoururls.txt | sed 's/^/\//'

剪切第三个之后的所有内容/，然后用 a 替换该行的开头/。不需要复杂的正则表达式。

Question 3

使用任何 sed：

$ sed 's:[^/]*//[^/]*::' file
/
/
/blog/
/blog/
/blog/
/blog/
/blog/
/blog/
/cases/page/4/
/cdn-cgi/challenge-platform/h/g/cv/result/7c9123dc38da6841
/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js
/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js
/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js

Answer

使用任何 sed：

$ sed 's:[^/]*//[^/]*::' file
/
/
/blog/
/blog/
/blog/
/blog/
/blog/
/blog/
/cases/page/4/
/cdn-cgi/challenge-platform/h/g/cv/result/7c9123dc38da6841
/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js
/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js
/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js

Question 4

使用乐（以前称为 Perl_6）

~$ raku -MURL -ne 'my $url = URL.new($_); put "/" ~ .path.join("/") for $url;'  file

示例输出：

/
/
/blog
/blog
/blog
/blog
/blog
/blog
/cases/page/4
/cdn-cgi/challenge-platform/h/g/cv/result/7c9123dc38da6841
/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js
/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js
/cdn-cgi/challenge-platform/h/g/scripts/jsd/7fe83wdcs/invisible.js

对于 Raku 来说，加载URL模块可能是最干净的答案，因为它可以处理 url 中的用户名/密码。上面，识别出的path元素前面是正斜杠，join后面是正/斜杠，然后是 out put。

简化上面的代码可以让您知道哪些元素被识别：

~$ raku -MURL -ne 'my $url = URL.new($_); .raku.put for $url;'  file
URL.new(scheme => "http", username => Str, password => Str, hostname => "www.example.com", port => Int, path => [], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "www.example.com", port => Int, path => [], query => {}, fragment => Str)
URL.new(scheme => "http", username => Str, password => Str, hostname => "example.com", port => Int, path => ["blog"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "example.com", port => Int, path => ["blog"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "www.example.co.uk", port => Int, path => ["blog"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "example.co.uk", port => Int, path => ["blog"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "sub.example.co.uk", port => Int, path => ["blog"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "www.example.com", port => Int, path => ["blog"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "www.example.com", port => Int, path => ["cases", "page", "4"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "www.example.com", port => Int, path => ["cdn-cgi", "challenge-platform", "h", "g", "cv", "result", "7c9123dc38da6841"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "www.example.com", port => Int, path => ["cdn-cgi", "challenge-platform", "h", "g", "scripts", "jsd", "7fe83wdcs", "invisible.js"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "www.example.co.uk", port => Int, path => ["cdn-cgi", "challenge-platform", "h", "g", "scripts", "jsd", "7fe83wdcs", "invisible.js"], query => {}, fragment => Str)
URL.new(scheme => "https", username => Str, password => Str, hostname => "sub.example.co.uk", port => Int, path => ["cdn-cgi", "challenge-platform", "h", "g", "scripts", "jsd", "7fe83wdcs", "invisible.js"], query => {}, fragment => Str)

如果你真的敢用正则表达式解析 URL（你确定没有数据被恶意制作吗？），那么以下是 @Stéphane_Chazelas 发布的 Perl 答案的相当直接的翻译：

~$ raku -pe 's|^ ( <-[/:]>+ \: )? \/ \/ <-[/]>* ||;'  < file

https://raku.land/cpan:TYIL/URL
https://raku.org

Answer