从已经下载的index.html中提取pdf文件以便wget它们，即使有多个pdf文件

Question

既然你有 GNU sed，你就可以安装 GNU awk。使用用于多字符 RS 和 RT 的 GNU awk：

$ awk -v RS='href="http[^"]+.pdf"' -F'"' 'RT{$0=RT; print $2}' file
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2019_Henrot-Versillé-C1_L1.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2019_Henrot-Versillé_C1_L2.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Henrot-Versillé_C3.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Martinelli_C2_L1_Bayesian.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Martinelli_C2_TD_Bayesian.pdf

否则，在每个 UNIX 机器上的任何 shell 中使用任何 awk：

$ awk '{
    while ( match($0,/href="http[^"]+.pdf"/) ) {
        split(substr($0,RSTART,RLENGTH),f,/"/)
        print f[2]
        $0 = substr($0,RSTART+RLENGTH)
    }
}' file
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2019_Henrot-Versillé-C1_L1.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2019_Henrot-Versillé_C1_L2.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Henrot-Versillé_C3.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Martinelli_C2_L1_Bayesian.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Martinelli_C2_TD_Bayesian.pdf

只需将该输出传送到xargs -n 1 curl -O, 即可下载 PDF（假设 URL 中没有空格）。

Answer 1

既然你有 GNU sed，你就可以安装 GNU awk。使用用于多字符 RS 和 RT 的 GNU awk：

$ awk -v RS='href="http[^"]+.pdf"' -F'"' 'RT{$0=RT; print $2}' file
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2019_Henrot-Versillé-C1_L1.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2019_Henrot-Versillé_C1_L2.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Henrot-Versillé_C3.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Martinelli_C2_L1_Bayesian.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Martinelli_C2_TD_Bayesian.pdf

否则，在每个 UNIX 机器上的任何 shell 中使用任何 awk：

$ awk '{
    while ( match($0,/href="http[^"]+.pdf"/) ) {
        split(substr($0,RSTART,RLENGTH),f,/"/)
        print f[2]
        $0 = substr($0,RSTART+RLENGTH)
    }
}' file
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2019_Henrot-Versillé-C1_L1.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2019_Henrot-Versillé_C1_L2.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Henrot-Versillé_C3.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Martinelli_C2_L1_Bayesian.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Martinelli_C2_TD_Bayesian.pdf

只需将该输出传送到xargs -n 1 curl -O, 即可下载 PDF（假设 URL 中没有空格）。

从已经下载的index.html中提取pdf文件以便wget它们，即使有多个pdf文件

答案1

相关内容