玩具问题:
$ echo "foo <a href="/topic/null-hypothesis/" data-sc="text link:topic link">Null hypothesis</a> bar" | sed -E 's@.*<a href=/topic/[^>]*>([^<]*)</a>.*@\1@'
Null hypothesis
现实世界(sed 不过滤任何内容):
$ cat *html | grep '<a href="/topic' | sed -E 's@.*<a href=/topic/[^>]*>([^<]*)</a>.*@\1@'
<a href="/topic/null-hypothesis/" data-sc="text link:topic link">Null hypothesis</a>,
<a href="/topic/approximation/" data-sc="text link:topic link">Approximation</a>,
<a href="/topic/estimation-methods/" data-sc="text link:topic link">Estimation methods</a>,
<a href="/topic/statistical-variance/" data-sc="text link:topic link">Statistical variance</a>,
<a href="/topic/identifiability/" data-sc="text link:topic link">Identifiability</a>,
<a href="/topic/preliminary-estimates/" data-sc="text link:topic link">Preliminary estimates</a>,
<a href="/topic/matrix-inversion/" data-sc="text link:topic link">Matrix inversion</a>
需要做出什么改变才能得到“零假设”?
附:
$ cat *html | grep -n10 '<a href="/topic' | sed -E 's@.*<a href=/topic/[^>]*>([^<]*)</a>.*@\1@'
538-
539-
540-
541-
542-
543-
544- <div class="topics-list mtl">
545- <p class="hide">You can always find the topics here!</p>
546- <strong>Topics:</strong>
547-
548: <a href="/topic/null-hypothesis/" data-sc="text link:topic link">Null hypothesis</a>,
549-
550: <a href="/topic/approximation/" data-sc="text link:topic link">Approximation</a>,
551-
552: <a href="/topic/estimation-methods/" data-sc="text link:topic link">Estimation methods</a>,
553-
554: <a href="/topic/statistical-variance/" data-sc="text link:topic link">Statistical variance</a>,
555-
556: <a href="/topic/identifiability/" data-sc="text link:topic link">Identifiability</a>,
557-
558: <a href="/topic/preliminary-estimates/" data-sc="text link:topic link">Preliminary estimates</a>,
559-
560: <a href="/topic/matrix-inversion/" data-sc="text link:topic link">Matrix inversion</a>
561-
562- </div>
563-
564- <div class="mvl left">
565-
566-
567-
568-
569-<div id="flag-description" aria-live="assertive">
570- <a class="hover" data-qa="give-feedback" data-toggle="flag-reason" href="#" title="Give feedback on the topics for this item.">
ṔS2:完整的*html 文件:https://pastebin.com/RLnWXKWe
答案1
cat *html | grep -oE '\"\/.*\/\"' | awk -F'/' '{print $(NF-1)}'
这应该可以正常工作。
答案2
您应该尝试运行第一个命令直到(但不包括)|
- 即,只是命令echo
。
$ echo "foo <a href="/topic/null-hypothesis/" data-sc="text link:topic link">Null hypothesis</a> bar"
foo <a href=/topic/null-hypothesis/ data-sc=text link:topic link>Null hypothesis</a> bar
看出什么问题了吗?你是否期待href=
并data-sc=
跟随引字符串?
你的echo
命令是错误的。开头的与结尾的"
不匹配;"
它与找到的第一个匹配:
$ echo "foo <a href="/topic/null-hypothesis/" data-sc="text link:topic link">Null hypothesis</a> bar"
↑▲▲▲▲▲▲▲▲▲▲▲▲↑.......................↑▲▲▲▲▲▲▲▲▲↑....................↑▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲↑
下面带有符号的字符▲
位于引号之间 - 因此您已经成功引用了<
和 >
字符,如果它们没有被引用,这将导致混乱。下面的字符.
不被引用。而且引文本身并没有被引用!
引用引用的最简单方法是使用另一种引用 - 即,将第一个和最后一个更改"
为 '
。
然后修复您的 sed 命令以使用正确的玩具数据,如下所示:
sed -E 's@.*<a href="/topic/[^>]*>([^<]*)</a>.*@\1@'注意"
后面添加的href=
。
你应该对真实数据感到满意。