分组和捕获适用于玩具问题,而不是现实世界

分组和捕获适用于玩具问题,而不是现实世界

玩具问题:

$ echo "foo <a href="/topic/null-hypothesis/" data-sc="text link:topic link">Null hypothesis</a> bar" | sed -E 's@.*<a href=/topic/[^>]*>([^<]*)</a>.*@\1@'
Null hypothesis

现实世界(sed 不过滤任何内容):

$ cat *html | grep '<a href="/topic' | sed -E 's@.*<a href=/topic/[^>]*>([^<]*)</a>.*@\1@'
                <a href="/topic/null-hypothesis/" data-sc="text link:topic link">Null hypothesis</a>, 
                <a href="/topic/approximation/" data-sc="text link:topic link">Approximation</a>, 
                <a href="/topic/estimation-methods/" data-sc="text link:topic link">Estimation methods</a>, 
                <a href="/topic/statistical-variance/" data-sc="text link:topic link">Statistical variance</a>, 
                <a href="/topic/identifiability/" data-sc="text link:topic link">Identifiability</a>, 
                <a href="/topic/preliminary-estimates/" data-sc="text link:topic link">Preliminary estimates</a>, 
                <a href="/topic/matrix-inversion/" data-sc="text link:topic link">Matrix inversion</a>

需要做出什么改变才能得到“零假设”?

附:

$ cat *html | grep -n10 '<a href="/topic' | sed -E 's@.*<a href=/topic/[^>]*>([^<]*)</a>.*@\1@'
538-
539-                
540-
541-
542-
543-    
544-        <div class="topics-list mtl">
545-            <p class="hide">You can always find the topics here!</p>
546-            <strong>Topics:</strong>
547-            
548:                <a href="/topic/null-hypothesis/" data-sc="text link:topic link">Null hypothesis</a>, 
549-            
550:                <a href="/topic/approximation/" data-sc="text link:topic link">Approximation</a>, 
551-            
552:                <a href="/topic/estimation-methods/" data-sc="text link:topic link">Estimation methods</a>, 
553-            
554:                <a href="/topic/statistical-variance/" data-sc="text link:topic link">Statistical variance</a>, 
555-            
556:                <a href="/topic/identifiability/" data-sc="text link:topic link">Identifiability</a>, 
557-            
558:                <a href="/topic/preliminary-estimates/" data-sc="text link:topic link">Preliminary estimates</a>, 
559-            
560:                <a href="/topic/matrix-inversion/" data-sc="text link:topic link">Matrix inversion</a>
561-            
562-        </div>
563-
564-        <div class="mvl left">
565-            
566-
567-
568-
569-<div id="flag-description" aria-live="assertive">
570-    <a class="hover" data-qa="give-feedback" data-toggle="flag-reason" href="#" title="Give feedback on the topics for this item.">

ṔS2:完整的*html 文件:https://pastebin.com/RLnWXKWe

答案1

cat *html | grep -oE '\"\/.*\/\"' | awk -F'/' '{print $(NF-1)}'
这应该可以正常工作。

答案2

您应该尝试运行第一个命令直到(但不包括)|- 即,只是命令echo

$ echo "foo <a href="/topic/null-hypothesis/" data-sc="text link:topic link">Null hypothesis</a> bar"
foo <a href=/topic/null-hypothesis/ data-sc=text link:topic link>Null hypothesis</a> bar

看出什么问题了吗?你是否期待href=data-sc=跟随字符串?

你的echo命令是错误的。开头的与结尾的"不匹配;"它与找到的第一个匹配:

$ echo "foo <a href="/topic/null-hypothesis/" data-sc="text link:topic link">Null hypothesis</a> bar"
       ↑▲▲▲▲▲▲▲▲▲▲▲▲↑.......................↑▲▲▲▲▲▲▲▲▲↑....................↑▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲↑

下面带有符号的字符位于引号之间 - 因此您已经成功引用了<和 >字符,如果它们没有被引用,这将导致混乱。下面的字符.不被引用。而且引文本身并没有被引用!

引用引用的最简单方法是使用另一种引用 - 即,将第一个和最后一个更改"为 '

然后修复您的 sed 命令以使用正确的玩具数据,如下所示:

sed -E 's@.*<a href="/topic/[^>]*>([^<]*)</a>.*@\1@'
注意"后面添加的href=

你应该对真实数据感到满意。

相关内容