如何解析某个字符串后面的数字大于阈值的行?

如何解析某个字符串后面的数字大于阈值的行?

我有一个如下所示的文件(list_20.txt):

[{"d_prime":"0.475425","variation1":"rs909776","r2":"0.057940","variation2":"rs16991816","population_name":"1000GENOMES:phase_3:KHV"}]
[{"r2":"0.057940","variation1":"rs909776","d_prime":"0.475425","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs16991819"}]
[{"variation1":"rs909776","r2":"0.078476","d_prime":"0.546491","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs8114269"}]
[{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs8114269","r2":"0.073418","variation1":"rs6130034","d_prime":"0.528588"}]
[{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs1201686","r2":"0.060239","variation1":"rs3746539","d_prime":"0.271891"}]
[{"variation2":"rs1201686","population_name":"1000GENOMES:phase_3:KHV","d_prime":"0.280262","r2":"0.058212","variation1":"rs2144011"}]
[{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs10485662","r2":"0.058826","variation1":"rs844808","d_prime":"0.423639"}]
[{"variation2":"rs6065565","population_name":"1000GENOMES:phase_3:KHV","d_prime":"0.638509","r2":"0.110749","variation1":"rs6139746"}]
[{"r2":"0.110749","variation1":"rs6139746","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072936"}]
[{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6065562","variation1":"rs6139746","r2":"0.091021","d_prime":"0.606214"}]
[{"variation1":"rs6139746","r2":"0.910749","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072937"}]
...

我想只提取在“r2”之后具有值的行:”大于 0.7 且小于或等于 1

在这个例子中,预期的输出就是这一行:

[{"variation1":"rs6139746","r2":"0.910749","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072937"}]

我试过这个:

awk '$NF >= 0.8 && $NF <1 {print $0}' list_20.txt  > 20.out

但我得到一个空文件。此外,此命令并不特定于感兴趣的字符串:“r2”:“

答案1

由于这看起来像 JSON,所以我们使用命令行 JSON 解析器:

$ jq '.[] | select((.r2|tonumber) > 0.7 and (.r2|tonumber) <= 1)' file
{
  "variation1": "rs6139746",
  "r2": "0.910749",
  "d_prime": "0.638509",
  "population_name": "1000GENOMES:phase_3:KHV",
  "variation2": "rs6072937"
}

我们必须使用 来将键的值r2从字符串转换为正确的数字tonumber,但除此之外,它是一个简单的过滤器select()

我们可以稍微缩短它,或者至少避免转换每个数字两次, 和

jq '.[] | (.r2|tonumber) as $r2 | select($r2 > 0.7 and $r2 <= 1)' file

您希望结果的格式与输入相同,请使用

$ jq -c '.[] | (.r2|tonumber) as $r2 | select($r2 > 0.7 and $r2 <= 1) | [.]' file
[{"variation1":"rs6139746","r2":"0.910749","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072937"}]

也就是说,要求“紧凑输出”并使用-c为通过过滤器提取的每个结果创建一个数组。select()[.]

答案2

使用 awk:

awk 'match($0, /"r2":"[^"]+"/) {
  t = substr($0, RSTART+6, RLENGTH-7)
  f = 0.7<t+0 && t+0<=1
  if ( f ) print 
}' list_20.txt 

您也可以在 perl 中执行此操作:

perl -lne '
  print if /"r2":"(.*?)"/ and 0.7<$1 && $1<=1;
' list_20.txt

我们正在寻找引号中的字符串 r2 及其后面的内容。然后应用范围检查的条件,如果发现在范围内则打印该行。

答案3

awk -F'[][{},]' '{
  for (i=3;i<=NF-2;i++){
    if ($i ~ /^"r2"/){
      r2=substr($i, 7, length($i)-7)
      if (r2>0.7 && r2<=1){ print; break }
    }
  }
}' list_20.txt > 20.out

使用][{}作为,字段分隔符。然后循环遍历每个记录中的字段,跳过前两个和最后两个字段(因为它们始终为空)。

测试当前字段是否以 开头"r2"并提取值substr($i, 7, length($i)-7),即跳过前 6 个字符"r2":"并省略最后一个字符"

如果值在范围内,则打印记录并中断循环。

答案4

如果数字是浮点数,您可以像这样 grep 出这些行:

$  LC_ALL=C grep -E '"r2":"((0?\.(7[0-9]*[1-9][0-9]*|[89][0-9]*))|1(\.0*)?)"' list_20.txt 

-E 选项打开扩展正则表达式

相关内容