我有一个如下所示的文件(list_20.txt):
[{"d_prime":"0.475425","variation1":"rs909776","r2":"0.057940","variation2":"rs16991816","population_name":"1000GENOMES:phase_3:KHV"}]
[{"r2":"0.057940","variation1":"rs909776","d_prime":"0.475425","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs16991819"}]
[{"variation1":"rs909776","r2":"0.078476","d_prime":"0.546491","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs8114269"}]
[{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs8114269","r2":"0.073418","variation1":"rs6130034","d_prime":"0.528588"}]
[{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs1201686","r2":"0.060239","variation1":"rs3746539","d_prime":"0.271891"}]
[{"variation2":"rs1201686","population_name":"1000GENOMES:phase_3:KHV","d_prime":"0.280262","r2":"0.058212","variation1":"rs2144011"}]
[{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs10485662","r2":"0.058826","variation1":"rs844808","d_prime":"0.423639"}]
[{"variation2":"rs6065565","population_name":"1000GENOMES:phase_3:KHV","d_prime":"0.638509","r2":"0.110749","variation1":"rs6139746"}]
[{"r2":"0.110749","variation1":"rs6139746","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072936"}]
[{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6065562","variation1":"rs6139746","r2":"0.091021","d_prime":"0.606214"}]
[{"variation1":"rs6139746","r2":"0.910749","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072937"}]
...
我想只提取在“r2”之后具有值的行:”大于 0.7 且小于或等于 1
在这个例子中,预期的输出就是这一行:
[{"variation1":"rs6139746","r2":"0.910749","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072937"}]
我试过这个:
awk '$NF >= 0.8 && $NF <1 {print $0}' list_20.txt > 20.out
但我得到一个空文件。此外,此命令并不特定于感兴趣的字符串:“r2”:“
答案1
由于这看起来像 JSON,所以我们使用命令行 JSON 解析器:
$ jq '.[] | select((.r2|tonumber) > 0.7 and (.r2|tonumber) <= 1)' file
{
"variation1": "rs6139746",
"r2": "0.910749",
"d_prime": "0.638509",
"population_name": "1000GENOMES:phase_3:KHV",
"variation2": "rs6072937"
}
我们必须使用 来将键的值r2
从字符串转换为正确的数字tonumber
,但除此之外,它是一个简单的过滤器select()
。
我们可以稍微缩短它,或者至少避免转换每个数字两次, 和
jq '.[] | (.r2|tonumber) as $r2 | select($r2 > 0.7 and $r2 <= 1)' file
您希望结果的格式与输入相同,请使用
$ jq -c '.[] | (.r2|tonumber) as $r2 | select($r2 > 0.7 and $r2 <= 1) | [.]' file
[{"variation1":"rs6139746","r2":"0.910749","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072937"}]
也就是说,要求“紧凑输出”并使用-c
为通过过滤器提取的每个结果创建一个数组。select()
[.]
答案2
使用 awk:
awk 'match($0, /"r2":"[^"]+"/) {
t = substr($0, RSTART+6, RLENGTH-7)
f = 0.7<t+0 && t+0<=1
if ( f ) print
}' list_20.txt
您也可以在 perl 中执行此操作:
perl -lne '
print if /"r2":"(.*?)"/ and 0.7<$1 && $1<=1;
' list_20.txt
我们正在寻找引号中的字符串 r2 及其后面的内容。然后应用范围检查的条件,如果发现在范围内则打印该行。
答案3
awk -F'[][{},]' '{
for (i=3;i<=NF-2;i++){
if ($i ~ /^"r2"/){
r2=substr($i, 7, length($i)-7)
if (r2>0.7 && r2<=1){ print; break }
}
}
}' list_20.txt > 20.out
使用]
、[
、{
和}
作为,
字段分隔符。然后循环遍历每个记录中的字段,跳过前两个和最后两个字段(因为它们始终为空)。
测试当前字段是否以 开头"r2"
并提取值substr($i, 7, length($i)-7)
,即跳过前 6 个字符"r2":"
并省略最后一个字符"
。
如果值在范围内,则打印记录并中断循环。
答案4
如果数字是浮点数,您可以像这样 grep 出这些行:
$ LC_ALL=C grep -E '"r2":"((0?\.(7[0-9]*[1-9][0-9]*|[89][0-9]*))|1(\.0*)?)"' list_20.txt
-E
选项打开扩展正则表达式