如何使用模拟值同时随机改变两列的顺序 1000 次？

Question 1

在所有情况下，| column -t仅添加以下内容以使输出在视觉上对齐。

1）我应该创建两列，分别命名为“simulation1”和“simulation2”，其中包含随机数

$ cat tst.awk
BEGIN { srand(seed) }
{ print $0, r(), r() }
function r() { return rand() * 100001 / 1000 }

$ awk -f tst.awk file | column -t
231   0.12    85.5574  23.7444
432   0.32    23.558   65.5853
11    0.0003  59.2486  50.3799
134   0.33    27.8248  45.7872
2334  0.553   45.7947  13.1887
12    0.33    51.6042  99.55
100   0.331   88.0281  17.4515
1008  1.6     1.37974  65.5945
223   -0.81   14.6773  97.6476
998   -3.001  87.888   31.97

2）之后，我根据“simulation1”列上的值对“ID”和“pheno”列进行排序

$ awk -f tst.awk file | sort -k3,3n | column -t
1008  1.6     1.37974  65.5945
223   -0.81   14.6773  97.6476
432   0.32    23.558   65.5853
134   0.33    27.8248  45.7872
2334  0.553   45.7947  13.1887
12    0.33    51.6042  99.55
11    0.0003  59.2486  50.3799
231   0.12    85.5574  23.7444
998   -3.001  87.888   31.97
100   0.331   88.0281  17.4515

3) 然后我计算前 40% 行的“pheno”平均值

$ cat tst2.awk
{ vals[NR] = $2 }
END {
    max = NR * 40 / 100
    for (i=1; i<=max; i++) {
        sum += vals[i]
    }
    print sum / max
}

$ awk -f tst.awk file | sort -k3,3n | awk -f tst2.awk
0.36

我希望你能弄清楚剩下的事情。在上面，我在每次调用时都为 awk 提供了相同的种子，以便输出保持不变，以便于跟踪整个计算阶段。更改对 to 的调用，tst.awk以便awk -v seed="$RANDOM" -f tst.awk在每次调用时生成不同的随机数。

Answer

在所有情况下，| column -t仅添加以下内容以使输出在视觉上对齐。

1）我应该创建两列，分别命名为“simulation1”和“simulation2”，其中包含随机数

$ cat tst.awk
BEGIN { srand(seed) }
{ print $0, r(), r() }
function r() { return rand() * 100001 / 1000 }

$ awk -f tst.awk file | column -t
231   0.12    85.5574  23.7444
432   0.32    23.558   65.5853
11    0.0003  59.2486  50.3799
134   0.33    27.8248  45.7872
2334  0.553   45.7947  13.1887
12    0.33    51.6042  99.55
100   0.331   88.0281  17.4515
1008  1.6     1.37974  65.5945
223   -0.81   14.6773  97.6476
998   -3.001  87.888   31.97

2）之后，我根据“simulation1”列上的值对“ID”和“pheno”列进行排序

$ awk -f tst.awk file | sort -k3,3n | column -t
1008  1.6     1.37974  65.5945
223   -0.81   14.6773  97.6476
432   0.32    23.558   65.5853
134   0.33    27.8248  45.7872
2334  0.553   45.7947  13.1887
12    0.33    51.6042  99.55
11    0.0003  59.2486  50.3799
231   0.12    85.5574  23.7444
998   -3.001  87.888   31.97
100   0.331   88.0281  17.4515

3) 然后我计算前 40% 行的“pheno”平均值

$ cat tst2.awk
{ vals[NR] = $2 }
END {
    max = NR * 40 / 100
    for (i=1; i<=max; i++) {
        sum += vals[i]
    }
    print sum / max
}

$ awk -f tst.awk file | sort -k3,3n | awk -f tst2.awk
0.36

我希望你能弄清楚剩下的事情。在上面，我在每次调用时都为 awk 提供了相同的种子，以便输出保持不变，以便于跟踪整个计算阶段。更改对 to 的调用，tst.awk以便awk -v seed="$RANDOM" -f tst.awk在每次调用时生成不同的随机数。

Question 2

更新的脚本因为bc与数字上的前导符号不能很好地配合，所以更改awk为domath.

还更改为使用shuf对每次迭代的数组索引进行洗牌，因为使用固定数组更简单。

#!/bin/bash

function domath {
    #do the math using the 4 indices into the pheno array
    awk '{print ($1+$2+$3+$4)/4}' <<<"${ph[$1]} ${ph[$2]} ${ph[$3]} ${ph[$4]}"
}

function iterate {
    #randomise the indices and get the first 4
    shuf -e 0 1 2 3 4 5 6 7 8 9 | head -n 4
}

#number of iterations
nits=100

#read the pheno values into an array
ph=($(tail -n +3 data | awk '{print $2}'))


echo -e row'\t'sim1'\t'sim2'\t'diff
for (( row=1; row<=$nits; row++ )); do
    #calculate simulation1 
    first=$(printf "%+.3f" $(domath $(iterate)))
    #calculate simulation 2
    second=$(printf "%+.3f" $(domath $(iterate)))
    #calculate the difference
    diff=$(printf "%+.3f" $(awk '{print $2-$1}' <<<"$first $second"))
    #and print
    echo -e $row'\t'$first'\t'$second'\t'$diff
done

Answer

更新的脚本因为bc与数字上的前导符号不能很好地配合，所以更改awk为domath.

还更改为使用shuf对每次迭代的数组索引进行洗牌，因为使用固定数组更简单。

#!/bin/bash

function domath {
    #do the math using the 4 indices into the pheno array
    awk '{print ($1+$2+$3+$4)/4}' <<<"${ph[$1]} ${ph[$2]} ${ph[$3]} ${ph[$4]}"
}

function iterate {
    #randomise the indices and get the first 4
    shuf -e 0 1 2 3 4 5 6 7 8 9 | head -n 4
}

#number of iterations
nits=100

#read the pheno values into an array
ph=($(tail -n +3 data | awk '{print $2}'))


echo -e row'\t'sim1'\t'sim2'\t'diff
for (( row=1; row<=$nits; row++ )); do
    #calculate simulation1 
    first=$(printf "%+.3f" $(domath $(iterate)))
    #calculate simulation 2
    second=$(printf "%+.3f" $(domath $(iterate)))
    #calculate the difference
    diff=$(printf "%+.3f" $(awk '{print $2-$1}' <<<"$first $second"))
    #and print
    echo -e $row'\t'$first'\t'$second'\t'$diff
done

如何使用模拟值同时随机改变两列的顺序 1000 次？

答案1

答案2

相关内容