如何计算数据框中每3列的平均值?

如何计算数据框中每3列的平均值?

我有一个像这样的数据框:

> head(dat_sg2)
               DwoC_2318_norm.1 DwoC_2318_norm.2 DwoC_2318_norm.3 DwoC_3395_norm.1 DwoC_3395_norm.2 DwoC_3395_norm.3 DwoC_6154_norm.1
Ku8QhfS0n_hIOABXuE         4.865523         4.806292         4.478393         4.539028         4.050325         4.440587         4.110421
Bx496XsFXiAlj.Eaeo         6.123590         6.423548         6.561369         5.856075         5.858094         5.930103         5.801459
W38p0ogk.wIBVRXllY         7.791964         7.648746         7.705958         7.561884         7.699504         7.676182         7.479021
QIBkqIS9LR5DfTlTS8         5.810877         5.579234         5.698071         5.088198         5.076525         5.367539         3.887972
BZKiEvS0eQ305U0v34         6.294961         6.358164         5.876450         5.414746         5.664350         5.924501         4.446681
6TheVd.HiE1UF3lX6g         5.268226         5.337910         5.420836         5.604646         5.007336         5.101670         5.590275

我需要获取一个数据框,其中每 3 列之间包含平均值。所以我想要的结果是 6 行 2 列,例如 DwoC_2318 和 DwoC_3395。

输出如下所示:

                    DwoC_2318_mean       DwoC_3395_mean
Ku8QhfS0n_hIOABXuE       4.716736           4.343313
Bx496XsFXiAlj.Eaeo       …                     …
W38p0ogk.wIBVRXllY       …                     …
QIBkqIS9LR5DfTlTS8       …                     …
BZKiEvS0eQ305U0v34       …                     …
6TheVd.HiE1UF3lX6g       …                     …

在哪里:

4.716736=(4.865523+4.806292+4.478393)/3

请注意,我的原始数据框由 21 列和大约 20000 行组成。

我想我可以在这里使用带有 rowMeans 的 R apply 函数,但我不知道如何应用它来计算每 3 列之间的平均值。

我尝试在完整数据框 (df) 上执行此操作,该数据框有 15568 行和 21 列:

groups=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7)
x=apply(df,1,function(x) tapply(x, list(groups), mean))

但我没有得到 15568 行和 7 列的输出:

7行15568列。

答案1

我首先通过转置数据框来解决这个问题,因为对我来说计算每 3 行之间的平均值更容易。后来我又把它转回来了。

#read in data
df=read.table("DwoC", header=T)
#transpose it
df <- as.data.frame(t(df))
# remove .1,.2,...strings from row names, and save unique row names
rn=unique(gsub("\\..*","",rownames(df)))
n=3
# calculate means between each 3 rows
dd=aggregate(df,list(rep(1:(nrow(df)%/%n+1),each=n,len=nrow(df))),mean)[-1]
# transpose it back
dt <- as.data.frame(t(dd))
# rename columns as the names were lost during transpose step
names(dt)=rn 

答案2

基于计算列子集的行平均值

> df = read.table('file')
> 
> data.frame(ID=df[,0], DwoC_2318_mean=rowMeans(df[1:3]), DwoC_3395_mean=rowMeans(df[4:6]))
                   DwoC_2318_mean DwoC_3395_mean
Ku8QhfS0n_hIOABXuE       4.716736       4.343313
Bx496XsFXiAlj.Eaeo       6.369502       5.881424
W38p0ogk.wIBVRXllY       7.715556       7.645857
QIBkqIS9LR5DfTlTS8       5.696061       5.177421
BZKiEvS0eQ305U0v34       6.176525       5.667866
6TheVd.HiE1UF3lX6g       5.342324       5.237884
> 

答案3

由于我不太擅长 R,awk所以我会尝试一个解决方案:

$ awk 'NR == 1 { next } { j=0; for (i = 2; i+2 <= NF; i+=3) m[++j] = ($(i+0)+$(i+1)+$(i+2))/3; $0 = $1; for (i=1; i<=j; ++i) $(i+1)=m[i]; print }' file
Ku8QhfS0n_hIOABXuE 4.71674 4.34331
Bx496XsFXiAlj.Eaeo 6.3695 5.88142
W38p0ogk.wIBVRXllY 7.71556 7.64586
QIBkqIS9LR5DfTlTS8 5.69606 5.17742
BZKiEvS0eQ305U0v34 6.17653 5.66787
6TheVd.HiE1UF3lX6g 5.34232 5.23788

带注释的awk脚本:

# Skip header
NR == 1 { next }

{
    j = 0

    # Go through the columns from column 2 onwards in groups of thee columns,
    # calculating the average of the group and store it in the array m.
    for (i = 2; i + 2 <= NF; i += 3)
        m[++j] = ($(i+0) + $(i+1) + $(i+2))/3

    # Rewrite the current row as the first column only.
    $0 = $1

    # Add the calculated averages as new columns after column 1.
    for (i = 1; i <= j; ++i)
        $(i+1) = m[i]

    print
}

该代码假设第 1 列之后的列数是三的倍数。如果有一两个尾随列(如示例中所示),则该数据将被删除。

相关内容