我有几个文件,R1 和 R2 都位于Linux 系统的fastq.gz
目录中。dir
看起来像:
dir
|____sampleA_1.fastq.gz
|____sampleA_2.fastq.gz
|____sampleB_1.fastq.gz
|____sampleB_2.fastq.gz
|____sampleC_1.fastq.gz
|____sampleC_2.fastq.gz
我想创建一个txt
文件,其中样本名称作为第一列,R1 fastq 的路径作为第二列,R2 fastq 的路径作为第三列。
在里面dir
我尝试了以下方式:
find "$PWD" -name \*1.fastq.gz > list1.txt
find "$PWD" -name \*2.fastq.gz > list2.txt
我必须再次合并这两个文件并给出列名称,然后再次使用示例名称创建另一个列。相反,有没有办法用单个命令来制作文件?
文件txt
应如下所示:
sample Second Third
sampleA dir/sampleA_1.fastq.gz dir/sampleA_2.fastq.gz
sampleB dir/sampleB_1.fastq.gz dir/sampleB_2.fastq.gz
sampleC dir/sampleC_1.fastq.gz dir/sampleC_2.fastq.gz
答案1
如果您可以保证始终存在一对样本,则此bash
/ksh
代码将根据所有样本 1 文件的存在生成输出:
示例(搭建演示环境):
mkdir -p /tmp/710303/dir
cd /tmp/710303
touch dir/sample{A,B,C}_{1,2}.fastq.gz # Assumes a { }-aware shell
文件生成(在演示环境中工作)
printf "%s %s %s\n" 'sample' 'Second' 'Third'
for f1 in dir/sample*_1.fastq* # Loop through all first samples
do
fn="${f1##*/}"; fn="${fn%%_*}" # Label
f2="${f1/1/2}" # Filename for second sample
printf "%s %s %s\n" "$fn" "$f1" "$f2" # Output the values
done
输出
sample Second Third
sampleA dir/sampleA_1.fastq.gz dir/sampleA_2.fastq.gz
sampleB dir/sampleB_1.fastq.gz dir/sampleB_2.fastq.gz
sampleC dir/sampleC_1.fastq.gz dir/sampleC_2.fastq.gz
这些是空格分隔的列。如果您想要制表符分隔,请更改printf
格式行以使用\t
(制表符)而不是(空格)。
答案2
这看起来不必要地复杂,但它正在处理仅存在一个示例文件的情况
{
printf '%s\n' sample Second Third
find ./dir/ -type f -name '*.fastq.gz' -print \
| cut -d _ -f 1 \
| sort -u \
| bash -c '
while read -r root; do
echo "${root##*/}"
for i in 1 2; do
f="${root}_${i}.fastq.gz"
[[ -f "$f" ]] && echo "$f" || echo ""
done
done
'
} \
| paste - - - \
| column -s $'\t' -t
测试:
mkdir dir
touch dir/sample{A,B,C}_{1,2}.fastq.gz
touch dir/sample{D_1,E_2}.fastq.gz
touch dir/ignore.me
然后上面的命令输出
sample Second Third
sampleA ./dir/sampleA_1.fastq.gz ./dir/sampleA_2.fastq.gz
sampleB ./dir/sampleB_1.fastq.gz ./dir/sampleB_2.fastq.gz
sampleC ./dir/sampleC_1.fastq.gz ./dir/sampleC_2.fastq.gz
sampleD ./dir/sampleD_1.fastq.gz
sampleE ./dir/sampleE_2.fastq.gz
也许这个 GNU awk 版本更整洁一些:
find ./dir -type f | gawk -F/ -v OFS='\t' '
BEGIN { print "sample", "Second", "Third" }
match($NF, /^(.*)_([12]).fastq.gz$/, m) {
file[m[1]][m[2]] = $0
}
END {
PROCINFO["sorted_in"] = "@ind_str_asc"
for (sample in file)
print sample, file[sample][1], file[sample][2]
}
' | column -s $'\t' -t
产生与上面相同的输出。
答案3
$ cat tst.awk
BEGIN {
FS="[/_]"; OFS="\t"
print "sample", "Second", "Third"
}
NR%2 { second = $0; next }
{ print $2, second, $0 }
$ printf '%s\n' dir/* | awk -f tst.awk
sample Second Third
sampleA dir/sampleA_1.fastq.gz dir/sampleA_2.fastq.gz
sampleB dir/sampleB_1.fastq.gz dir/sampleB_2.fastq.gz
sampleC dir/sampleC_1.fastq.gz dir/sampleC_2.fastq.gz