我试图用新的数字列表替换以下 for 循环中的这些数字“A00002 X53307 BB145968 CAA42669 V00181 AH002406 HQ844023”。但我的新列表是一个 .CSV 文件,其中有数百个数字。我的问题是,我可以直接读取 .CSV 文件并使其作为 for 循环中的列表工作吗?
for ACC in A00002 X53307 BB145968 CAA42669 V00181 AH002406 HQ844023
do
echo -n -e "$ACC\t"
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ACC}&rettype=fasta&retmode=xml" |\
grep TSeq_taxid |\
cut -d '>' -f 2 |\
cut -d '<' -f 1 |\
tr -d "\n"
echo
done
.csv 文件如下所示:
WP_004064712.1
WP_023555236.1
WP_051593235.1
KAJ52037.1
WP_012103448.1
WP_049740904.1
WP_003346264.1
WP_026134014.1
WP_051870539.1
AKF93952.1
XP_008397367.1
XP_014896959.1
XP_007567109.1
XP_014847432.1
EHG27035.1
EGX75147.1
WP_033630878.1
答案1
如果 @Mark 要求 CSV 文件每行包含一个值,您可以通过用命令替换替换初始列表来轻松完成此操作:
for ACC in `cat csvfile`
do
...
done
答案2
如果您知道要将“A00002 X53307 BB145968 CAA42669 V00181 AH002406 HQ844023”替换为什么值,您可以执行以下操作:
CSV=`cat csvfile`
for LINE in $CSV
do
sed -i "s/A00002/NewValue/g" $CSV
sed -i "s/X53307/NewValue/g" $CSV
...
done
sed命令解释:
sed -i“s/X53307/NewValue/g”$CSV
该命令的作用是:直接在 $CSV 文件中将 X53307 替换为 NewValue。
答案3
你在这里忘记了两件事:
- Curl 语句中的字符串扩展确实会产生输出。
- 您可以按照@John 的建议,使用 CSV 文件作为输入控件。
因此,您不需要替换字符串值,只需覆盖它们即可。
老的:
<?xml version="1.0"?>
<!DOCTYPE TSeqSet PUBLIC "-//NCBI//NCBI TSeq/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">
<TSeqSet>
<TSeq>
<TSeq_seqtype value="nucleotide"/>
<TSeq_gi>39899</TSeq_gi>
<TSeq_accver>X53307.1</TSeq_accver>
<TSeq_taxid>1423</TSeq_taxid>
<TSeq_orgname>Bacillus subtilis</TSeq_orgname>
<TSeq_defline>Bacillus subtilis epr gene for a novel serine protease</TSeq_defline>
<TSeq_length>2521</TSeq_length>
<TSeq_sequence>GTTAACAGGATATCCGAGCTTATCGGCCCACTCGTTCCCAAACACACTCGCCATGAAATCAGCATACCCCGGAATCGGCAAGCTCGTTAAAATCAAGAAGACAGACCCGATAATAATCAGCGGCATGGACTGGATAATTCCGTCACGCAAAGCGCTGAGATGCCGCTGCCCGGCAATTTTCCCGGCGACAGGCATTATTTTTTCCTCCATCACCCGAGTGAATGTGCTCATCTTAAAAACCCCCTTTTCTCATTGCTTTGTGAACAACAACCTCCGCAATGTTTTCTTTATCTTATTTTGAAAACGCTTAGAAATTCATTTGGAAAATTTCCTCTTCATGCGGAAAAAATCTGCATTTTGCTAAACAACCCTGCCCATGAAAATTTTTTCCTTCTTACTATTAATCTCTCTTTTTTTCTCCGATATATATATCAAACATCATAGAAAAAGGAGATGAATCATGAAAAACATGTCTTGCAAACTTGTTGTATCAGTCACTCTGTTTTTCAGTTTTCTCACCATAGGCCCTCTCGCTCATGCGCAAAACAGCAGCGAGAAAGAGGTTATTGTGGTTTATAAAAACAAGGCCGGAAAGGAAACCATCCTGGACAGTGATGCTGATGTTGAACAGCAGTATAAGCATCTTCCCGCGGTAGCGGTCACAGCAGACCAGGAGACAGTAAAAGAATTAAAGCAGGATCCTGATATTTTGTATGTAGAAAACAACGTATCATTTACCGCAGCAGACAGCACGGATTTCAAAGTGCTGTCAGACGGCACTGACACCTCTGACAACTTTGAGCAATGGAACCTTGAGCCCATTCAGGTGAAACAGGCTTGGAAGGCAGGACTGACAGGAAAAAATATCAAAATTGCCGTCATTGACAGCGGGATCTCCCCCCACGATGACCTGTCGATTGCCGGCGGGTATTCAGCTGTCAGTTATACCTCTTCTTACAAAGATGATAACGGCCACGGAACACATGTCGCAGGGATTATCGGAGCCA
AGCATAACGGCTACGGAATTGACGGCATCGCACCGGAAGCACAAATATACGCGGTTAAAGCGCTTGATCAGAACGGCTCGGGGGATCTTCAAAGTCTTCTCCAAGGAATTGACTGGTCGATCGCAAACAGGATGGACATCGTCAATATGAGCCTTGGCACGACGTCAGACAGCAAAATCCTTCATGACGCCGTGAACAAAGCATATGAACAAGGTGTTCTGCTTGTTGCCGCAAGCGGTAACGACGGAAACGGCAAGCCAGTGAATTATCCGGCGGCATACAGCAGTGTCGTTGCGGTTTCAGCAACAAACGAAAAGAATCAGCTTGCCTCCTTTTCAACAACTGGAGATGAAGTTGAATTTTCAGCACCGGGGACAAACATCACAAGCACTTACTTAAACCAGTATTATGCAACGGGAAGCGGAACATCCCAAGCGACACCGCACGCCGCTGCCATGTTTGCCTTGTTAAAACAGCGTGATCCTGCCGAGACAAACGTCCAGCTTCGCGAGGAAATGCGGAAAAACATCGTTGATCTTGGTACCGCAGGCCGCGATCAGCAATTTGGCTACGGCTTAATCCAGTATAAAGCACAGGCAACAGATTCAGCGTACGCGGCAGCAGAGCAAGCGGTGAAAAAAGCGGAACAAACAAAAGCACAAATCGATATCAACAAAGCGCGAGAACTCATCAGCCAGCTGCCGAACTCCGACGCCAAAACTGCCCTGCACAAAAGACTGGATAAAGTACAGTCATACAGAAATGTAAAAGATGCGAAAGACAAAGTCGCAAAGGCAGAAAAATATAAAACACAGCAAACCGTTGACACAGCACAAACTGCCATCAACAAGCTGCCAAACGGAACAGACAAAAAGAACCTTCAAAAACGCTTAGACCAAGTAAAACGATACATCGCGTCAAAGCAAGCGAAAGACAAAGTTGCGAAAGCGGAAAAAAGCAAAAAGAAAACAGATGTGGACAGCGCACAATCAGCAATTGGCAAGCTGCCTGCAAGTTCAGAAAA
AACGTCCCTGCAGAAACGCCTTAACAAAGTGAAGAGCACCAATTTGAAGACGGCACAGCAATCCGTATCTGCGGCTGAAAAGAAATCAACTGATGCAAATGCGGCAAAAGCACAATCAGCCGTCAATCAGCTTCAAGCAGGCAAGGACAAAACGGCATTGCAAAAACGGTTAGACAAAGTGAAGAAAAAGGTGGCGGCGGCTGAAGCAAAAAAAGTGGAAACTGCAAAGGCAAAAGTGAAGAAAGCGGAAAAAGACAAAACAAAGAAATCAAAGACATCCGCTCAGTCTGCAGTGAATCAATTAAAAGCATCCAATGAAAAAACAAAGCTGCAAAAACGGCTGAACGCCGTCAAACCGAAAAAGTAACCAAAAACCTTTAAGATTTGCATTCCAAGTCTTAAAGGTTTTTTTCATTCTAAGAACACCACACACAACCTTTTTCCCATCCATTGTACAGGCTTTTCATACTATTGCTATACAGCCATGAAC</TSeq_sequence>
</TSeq>
</TSeqSet>
新的:
<?xml version="1.0"?>
<!DOCTYPE TSeqSet PUBLIC "-//NCBI//NCBI TSeq/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">
<TSeqSet>
<TSeq>
<TSeq_seqtype value="protein"/>
<TSeq_gi>490166065</TSeq_gi>
<TSeq_accver>WP_004064712.1</TSeq_accver>
<TSeq_taxid>97253</TSeq_taxid>
<TSeq_orgname>Eubacterium plexicaudatum</TSeq_orgname>
<TSeq_defline>hypothetical protein [Eubacterium plexicaudatum]</TSeq_defline>
<TSeq_length>1508</TSeq_length>
<TSeq_sequence>MKKSFMTRVLAVSLSAAMAFSMSSASNLVTASAASTVNLKTTFKTLKVGQTYKLTLKKNTLNWKITKVQTTNKKICTVYGKTASSVMLKGKGVGRAKISVKVKTTKRKYPKNIKIMKCTANVKAADGSGTTDEFKVTSATASSNTEVRVMFSKAIDAAEMTNFTVSDSVTVSKAELSEDKKSVLLTIAGAEYGKNYELTVNGIKVAGKEQAAQKVTFTTPSASEKYPTTLEAKDPVLASDGHSQTLVTFTIKDANGNPITDKGVEVAFATSLGKFAEQRVSIQNGVATVMYTSEALMETQTSAITATVVESTDNQELMGLSATSSITLTPNPDEFNIVPIITSITAPTADRVIAYFNEKVSASDFKTASGKLDHSKFTANVAWGFDNGFDELGNRLVGRSNVVGILDVPGSDNALQLLVDRPMTDNTNISVTFENKTKASSLVSASNTVYTKLTDAHQPSVLTAKGDGLRTVVVNFSEAVLPTAYCDNVETDKKNANQTLFAADNIENYLIDGKPLSYWGVTEVKTPDSETPDDTSSNLKKESSKNDATKTGSEKPGEIQVGSYKDGEDNRHVVTIKLSRERFLEPGTHSMTISNVGDWAAKTDRERNIVNTQTFDFVVENNDVIPTFEVEEQSPEQWLLKFNSDIEPVSETLTTPNSQYSDQASILKLQELVGSTWVDISDSDAAGKNPIRVSQVDDTRNYVVEVRKDWTEVYNTSSTKQNYFNKQLRLHIDAGKIVNIANNKQNGTIDIPLDGTIMRTPDVVSPEIGEVTPAEDTSGNVLDSYNVKLSEPVKLSDGTGGAGGANGEGLTPSQIQSANGSNSNNQGVPMPSAQFIRVDNGQTVEGIITSNVFVDAYDTTINIAPESALSAGKWRLVISSISDDYGNTASTVAHEIDVTQESVTTDFKIVWAAVSDQQTYAEDHIGVERGRYIFVKFSKPVTMTGNSVNAGVTGNYTVNGATLPTGTQIRANIVGYDDHDAVTDSVTIMLPTGNVNAGWGATGDYTV
SGKNAMLNVSRAITATTGENLSNGGLIRIPFQYGSATEDTGYNDYNDSLTALTDAVWGNYRSETRAGYDNLRDYYKALKSALENDKYRRVVLTAPLDLSNPDDNPNEDQKDAVAVFGRSHTLTIKRAVDFDLNGNNITGNVVISTTDAVNRIKLHSSKERAHIYGYANNKDNVATLTVNAGSAKEFLLDNVEVHETDKGNALNINDTWKASFVNNGVIDGKIRITDTNGCGFKNENTTDGFTNRTRFIIDSTGDVNLKGDLSALRNLTDEFGITVNQAAKLSFGVDSKDETTPCDISGVKIVVRGPGARVIFTPVATTTADTALTAEADNVRVQLSQANSGSGKIQFFTDRGGKIVAVDKDNKEVTSDSKDAVKISSDDIKVTGIQKALENLDVQTGVITDGKVDSTVTISCGAISGGSYNIEELAKNIKKAEFEYKGKPDTTGIVANYSLLSTNLLKKDSTHIWPKDNWTDQKDDVSDTIRVTLAYDGYTMVKYIKVTRV</TSeq_sequence>
</TSeq>
</TSeqSet>
答案4
这是一个重构,可以避免将整个 CSV 文件读入内存,并稍微简化后处理。
# Use lower case for private variables
# and https://mywiki.wooledge.org/DontReadLinesWithFor
while read -r acc; do
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${acc}&rettype=fasta&retmode=xml" |
# Run a single awk script for extraction and formatting
awk -v acc="$acc" '/TSeq_taxid/ {
sub(/>.*/, ""); sub(/.*</, ""); print acc "\t" $0 }'
done <csvfile