提取列中字符串的一部分并保留其他列

Question 1

$ # assuming `rs[digits]` string will match only in 2nd column
$ # string matched within () will get printed
$ perl -lne 'print /(rs\d+\t)[^\t]+\t([^\t]+)/' ip.txt
rs199   info2
rs2778  info5

$ # to match from 2nd column only
$ perl -lne 'print /^[^\t]+\t[^\t]*(rs\d+\t)[^\t]+\t([^\t]+)/' ip.txt
rs199   info2
rs2778  info5

$ # to get some other column, say 2nd and 5th
$ perl -lne 'print /^[^\t]+\t[^\t]*(rs\d+\t)(?:[^\t]+\t){2}([^\t]+)/' ip.txt
rs199   info3
rs2778  info6

仅在找到匹配项时才打印：

$ perl -lne '/^[^\t]+\t[^\t]*(rs\d+\t)(?:[^\t]+\t){1}([^\t]+)/ && print $1,$2' ip.txt
rs199   info2
rs2778  info5
$ perl -lne '/^[^\t]+\t[^\t]*(rs\d+\t)(?:[^\t]+\t){2}([^\t]+)/ && print $1,$2' ip.txt
rs199   info3
rs2778  info6

以前的解决方案，其中要提取的字符串彼此相邻

$ # assuming the shell being used supports $'' strings
$ grep -o $'rs[0-9]*\t[^\t]*' ip.txt
rs199   info1
rs2778  info4

Answer

$ # assuming `rs[digits]` string will match only in 2nd column
$ # string matched within () will get printed
$ perl -lne 'print /(rs\d+\t)[^\t]+\t([^\t]+)/' ip.txt
rs199   info2
rs2778  info5

$ # to match from 2nd column only
$ perl -lne 'print /^[^\t]+\t[^\t]*(rs\d+\t)[^\t]+\t([^\t]+)/' ip.txt
rs199   info2
rs2778  info5

$ # to get some other column, say 2nd and 5th
$ perl -lne 'print /^[^\t]+\t[^\t]*(rs\d+\t)(?:[^\t]+\t){2}([^\t]+)/' ip.txt
rs199   info3
rs2778  info6

仅在找到匹配项时才打印：

$ perl -lne '/^[^\t]+\t[^\t]*(rs\d+\t)(?:[^\t]+\t){1}([^\t]+)/ && print $1,$2' ip.txt
rs199   info2
rs2778  info5
$ perl -lne '/^[^\t]+\t[^\t]*(rs\d+\t)(?:[^\t]+\t){2}([^\t]+)/ && print $1,$2' ip.txt
rs199   info3
rs2778  info6

以前的解决方案，其中要提取的字符串彼此相邻

$ # assuming the shell being used supports $'' strings
$ grep -o $'rs[0-9]*\t[^\t]*' ip.txt
rs199   info1
rs2778  info4

Question 2

以下是一些选项：

awk
```
$ awk -vOFS="\t" '{sub(/.*-/,"",$2);print $2,$4}' file 
rs199   info1
rs2778  info3
```
这将删除-第二个字段中第一个字段之前的所有内容，然后打印生成的第二个字段和第四个字段。
珀尔
```
$ perl -pe 's/.*?-*(rs\d+\t)\S+\t(\S+).*/$1\t$2/' file 
rs199   info2
rs2778  info5
```
如上所述，如果您可以rs在第一个字段中包含该内容，则此操作将会失败。更稳健的方法是：
```
$ perl -F'\t' -lane '$F[1]=~s/.+-//; print join "\t",@F[1,3]' file
rs199   info2
rs2778  info5
```
这会删除-第二个字段中之前的所有字符（如果第二个字段没有，则不会执行任何操作-），然后打印第二个和第四个字段。

Answer

以下是一些选项：

awk
```
$ awk -vOFS="\t" '{sub(/.*-/,"",$2);print $2,$4}' file 
rs199   info1
rs2778  info3
```
这将删除-第二个字段中第一个字段之前的所有内容，然后打印生成的第二个字段和第四个字段。
珀尔
```
$ perl -pe 's/.*?-*(rs\d+\t)\S+\t(\S+).*/$1\t$2/' file 
rs199   info2
rs2778  info5
```
如上所述，如果您可以rs在第一个字段中包含该内容，则此操作将会失败。更稳健的方法是：
```
$ perl -F'\t' -lane '$F[1]=~s/.+-//; print join "\t",@F[1,3]' file
rs199   info2
rs2778  info5
```
这会删除-第二个字段中之前的所有字符（如果第二个字段没有，则不会执行任何操作-），然后打印第二个和第四个字段。

Question 3

我已经通过以下方法完成了

输入文件

ILM-rs199    info1    info2    info3
aws-rs2778   info4    info5    info6
345-678945   info7    info8    info9
aws-rs789    info10   info11   info-rs789

命令

awk -F "-" '{print $1,$2,$3,$4,$5}' inputfile | awk '$2 ~ /^rs[0-9]/{print $2,$4}'

输出

rs199 info2
rs2778 info5
rs789 info11

Answer

我已经通过以下方法完成了

输入文件

ILM-rs199    info1    info2    info3
aws-rs2778   info4    info5    info6
345-678945   info7    info8    info9
aws-rs789    info10   info11   info-rs789

命令

awk -F "-" '{print $1,$2,$3,$4,$5}' inputfile | awk '$2 ~ /^rs[0-9]/{print $2,$4}'

输出

rs199 info2
rs2778 info5
rs789 info11

提取列中字符串的一部分并保留其他列

答案1

答案2

答案3

相关内容