从这个不统一的数据集中提取电影名称的策略？

Question 1

使用gawk并假设年份总是结束记录：

awk -F"[0-9]{4}$" '{print $1}' movies

Answer

使用gawk并假设年份总是结束记录：

awk -F"[0-9]{4}$" '{print $1}' movies

Question 2

重击：

while read -r line; do
    if [[ $line =~ (.*)[[:blank:]]+[0-9]{4}$ ]]; then
        echo "${BASH_REMATCH[1]}"
    fi
done < data

sed：

sed 's/[[:blank:]]\+[0-9]\{4\}$//' < data

Answer

重击：

while read -r line; do
    if [[ $line =~ (.*)[[:blank:]]+[0-9]{4}$ ]]; then
        echo "${BASH_REMATCH[1]}"
    fi
done < data

sed：

sed 's/[[:blank:]]\+[0-9]\{4\}$//' < data

Question 3

这确实很简单。只要最后一个字段（年份）不包含任何空格（从您的问题中不清楚，但我假设情况确实如此），您所需要做的就是删除最后一个字段。例如：

$ cat movies
Casablanca  1942
Eternal Sunshine        of the Spotless Mind            2004
He Died with a Felafel in His Hand                       2001
The Blues Brothers 1980

因此，如果您只想打印标题，可以使用：

$ perl -lpe 's/[^\s]+$//' movies
Casablanca  
Eternal Sunshine        of the Spotless Mind            
He Died with a Felafel in His Hand                       
The Blues Brothers 

$ sed 's/[^ \t]*$//' movies 
Casablanca  
Eternal Sunshine        of the Spotless Mind            
He Died with a Felafel in His Hand                       
The Blues Brothers

或者，也折叠标题中的空白：

$ sed -r 's/[\t ]+/ /g;s/[^ \t]*$//' movies 
Casablanca 
Eternal Sunshine of the Spotless Mind 
He Died with a Felafel in His Hand 
The Blues Brothers 

$ perl -lpe 's/\s+/ /g;s/[^\s]+$//' movies
Casablanca 
Eternal Sunshine of the Spotless Mind 
He Died with a Felafel in His Hand 
The Blues Brothers 

$ awk '{for(i=1;i<NF-1;i++){printf "%s ",$i} print $(NF-1)}' movies
Casablanca 
Eternal Sunshine of the Spotless Mind 
He Died with a Felafel in His Hand 
The Blues Brothers

如果年份始终为 4 位数字，则可以使用

$ perl -lpe 's/....$//' movies 
Casablanca 
Eternal Sunshine of the Spotless Mind 
He Died with a Felafel in His Hand 
The Blues Brothers

或者

$ perl -lpe 's/\s+/ /g;s/....$//' movies 
Casablanca 
Eternal Sunshine of the Spotless Mind 
He Died with a Felafel in His Hand 
The Blues Brothers

或者

$ while read line; do echo ${line%%????}; done < movies|od -c 
Casablanca 
Eternal Sunshine of the Spotless Mind 
He Died with a Felafel in His Hand 
The Blues Brothers

Answer

这确实很简单。只要最后一个字段（年份）不包含任何空格（从您的问题中不清楚，但我假设情况确实如此），您所需要做的就是删除最后一个字段。例如：

$ cat movies
Casablanca  1942
Eternal Sunshine        of the Spotless Mind            2004
He Died with a Felafel in His Hand                       2001
The Blues Brothers 1980

因此，如果您只想打印标题，可以使用：

$ perl -lpe 's/[^\s]+$//' movies
Casablanca  
Eternal Sunshine        of the Spotless Mind            
He Died with a Felafel in His Hand                       
The Blues Brothers 

$ sed 's/[^ \t]*$//' movies 
Casablanca  
Eternal Sunshine        of the Spotless Mind            
He Died with a Felafel in His Hand                       
The Blues Brothers

或者，也折叠标题中的空白：

$ sed -r 's/[\t ]+/ /g;s/[^ \t]*$//' movies 
Casablanca 
Eternal Sunshine of the Spotless Mind 
He Died with a Felafel in His Hand 
The Blues Brothers 

$ perl -lpe 's/\s+/ /g;s/[^\s]+$//' movies
Casablanca 
Eternal Sunshine of the Spotless Mind 
He Died with a Felafel in His Hand 
The Blues Brothers 

$ awk '{for(i=1;i<NF-1;i++){printf "%s ",$i} print $(NF-1)}' movies
Casablanca 
Eternal Sunshine of the Spotless Mind 
He Died with a Felafel in His Hand 
The Blues Brothers

如果年份始终为 4 位数字，则可以使用

$ perl -lpe 's/....$//' movies 
Casablanca 
Eternal Sunshine of the Spotless Mind 
He Died with a Felafel in His Hand 
The Blues Brothers

或者

$ perl -lpe 's/\s+/ /g;s/....$//' movies 
Casablanca 
Eternal Sunshine of the Spotless Mind 
He Died with a Felafel in His Hand 
The Blues Brothers

或者

$ while read line; do echo ${line%%????}; done < movies|od -c 
Casablanca 
Eternal Sunshine of the Spotless Mind 
He Died with a Felafel in His Hand 
The Blues Brothers

Question 4

这应该删除最后一个数字字符及其前面的制表符和空格：

sed -e 's#[\t ]*[0-9]*$##' movies.txt

Answer

这应该删除最后一个数字字符及其前面的制表符和空格：

sed -e 's#[\t ]*[0-9]*$##' movies.txt

从这个不统一的数据集中提取电影名称的策略？

答案1

答案2

答案3

答案4

相关内容