外部供应商以 *.csv 文件形式向我提供了 20 多个表的数据转储。他们缺乏文档,因此我必须手动浏览这些文件以找出哪些文件在 RDBMS 意义上是“相关的”。有没有办法让我找出哪些文件具有相同的字符串模式并将其整齐地打印出来?
目前我正在这样做并手动将它们链接在一起:
$> head -n 1 *.csv
这给了我一个类似的输出
==> EVO_ANGLE.csv <==
"evo_ang_id","angle_description"
==> EVOP_IMAGE.csv <==
"evop_image_id","evop_id","evo_ang_id","evo_collection","file_format","image_name","image_path", "image_type"
==> IMAGE_TYPE.csv <==
"id","image_type","group","description"
正如您所看到的,文件EVO_ANGLE
和EVOP_IMAGE
是通过evo_ang_id
和相关的,EVOP_IMAGE
并且IMAGE_TYPE
具有image_type
共同点。
有没有更好的方法让我打印此信息:对于每个文件,我可以在哪里看到哪些其他文件具有该字段?
我对此的最佳解决方法是编写一个 shell 脚本,该脚本按顺序执行以下操作:
- 获取每个文件的第一行并将其存储在数组中的映射中
- 对于每行中的每个单词,找到它在数组中出现的位置
- 整理此信息后打印输出。
这是一件苦差事,需要大量的调试才能正确,并且观察控制台输出可能会更快。有没有更好的办法?剪切/连接/grep 组合有什么技巧吗?
答案1
如果您正在查找某个属性属于哪个文件,您可以使用awk
.
假设您的csv
文件如下所示:
$ for i in *.csv; do echo $i; head -n1 $i; echo; done
EVO_ANGLE.csv
"evo_ang_id","angle_description"
EVOP_IMAGE.csv
"evop_image_id","evop_id","evo_ang_id","evo_collection","file_format","image_name","image_path", "image_type"
IMAGE_TYPE.csv
"id","image_type","group","description"
以下awk
命令将反转属性和文件名:
$ awk -F', *' ' # field separator = comma and optional spaces
FNR==1{ # Parse only the first line of each file.
for(i=1;i<=NF;i++) # Loop through all fields, and store them
a[$i]=a[$i] " " FILENAME # in an array together with the filename.
}
END{ # When all files parsed,
for(i in a) print i,a[i] # print the content of the array
}' *.csv
"image_name" EVOP_IMAGE.csv
"evo_collection" EVOP_IMAGE.csv
"image_path" EVOP_IMAGE.csv
"file_format" EVOP_IMAGE.csv
"image_type" EVOP_IMAGE.csv IMAGE_TYPE.csv
"evop_id" EVOP_IMAGE.csv
"evop_image_id" EVOP_IMAGE.csv
"id" IMAGE_TYPE.csv
"evo_ang_id" EVO_ANGLE.csv EVOP_IMAGE.csv
"description" IMAGE_TYPE.csv
"group" IMAGE_TYPE.csv
"angle_description" EVO_ANGLE.csv
如果需要过滤属于多个文件的属性,只需使用以下命令:
$ awk -F', *' 'FNR==1{for(i=1;i<=NF;i++) a[$i]=a[$i] " " FILENAME}END{for(i in a) print i,a[i]}' *.csv | awk 'NF>2'
"image_type" EVOP_IMAGE.csv IMAGE_TYPE.csv
"evo_ang_id" EVO_ANGLE.csv EVOP_IMAGE.csv
答案2
这是一个以 bash 为中心的版本;看起来非常相似奥利夫的 awk 版本
unset fileheads fields
declare -A fileheads
declare -A fields
for f in *.csv
do
IFS=, fileheads[$f]=$(head -n1 "$f");
set -f
for field in ${fileheads[$f]}
do
fields[$field]+=x
done
set +f
done
for field in ${!fields[*]}
do
[[ ${#fields[$field]} -gt 1 ]] || continue
for file in ${!fileheads[*]}
do
[[ ${fileheads[$file]} =~ $field ]] && echo "$file has $field"
done
echo
done
这会将每个文件的字段(第 1 行)收集到fileheads
关联数组中,并按文件名索引。它还收集每个字段名称出现次数的列表。我们在这里假设逗号不会出现在字段名称本身中。
然后我们循环遍历所有已知的字段;如果其中任何一个被多次看到,我们将循环遍历文件(数组中的索引fileheads
)以查看其中是否有任何一个包含该字段。至少有两个文件应符合此条件;为了便于阅读,它们的文件名和链接字段将被回显,后跟一个空行。
示例运行:
输入
$ head -n1 *.csv
==> EVOP_IMAGE.csv <==
"evop_image_id","evop_id","evo_ang_id","evo_collection","file_format","image_name","image_path","image_type"
==> EVO_ANGLE.csv <==
"evo_ang_id","angle_description"
==> IMAGE_TYPE.csv <==
"id","image_type","group","description"
输出
EVOP_IMAGE.csv has "evo_ang_id"
EVO_ANGLE.csv has "evo_ang_id"
EVOP_IMAGE.csv has "image_type"
IMAGE_TYPE.csv has "image_type"