我有一个以下形式的三列表数据:
TIME MPID CPID
16:45:51 10051 77845
16:45:51 10051 77845
16:46:52 10051 77846
16:46:53 10051 77846
16:48:38 10051 77847
16:48:38 10051 77847
16:48:39 10051 77995
16:49:31 10051 77848
16:49:31 10051 77848
16:51:03 10051 77849
16:51:03 10051 77849
这里,TIME 列由 HH:MM:SS 格式的时间戳组成。 MPID 和 CPID 列是标识号;它们的含义对于我的问题并不重要。 MPID 值也不起作用,除了它们存在于数据集中并且需要传播到输出这一事实之外。
我想要的是识别具有匹配 CPID 值的行对并计算它们对应时间之间的差异。例如,上面的示例中有两行(第三行和第四行),CPID 为 77846。对应的时间是 16:46:52 和 16:46:53,所以我想计算差异:
16:46:53 - 16:46:52 = 00:00:01
我还想以以下格式输出此结果:
MPID 10051 CPID 77846 Total time difference: 01 seconds
如果给定的 CPID 在数据集中没有恰好出现两次,那么我想忽略它。
给定示例数据的所需输出应如下所示:
MPID 10051 CPID 77845 Total time difference: 00 seconds
MPID 10051 CPID 77846 Total time difference: 01 seconds
MPID 10051 CPID 77847 Total time difference: 00 seconds
MPID 10051 CPID 77848 Total time difference: 00 seconds
MPID 10051 CPID 77849 Total time difference: 00 seconds
答案1
假设输入是有差异的文件,用于
uniq -D
显示所有有重复的行第三名领域仅有的,这也摆脱了标题。然后使用date
格式%s
将时间转换为秒,并减去它们。其余为标准壳:uniq -D -f 2 input | while read a b c && read d e f ; do g=$(( $(date -d $d +%s) - $(date -d $a +%s) )) printf "MPID %s CPID %s Total time difference: %02i seconds\n" $b $c $g done
输出:
MPID 10051 CPID 77845 Total time difference: 00 seconds MPID 10051 CPID 77846 Total time difference: 01 seconds MPID 10051 CPID 77847 Total time difference: 00 seconds MPID 10051 CPID 77848 Total time difference: 00 seconds MPID 10051 CPID 77849 Total time difference: 00 seconds
较早的规范(根据OP评论),需要更多代码:
uniq -D -f 2 input | while read a b c && read d e f ; do g=$(( $(date -d $d +%s) - $(date -d $a +%s) )) printf "For %s %s time difference: %02i:%02i:%02i\n" \ $b $c $((g/360)) $(((g/60)%60)) $((g%60)) done
输出:
For 10051 77845 time difference: 00:00:00 For 10051 77846 time difference: 00:00:01 For 10051 77847 time difference: 00:00:00 For 10051 77848 time difference: 00:00:00 For 10051 77849 time difference: 00:00:00
答案2
这是一个粗略的 shell 脚本,可以完成您想要的操作:
#!/bin/bash
# pairwise_pid_time_diff.sh
# Write the data to a temporary file, sorted first by pid and then by time
cat "${1}" | sed 's/\s\+/ /g' | sort -k3,3 -k1,1 -n > sorted_pids.csv
# Compute the pair-wise time differences
for pid in $(cat sorted_pids.csv | cut -d' ' -f3 | uniq); do
if (( "$(grep "${pid}" sorted_pids.csv | wc -l)" == 2 )); then
time1_string="$(cat sorted_pids.csv | grep "${pid}" | head -1 | cut -d' ' -f1)"
time2_string="$(cat sorted_pids.csv | grep "${pid}" | tail -1 | cut -d' ' -f1)"
time1_seconds="$(date -u -d "${time1_string}" +"%s")"
time2_seconds="$(date -u -d "${time2_string}" +"%s")"
date -u -d "0 ${time2_seconds} sec - ${time1_seconds} sec" +"%H:%M:%S"
fi
done
# Delete the temporary file
rm sorted_pids.csv
为了测试它,我们将示例数据写入文件:
cat <<HEREDOC > pids.csv
16:45:51 10051 77845
16:45:51 10051 77845
16:46:52 10051 77846
16:46:53 10051 77846
16:48:38 10051 77847
16:48:38 10051 77847
16:48:39 10051 77995
16:49:31 10051 77848
16:49:31 10051 77848
16:51:03 10051 77849
16:51:03 10051 77849
HEREDOC
然后运行脚本:
bash pairwise_pid_time_diff.sh pids.csv
这给了我们以下输出:
00:00:00
00:00:01
00:00:00
00:00:00
00:00:00
由于您似乎在运行脚本时遇到了麻烦,因此这里有另一个用于调试目的的版本:
#!/bin/bash
# pairwise_pid_time_diff_debug.sh
# Write the data to a temporary file, sorted first by pid and then by time
1>&2 echo "Writing sorted data to temporary file."
cat "${1}" | sed 's/\s\+/ /g' | sort -k3,3 -k1,1 -n > sorted_pids.csv
# Compute the pair-wise time differences
1>&2 echo "Entering main loop..."
for pid in $(cat sorted_pids.csv | cut -d' ' -f3 | uniq); do
1>&2 echo "Looking at pid: ${pid}"
if (( "$(grep "${pid}" sorted_pids.csv | wc -l)" == 2 )); then
1>&2 echo "Found matching pair."
time1_string="$(cat sorted_pids.csv | grep "${pid}" | head -1 | cut -d' ' -f1)"
1>&2 echo "Time 1 string: ${time1_string}"
time2_string="$(cat sorted_pids.csv | grep "${pid}" | tail -1 | cut -d' ' -f1)"
1>&2 echo "Time 2 string: ${time2_string}"
time1_seconds="$(date -u -d "${time1_string}" +"%s")"
1>&2 echo "Time 1 in seconds: ${time1_seconds}"
time2_seconds="$(date -u -d "${time2_string}" +"%s")"
1>&2 echo "Time 1 in seconds: ${time2_seconds}"
time_difference="$(date -u -d "0 ${time2_seconds} sec - ${time1_seconds} sec" +"%H:%M:%S")"
echo "${time_difference}"
fi
done
# Delete the temporary file
rm sorted_pids.csv
1>&2 echo "Deleted temporary file."
当我运行此脚本时,我得到以下输出:
bash pairwise_pid_time_diff_debug.sh pids.csv
Writing sorted data to temporary file.
Entering main loop...
Looking at pid: 77845
Found matching pair.
Time 1 string: 16:45:51
Time 2 string: 16:45:51
Time 1 in seconds: 1510332351
Time 1 in seconds: 1510332351
00:00:00
Looking at pid: 77846
Found matching pair.
Time 1 string: 16:46:52
Time 2 string: 16:46:53
Time 1 in seconds: 1510332412
Time 1 in seconds: 1510332413
00:00:01
Looking at pid: 77847
Found matching pair.
Time 1 string: 16:48:38
Time 2 string: 16:48:38
Time 1 in seconds: 1510332518
Time 1 in seconds: 1510332518
00:00:00
Looking at pid: 77995
Looking at pid: 77848
Found matching pair.
Time 1 string: 16:49:31
Time 2 string: 16:49:31
Time 1 in seconds: 1510332571
Time 1 in seconds: 1510332571
00:00:00
Looking at pid: 77849
Found matching pair.
Time 1 string: 16:51:03
Time 2 string: 16:51:03
Time 1 in seconds: 1510332663
Time 1 in seconds: 1510332663
00:00:00
Deleted temporary file.
让我们手动逐步完成该过程,而不是运行脚本。首先,让我们创建我们的数据文件:
cat <<HEREDOC > pids.csv
16:45:51 10051 77845
16:45:51 10051 77845
16:46:52 10051 77846
16:46:53 10051 77846
16:48:38 10051 77847
16:48:38 10051 77847
16:48:39 10051 77995
16:49:31 10051 77848
16:49:31 10051 77848
16:51:03 10051 77849
16:51:03 10051 77849
HEREDOC
接下来,让我们对数据进行排序 - 首先按 PID,然后按时间:
cat pids.csv | sed 's/\s\+/ /g' | sort -k3,3 -k1,1 -n > sorted_pids.csv
让我们检查文件是否正确写入:
cat sorted_pids.csv
我们应该得到以下输出:
16:45:51 10051 77845
16:45:51 10051 77845
16:46:52 10051 77846
16:46:53 10051 77846
16:48:38 10051 77847
16:48:38 10051 77847
16:49:31 10051 77848
16:49:31 10051 77848
16:51:03 10051 77849
16:51:03 10051 77849
16:48:39 10051 77995
现在让我们从数据中获取 uniq PID:
user@host:~$ cat sorted_pids.csv | cut -d' ' -f3 | uniq
77845
77846
77847
77848
77849
77995
让我们尝试计算第一个 PID 的时间差:77845。
首先让我们获取 77845 的第一个时间字符串:
time1_string="$(cat sorted_pids.csv | grep 77845 | head -1 | cut -d' ' -f1)"
让我们检查一下我们的值是否正确:
user@host:~$ echo ${time1_string}
16:45:51
现在让我们将其转换为秒:
time1_seconds="$(date -u -d "${time1_string}" +"%s")"
并检查我们得到了什么值:
user@host:~$ echo ${time1_seconds}
1510332351
现在让我们对第二次字符串执行相同的操作:
time2_string="$(cat sorted_pids.csv | grep 77845 | tail -1 | cut -d' ' -f1)"
time2_seconds="$(date -u -d "${time2_string}" +"%s")"
user@host:~$ echo "${time2_string}"
16:45:51
user@host:~$ echo "${time2_seconds}"
1510332351
现在我们计算以秒为单位的差异并将其转换为拍号:
difference=$(date -u -d "0 ${time2_seconds} sec - ${time1_seconds} sec" +"%H:%M:%S")
并检查结果:
00:00:00