我目前正在创建一个 bash 脚本,该脚本应该处理我的程序之一中的大型日志文件。当我第一次开始时,脚本大约需要 15 秒才能完成,这还不错,但我想改进它。我实现了一个队列,mkfifo
并将解析时间减少到 6 秒。我想请教大家有什么办法可以提高脚本的解析速度。
脚本的当前版本:
#!/usr/bin/env bash
# $1 is server log file
# $2 is client logs file directory
declare -A orders_array
fifo=$HOME/.fifoDate-$$
mkfifo $fifo
# Queue for time conversion
exec 5> >(exec stdbuf -o0 date -f - +%s%3N >$fifo)
exec 6< $fifo
# Queue for ID extraction
exec 7> >(exec stdbuf -o0 grep -oP '[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*' >$fifo)
exec 8< $fifo
rm $fifo
while read line; do
order_id=${line:52:36};
echo >&5 "${line:1:26}"
read -t 1 -u6 converted_time
orders_array[$order_id]=$converted_time
done < <(grep -ah 'MarketOrderTransitions.*\[MarketMessages::OrderExecuted\]' $1)
while read line; do
echo >&7 "$line"
read -t 1 -u8 id
echo >&5 "${line:1:26}"
read -t 1 -u6 converted_time
time_diff="$(($converted_time - orders_array[$id]))"
echo "$id -> $time_diff ms"
done < <(grep -ah 'Event received (OrderExecuted)' $2/*market*.log)
该脚本的基本任务是从客户端和服务器日志文件中提取消息的时间戳,找到它们匹配的消息 ID,并据此计算服务器发送消息和客户端接收消息之间经过的毫秒数。
第一个 while 循环完成得相当快(1.5 秒),但第二部分(我的猜测是因为grep
)需要更长的时间。
测试中的引擎文件大约有 500k 行。我还有大约 700 个客户端日志文件(总共 130 万行)。
服务器文件中的订单 ID 位于固定位置,但在客户端日志中我必须 grep 查找它。
编辑:
根据建议,我将添加输入文件的示例:服务器:
[2022-12-07 07:36:18.209496] [MarketOrderTransitionsa4ec2abf-059f-4452-b503-ae58da2ce1ff] [info] [log_process_event] [MarketMessages::OrderExecuted]
[2022-12-07 07:36:18.209558] [MarketOrderTransitionsa4ec2abf-059f-4452-b503-ae58da2ce1ff] [info] [log_guard] [[True] (lambda at ../subprojects/market_session/private_include/MarketSession/MarketOrderTransitions.hpp:81:24)]
[2022-12-07 07:36:18.209564] [MarketOrderTransitionsa4ec2abf-059f-4452-b503-ae58da2ce1ff] [info] [log_state_change] [GatewayCommon::States::New --> GatewayCommon::States::Executed]
[2022-12-07 07:36:18.209567] [MarketOrderTransitionsa4ec2abf-059f-4452-b503-ae58da2ce1ff] [info] [log_action] [(lambda at ../subprojects/market_session/private_include/MarketSession/MarketOrderTransitions.hpp:57:25) for event: MarketMessages::OrderExecuted]
[2022-12-07 07:36:18.209574] [MarketOrderTransitionsa4ec2abf-059f-4452-b503-ae58da2ce1ff] [info] [log_process_event] [boost::sml::v1_1_0::back::on_entry<boost::sml::v1_1_0::back::_, MarketMessages::OrderExecuted>]
id 位于 MarketOrderTransitions 之后的方括号中 (a4ec2abf-059f-4452-b503-ae58da2ce1ff)
客户
[2022-12-07 07:38:47.545433] [twap_algohawk] [info] [] [Event received (OrderExecuted): {"MessageType":"MarketMessages::OrderExecuted","averagePrice":"49.900000","counterPartyIds":{"activeId":"dIh5wYd/S4ChqMQSKMxEgQ**","executionId":"2295","inactiveId":"","orderId":"3dOKjIoURqm8JjWERtInkw**"},"cumulativeQuantity":"1200.000000","executedPrice":"49.900000","executedQuantity":"1200.000000","executionStatus":"Executed","instrument":[["Symbol","5"],["Isin","5"],["SecurityIDSource","4"],["Mic","MARS"]],"lastFillMarket":"MARS","leavesQuantity":"0.000000","marketSendTime":"07:38:31.972000000","orderId":"a4ec2abf-059f-4452-b503-ae58da2ce1ff","orderPrice":"49.900000","orderQuantity":"1200.000000","propagationData":[],"reportId":"Qx2k73f7QqCqcT0LTEJIXQ**","side":"Buy","sideDetails":"Unknown","transactionTime":"00:00:00.000000000"}]
客户端日志中的id位于orderId标签内(其中有2个,我使用第二个)
想要的输出是:
98ddcfca-d838-4e49-8f10-b9f780a27470 -> 854 ms
5a266ca4-67c6-4482-9068-788a3520b2f3 -> 18 ms
2e8d28de-eac0-4776-85ab-c75d9719b7c6 -> 58950 ms
409034eb-4e55-4e39-901a-eba770d497c0 -> 56172 ms
5b1dc7e8-fae0-43d2-86ea-d3df4dbe810b -> 52505 ms
5249ac24-39d2-40f5-8adf-dcf0410aebb5 -> 17446 ms
bef18cb3-8cef-4d8a-b244-47fed82f21ea -> 1691 ms
7c53c950-23fd-497e-a011-c07363d5fe02 -> 18194 ms
我特别关心日志文件中的“订单已执行”消息
答案1
为了展示我们目前所处的位置 - 您向我们展示了 2 个输入文件:客户端和服务器,并告诉我们在哪里可以找到每个文件中的 ID。使用任何 awk 就是这样:
$ cat tst.sh
#!/usr/bin/env bash
awk '
(NR == FNR) && match($0,/\[MarketOrderTransitions[^]]+]/) {
id = substr($0,RSTART+23,RLENGTH-24)
print FILENAME, id
}
(NR > FNR) && match($0,/.*"orderId":"/) {
id = substr($0,RLENGTH+1)
sub(/".*/,"",id)
print FILENAME, id
}
' "$@"
$ ./tst.sh Server Client
Server a4ec2abf-059f-4452-b503-ae58da2ce1ff
Server a4ec2abf-059f-4452-b503-ae58da2ce1ff
Server a4ec2abf-059f-4452-b503-ae58da2ce1ff
Server a4ec2abf-059f-4452-b503-ae58da2ce1ff
Server a4ec2abf-059f-4452-b503-ae58da2ce1ff
Client a4ec2abf-059f-4452-b503-ae58da2ce1ff
您还说过您的预期输出是一个看起来类似的 ID 的列表,旁边有数字,但这些 ID 似乎与您提供的示例输入无关,并且您没有告诉我们这些数字来自哪里从。
一旦您能够表达您的需求并在您的问题中提供可测试的示例,我们就可以完成此脚本,并且它将比您的 shell 脚本运行速度快几个数量级。
对您可能尝试执行的操作的一种猜测如下,使用 GNU awk 来执行时间函数:
$ cat tst.sh
#!/usr/bin/env bash
awk '
{ time = substr($0,2,26) }
(NR == FNR) && match($0,/\[MarketOrderTransitions[^]]+]/) {
id = substr($0,RSTART+23,RLENGTH-24)
orders_time[id] = time
}
(NR > FNR) && match($0,/.*"orderId":"/) {
id = substr($0,RLENGTH+1)
sub(/".*/,"",id)
time_diff = time2ms(time) - time2ms(orders_time[id])
print id " -> " time_diff " ms"
}
function time2ms(time, t,secs) {
gsub(/[-:]/," ",time)
split(time,t,/[.]/)
return ( mktime(t[1]) substr(t[2],1,3) )
}
' "$@"
$ ./tst.sh Server Client
a4ec2abf-059f-4452-b503-ae58da2ce1ff -> 149336 ms
但由于您发布的预期输出似乎与您发布的示例输入无关,我不知道这是否正确。