Aspera 日志包含每个文件传输会话的大量不同事件类型。其中一些事件相当隐晦:
LOG Sender DS Qs ds/re/a/d/ao/eo/n/ap/rq/rd/ru/cl/di/do=0/0/0/0/0/0/0/0/0/0/0/0/0/0 Rs o/i=1/1 mgmt backlog i/s/n = 0/412/0 1/456/0
有些——不是那么神秘,但仍然可能需要一些破译:
2020-08-21 19:48:44.622 [112c-00001cf0] LOG FASP Session Statistics [Sender] id=029c1333-c732-4636-83c1-e1acec2864cb delay=64ms rex_delay=72ms solicited_rex=0.01% unsolicited_rex=0.00% ave_xmit_rate 299.53Mbps effective=99.99% effective_rate=299.50Mbps (detail: good_blks 25762789 bl_total 25765926 bl_orig 25762790 bl_rex 3136 dup_last_blks 1) (sndr ctl: sent 1061 rcvd 1061 lost 0 lost 0.00%) (rcvr ctl: sent 9618 rcvd 9617 lost 1 lost 0.01%) (rex ctl: sent 20 rcvd 20 lost 0 lost 0.00%) (progress: tx_bytes 36274005756 file_bytes 55740661756 tx_time 996528996) (estimate transfer size: 0 bytes) sess->instru.num_rex_xfer_not_found 0 sess->instru.num_rex_abrtxfer_not_found 0
... 这只是冰山一角,因为每个文件传输会话都由几个事件类型组成,这些事件类型由某些公共字段(线程 ID、传输 ID 和传输 UUID)连接而成。以下是一些示例:
2020-08-21 19:32:07.730 [112c-00001cf0] LOG FASP Session Start uuid=029c1333-c732-4636-83c1-e1acec2864cb op=send status=started source=D:/Content/somefile.mov dest=/some_destination source_prefix=- local=anon_IP:port peer=dest_IP:33001 tcp_port=22 os="Microsoft Windows *.* " ver=3.9.* lic=9:999:9999 peeros="Isilon OneFS v*.*" peerver=3.* peerlic=99:99:9999 proto_sess=20003 proto_udp=20000 proto_bwmeas=20000 proto_data=2000e
2020-08-21 19:32:07.730 [112c-00001cf0] LOG FASP Session Params uuid=029c1333-c732-4636-83c1-e1acec2864cb userid=0 user="peer_user" targetrate=300000000 minrate=50000000 rate_policy=fair cipher=aes-128-cfb resume=2 create=8 ovr=1 times=0 precalc=no mf=0 mf_path=- mf_suffix=.aspera-inprogress partial_file_suffix=- files_encrypt=yes files_decrypt=no file_csum=NONE dgram_sz=0 prepostcmd=- tcp_mode=no rtt_auto=yes cookie="-" vl_proto_ver=1 peer_vl_proto_ver=1 vl_local=2500000000 vlink_remote=0 vl_sess_id=8 srcbase=- rd_sz=0 wr_sz=0 cluster_num_nodes=1 cluster_node_id=0 cluster_multi_session_threshold=-1 range=0-0 keepalive=no test_login=no proxy_ip=- net_rc_alg=alg_delay exclude_older/newer_than=0/0
2020-08-21 19:32:08.198 [112c-00001cf0] LOG FASP Transfer Start uuid=029c1333-c732-4636-83c1-e1acec2864cb op=send status=started file="D:/Content/somefile.mov" size=55740661756 start_byte=19466656000 rate=300.00Mbps loss=0.00 rexreqs=0 overhead=0
2020-08-21 19:32:26.433 [112c-00001cf0] LOG Sender bl t/o/r/dl/c/h=448201/448201/0/0/442785/442785 ctl bm/bs=0/0 rex b/r/n=0/0/0 r_ott q/s/n=0/0/146 rate t/m/c=300000000/50000000/300000000 prog t/f/e=616782848/20083438848/18011034 xmtD=0
2020-08-21 19:32:26.433 [112c-00001cf0] LOG Sender DS Qs ds/re/a/d/ao/eo/n/ap/rq/rd/ru/cl/di/do=1/0/0/0/0/0/0/0/0/0/1/0/0/0 Rs o/i=1/1 mgmt backlog i/s/n = 0/412/0 1/456/0
...+ 传输过程中大约有 100 个“LOG Sender DS”和“LOG Sender bl”事件,持续约 16 分钟,然后总结:
2020-08-21 19:48:44.231 [112c-00001cf0] LOG FASP Transfer Stop uuid=029c1333-c732-4636-83c1-e1acec2864cb op=send status=success file="D:/Content/somefile.mov" size=55740661756 start_byte=19466656000 rate=291.35Mbps elapsed=996.02s loss=0.00 rexreqs=0 overhead=0
2020-08-21 19:48:44.622 [112c-00001cf0] LOG FASP Session Stop uuid=029c1333-c732-4636-83c1-e1acec2864cb op=send status=success source=D:/Content/somefile.mov dest=/some_destination source_prefix=- local=local_IP:port peer=dest_IP:33001 tcp_port=22 os="Microsoft Windows *.* " ver=3.9.* lic=9:999:9999 peeros="Isilon OneFS v*.* " peerver=3.* peerlic=99:99:9999 proto_sess=20003 proto_udp=20000 proto_bwmeas=20000 proto_data=2000e
2020-08-21 19:48:44.622 [112c-00001cf0] LOG FASP Session Params uuid=029c1333-c732-4636-83c1-e1acec2864cb userid=0 user="peer_user" targetrate=300000000 minrate=50000000 rate_policy=fair cipher=aes-128-cfb resume=2 create=8 ovr=1 times=0 precalc=no mf=0 mf_path=- mf_suffix=.aspera-inprogress partial_file_suffix= files_encrypt=yes files_decrypt=no file_csum=NONE dgram_sz=0 prepostcmd=- tcp_mode=no rtt_auto=yes cookie="-" vl_proto_ver=1 peer_vl_proto_ver=1 vl_local=2500000000 vlink_remote=0 vl_sess_id=8 srcbase=- rd_sz=0 wr_sz=0 cluster_num_nodes=1 cluster_node_id=0 cluster_multi_session_threshold=-1 range=0-0 keepalive=no test_login=no proxy_ip=- net_rc_alg=alg_delay exclude_older/newer_than=0/0
... 会话以上面已经发布的“会话统计”事件结束:
2020-08-21 19:48:44.622 [112c-00001cf0] LOG FASP Session Statistics [Sender] id=029c1333-c732-4636-83c1-e1acec2864cb delay=64ms rex_delay=72ms solicited_rex=0.01% unsolicited_rex=0.00% ave_xmit_rate 299.53Mbps effective=99.99% effective_rate=299.50Mbps (detail: good_blks 25762789 bl_total 25765926 bl_orig 25762790 bl_rex 3136 dup_last_blks 1) (sndr ctl: sent 1061 rcvd 1061 lost 0 lost 0.00%) (rcvr ctl: sent 9618 rcvd 9617 lost 1 lost 0.01%) (rex ctl: sent 20 rcvd 20 lost 0 lost 0.00%) (progress: tx_bytes 36274005756 file_bytes 55740661756 tx_time 996528996) (estimate transfer size: 0 bytes) sess->instru.num_rex_xfer_not_found 0 sess->instru.num_rex_abrtxfer_not_found 0
我请 IBM 支持人员解释一下其中一些的含义,并得到了回复;我会将我了解到的信息发布为答案。如果您有任何其他信息 - 请分享!
答案1
以下是 IBM 的回复,解释了一些日志 + 我们自己对“会话统计”事件的分析 - 这使我们能够解决一个相当具有挑战性的问题。希望这对某些人有帮助!
IBM 的回应(很长):
工程部门不考虑为客户解密大多数日志消息,它们主要用于内部故障排除,因此我们没有记录很多这些消息,并且不能保证这些消息在未来的版本中不会发生变化。
以下是我们记录的一些描述(请注意,新版本中的消息有所不同)
FASP receiver session logging (2.0.2/2.1.x/2.2.x/2.5.x/2.6.x/2.7.x/2.8.x/3.x.x) 2008-01-11 18:28:34 [31c9-b6238bb0] Receiver bl t/o/r/f/l=2/1/0/0/0 rex_rtt r/s/o=0/0/8 rate_rtt b/r/s/f=0/0/0/1 rex n/s/q/a/r=0/0/0/0/0 bl l/d/o/x/r=0/1/0/0/0 disk b/w=0/1452 rate t/m/c=10000000/200000/10000000 prog t/f/e=60/60/132229 rcvD=1 (2.0.2/2.1.x) 2008-01-11 18:28:34 [31c9-b6238bb0] Receiver bl t/o/r/f/l=2/1/0/0/0 rex_rtt r/s/o=0/0/8 rate_rtt b/r/s/f=0/0/0/1 rex n/s/q/a/r=0/0/0/0/0 bl l/d/o/x/r=0/1/0/0/0 disk b/w=0/1452 rate t/m/c/n/vl/vr/r=10000000/200000/8000000/10000000/10000000/10000000/8000000 prog t/f/e=60/60/132229 rcvD=1 (2.2.x/2.5.x) 2008-01-11 18:28:34 [31c9-b6238bb0] Receiver bl t/o/r/f/l=2/1/0/0/0 rex_rtt r/s/o=0/0/8 ooo_rtt r/s/o=0/0/8 rate_rtt b/r/s/f=0/0/0/1 rex n/s/q/v/a/r=0/0/0/0/0/0 bl l/d/o/x/r=0/1/0/0/0 disk b/w=0/1452 rate t/m/c/n/vl/vr/r=10000000/200000/8000000/10000000/10000000/10000000/8000000 prog t/f/e=60/60/132229 rcvD=1 (2.6.x) 2008-01-11 18:28:34 [31c9-b6238bb0] Receiver bl t/o/r/d=2/1/0/0 rex_rtt l/h/s/o=0/20/5/8 ooo_rtt l/h/s/o=0/0/0/8 rate_rtt b/l/h/s/f=1/0/20/10/1 rex n/s/q/v/a/r=0/0/0/0/0/0 bl l/d/o/r/a/x/dl/df/dm/ds=0/1/0/0/1/1/0/0/0/0 disk l/h/b/w=0/6/1/1452 rate t/m/c/n/vl/vr/r=10000000/200000/8000000/10000000/10000000/10000000/8000000 prog t/f/e=60/60/132229 rcvD=1 (2.7.1/2.7.3) 2008-01-11 18:28:34 [31c9-b6238bb0] Receiver bl t/o/r/d=2/1/0/0 rex_rtt l/h/s/o=0/20/5/8 ooo_rtt l/h/s/o=0/0/0/8 rate_rtt b/l/h/s/r/f=1/0/20/10/0/1 ctl bm/bs=0/1 rex n/s/q/v/a/r=0/0/0/0/0/0 bl l/d/o/r/a/x/dl/df/dm/ds=0/1/0/0/1/1/0/0/0/0 disk l/h/b/w=0/6/1/1452 rate t/m/c/n/vl/vr/r=10000000/200000/8000000/10000000/10000000/10000000/8000000 prog t/f/e=60/60/132229 rcvD=1 (2.7.4 and 2.8.x) 2008-01-11 18:28:34 [31c9-b6238bb0] Receiver bl t/o/r/d/ts=2/1/0/0/2 rex_rtt l/h/s/o=0/20/5/8 ooo_rtt l/h/s/o=0/0/0/8 rate_rtt b/l/h/s/r/f=1/0/20/10/0/1 ctl bm/bs=0/1 rex n/s/q/v/a/r=0/0/0/0/0/0 bl l/d/o/r/a/x/dl/df/dm/ds=0/1/0/0/1/1/0/0/0/0 disk l/h/b=0/6/1 vlink lq/lo/rq/ro=5/10/0/0 rate t/m/c/n/vl/vr/r=10000000/200000/8000000/10000000/10000000/10000000/8000000 prog t/f/e=60/60/132229 rcvD=1 (3.0.x and later) Receiver Blocks Info (bl) [t/o/r/f/l] (pre-2.7) t = total number of blocks received o = number of original-blocks received r = number of retransmitted-blocks received f = number of times first block was received l = number of times the last block was received Receiver Blocks Info (bl) [t/o/r/d] (2.7 and 2.8.x) t = total number of blocks received o = number of original-blocks received r = number of retransmitted-blocks received d = number of blocks discarded (before getting accepted) (t = o + r + d) Receiver Blocks Info (bl) [t/o/r/d/ts] (3.0.x) t = total number of blocks received o = number of original-blocks received r = number of retransmitted-blocks received d = number of blocks discarded (before getting accepted) ts= total number of blocks sent by sender (only non-zero if peer supports sending it >=3.0.x) (t = o + r + d) Retransmission round-trip time (rex_rtt) [r/s/o] (pre-2.7) r = latest measurement of REX RTT s = smoothed version of REX RTT o = latest computed time-out value (i.e. RTO) Retransmission round-trip time (rex_rtt) [l/h/s/o] (2.7) l = smallest measured REX RTT during last period h = biggest measured REX RTT during last period s = smoothed version of REX RTT o = latest computed time-out value (i.e. RTO) Out-of-order blocks round-trip time (ooo_rtt) [r/s/o] (pre-2.7) r = latest measurement of OOO RTT s = smoothed version of OOO RTT o = latest computed time-out value (i.e. RTO) Out-of-order blocks round-trip time (ooo_rtt) [l/h/s/o] (2.7) l = smallest measured OOO RTT during last period h = biggest measured OOO RTT during last period s = smoothed version of OOO RTT o = latest computed time-out value (i.e. RTO) Network round-trip time (rate_rtt) [b/r/s/f] (pre-2.7) b = base (minimum) network RTT r = latest measurement of network RTT s = smoothed version of (network) RTT (In 2.2.x, "s" is the sum of the base RTT plus the largest queueing component -- the one that currently takes effect) f = FASP rate adaptation mode (meaningful with –Q) Network round-trip time (rate_rtt) [b/l/h/s/r/f] (2.7) b = base (minimum) network RTT l = smallest measured NET RTT during last period h = biggest measured NET RTT during last period s = smoothed version of (network) RTT (In 2.2.x, "s" is the sum of the base RTT plus the largest queueing component -- the one that currently takes effect) r = number of times base RTT was reset f = FASP rate adaptation mode (meaningful with –Q) Control PDU stats (ctl) [bm/bs] (2.7.4 - ) bm = number of control PDUs with badmac bs = number of control PDUs with badseqno Retransmit Requests (rex) [n/s/q/a/r] (2.0.x - 2.5.x) n = number of NACKs outstanding in the NACK-list s = number of NACK requests sent q = number of NACK request messages sent (1 NACK request message carries multiple NACK requests) a = number of ADD failures to NACK list (due to out of space) r = number of REMOVE failures from nack list (implies duplicated retransmissions or buggy behavior) Retransmit Requests (rex) [n/s/q/v/a/r] (2.6.x) n = number of NACKs outstanding in the NACK-list s = number of NACK requests sent q = number of NACK request messages sent (1 NACK request message carries multiple NACK requests) v = number of NACK requests removed because block already revd a = number of ADD failures to NACK list (due to out of space) r = number of REMOVE failures from nack list (implies duplicated retransmissions or buggy behavior) Lost blocks (bl) [l/d/o/x/r] (pre-2.7) l = number of lost original blocks (note: if rex buffer/timer full, this counter is incorrectly adding same lost > originals multiple times) d = number of duplicate blocks received o = number of out-of-order blocks received x = number of lost blocks for which retransmission should have been requested, but were not because the rex table was full r = number of lost blocks due to receiver ring buffer full Lost blocks (bl) [l/d/o/r/a/x/dl/df/dm/ds] (2.7) l = number of lost original blocks (note: if rex buffer/timer full, this counter is incorrectly adding same lost > originals multiple times) d = number of duplicate blocks received o = number of out-of-order blocks received r = number of lost blocks due to receiver ring buffer full a = number of blocks discarded due to aborted transfers x = number of blocks discarded when xfer is not ready dl= number of duplicate last blocks df= number of blocks dropped due to bad format (before being accepted) dm= number of blocks dropped due to bad mac checksum (before being accepted) ds= number of blocks dropped due to bad sequence number (before being accepted) (d (number of blocks discarded) = r + a + x + dl) Disk (disk) [b/w] b = number of buffers occupied in the disk-ring (equiv to disk backlog) w = total number of bytes written (including non-contiguous) Disk (disk) [l/h/b/w] (2.7) l = smallest measured buffer depth during last period h = largest measured buffer depth during last period b = number of buffers occupied in the disk-ring (equiv to disk backlog) w = total number of bytes written (including non-contiguous) Rate (rate) [t/m/c] (2.0.2/2.1.x) t = target rate of the session m = minimum rate of the session c = transfer rate currently computed Rate (rate) [t/m/c/n/vl/vr/r] (2.2.x) t = target rate of the session m = minimum rate of the session c = transfer rate currently computed (minimal of n/vl/vr/r) n = transfer rate computed based on network congestion vl = transfer rate computed based on local virtual link vr = transfer rate computed based on remote virtual link r = transfer rate computed based on receiver storage I/O Vlink (vlink) [lq/lo/rq/ro] (3.0.x) lq = local vlink queueing delay (ms) lo = local vlink occupancy (%) rq = remote vlink queueing delay (ms) ro = remote vlink occupancy (%) Progress (prog) [t/f/e] t = running total of bytes received f = running total of file bytes received e = total elapsed time (microseconds) Number of files received (rcvD)
FASP 发送方会话日志记录(2.0.2/2.1.x/2.2.x/2.5.x/2.6.x/2.7.x/2.8.x)
2008-01-11 14:37:30 [97f-b6962bb0] Sender bl t/o/r/f/l=3/3/0/0/0 rex b/r/n=0/0/0 rate t/m/c=10000000/200000/10000000 prog t/f/e=0/0/46 xmtD=1 (2.0.2-2.6.x excluding 2.6.4) 2008-01-11 14:37:30 [97f-b6962bb0] Sender bl t/o/r/f/l=3/3/0/0/0 rex b/r/n=0/0/0 r_ott q/s/n=2/1/1000 rate t/m/c=10000000/200000/10000000 prog t/f/e=0/0/46 xmtD=1 (2.6.4/2.7.1) 2008-01-11 14:37:30 [97f-b6962bb0] Sender bl t/o/r/dl=3/3/0/1 rex b/r/n=0/0/0 r_ott q/s/n=2/1/1000 rate t/m/c=10000000/200000/10000000 prog t/f/e=0/0/46 xmtD=1 (2.7.1/2.7.3) 2008-01-11 14:37:30 [97f-b6962bb0] Sender bl t/o/r/dl=3/3/0/1 ctl bm/bs=0/1 rex b/r/n=0/0/0 r_ott q/s/n=2/1/1000 rate t/m/c=10000000/200000/10000000 prog t/f/e=0/0/46 xmtD=1 (2.7.4 and after) Sender Blocks Info (bl) [t/o/r/f/l] t = total number of blocks sent o = number of original-blocks transmitted r = number of retransmitted-blocks transmitted f = number of times first block was transmitted l = number of times the last block was transmitted Sender Blocks Info (bl) [t/o/r/dl] (2.7.3 and newer) t = total number of blocks sent o = number of original-blocks transmitted r = number of retransmitted-blocks transmitted dl = number of duplicated last block Control PDU stats (ctl) [bm/bs] (2.7.4 - ) bm = number of control PDUs with badmac bs = number of control PDUs with badseqno Retransmissions (rex) [b/r/n] b = number of NACKs received (1 NACK per block) r = number of NACK requests received n = number of outstanding blocks to be retransmitted Reverse path queueing (r_ott) [q/s/n] q = reverse path queueing (ms) s = estimated clock skew (magnified by 1000) n = number of reverse path queueing probes accepted Rate [t/m/c] t = target rate of the session m = minimum rate of the session c = current transfer rate (computed by the receiver) Progress (prog) [t/f/e] t = running total of bytes transmitted f = running total of file bytes transmitted e = total elapsed time (microseconds) Number of files transmitted (xmtD)
DS 发送方日志记录
Queues (Qs) [ds/re/a/d/ao/eo/n/ap/rq/rd/ru/di/do] ds = nubmer of DS count re = number of entries in the "retry" queue (used when OS returns too many open file) a = number of entries in the "arguments" queue d = number of entries in the directory queue ao = number of entries in the open asynchronous queue eo = number of entries in the open error queue n = number of entries in the "new file" queue ap = number of entries in the "attribute preparation" queue (in the process of calculating md5 checksum) rq = number of entries for which the "DS request" was sent (not yet responded to) rd = number of entries in the READY queue (DS response received) ru = number of entries in the RUNNING queue cl = number of entries in the CLEAN queue di = number of entries in the DISP queue (no action needed, for example the file was already transferred) do = number of entries in the DONE queue (finished transferring) Requests/Responses (Rs) [o/i] o = number of DS requests sent (cumulative) i = number of DS responses received (cumulative) Management backlog (mgmt backlog) [i/s/n] i = index of management socket s = file descriptor of management socket n = total messages buffered for socket
DS 接收器日志记录
Queues (Qs) [ds/n/rq/ao/ap/rd/ru/no/po/pc/do] ds = nubmer of DS count n = number of entries in the "new file" queue rq = number of entries for which the "DS request" was received ao = number of entries in the open asynchronous queue ap = number of entries in the attribute preparation queue rd = number of entries in the READY queue ru = number of entries in the RUNNING queue no = number of entries in the NOXFER DISP queue (no action needed, for example the file was already transferred) po = number of entries in the POST queue pc = number of entries in the post complete queue do = number of entries in the DONE queue (finished transferring) Requests/Responses (Rs) [o/i] o = number of DS requests received (cumulative) i = number of DS responses sent (cumulative)
IBM 尚未提供有关“FASP 会话统计”的信息,因此这里是我的团队的实证分析:
FASP 会话统计
[112c-00001cf0] thread ID - shared across events for the same transfer session
[Sender] Role / type. I.e. "sender" means an outbound transfer
id=029c1333-***64cb another form of a transfer id ("UUID" in some other events)
delay=64ms TBD
rex_delay=72ms "retransmission delay", TBD
solicited_rex=0.01% TBD
unsolicited_rex=0.00% TBD
ave_xmit_rate 299.53Mbps self explanatory; formula on how it's calculated - TBD
effective=99.99% self explanatory
effective_rate=299.50Mbps self explanatory
(detail: good_blks 25762789 bl_total 25765926 bl_orig 25762790 bl_rex 3136 dup_last_blks 1)
"detail" stanza
good_blks : actual blocks to transfer
bl_total : total blocks transferred incl. rex
: forumula: bl_total = bl_orig + bl_rex
bl_orig : "original blocks" = good_blks + dup_last_blks
bl_rex : retransmitted blocks*
dup_last_blks : "duplicated last blocks"
* high ratio of bl_rex to good_blks
indicates an issue
(sndr ctl: sent 1061 rcvd 1061 lost 0 lost 0.00%) (rcvr ctl: sent 9618 rcvd 9617 lost 1 lost 0.01%)
"Sender Control" stanza (TBD)
(rex ctl: sent 20 rcvd 20 lost 0 lost 0.00%)
"Retransmission Control" stanza (TBD)
(progress: tx_bytes 36274005756 file_bytes 55740661756 tx_time 996528996)
"Progress" stanza
tx_bytes : transfer bytes
file_bytes : file bytes
tx_time : transfer time (in microseconds)
(estimate transfer size: 0 bytes) self explanatory (relationship with blocks and
progress bytes TBD
sess->instru.num_rex_xfer_not_found 0 TBD
sess->instru.num_rex_abrtxfer_not_found 0 TBD
用例:性能不佳的 Aspera 服务器
通过解读此特定事件类型(上面的“LOG FASP 会话统计信息”)中的字段,我们可以在 Splunk 中并排比较我们的两个 Aspera 服务器及其关键统计数据(重传率和每小时 GB 数)。比较结果显示,在整个 2020 年 8 月,其中一个服务器的重传块比率异常高,性能降低了 40%。查看前几个月的类似统计数据显示,这两台服务器的重传率低于百分之一,性能 >90GB/h,处于同一水平。
8 月初,有问题的服务器被迁移到了不同的网络段。除了我们看到的其他错误之外,这种分析还使我们能够检测和衡量问题的影响,并最终锁定罪魁祸首(网络配置)。问题缓解后,服务器的性能恢复正常。
如果您将 Aspera 日志导入 Splunk 并希望在搜索时提取字段并显示统计数据的 SPL,我会将一些内容发布到 Splunk Answers 并链接到这里。