我一直在使用 Sybase 的伟大专家 Rob Verschoor 先生编写的 shell 脚本这里。该作业每小时通过 cron 作业调用一次,如果任何关键字与错误日志中预定义的关键字匹配,它会向我们发送电子邮件。为了方便参考,我发布了下面可能导致问题的代码:
LAST_MARKER=$(${AWK} '/'$MARKER'/ { a=NR } END { print a }' $LOGFILE_COPY)
LAST_MARKER=`echo "$LAST_MARKER+0"|bs`
if [ ! "$LAST_MARKER" = "" ]
then
sed "1,${LAST_MARKER}d" $LOGFILE_COPY > $TMP.x
cp $TMP.x $LOGFILE_COPY
fi
在过去的两年里,这一直工作得很好,没有任何问题,只是在第 1 行之后增加了一行。从我这边来看如下:
LAST_MARKER=`echo "$LAST_MARKER+0"|bs`
这是为了格式化以正确的数字格式返回的行数,因为它是以科学格式出现的。
在我们禁用一个监控工具后,在找到最近几天的最后一个标记时似乎存在问题,该监控工具几乎每一秒都会用跟踪消息填充错误日志。所以,基本上从最后一个标记到新标记 - 我们曾经有很多行条目并且从未遇到过任何问题。现在,禁用此工具后 - 在非工作时间,没有任何活动,因此最后一个标记和新标记将成为后续行。
早些时候,错误日志看起来像下面这样,有很多消息:
00:0005:00000:00514:2020/04/17 10:15:59.92 server _Marker_For_Checking_Errorlog_
00:0005:00000:00514:2020/04/17 10:15:59.92 server _Marker_End_
...
0:0002:00000:00608:2020/04/16 11:12:40.88 server DBCC TRACEON 3604, SPID 608
00:0002:00000:00608:2020/04/16 11:12:40.88 server DBCC TRACEOFF 3604, SPID 608
00:0006:00000:00660:2020/04/16 11:13:40.47 server DBCC TRACEON 3604, SPID 660
00:0006:00000:00660:2020/04/16 11:13:40.47 server DBCC TRACEOFF 3604, SPID 660
00:0006:00000:00664:2020/04/16 11:13:40.51 server DBCC TRACEON 3604, SPID 664
00:0006:00000:00664:2020/04/16 11:13:40.51 server DBCC TRACEOFF 3604, SPID 664
00:0002:00000:00608:2020/04/16 11:13:40.54 server DBCC TRACEON 3604, SPID 608
00:0002:00000:00608:2020/04/16 11:13:40.54 server DBCC TRACEOFF 3604, SPID 608
00:0006:00000:00660:2020/04/16 11:13:40.87 server DBCC TRACEON 3604, SPID 660
00:0006:00000:00660:2020/04/16 11:13:40.87 server DBCC TRACEOFF 3604, SPID 660
00:0004:00000:00608:2020/04/16 11:14:40.92 server DBCC TRACEOFF 3604, SPID 608
...
00:0005:00000:00514:2020/04/17 11:15:59.92 server _Marker_For_Checking_Errorlog_
00:0005:00000:00514:2020/04/17 11:15:59.92 server _Marker_End_
现在,错误日志如下所示:
00:0004:00000:00974:2020/04/17 09:15:28.80 server _Marker_For_Checking_Errorlog_
00:0004:00000:00974:2020/04/17 09:15:38.80 server _Marker_End_
00:0005:00000:00514:2020/04/17 10:15:59.92 server _Marker_For_Checking_Errorlog_
00:0005:00000:00514:2020/04/17 10:15:59.92 server _Marker_End_
00:0003:00000:00030:2020/04/17 11:16:01.51 server _Marker_For_Checking_Errorlog_
00:0003:00000:00030:2020/04/17 11:16:01.51 server _Marker_End_
该工具无法区分先前的标记和最后的标记,因此甚至会一次又一次地发送 3-4 小时前发生的错误,而它应该不发送任何错误邮件,因为过去一小时内错误日志中没有写入任何内容。
我不是 shell 脚本专家;因此,任何对此的帮助都将受到高度赞赏。
编辑:此工具的正确行为是在 4:15(计划时间)发送如下电子邮件,因为在过去一小时内(3:15 到 4:15 之间)存在预定义的匹配关键字:
Checking ASE errorlog
Fri Apr 17 04:16:06 WAT 2020
Server=Sybaseprd
Errorlog=/mount/ASE-15_0/install/Sybaseprd.log
00:0006:00000:00061:2020/04/17 04:03:37.15 server Error: 1621, Severity: 18, State: 1
00:0006:00000:00061:2020/04/17 04:03:37.15 server Type '16' not allowed before login.
00:0004:00000:00668:2020/04/17 04:03:42.17 server Error: 1621, Severity: 18, State: 1
00:0004:00000:00668:2020/04/17 04:03:42.17 server Type '16' not allowed before login.
00:0004:00000:00100:2020/04/17 04:03:42.17 server Error: 1621, Severity: 18, State: 1
00:0004:00000:00100:2020/04/17 04:03:42.17 server Type '16' not allowed before login.
00:0012:00000:00000:2020/04/17 04:03:49.30 kernel ksmask__rpacket: Invalid tdslength value 21536, kpid: 268895208
00:0003:00000:00932:2020/04/17 04:04:59.20 server Error: 1621, Severity: 18, State: 1
00:0003:00000:00932:2020/04/17 04:04:59.20 server Type '3' not allowed before login.
9 error lines found in errorlog for ASE server 'SybasePrd'
(end)
不正确的行为如下:
Checking ASE errorlog
Fri Apr 17 05:16:01 WAT 2020
Server=SybasePrd
Errorlog=/mount/ASE-15_0/install/Sybaseprd.log
00:0006:00000:00061:2020/04/17 04:03:37.15 server Error: 1621, Severity: 18, State: 1
00:0006:00000:00061:2020/04/17 04:03:37.15 server Type '16' not allowed before login.
00:0004:00000:00668:2020/04/17 04:03:42.17 server Error: 1621, Severity: 18, State: 1
00:0004:00000:00668:2020/04/17 04:03:42.17 server Type '16' not allowed before login.
00:0004:00000:00100:2020/04/17 04:03:42.17 server Error: 1621, Severity: 18, State: 1
00:0004:00000:00100:2020/04/17 04:03:42.17 server Type '16' not allowed before login.
00:0012:00000:00000:2020/04/17 04:03:49.30 kernel ksmask__rpacket: Invalid tdslength value 21536, kpid: 268895208
00:0003:00000:00932:2020/04/17 04:04:59.20 server Error: 1621, Severity: 18, State: 1
00:0003:00000:00932:2020/04/17 04:04:59.20 server Type '3' not allowed before login.
9 error lines found in errorlog for ASE server 'SybasePRD'
(end)
上述作业在 5:15 触发,并且在 4:15 和 5:15 之间没有匹配行,因此不应报告任何内容。正如我之前提到的,该程序继续发送电子邮件,直到接下来的 5 个计划,即到 10:15,并且仅当上述错误之后的错误日志中的条目数超过 40 左右时才停止。
因此,期望的结果是找到上述 shell 脚本中的错误并修复它以精确检查过去一小时,即从最后一个标记到错误日志中的最后一行,如果没有条目,则意味着没有添加行从上次检查开始,然后不检查或不报告任何内容,如下所示:
00:0004:00000:00974:2020/04/17 09:15:28.80 server _Marker_For_Checking_Errorlog_
00:0004:00000:00974:2020/04/17 09:15:38.80 server _Marker_End_
00:0005:00000:00514:2020/04/17 10:15:59.92 server _Marker_For_Checking_Errorlog_
00:0005:00000:00514:2020/04/17 10:15:59.92 server _Marker_End_
00:0003:00000:00030:2020/04/17 11:16:01.51 server _Marker_For_Checking_Errorlog_
00:0003:00000:00030:2020/04/17 11:16:01.51 server _Marker_End_
答案1
我们有点停滞不前,所以让我们看看是否可以让事情再次滚动。假设您发布的代码:
LAST_MARKER=$(${AWK} '/'$MARKER'/ { a=NR } END { print a }' $LOGFILE_COPY)
LAST_MARKER=`echo "$LAST_MARKER+0"|bs`
if [ ! "$LAST_MARKER" = "" ]
then
sed "1,${LAST_MARKER}d" $LOGFILE_COPY > $TMP.x
cp $TMP.x $LOGFILE_COPY
fi
旨在从 $LOGFILE_COPY 中删除直到并包括包含 $MARKER 的最后一行(如果存在)的文本,如果您有tac
:
tac "$LOGFILE_COPY" | awk -v m="$MARKER" '$0~m{exit} 1' | tac > "${TMP}.x" &&
mv "${TMP}.x" "$LOGFILE_COPY"
如果您没有,tac
那么以下 2 遍 awk-only 解决方案运行速度会慢一些,并且不适用于来自管道的输入,但它适用于任何大小的输入文件,而上面的 tac 解决方案可能会失败,如果输入文件绝对庞大:
awk -v m="$MARKER" 'NR==FNR{if ($0~m) a=NR; next} FNR>a' "$LOGFILE_COPY" "$LOGFILE_COPY" > "${TMP}.x" &&
mv "${TMP}.x" "$LOGFILE_COPY"
如果这太慢了(如果是的话我会感到惊讶),这可能会快一点(它肯定会比您开始使用的脚本快):
start=$(awk -v m="$MARKER" '$0~m{a=NR} END{printf "%d\n", a+1; exit (a?0:1)}' "$LOGFILE_COPY") &&
tail -n +"$start" "$LOGFILE_COPY" > "${TMP}.x" &&
mv "${TMP}.x" "$LOGFILE_COPY"
这能解决你的问题吗?
旁白:这是如何修改的开始你原来的剧本解决其中最基本的问题并使其更易于阅读:
#!/bin/sh
this_prog=$(basename "$0")
usage()
{
echo "Usage:"
echo " $this_prog <servername> <login> <passwd> [<errorlog-pathname> [\"all\"]]"
}
#---------------------------------------------------------------------------
# Check parameters
if [ $# -lt 3 ] || [ $# -gt 5 ]
then
usage
exit 1
fi
srv=$1
login=$2
psswd=$3
logfile=$4
opt=$5
#---------------------------------------------------------------------------
# Temp directory
tmp=$(mktemp -d) || exit 1
trap 'rm -f "$tmp"/*; rmdir "$tmp"; exit' 0
logfile_copy="${tmp}/errlog"
#---------------------------------------------------------------------------
# Some contants; do NOT change these !
dft_mailprog="your_mail_program" #DO NOT CHANGE -- go to the next section
dft_dba_mail="[email protected] [email protected]" #DO NOT CHANGE
# -- go to the next section
#---------------------------------------------------------------------------
# Some definitions
#
# mailprog must be set to your command-line mail program, like 'mail', 'mailx',
# etc. Later in this script, it is assumed that this mail program supports
# specifying the mail subject on the command line with the "-s" option.
# Should you use 'sendmail', you'll have to modify the script, or do without
# the mail subject, as 'sendmail' does not have this "-s" option.
# NT users may want to use 'ssmtp' (part of CygWin) as their mail
# program (also see comment below).
mailprog="$dft_mailprog" # define your own setting here
# Define a list of people receiving results by email:
dba_mail="$dft_dba_mail" # define your own setting here
skip_when_empty=NO # if YES, will not send mail when no errors were found
#---------------------------------------------------------------------------
# The marker strings below can be set to any arbitrary string, as long
# as this is unique and does not appear in the errorlog as part of any
# error message.
# These strings should not be changed anymore once you've started using
# this script.
marker="_Marker_For_Checking_Errorlog_" #do not change this !
marker2="_Marker_End_" #do not change this !
#--------------------------------------------------------------------------
# Change the below to 'gawk' (or 'nawk') if desired... This may be needed
# when hitting built-in max. string length limits in 'awk'. 'gawk' etc.
# tend to be more flexible.
AWK='awk' # awk|gawk
#---------------------------------------------------------------------
# Check the mail program and email adresses have been defined
if [ "$mailprog" = "$dft_mailprog" ]
then
echo ""
echo "You must first define the variable 'mailprog' in this script;"
echo "please set it to the name of your command-line mail program,"
echo "like 'mail', 'mailx', etc."
echo ""
exit 1
fi
if [ "$dba_mail" = "$dft_dba_mail" ]
then
echo ""
echo "You must first define the variable 'dba_mail' in this script;"
echo "please set it to a list of recipients."
echo ""
exit 1
fi
#--------------------------------------------------------------------------
# First locate the server errorlog
rm -f "$logfile_copy"
if [ "$logfile" = "" ]
then
# Pick up the server errorlog pathname; first check if this is 12.0
# or later to determine the method for doing this
#
cat << --EOF-- > "${tmp}/vchk.sql"
select name from sysobjects -- used for ASE version check
where name = "sysqueryplans"
go
dbcc traceon(3604)
go
dbcc resource -- contains errorlog pathname
go
--EOF--
# The below isql session also doubles as an ASE access and
# privilege check.
# Using 'cat' and piping the SQL to isql is done to make it run on
# Windows NT as well ('cos the NT version of 'isql' won't understand
# Unix-style pathnames)
#
< "${tmp}/vchk.sql" isql -S"$srv" -U"$login" -P"$psswd" -w500 > "${tmp}/vchk"
if grep -q "CT-LIBRARY error" "${tmp}/vchk"
then
cat "${tmp}/vchk"
echo ""
echo "*** Note: in case you cannot connect because the ASE server is down,"
echo "*** you can also specify the errorlog pathname explicitly."
echo ""
usage
exit 1
fi
if grep "You must have the following role(s) to" "${tmp}/vchk"
then
exit 1
fi
# 18-Sep-2001 Corrected the test below: it said "-ne 1" instead of "-eq 1",
# causing it to not to identify version pre-12.0 correctly
# (thanks to Jean Loesch)
#
if [ "$(grep -c "sysqueryplans" "${tmp}/vchk")" -eq 1 ]
then
#--------------------------------------------------------------------------
# This is ASE 12.0+, so locate the errorlog through @@errorlog (this isn't
# really necessary, as dbcc resource would still work fine), but let's do
# it anyway for educational purposes ...
cat << --EOF-- > "${tmp}/ataterrlog.sql"
print @@errorlog
go
--EOF--
< "${tmp}/ataterrlog.sql" isql -S"$srv" -U"$login" -P"$psswd" > "${tmp}/ataterrlog"
logfile=$( "$AWK" '{print $1}' "${tmp}/ataterrlog" )
#--------------------------------------------------------------------------
else # not 12.0+
# This is ASE pre-12.0, so locate the errorlog through dbcc resource (already
# executed above)
logfile=$( "$AWK" 'sub(/.*rerrfile=/,""){print $1}' "${tmp}/vchl" )
fi
fi # if $logfile = ""
#--------------------------------------------------------------------------
# Errorlog file name known now, check if it's there
if [ ! -f "$logfile" ]
then
echo "Error accessing server errorlog file [$logfile] - file not found"
echo "Note: this script must be run on the same host where the "
echo "ASE errorlog file is located."
exit 1
fi
cp "$logfile" "$logfile_copy"
#--------------------------------------------------------------------------
# Check option parameter
#
if [ "$opt" = "" ]
then
scan_all=N
else
scan_all=Y
echo "Scanning the entire ASE errorlog."
fi
#--------------------------------------------------------------------------
if [ "$scan_all" = "N" ]
then
# Skip the part of the errorlog until the last marker
# Note: if the next line gives an error message, use a different shell
last_marker=$("$AWK" -v marker="$marker" '$0 ~ marker { a=NR } END { print a+0 }' "$logfile_copy")
if [ ! "$last_marker" = "" ]
then
sed "1,${last_marker}d" "$logfile_copy" > "${tmp}/x" &&
cp "${tmp}/x" "$logfile_copy"
fi
fi
#--------------------------------------------------------------------------
# Create output file
{
echo "Checking ASE errorlog"
date
echo "Server=$srv"
echo "Errorlog=$logfile"
echo ""
} > "${tmp}/out"
#--------------------------------------------------------------------------
# Finally... search for errors in the log file. The below set of search
# strings catches pretty much everything, but you can add any string here
# which you would also like to search for...
#
# Note that these strings indicate the presence of messages that should
# be investigated. Still, this may require further inspection of the
# errorlog, as more messages may be present which contain additional
# information.
grep -Ei '(warning|severity|fail|unmirror|mirror exit|not enough|error|suspect|corrupt|correct|deadlock|critical|allow|infect|error|full|problem|unable|not found|threshold|couldn|not valid|invalid|NO_LOG|logsegment|syslogs|stacktrace)' "$logfile_copy" |
grep -Evi '(successfull|_Marker_|(Suspect Granularity))' > "${tmp}/out2"
nrlines=$(wc -l "${tmp}/out2" | "$AWK" '{print $1}')
cat "${tmp}/out2" >> "${tmp}/out"
#--------------------------------------------------------------------------
#
echo "$nrlines error lines found in errorlog for ASE server '$srv'"
{
echo ""
echo "$nrlines error lines found in errorlog for ASE server '$srv'"
echo ""
echo "(end)"
} >> "${tmp}/out"
if [ "$skip_when_empty" = "NO" ] && [ "$nrlines" -eq 0 ]
then
nrlines=1 # to force it into mailing anyway
fi
if [ "$nrlines" -gt 0 ]
then
# Mail any error messages found to the list of recipients
# (note: assumption is that the -s "subject" option is available for
# your email program. Should you use "sendmail", it may not be
# available, and you'd have to remove this option; when you're familiar
# with 'sendmail', you can add the subject line yourself by inserting
# header lines into the message file)
#
# Note for NT users: if you need a command-line mail program on NT,
# consider 'ssmtp'. This is part of the CygWin package, which you need
# anyway to run this script on NT. The download location for CygWin
# is in the file header above.
subj="Results of ASE errorlog check for '$srv'"
"$mailprog" -s "$subj" "$dba_mail" < "${tmp}/out"
fi
#--------------------------------------------------------------------------
if [ "$scan_all" = "N" ]
then
# Write a new marker to the server errorlog to indicate we got till here
# Only do this when (i) no explicit errorlog pathname was specified and
# (ii) only the last part of the log was scanned.
cat << --EOF-- > "${tmp}/logprint.sql"
dbcc logprint ("$marker")
dbcc logprint ("$marker2") -- need a second line to avoid missing the last line
if @@error = 0 print "Writing marker to ASE errorlog."
-- note: in ASE 12.0, we could the more tidy "dbcc printolog(string)" instead
go
--EOF--
< "${tmp}/logprint.sql" isql -S"$srv" -U"$login" -P"$psswd" | grep -Ev '(DBCC execution compl|(SA))'
fi
#--------------------------------------------------------------------------
# end
#
还有其他可以进行的改进,并且尚未经过测试,因此可能存在错误,但希望您可以将其与原始版本进行比较,以了解应以哪些方式更改原始版本。