从 html/xml 文件中提取特定单词及其数据

从 html/xml 文件中提取特定单词及其数据

样本输入是

<bre rt="1600" et="1550794901464" st="1550794899864" tid="8390500116294391399" mh="N" cn="" lc="" ts="N/A" cidc="" IDC="" eidc="BRE-S-TRA-0085418501"/>
    <r1>
        <gr1>
            <a="1" b="smaple data with spaces" c="Created TrasctionInfo" d="1550794901228"/>
            <e="INITIAL" f="2" g="INITIAL_LEGACY" h="1550794901228" i="LegacyToggle is off. Follow Legacy flow"/>
            <lx ets="2019-02-22T00:21:41.228Z" trxn="smaple data with spaces 2 record" rn="Derive data" abc="COT def" def="Season occur" trxn="smaple data with spaces 3rd record" den="andys and others" trxn="smaple data with spaces 4th record" kit="Theater - Span day"
             rns="Span day" trxn="smaple data with spaces 5th record" off="|"/>
            <cwl wc="2.0766" tot="16" act="116.28960000000001" CSE="CHE-CSFL" wg1.0" high="1" </cwl>
                </gr1>
            </r1>
</bre>
<bre rt="1234" et="1234794901464" st="1234794899864" tid="2345500116294391399" mh="Y" cn="At123" lc="" ts="NA" cidc="" IDC="some text value" eidc="abc-def-gh-2385418501"/>
    <r1>
        <gr1>
            <a="1" trxn="other data with spaces" c="Created Info" d="3434794545228"/>
            <e="begin" f="2" g="INITIAL_LEGACY" h="1234709901228" i="Toggle hig. Follow toggle flow"/>
            <lx ets="2017-02-22T00:21:41.228Z" trxn="another record data" rn="Derive data" abc="COT def" trxn="smaple data with spaces record" def="Season occur" den="andys and others" trxn="smaple data with spaces 4th record" kit="Theater - Span day"
             rns="Span day" trxn="data with spaces" off="|"/>
            <cwl wc="2.0766" tot="16" act="116.28960000000001" CSE="CHE-CSFL" wg1.0" high="1" </cwl>
                </gr1>
            </r1>
</bre>
<bre rt="1234" et="1234794901464" st="1234794899864" tid="2345500116294391399" mh="Y" cn="At123" lc="" ts="NA" cidc="" IDC="some text value" eidc="abc-def-gh-2385418501"/>
    <r1>
        <gr1>
            <a="1" c="Created transaction" b="3434794545228"/>
            <e="begin" f="2" g="INITIAL_LEGACY" h="1234709901228" i="Toggle hig. Follow toggle flow"/>
            <lx ets="2017-02-22T00:21:41.228Z" rn="Derive data" abc="COT def" def="Season occur" den="andys and others" kit="Theater - Span day"
             rns="Span day" off="|"/>
            <cwl wc="2.0766" tot="16" act="116.28960000000001" CSE="CHE-CSFL" wg1.0" high="1" </cwl>
                </gr1>
            </r1>
</bre>

输出应该是

tid="8390500116294391399"
ts="N/A"
ets="2019-02-22T00:21:41.228Z" 
trxn="smaple data with spaces 2 record"
trxn="smaple data with spaces 3rd record"
trxn="smaple data with spaces 5th record"
tid="2345500116294391399"
ts="NA"
ets="2017-02-22T00:21:41.228Z" 
trxn="other data with spaces"
trxn="another record data"
trxn="smaple data with spaces record"
trxn="data with spaces"
tid="2345500116294391399"
ts="NA"
ets="2017-02-22T00:21:41.228Z"

我尝试如下

sed -e 's/trxn=/\ntrxn=/g' -e 's/tid=/\ntid=/g' -e 's/ts=/\nts=/g'

while IFS= read -r var
do
    if grep -Fxq "$trxn" temp2.txt
    then
      awk -F"=" '/tid/{print VAL=$i} /ts/{print VAL=$i} /ets/{print VAL=$i} /trxn/{print VAL=$i} /tid/{print VAL=$i;next}' temp2.txt >> out.txt
    else
      awk -F"=" '/tid/{print VAL=$i} /ts/{print VAL=$i} /ets/{print VAL=$i} /tid/{print VAL=$i;next}' temp2.txt >> out.txt
    fi
done < "$input"

答案1

或者使用 grep:

$ grep -Eo '(ets|tid|trxn|ts)="[^"]+"' file
tid="8390500116294391399"
ts="N/A"
ets="2019-02-22T00:21:41.228Z"
trxn="smaple data with spaces 2 record"
trxn="smaple data with spaces 3rd record"
trxn="smaple data with spaces 4th record"
trxn="smaple data with spaces 5th record"
tid="2345500116294391399"
ts="NA"
trxn="other data with spaces"
ets="2017-02-22T00:21:41.228Z"
trxn="another record data"
trxn="smaple data with spaces record"
trxn="smaple data with spaces 4th record"
trxn="data with spaces"
tid="2345500116294391399"
ts="NA"
ets="2017-02-22T00:21:41.228Z"

答案2

尝试这个,

sed -e 's/trxn=/\ntrxn=/g' -e 's/tid=/\ntid=/g' -e 's/ets=/\nets=/g' input | awk -F '"' '$1~/ets|trx|tid/{print $1"\""$2"\""}'


tid="8390500116294391399"
ets="2019-02-22T00:21:41.228Z"
trxn="smaple data with spaces 2 record"
trxn="smaple data with spaces 3rd record"
trxn="smaple data with spaces 4th record"
trxn="smaple data with spaces 5th record"
tid="2345500116294391399"
trxn="other data with spaces"
ets="2017-02-22T00:21:41.228Z"
trxn="another record data"
trxn="smaple data with spaces record"
trxn="smaple data with spaces 4th record"
trxn="data with spaces"
tid="2345500116294391399"
ets="2017-02-22T00:21:41.228Z"

答案3

sed -e "s#\" #\"\n#g;s#.*<lx ##" filename  | grep -E "tid=|ts=|ets=|trxn"

将所有“(双引号)替换为”(双引号)+新行,然后只 grep 所需的模式。

$ awk -F\" '{for(i=1;i<=NF;i++)if($i~/tid=|ts=|ets=|trxn/){gsub(".* ","",$i);print $i""$(i+1)}}' filename
tid=8390500116294391399
ts=N/A
ets=2019-02-22T00:21:41.228Z
trxn=smaple data with spaces 2 record
trxn=smaple data with spaces 3rd record
trxn=smaple data with spaces 4th record
trxn=smaple data with spaces 5th record

相关内容