选择性提取数据

选择性提取数据

我在 txt 文件中有以下数据块:

Informatica(r) PMCMD, version [9.5.1 HotFix4], build [313.0217], SOLARIS 64-bit 
Copyright (c) Informatica Corporation 1994 - 2014   
All Rights Reserved. 

Invoked at Wed Dec 30 11:13:42 2015 

Connected to Integration Service: [TEST_Integration_Service].   
Integration Service status: [Running]   
Integration Service startup time: [Sun Dec 27 06:37:32 2015]  
Integration Service current time: [Wed Dec 30 11:13:42 2015]  
Folder: [ALS_DIM]  
Workflow: [wf_ld_als_dim] version [1].  
Workflow run status: [Scheduled]  
Workflow run error code: [0]  
Workflow run id [0].  
Schedule time: [Wed Dec 30 19:00:00 2015]  
Workflow run type: [Schedule]  
Run workflow as user: [Administrator]  
Run workflow with Impersonated OSProfile in domain: []  
Integration Service: [TEST_Integration_Service]  
Folder: [ALS_FACT]  
Workflow: [wf_s_m_ld_interchanges_detail_log] version [1].  
Workflow run status: [Scheduled]  
Workflow run error code: [0]  
Workflow run id [0].  
Schedule time: [Mon Jan 04 16:30:00 2016]  
Workflow run type: [Schedule]  
Run workflow as user: [Administrator]  
Run workflow with Impersonated OSProfile in domain: []  
Integration Service: [TEST_Integration_Service]  
Folder: [ALS_PRD]  
Workflow: [wf_maint_service_fields] version [1].  
Workflow run status: [Scheduled]  
Workflow run error code: [0]  
Workflow run id [0].  
Schedule time: [Thu Dec 31 07:10:00 2015]  
Workflow run type: [Schedule]  
Run workflow as user: [Administrator]  
Run workflow with Impersonated OSProfile in domain: []  
Integration Service: [TEST_Integration_Service]  
Number of scheduled workflows on this Integration Service: [3]  
Disconnecting from Integration Service  

我需要将数据提取到另一个文本文件中,以便每次重复时仅按顺序提取文件夹名称、工作流名称、工作流运行状态、计划时间和集成服务名称。

例如:

Insert into <tablename> values('ALS_DIM', 'wf_ld_als_dim', 'Scheduled', 'Wed Dec 30 19:00:00 2015', 'TEST_Integration_Service')

应为第一组提取此内容,依此类推。

我开发了一个特定于给定 3 组数据的脚本,但它应该使得该脚本可以针对任意数量的数据集运行。

我对 shell 脚本有非常初步的了解,因此如果能够获得一些关于此方面的帮助将是很棒的。

答案1

一个sed办法...

sed -ne'/^Folder: *\[/!{'                                     \
         -e'/^Workflow\( run status\)\{0,1\}: *\[/!{'         \
              -e'/^Schedule time: *\[/!{'                     \
                   -e'/^Integration Service: *\[/!d'          \
    -e\} -e\} -e\} -e"s//'/"      -e"s/\].*/'/"  -e'H;x'      \
                   -e'/ .*\n.*/h' -e's///'       -e'x'        \ 
                   -e's//Insert into <tablename> values(&)/'  \
                   -e's/\n//'     -e's//, /gp'

Insert into <tablename> values('ALS_DIM', 'wf_ld_als_dim', 'Scheduled', 'Wed Dec 30 19:00:00 2015', 'TEST_Integration_Service')
Insert into <tablename> values('ALS_FACT', 'wf_s_m_ld_interchanges_detail_log', 'Scheduled', 'Mon Jan 04 16:30:00 2016', 'TEST_Integration_Service')
Insert into <tablename> values('ALS_PRD', 'wf_maint_service_fields', 'Scheduled', 'Thu Dec 31 07:10:00 2015', 'TEST_Integration_Service')

所以第一行否定了可接受的匹配,例如......

if ! match ^Folder: *\[
then  if ! match ^Workflow: *\[ or ^Workflow run status: *\[
      then if !  match ^Schedule time: *\[
           then  if !  match ^Integration Service: *\[
                 then  delete
                 fi
           fi
      fi
fi

然而,一旦一条线与链中的任何一条相匹配,链就会断开并且不会遍历到末端。这意味着针对任何给定匹配行测试的最后一个正则表达式将描述该行的头部直到右方括号。在sed脚本中,您可以使用空地址再次引用最近编译的正则表达式//。我只是将其替换掉,并将其替换为您所需输出中的s//'/前导引号。'

剩下的就是您想要的所有信息以及]每行第一行后面的尾随上下文。因此,我也将每行不需要的尾部s/\].*/'/替换为您的尾随引用。'

此时,所有线条都已被剥离为您想要的部分,但它们尚未连接。为了实现这一点,我使用了H保留行周期的旧空间。因此,将每行的副本附加到H旧空间,更改x保留和模式缓冲区,并/ .*\n.*/查找<空格>随后是一个<换行>- 仅发生在日期变更线之后的下一条保留线上。

如果找到该模式,我会用h它覆盖旧空间并s///替换掉全部模式空间的(将其留空以供下一次迭代使用,因为这是本次迭代的最后一行)。这里的空地址s///仅删除与该地址匹配的行的内容<空格> + <换行>模式 - 因此只有在每次迭代的最后一行这些指令中的任何一个才能成功。

无论如何,我后来最后x一次更改了保持缓冲区和模式缓冲区,因此模式缓冲区现在将包含由以下分隔的本次迭代的所有匹配项<换行>字符,否则它将仅包含最近的匹配项和零<换行>人物。如果这是本次迭代的最后一个匹配行,则此时的保持缓冲区为空,否则它包含到目前为止的所有匹配行,每个匹配行都以前导为前缀<换行>

接下来,我再次引用相同的正则表达式,并尝试替换与<空格> + <换行>自身的图案包裹在Insert into <tablename> values(和一个尾随的)

最后,如果有的话,我s///不会替换领先的<换行>在最后一个匹配行的模式空间中,以及所有剩余的<换行符>一个逗号然后一个<空格>每个。如果s///替换成功,其结果将p打印到标准输出。

答案2

Perl 方法:

$ perl -lne 'if(/^(Folder|Workflow|Workflow.*?status|Sched.*time|Integration Service):.*?\[([^][]+)/){++$k%5==0 ? print "$2"  : printf "%s,",$2}' file
ALS_DIM, wf_ld_als_dim, Scheduled, Wed Dec 30 19:00:00 2015, TEST_Integration_Service
ALS_FACT, wf_s_m_ld_interchanges_detail_log, Scheduled, Mon Jan 04 16:30:00 2016, TEST_Integration_Service
ALS_PRD, wf_maint_service_fields, Scheduled, Thu Dec 31 07:10:00 2015, TEST_Integration_Service

或者,更简洁:

$ perl -lne '
 if(/^                       ## Match the beginning of the line
     (                       ## 1st capturing group: $1
      Folder               | ## The various things we want to match
      Workflow             | 
      Workflow.*?status    |
      Sched.*time          |
      Integration\s*Service
      ):                     ## Only if they are followed by a :
      .*?\[
      (                      ## 2nd caprturing group: $2.
        [^][]+               ## The longest string of non-] or [
      )/x                    ## The x allows writing multiline regexes          
    )
{                            ## If this line matches...
    $k=$k+1;                   ## Increment the counter $k by one
    if($k%5==0){               ## If the current value of $k is a multiple of 5.
      print "$2"               ## Print the 2nd captured group and a newline.
    }                          ## The newline is automatically added by the -l. 

    else{
      printf "%s,",$2         ## For other lines, just print with no newline.
    }
}' file
ALS_DIM, wf_ld_als_dim, Scheduled, Wed Dec 30 19:00:00 2015, TEST_Integration_Service
ALS_FACT, wf_s_m_ld_interchanges_detail_log, Scheduled, Mon Jan 04 16:30:00 2016, TEST_Integration_Service
ALS_PRD, wf_maint_service_fields, Scheduled, Thu Dec 31 07:10:00 2015, TEST_Integration_Service

要添加Insert ...,您只需通过一个简单的传递即可sed

$ perl -lne 'if(/^(Folder|Workflow|Workflow.*?status|Sched.*time|Integration Service):.*?\[([^][]+)/){++$k%5==0 ? print "$2"  : printf "%s,",$2}' file | 
    sed "s/^/Insert into <tablename> values('/; s/,/','/g; s/$/')/"
Insert into <tablename> values("ALS_DIM","wf_ld_als_dim","Scheduled","Wed Dec 30 19:00:00 2015","TEST_Integration_Service")
Insert into <tablename> values("ALS_FACT","wf_s_m_ld_interchanges_detail_log","Scheduled","Mon Jan 04 16:30:00 2016","TEST_Integration_Service")
Insert into <tablename> values("ALS_PRD","wf_maint_service_fields","Scheduled","Thu Dec 31 07:10:00 2015","TEST_Integration_Service")

运行sed三个替换运算符:

  • s/^/Insert into <tablename> values("/:^是行的开头。因此,只需在行的开头s/^/foo/插入即可。foo在这里,它正在插入nsert into <tablename> values(".
  • s/,/','/g: 将所有 ( s///g) 逗号替换为','.
  • s/$/")/':$是行尾,因此这将添加)"到末尾。

相关内容