在 powershell 中过滤 CSV 并添加新的标签列

Question 1

您可以尝试以下操作：

假设您有一个名为 in.txt 的文件位于 path/to/in.txt，其中包含以下内容：

"Doc_Title","Doc_Date","Doc_URL"
"/pub/howto.en.pdf","1980-01-01","An easy introduction"
"/pub/howto.de.pdf","1980-01-01","Eine einfache Einführung"
"/pub/howto.fr.pdf","1980-01-01","Une introduction simple"
"/lit/intro.en.pdf","1980-01-01","Literature review"
"/lit/intro.pdf","1980-01-01","Revue de littérature"
"/foo/intro.pdf","1980-01-01","Literatur-Review"

并且您想将结果导出到位于 path/to/out.csv 的 .csv 文件，您可以使用以下代码：

$types=@{'pub'='Publication';'lit'='Literature'}
$languages=@{'.en.pdf'='English';'.fr.pdf'='French';'.de.pdf'='German'}
$rows=import-csv "path/to/in.txt"
$table=foreach ($row in $rows) {
    $title=$row.Doc_Title
    $date=$row.Doc_Date
    $url=$row.Doc_URL
    $type=($title | Select-String -pattern "(?<=\/)([\w]{3})(?=\/)").matches.value
    $type=$types.$type
    $language=($title | Select-String -pattern "(\.\w{2}\.pdf)").matches.value
    $language=$languages.$language
    [PSCustomObject]@{Doc_Title=$title;Doc_Date=$date;Doc_URL=$url;Doc_Type=$type;Doc_Lang=$language}
}
$table | export-csv "path/to/out.csv"

请务必尝试我的代码并告诉我它是否给出了您想要的结果。

我的代码非常简单，非常清晰，可读性强。我认为最好你自己去弄清楚，这样你才能完全理解。我不喜欢填鸭式的指导。

一些说明：

1、我认为最好让文件扩展名与内容的格式保持一致，尽管它是纯文本文件，但它是有结构的，而不是任何.txt文件，所以我认为最好将CSV文件的扩展名设为.csv...

2、我认为将两个值放在一列中的想法是错误的，CSV 代表逗号分隔值，值由逗号分隔，因此最好不要将逗号放入值中，并将两个值保留为单独的列而不是一列。

3、我给您的正则表达式与您提供的示例配合得很好，类型识别的正则表达式接受斜杠之间的三个单词字符（字母和数字），语言识别的正则表达式接受点之间的两个单词字符，如果需要，可以调整正则表达式。

最低 PowerShell 版本要求：未知，但是我只在 PowerShell 7.1.1 上测试了我的代码，我不知道我的代码是否在较低版本上运行，但使用最新的软件总是好的。

更新

如果中的斜杠之间只有三个字符Doc_Title，则可以使用此方法获取斜杠之间的三个字母：

$title.substring(1,3)

仅当斜杠之间恰好有三个字母（并且第一个斜杠位于字段的开头）时，此方法才会起作用。

您可以使用它来获取类似的字符串.en.pdf

$title.substring(($title.length-7),7)

这只有满足两个条件时才会起作用：首先，中必须存在此字符串$title，并且点之间必须恰好有两个字符。

我修改了我的代码以生成您想要的结果：

$types=@{'pub'='Publication';'lit'='Literature'}
$languages=@{'.en.pdf'='English';'.fr.pdf'='French';'.de.pdf'='German'}
$rows=import-csv "path/to/in.csv"
foreach ($row in $rows) {
    $title=$row.Doc_Title
    $date=$row.Doc_Date
    $url=$row.Doc_URL
    $category=$types.$(($title | Select-String -pattern "(?<=\/)([\w]+)(?=\/)").matches.value)
    if ($title -match "(\.[\w]+\.pdf)"){$category=$category+","+$languages.$(($title | Select-String -pattern "(\.[\w]+\.pdf)").matches.value)}
    [PSCustomObject]@{Doc_Title=$title;Doc_Date=$date;Doc_URL=$url;Doc_Category=$category} | export-csv -path "path/to/out.csv" -NoTypeInformation -append
}

示例输出：

"Doc_Title","Doc_Date","Doc_URL","Doc_Category"
"/pub/howto.en.pdf","1980-01-01","An easy introduction","Publication,English"
"/pub/howto.de.pdf","1980-01-01","Eine einfache Einführung","Publication,German"
"/pub/howto.fr.pdf","1980-01-01","Une introduction simple","Publication,French"
"/lit/intro.en.pdf","1980-01-01","Literature review","Literature,English"
"/lit/intro.pdf","1980-01-01","Revue de littérature","Literature"
"/foo/intro.pdf","1980-01-01","Literatur-Review",

我使用的正则表达式将匹配任意数量的单词字符（a-zA-Z0-9_），它们与您给出的示例配合良好，但如果您的字符串包含非单词字符则不起作用，请根据需要进行调整。

为什么我没有首先创建一个带逗号的字段，请参阅维基百科

简而言之，CSV 文件格式并非完全标准化，对于 CSV 应该是什么样子并没有达成共识，不同的 CSV 实现可能会或可能不会允许字段中使用逗号，因此字段中的逗号可能会或可能不会破坏格式。

Answer

您可以尝试以下操作：

假设您有一个名为 in.txt 的文件位于 path/to/in.txt，其中包含以下内容：

"Doc_Title","Doc_Date","Doc_URL"
"/pub/howto.en.pdf","1980-01-01","An easy introduction"
"/pub/howto.de.pdf","1980-01-01","Eine einfache Einführung"
"/pub/howto.fr.pdf","1980-01-01","Une introduction simple"
"/lit/intro.en.pdf","1980-01-01","Literature review"
"/lit/intro.pdf","1980-01-01","Revue de littérature"
"/foo/intro.pdf","1980-01-01","Literatur-Review"

并且您想将结果导出到位于 path/to/out.csv 的 .csv 文件，您可以使用以下代码：

$types=@{'pub'='Publication';'lit'='Literature'}
$languages=@{'.en.pdf'='English';'.fr.pdf'='French';'.de.pdf'='German'}
$rows=import-csv "path/to/in.txt"
$table=foreach ($row in $rows) {
    $title=$row.Doc_Title
    $date=$row.Doc_Date
    $url=$row.Doc_URL
    $type=($title | Select-String -pattern "(?<=\/)([\w]{3})(?=\/)").matches.value
    $type=$types.$type
    $language=($title | Select-String -pattern "(\.\w{2}\.pdf)").matches.value
    $language=$languages.$language
    [PSCustomObject]@{Doc_Title=$title;Doc_Date=$date;Doc_URL=$url;Doc_Type=$type;Doc_Lang=$language}
}
$table | export-csv "path/to/out.csv"

请务必尝试我的代码并告诉我它是否给出了您想要的结果。

我的代码非常简单，非常清晰，可读性强。我认为最好你自己去弄清楚，这样你才能完全理解。我不喜欢填鸭式的指导。

一些说明：

1、我认为最好让文件扩展名与内容的格式保持一致，尽管它是纯文本文件，但它是有结构的，而不是任何.txt文件，所以我认为最好将CSV文件的扩展名设为.csv...

2、我认为将两个值放在一列中的想法是错误的，CSV 代表逗号分隔值，值由逗号分隔，因此最好不要将逗号放入值中，并将两个值保留为单独的列而不是一列。

3、我给您的正则表达式与您提供的示例配合得很好，类型识别的正则表达式接受斜杠之间的三个单词字符（字母和数字），语言识别的正则表达式接受点之间的两个单词字符，如果需要，可以调整正则表达式。

最低 PowerShell 版本要求：未知，但是我只在 PowerShell 7.1.1 上测试了我的代码，我不知道我的代码是否在较低版本上运行，但使用最新的软件总是好的。

更新

如果中的斜杠之间只有三个字符Doc_Title，则可以使用此方法获取斜杠之间的三个字母：

$title.substring(1,3)

仅当斜杠之间恰好有三个字母（并且第一个斜杠位于字段的开头）时，此方法才会起作用。

您可以使用它来获取类似的字符串.en.pdf

$title.substring(($title.length-7),7)

这只有满足两个条件时才会起作用：首先，中必须存在此字符串$title，并且点之间必须恰好有两个字符。

我修改了我的代码以生成您想要的结果：

$types=@{'pub'='Publication';'lit'='Literature'}
$languages=@{'.en.pdf'='English';'.fr.pdf'='French';'.de.pdf'='German'}
$rows=import-csv "path/to/in.csv"
foreach ($row in $rows) {
    $title=$row.Doc_Title
    $date=$row.Doc_Date
    $url=$row.Doc_URL
    $category=$types.$(($title | Select-String -pattern "(?<=\/)([\w]+)(?=\/)").matches.value)
    if ($title -match "(\.[\w]+\.pdf)"){$category=$category+","+$languages.$(($title | Select-String -pattern "(\.[\w]+\.pdf)").matches.value)}
    [PSCustomObject]@{Doc_Title=$title;Doc_Date=$date;Doc_URL=$url;Doc_Category=$category} | export-csv -path "path/to/out.csv" -NoTypeInformation -append
}

示例输出：

"Doc_Title","Doc_Date","Doc_URL","Doc_Category"
"/pub/howto.en.pdf","1980-01-01","An easy introduction","Publication,English"
"/pub/howto.de.pdf","1980-01-01","Eine einfache Einführung","Publication,German"
"/pub/howto.fr.pdf","1980-01-01","Une introduction simple","Publication,French"
"/lit/intro.en.pdf","1980-01-01","Literature review","Literature,English"
"/lit/intro.pdf","1980-01-01","Revue de littérature","Literature"
"/foo/intro.pdf","1980-01-01","Literatur-Review",

我使用的正则表达式将匹配任意数量的单词字符（a-zA-Z0-9_），它们与您给出的示例配合良好，但如果您的字符串包含非单词字符则不起作用，请根据需要进行调整。

为什么我没有首先创建一个带逗号的字段，请参阅维基百科

简而言之，CSV 文件格式并非完全标准化，对于 CSV 应该是什么样子并没有达成共识，不同的 CSV 实现可能会或可能不会允许字段中使用逗号，因此字段中的逗号可能会或可能不会破坏格式。

Question 2

只是为了好玩，下面是解析 URL 以获取标签的另一种策略：

$Tags = @'
id,text
pub,Publication
lit,Literature
en,English
fr,French
de,German
'@ | ConvertFrom-Csv | ForEach { $hash = @{} } {
    $hash.Add( $_.ID, $_.Text )
} { $hash }

@'
Doc_URL,Doc_Date,Doc_Title
/pub/howto.en.pdf,1980-01-01,An easy introduction
/pub/howto.de.pdf,1980-01-01,Eine einfache Einführung
/pub/howto.fr.pdf,1980-01-01,Une introduction simple
/lit/intro.en.pdf,1980-01-01,Literature review
/lit/intro.pdf,1980-01-01,Revue de littérature
/foo/intro.pdf,1980-01-01,Literatur-Review
'@ | ConvertFrom-CSV | ForEach {
    $Doc_Tags = @( $Tags[$_.Doc_URL.Split('/')[1]] , $Tags[$_.Doc_URL.Split('.')[-2]] ) -ne $null -join ', '
    [PSCustomObject]@{
        'Doc_URL'   = $_.Doc_URL
        'Doc_Date'  = $_.Doc_Date
        'Doc_Title' = $_.Doc_Title
        'Doc_Tags'  = $Doc_Tags
    }
} | Export-Csv $env:Temp\out.csv -NoTypeInformation
Import-Csv $env:Temp\out.csv

该<*Here-String*> | ConvertFrom-Csv构造可以用Import-Csv <FileName>语句替换：

$TagFIle = c:\Tag.txt
$InFIle  = c:\In.txt

$Tags = Import-Csv $TagFIle | ForEach { $hash = @{} } {
    $hash.Add( $_.ID, $_.Text )
} { $hash }

Import-Csv $InFIle | ForEach {
    $Doc_Tags = @( $Tags[$_.Doc_URL.Split('/')[1]] , $Tags[$_.Doc_URL.Split('.')[-2]] ) -ne $null -join ', '
    [PSCustomObject]@{
        'Doc_URL'   = $_.Doc_URL
        'Doc_Date'  = $_.Doc_Date
        'Doc_Title' = $_.Doc_Title
        'Doc_Tags'  = $Doc_Tags
    }
} | Export-Csv $env:Temp\out.csv -NoTypeInformation
Import-Csv $env:Temp\out.csv

Answer

只是为了好玩，下面是解析 URL 以获取标签的另一种策略：

$Tags = @'
id,text
pub,Publication
lit,Literature
en,English
fr,French
de,German
'@ | ConvertFrom-Csv | ForEach { $hash = @{} } {
    $hash.Add( $_.ID, $_.Text )
} { $hash }

@'
Doc_URL,Doc_Date,Doc_Title
/pub/howto.en.pdf,1980-01-01,An easy introduction
/pub/howto.de.pdf,1980-01-01,Eine einfache Einführung
/pub/howto.fr.pdf,1980-01-01,Une introduction simple
/lit/intro.en.pdf,1980-01-01,Literature review
/lit/intro.pdf,1980-01-01,Revue de littérature
/foo/intro.pdf,1980-01-01,Literatur-Review
'@ | ConvertFrom-CSV | ForEach {
    $Doc_Tags = @( $Tags[$_.Doc_URL.Split('/')[1]] , $Tags[$_.Doc_URL.Split('.')[-2]] ) -ne $null -join ', '
    [PSCustomObject]@{
        'Doc_URL'   = $_.Doc_URL
        'Doc_Date'  = $_.Doc_Date
        'Doc_Title' = $_.Doc_Title
        'Doc_Tags'  = $Doc_Tags
    }
} | Export-Csv $env:Temp\out.csv -NoTypeInformation
Import-Csv $env:Temp\out.csv

该<*Here-String*> | ConvertFrom-Csv构造可以用Import-Csv <FileName>语句替换：

$TagFIle = c:\Tag.txt
$InFIle  = c:\In.txt

$Tags = Import-Csv $TagFIle | ForEach { $hash = @{} } {
    $hash.Add( $_.ID, $_.Text )
} { $hash }

Import-Csv $InFIle | ForEach {
    $Doc_Tags = @( $Tags[$_.Doc_URL.Split('/')[1]] , $Tags[$_.Doc_URL.Split('.')[-2]] ) -ne $null -join ', '
    [PSCustomObject]@{
        'Doc_URL'   = $_.Doc_URL
        'Doc_Date'  = $_.Doc_Date
        'Doc_Title' = $_.Doc_Title
        'Doc_Tags'  = $Doc_Tags
    }
} | Export-Csv $env:Temp\out.csv -NoTypeInformation
Import-Csv $env:Temp\out.csv

在 powershell 中过滤 CSV 并添加新的标签列

有关的

答案1

更新

示例输出：

答案2

相关内容