清理串联的 XML 文件

Question 1

由于数据格式一致，头和尾是你的朋友。这甚至应该适用于最后一个较短的文件。

cat file | tail -n +3 | head -n -1 > trimmed_file

tail -n +3 获取从第三行到文件末尾的所有内容，而 head -n -1 获取除文件最后一行之外的所有内容。

获得一组修剪后的文件后，将它们与整个文件的适当页眉和页脚部分放在一起。

更新：为了避免创建大量额外文件，只需将其包装在 for 循环中：

for i in *
do
    cat $i | tail -n +3 | head -n -1 >> newfile
done

通过 head 运行其中一个文件来提取前 3 行，然后运行上面的 for 循环，获取标头模板。然后使用 tail 执行类似的操作以获取其中一个文件的最后一行并将其附加到 newfile 中。我想您需要更新页眉和页脚信息。

Answer

由于数据格式一致，头和尾是你的朋友。这甚至应该适用于最后一个较短的文件。

cat file | tail -n +3 | head -n -1 > trimmed_file

tail -n +3 获取从第三行到文件末尾的所有内容，而 head -n -1 获取除文件最后一行之外的所有内容。

获得一组修剪后的文件后，将它们与整个文件的适当页眉和页脚部分放在一起。

更新：为了避免创建大量额外文件，只需将其包装在 for 循环中：

for i in *
do
    cat $i | tail -n +3 | head -n -1 >> newfile
done

通过 head 运行其中一个文件来提取前 3 行，然后运行上面的 for 循环，获取标头模板。然后使用 tail 执行类似的操作以获取其中一个文件的最后一行并将其附加到 newfile 中。我想您需要更新页眉和页脚信息。

Question 2

让它工作：

sed -i -e '3,${/^</d}' file

换句话说，在第 3 行和最后一行之间，删除所有以开头的行<。抱歉，原始帖子中没有显示缩进。

Answer

让它工作：

sed -i -e '3,${/^</d}' file

换句话说，在第 3 行和最后一行之间，删除所有以开头的行<。抱歉，原始帖子中没有显示缩进。

Question 3

这看起来有点笨拙。为什么不直接处理传入的数据呢？

bookmarks_count=$chunk_size
total_bookmarks_count=0
{
  while [ $bookmarks_count -eq $chunk_size ]; do
    chunk=$(wget … -O - "$EXPORT_URL?start=$total_bookmarks_count")
    bookmarks_count=$(printf %s "$chunk" | grep -c "$bookmark_prefix")
    total_bookmarks_count=$((total_bookmarks_count + bookmarks_count))
    printf %s "$chunk" |
    sed -e 's#><#>\n<#g' -e "$EXPORT_COMPATIBILITY" -e "$EXPORT_COMPATIBILITY"
  done
  echo '<\/posts>'
} >"$EXPORT_PATH"

您甚至可以避免将每个块存储在内存中，尽管这有点棘手。这是一种仅适用于 ksh 和 zsh 的方法；在其他 shell 中，管道的右侧在子 shell 中运行，因此的值total_bookmarks_count不会更新。

{
  total_bookmarks_count=0
  while
      wget … -O - "$EXPORT_URL?start=$bookmarks_count" |
      sed -e … |
      tee /dev/fd/3 |
      this_chunk_size=$(grep -c "$bookmark_prefix")
      [[ $this_chunk_size = $chunk_size ]]
  do
    ((total_bookmarks_count += chunk_size))
  done
  echo '<\/posts>' >&3
} 3>"$EXPORT_PATH"

这是一种使此方法在其他 shell 中工作的方法，其中您可以从管道中获取的唯一信息是其返回状态。

: >"$EXPORT_PATH"
total_bookmarks_count=0
while
    wget … -O - "$EXPORT_URL?start=$bookmarks_count" |
    sed -e … |
    tee -a "$EXPORT_PATH" |
    [ $(grep -c "$bookmark_prefix") = $chunk_size ]
do
  total_bookmarks_count=$((total_bookmarks_count + chunk_size))
done
echo '<\/posts>' >> "$EXPORT_PATH"

Answer

这看起来有点笨拙。为什么不直接处理传入的数据呢？

bookmarks_count=$chunk_size
total_bookmarks_count=0
{
  while [ $bookmarks_count -eq $chunk_size ]; do
    chunk=$(wget … -O - "$EXPORT_URL?start=$total_bookmarks_count")
    bookmarks_count=$(printf %s "$chunk" | grep -c "$bookmark_prefix")
    total_bookmarks_count=$((total_bookmarks_count + bookmarks_count))
    printf %s "$chunk" |
    sed -e 's#><#>\n<#g' -e "$EXPORT_COMPATIBILITY" -e "$EXPORT_COMPATIBILITY"
  done
  echo '<\/posts>'
} >"$EXPORT_PATH"

您甚至可以避免将每个块存储在内存中，尽管这有点棘手。这是一种仅适用于 ksh 和 zsh 的方法；在其他 shell 中，管道的右侧在子 shell 中运行，因此的值total_bookmarks_count不会更新。

{
  total_bookmarks_count=0
  while
      wget … -O - "$EXPORT_URL?start=$bookmarks_count" |
      sed -e … |
      tee /dev/fd/3 |
      this_chunk_size=$(grep -c "$bookmark_prefix")
      [[ $this_chunk_size = $chunk_size ]]
  do
    ((total_bookmarks_count += chunk_size))
  done
  echo '<\/posts>' >&3
} 3>"$EXPORT_PATH"

这是一种使此方法在其他 shell 中工作的方法，其中您可以从管道中获取的唯一信息是其返回状态。

: >"$EXPORT_PATH"
total_bookmarks_count=0
while
    wget … -O - "$EXPORT_URL?start=$bookmarks_count" |
    sed -e … |
    tee -a "$EXPORT_PATH" |
    [ $(grep -c "$bookmark_prefix") = $chunk_size ]
do
  total_bookmarks_count=$((total_bookmarks_count + chunk_size))
done
echo '<\/posts>' >> "$EXPORT_PATH"

清理串联的 XML 文件

答案1

答案2

答案3

相关内容