Sed 脚本在处理大文件时崩溃

Question 1

常规方法

您可以将每个文件拆分为一个文件头和另一个带有数据行的文件
然后，您可以使用当前的 sed 命令轻松地单独编辑标题
最后，您可以将标题和文件与数据行连接起来。

管理大文件的轻量级工具

您可以使用head和tail创建一个头文件和一个数据文件。
您可以用它cat连接修改后的头文件和数据文件。
使用 awk、sed 或其他工具从大文件中打印行的有效方法？
另一种方法是使用 split

测试

我使用您的标题和一个包含 1080000000 个编号行（大小为 19 Gib）的文件进行了测试，总共 1080000007 行，并且它有效，输出文件（包含 1080000004 行）在我的旧 hp xw8400 工作站中用 5 分钟写入（包括输入启动 shellscript 的命令）。
```
$ ls -lh --time-style=full-iso huge*
-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in
-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out
```
大型写入操作发生在 SSD 上的系统分区和 HDD 上的数据分区之间。

Shell脚本

您需要在文件系统中留出足够的可用空间来/tmp存放巨大的临时“数据”文件，根据您最初的问题，该空间需要超过 9 GB。

$ LANG=C df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       106G   32G   69G  32% /

这看起来可能是一种尴尬的做法，但它适用于大文件，且不会使工具崩溃。也许您必须将临时“数据”文件存储在其他地方，例如外部驱动器（但速度可能会更慢）。

#!/bin/bash

# $1 : FCIDUMP file to convert from "new format" to "old format"

if [ $# -ne 2 ]
then
  echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2
  echo "Example:  $0 file.in file.out" 1>&2
  exit 1
fi

if [ "$1" == "$2" ]
then
  echo "The names of the input file and output file must differ"
  exit 2
exit
fi

endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
if [ "$endheader" == "" ]
then
  echo "Bad input file: the end marker of the header was not found"
  exit 3
fi
#echo "endheader=$endheader"

< "$1" head -n "$endheader" > /tmp/header
#cat /tmp/header

if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header  > /dev/null
then
  echo "The provided file is already in old FCIDUMP format." 1>&2
  exit 4
fi

# run sed inline on /tmp/header 
sed '
{
:a; N; $!ba
s/\(=[^,]*,\)\n/\1 /g
s/\(&FCI\)\n/\1 /
s/ORBSYM/\n&/g
s/&END/ISYM=1,\n\//
}' -i /tmp/header 

if [ $? -ne 0 ]
then
  echo "Failed to convert the header format in /tmp/header"
  exit 5
fi

< "$1" tail -n +$(($endheader+1)) > /tmp/tailer

if [ $? -ne 0 ]
then
  echo "Failed to create the 'data' file /tmp/tailer"
  exit 6
fi

#echo "---"
#cat /tmp/tailer
#echo "---"

cat /tmp/header /tmp/tailer > "$2"

exit 0

Answer

常规方法

您可以将每个文件拆分为一个文件头和另一个带有数据行的文件
然后，您可以使用当前的 sed 命令轻松地单独编辑标题
最后，您可以将标题和文件与数据行连接起来。

管理大文件的轻量级工具

您可以使用head和tail创建一个头文件和一个数据文件。
您可以用它cat连接修改后的头文件和数据文件。
使用 awk、sed 或其他工具从大文件中打印行的有效方法？
另一种方法是使用 split

测试

我使用您的标题和一个包含 1080000000 个编号行（大小为 19 Gib）的文件进行了测试，总共 1080000007 行，并且它有效，输出文件（包含 1080000004 行）在我的旧 hp xw8400 工作站中用 5 分钟写入（包括输入启动 shellscript 的命令）。
```
$ ls -lh --time-style=full-iso huge*
-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in
-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out
```
大型写入操作发生在 SSD 上的系统分区和 HDD 上的数据分区之间。

Shell脚本

您需要在文件系统中留出足够的可用空间来/tmp存放巨大的临时“数据”文件，根据您最初的问题，该空间需要超过 9 GB。

$ LANG=C df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       106G   32G   69G  32% /

这看起来可能是一种尴尬的做法，但它适用于大文件，且不会使工具崩溃。也许您必须将临时“数据”文件存储在其他地方，例如外部驱动器（但速度可能会更慢）。

#!/bin/bash

# $1 : FCIDUMP file to convert from "new format" to "old format"

if [ $# -ne 2 ]
then
  echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2
  echo "Example:  $0 file.in file.out" 1>&2
  exit 1
fi

if [ "$1" == "$2" ]
then
  echo "The names of the input file and output file must differ"
  exit 2
exit
fi

endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
if [ "$endheader" == "" ]
then
  echo "Bad input file: the end marker of the header was not found"
  exit 3
fi
#echo "endheader=$endheader"

< "$1" head -n "$endheader" > /tmp/header
#cat /tmp/header

if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header  > /dev/null
then
  echo "The provided file is already in old FCIDUMP format." 1>&2
  exit 4
fi

# run sed inline on /tmp/header 
sed '
{
:a; N; $!ba
s/\(=[^,]*,\)\n/\1 /g
s/\(&FCI\)\n/\1 /
s/ORBSYM/\n&/g
s/&END/ISYM=1,\n\//
}' -i /tmp/header 

if [ $? -ne 0 ]
then
  echo "Failed to convert the header format in /tmp/header"
  exit 5
fi

< "$1" tail -n +$(($endheader+1)) > /tmp/tailer

if [ $? -ne 0 ]
then
  echo "Failed to create the 'data' file /tmp/tailer"
  exit 6
fi

#echo "---"
#cat /tmp/tailer
#echo "---"

cat /tmp/header /tmp/tailer > "$2"

exit 0

Question 2

sed可能不是最好的工具，请调查一下perl。但是，你可以将问题重新表述为：

从巨型数据文件中提取旧标题，并将其保存为其自己的文件。
调整提取出的旧标头，使其成为新标头。

将巨型数据文件中的旧标题替换为新标题。

endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
head -n "$endheader" >/tmp/header
trap "/bin/rm -f /tmp/header" EXIT
# do the sed stuff to /tmp/header, I assume it does what you want 
sed '
{
:a; N; $!ba
s/\(=[^,]*,\)\n/\1 /g
s/\(&FCI\)\n/\1 /
s/ORBSYM/\n&/g
s/&END/ISYM=1,\n\//
}' -i /tmp/header 

# Then combine the new header with the rest of the giant data file,
# using `ed` (see `man ed;info Ed`) and here-document
ed "$1" <<EndOfEd
1,${endheader}d
:0r /tmp/header
:wq
EndOfEd

Answer

sed可能不是最好的工具，请调查一下perl。但是，你可以将问题重新表述为：

从巨型数据文件中提取旧标题，并将其保存为其自己的文件。
调整提取出的旧标头，使其成为新标头。

将巨型数据文件中的旧标题替换为新标题。

endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
head -n "$endheader" >/tmp/header
trap "/bin/rm -f /tmp/header" EXIT
# do the sed stuff to /tmp/header, I assume it does what you want 
sed '
{
:a; N; $!ba
s/\(=[^,]*,\)\n/\1 /g
s/\(&FCI\)\n/\1 /
s/ORBSYM/\n&/g
s/&END/ISYM=1,\n\//
}' -i /tmp/header 

# Then combine the new header with the rest of the giant data file,
# using `ed` (see `man ed;info Ed`) and here-document
ed "$1" <<EndOfEd
1,${endheader}d
:0r /tmp/header
:wq
EndOfEd

Sed 脚本在处理大文件时崩溃

答案1

常规方法

管理大文件的轻量级工具

测试

Shell脚本

答案2

相关内容