使用 ^H 和 ^M 字符评估大文件

使用 ^H 和 ^M 字符评估大文件

我有一个包含许多字符的日志文件^H^M因为生成该文件的过程会更新基于文本的进度条。

使用时,cat输出会被评估并且显得人类可读且简洁。下面是一个示例输出。

Epoch 11/120
4355/4355 [==============================] - ETA: 0s - loss: 0.0096   
Epoch 00011: val_loss did not improve from 0.00992
4355/4355 [==============================] - 1220s 280ms/step - loss: 0.0096 - val_loss: 0.0100

cat然而,与上面的实际打印文本(大约 900 行,70MB)相比,文件本身很大。

以下是日志文件中包含的实际文本的片段。

1/Unknown - 0s 81us/step - loss: 0.5337^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^M  2/Unknown - 1s 438ms/step - loss: 0.5299^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^
H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^M      3/Unknown - 1s 386ms/step - loss: 0.5286^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^M      4/Unknown - 1s 357ms/step - loss: 0.5289^H^H^H^H^H^H^H^H^H^
H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^M      5/Unknown - 2s 339ms/step - loss: 0.5277^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^M      6/Unknown - 2s 327ms/
step - loss: 0.5258^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^M      7/Unknown - 2s 318ms/step - loss: 0.5250^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^
H^H^H^H^H^M      8/Unknown - 2s 312ms/step - loss: 0.5260^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^M      9/Unknown - 3s 307ms/step - loss: 0.5265^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^
H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^M     10/Unknown - 3s 303ms/step - loss: 0.5257^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^M     11/Unknown - 3s 299ms/step - loss: 0.5258^H^H^H^

基本上,我想创建一个看起来与cat生成的文件一样的文件。

以下是我尝试过但收效甚微的一些事情:

  • tr -d '\b\r' < logfile > new_file删除所有字符,但因此留下所有不需要的文本。
  • cat logfile > new_file实际上只是逐字复制文件,而不评估特殊字符。
  • cat logfile | col -b > new_file非常接近,但在重复的行之一上做了一些奇怪的事情:
4355/4355 [==============================] - ETA: 0ss--loss::0.0096557
Epoch 00011: val_loss did not improve from 0.00992
4355/4355 [==============================] - 1220s 280ms/step - loss: 0.0096 - val_loss: 0.0100

任何帮助,将不胜感激。

谢谢

答案1

为了清楚起见,将其发布为答案。

作为划艇指出,在这种情况下,该命令awk -F '\r' '{print $NF}' file按预期工作,删除最后一个回车符后的所有内容。虽然这并不稳健泽夫泽克指出。

我在下面用 C++ 编写了一个更强大的解决方案。

#include <fstream>
#include <string>
#include <iostream>

using namespace std;

string filter_string(string line, const char *bspace, const char *creturn){

    string new_str;

    for(string::size_type i = 0; i < line.size(); ++i) {
        // Step back if current string not empty
        if (line[i] == *bspace){
            if (new_str.size() != 0){
                new_str.pop_back();
            };
        // Reset on carriage return
        } else if (line[i] == *creturn){
            new_str = "";
        } else {
            new_str += line[i];
        };
    }

    return new_str;
};

int main(int argc, char* argv[]){
    const char backspace = '\x08';
    const char creturn = '\r';

    if (argc != 2){
        cerr << "USAGE: " << argv[0] << " [src]" << endl;
        return 1;
    }

    // Filter lines in file
    string line;
    ifstream infile(argv[1]);
    while (getline(infile, line)){
        cout << filter_string(line, &backspace, &creturn) << endl;
    };

    return 0;
};

这里迭代每行中的每个字符,如果^H存在 a,则字符串被推回一位(如果尚未为空),如果^M存在回车符,则重置字符串。输出被发送到stdout,然后可以通过管道传输到文件。

答案2

sed 's/.*\x0d//' logfile

似乎按照你的要求做。

请注意,col -b失败是因为它忽略了空格:

$ echo $'--------\r1st try\r2nd   \r3rd\n' | col -b
3rd-try-

相关内容