从 GCS 存储桶下载数据的最快方法

Question

好吧，我将此作为答案发布，但总是欢迎新答案。

GSUTIL如果您必须从数百万或数千个文件中进行搜索，某些模式匹配工作会非常慢。最好先列出它们并使用绝对文件路径下载文件。

vikrant_singh_rana@cloudshell:~/download$ cat download_gcs_file.sh
#!/bin/bash

#below code will delete the file if it already exists in the current working directory
file="ls_output.csv"

if [ -f "$file" ] ; then
    rm "$file"
fi
#below code will list the files to output file ls_output.csv based on search pattern
gsutil ls -l "gs://test-bucket-data-prod-ingest/cm_data/AN/AM/*/a01_*_20210128*.csv.bz2" | awk '!hdr{ print "filename"; hdr=1; }; $1 <= 100{ print $3; }' >ls_output.csv

input_file_path='/home/vikrant_singh_rana/download/ls_output.csv'

#below code will read the input file name and download it from gcs location to local
count=0

{
    read
    while IFS=, read -r inputfilename
    do

        echo "input filename is:"$inputfilename

        if [ ! -z "$inputfilename" ] || [ "$inputfilename" != "filename" ]
        then
        echo "downloading file:" $inputfilename
        gsutil -m cp -R "$inputfilename" /home/vikrant_singh_rana/download/output/

        else echo "No Empty Files found"
        fi

        count=$[count + 1]
        echo "count is:" $count
    done
} < $input_file_path

#below will unzip the files to csv format
bzip2 -d /home/vikrant_singh_rana/download/output/*

这是输入文件

vikrant_singh_rana@cloudshell:~/download$ cat ls_output.csv
filename
gs://test-bucket-data-prod-ingest/cm_data/AN/AM/172.24.105.197-CORE-2/a01_1h_255_XYZ_202101282300_0009.csv.bz2

Answer 1

好吧，我将此作为答案发布，但总是欢迎新答案。

GSUTIL如果您必须从数百万或数千个文件中进行搜索，某些模式匹配工作会非常慢。最好先列出它们并使用绝对文件路径下载文件。

vikrant_singh_rana@cloudshell:~/download$ cat download_gcs_file.sh
#!/bin/bash

#below code will delete the file if it already exists in the current working directory
file="ls_output.csv"

if [ -f "$file" ] ; then
    rm "$file"
fi
#below code will list the files to output file ls_output.csv based on search pattern
gsutil ls -l "gs://test-bucket-data-prod-ingest/cm_data/AN/AM/*/a01_*_20210128*.csv.bz2" | awk '!hdr{ print "filename"; hdr=1; }; $1 <= 100{ print $3; }' >ls_output.csv

input_file_path='/home/vikrant_singh_rana/download/ls_output.csv'

#below code will read the input file name and download it from gcs location to local
count=0

{
    read
    while IFS=, read -r inputfilename
    do

        echo "input filename is:"$inputfilename

        if [ ! -z "$inputfilename" ] || [ "$inputfilename" != "filename" ]
        then
        echo "downloading file:" $inputfilename
        gsutil -m cp -R "$inputfilename" /home/vikrant_singh_rana/download/output/

        else echo "No Empty Files found"
        fi

        count=$[count + 1]
        echo "count is:" $count
    done
} < $input_file_path

#below will unzip the files to csv format
bzip2 -d /home/vikrant_singh_rana/download/output/*

这是输入文件

vikrant_singh_rana@cloudshell:~/download$ cat ls_output.csv
filename
gs://test-bucket-data-prod-ingest/cm_data/AN/AM/172.24.105.197-CORE-2/a01_1h_255_XYZ_202101282300_0009.csv.bz2

从 GCS 存储桶下载数据的最快方法

答案1

相关内容