将 csv 从 CF 写入存储桶时：'with open(filepath, "w") as MY_CSV:' 导致“FileNotFoundError: [Errno 2] 没有这样的文件或目录：”

Question

答案是令人惊讶的。你必须gcsfs如果您想使用写入文件，请导入并使用该模块open()。

如果使用pd.to_csv()，import gcsfs则不需要，但是gcsfs仍然requirements.txt需要pd.to_csv()，因此，pandasto_csv()似乎会自动使用它。

抛开惊讶pd.to_csv()，下面是回答问题的代码（已测试）：

def write_to_csv_file(connection, filepath):
    """Write the QUERY result in a loop over batches into a csv.
    This is done in batches since the query from the database is huge.
    :param connection: mysqldb connection to DB
    :param filepath: path to csv file to write data
    return: metadata on rows and time
    """
    countrows = 0
    print("Right before opening the file ...")
   

    # A gcsfs object is needed to open a file.
    # https://stackoverflow.com/questions/52805016/how-to-open-a-file-from-google-cloud-storage-into-a-cloud-function
    # https://gcsfs.readthedocs.io/en/latest/index.html#examples
    # Side-note (Exception):
    # pd.to_csv() needs neither the gcsfs object, nor its import.
    # It is not used here, but it has been tested with examples.
    fs = gcsfs.GCSFileSystem(project=MY_PROJECT)
    fs.ls(BUCKET_NAME)


    # wb needed, else "builtins.TypeError: must be str, not bytes"
    # https://stackoverflow.com/questions/5512811/builtins-typeerror-must-be-str-not-bytes
    with fs.open(filepath, 'wb') as outcsv:
        print("Right after opening the file ...")

        writer = csv.DictWriter(
            outcsv,
            fieldnames=FIELDNAMES,
            extrasaction="ignore",
            delimiter="|",
            lineterminator="\n",
        )
        # write header according to fieldnames
        print("before writer.writeheader()")
        writer.writeheader()
        print("after writer.writeheader()")

        for batch in query_execute_batch(connection):
            writer.writerows(batch)
            countrows += len(batch)
        datetime_now_save = datetime.now()
    return countrows, datetime_now_save

边注

不要像这样使用 csv 编写器。

它耗时太长了，参数为 5000 的 CFpd.to_csv()只chunksize需要 62 秒就可以加载 700k 行并将其作为 csv 存储在 bucket 中，而具有批处理写入器的 CF 需要超过 9 分钟，这超过了超时限制。因此，我不得不改用pd.to_csv()并将我的数据转换为数据框。

Answer 1

答案是令人惊讶的。你必须gcsfs如果您想使用写入文件，请导入并使用该模块open()。

如果使用pd.to_csv()，import gcsfs则不需要，但是gcsfs仍然requirements.txt需要pd.to_csv()，因此，pandasto_csv()似乎会自动使用它。

抛开惊讶pd.to_csv()，下面是回答问题的代码（已测试）：

def write_to_csv_file(connection, filepath):
    """Write the QUERY result in a loop over batches into a csv.
    This is done in batches since the query from the database is huge.
    :param connection: mysqldb connection to DB
    :param filepath: path to csv file to write data
    return: metadata on rows and time
    """
    countrows = 0
    print("Right before opening the file ...")
   

    # A gcsfs object is needed to open a file.
    # https://stackoverflow.com/questions/52805016/how-to-open-a-file-from-google-cloud-storage-into-a-cloud-function
    # https://gcsfs.readthedocs.io/en/latest/index.html#examples
    # Side-note (Exception):
    # pd.to_csv() needs neither the gcsfs object, nor its import.
    # It is not used here, but it has been tested with examples.
    fs = gcsfs.GCSFileSystem(project=MY_PROJECT)
    fs.ls(BUCKET_NAME)


    # wb needed, else "builtins.TypeError: must be str, not bytes"
    # https://stackoverflow.com/questions/5512811/builtins-typeerror-must-be-str-not-bytes
    with fs.open(filepath, 'wb') as outcsv:
        print("Right after opening the file ...")

        writer = csv.DictWriter(
            outcsv,
            fieldnames=FIELDNAMES,
            extrasaction="ignore",
            delimiter="|",
            lineterminator="\n",
        )
        # write header according to fieldnames
        print("before writer.writeheader()")
        writer.writeheader()
        print("after writer.writeheader()")

        for batch in query_execute_batch(connection):
            writer.writerows(batch)
            countrows += len(batch)
        datetime_now_save = datetime.now()
    return countrows, datetime_now_save

边注

不要像这样使用 csv 编写器。

它耗时太长了，参数为 5000 的 CFpd.to_csv()只chunksize需要 62 秒就可以加载 700k 行并将其作为 csv 存储在 bucket 中，而具有批处理写入器的 CF 需要超过 9 分钟，这超过了超时限制。因此，我不得不改用pd.to_csv()并将我的数据转换为数据框。

将 csv 从 CF 写入存储桶时：'with open(filepath, "w") as MY_CSV:' 导致“FileNotFoundError: [Errno 2] 没有这样的文件或目录：”

答案1

边注

相关内容