记忆/缓存命令行输出

记忆/缓存命令行输出

我有一个 bash 脚本,用于按顺序运行一些 python 和 C++ 程序。

每个程序都接受我在 bash 脚本中定义的一些输入参数。作为一个例子,我像这样运行程序:

echo $param1 $param2 $param3 | python foo.py

python 程序输出一些值,我们将其用作后续程序的输入。正如我上面所说,如果我将值存储在某个文件中并从那里读取它们,我不需要运行 python 程序。

那么我的问题是。有没有一些通用工具可以实现此功能?也就是说,是否有一些名为“bar”的程序,我可以像这样运行

bar $param1 $param2 $param3 "python foo.py"

它将检查缓存文件是否存在,如果存在,它将检查程序是否已使用给定的参数运行,如果是,它将输出缓存的输出值,而不是再次运行程序。

编辑:当然也可以提供日志文件的名称作为输入。

答案1

我刚刚被书呆子狙击到为此编写了一个相当完整的脚本;最新版本位于https://gist.github.com/akorn/51ee2fe7d36fa139723c851d87e56096

相对于 sivann 的 shell 实现的优点:

  • 计算缓存键时也会考虑环境变量;
  • 使用锁定而不是依赖随机等待完全避免竞争条件
  • 由于分叉较少,在紧密循环中调用时性能更好
  • 还缓存 stderr
  • 完全透明:不打印任何内容;不阻止同一命令的并行执行;如果缓存有问题,则仅运行未缓存的命令
  • 可通过环境变量和命令行开关进行配置
  • 可以修剪其缓存(删除所有过时的条目)

缺点:用zsh编写,而不是bash。

#!/bin/zsh
#
# Purpose: run speficied command with specified arguments and cache result. If cache is fresh enough, don't run command again but return cached output.
# Also cache exit status and stderr.
# Copyright (c) 2019-2020 András Korn; License: GPLv3

# Use silly long variable names to avoid clashing with whatever the invoked program might use
RUNCACHED_MAX_AGE=${RUNCACHED_MAX_AGE:-300}
RUNCACHED_IGNORE_ENV=${RUNCACHED_IGNORE_ENV:-0}
RUNCACHED_IGNORE_PWD=${RUNCACHED_IGNORE_PWD:-0}
[[ -n "$HOME" ]] && RUNCACHED_CACHE_DIR=${RUNCACHED_CACHE_DIR:-$HOME/.runcached}
RUNCACHED_CACHE_DIR=${RUNCACHED_CACHE_DIR:-/var/cache/runcached}

function usage() {
    echo "Usage: runcached [--ttl <max cache age>] [--cache-dir <cache directory>]"
    echo "       [--ignore-env] [--ignore-pwd] [--help] [--prune-cache]"
    echo "       [--] command [arg1 [arg2 ...]]"
    echo
    echo "Run 'command' with the specified args and cache stdout, stderr and exit"
    echo "status. If you run the same command again and the cache is fresh, cached"
    echo "data is returned and the command is not actually run."
    echo
    echo "Normally, all exported environment variables as well as the current working"
    echo "directory are included in the cache key. The --ignore options disable this."
    echo "The OLDPWD variable is always ignored."
    echo
    echo "--prune-cache deletes all cache entries older than the maximum age. There is"
    echo "no other mechanism to prevent the cache growing without bounds."
    echo
    echo "The default cache directory is ${RUNCACHED_CACHE_DIR}."
    echo "Maximum cache age defaults to ${RUNCACHED_MAX_AGE}."
    echo
    echo "CAVEATS:"
    echo
    echo "Side effects of 'command' are obviously not cached."
    echo
    echo "There is no cache invalidation logic except cache age (specified in seconds)."
    echo
    echo "If the cache can't be created, the command is run uncached."
    echo
    echo "This script is always silent; any output comes from the invoked command. You"
    echo "may thus not notice errors creating the cache and such."
    echo
    echo "stdout and stderr streams are saved separately. When both are written to a"
    echo "terminal from cache, they will almost certainly be interleaved differently"
    echo "than originally. Ordering of messages within the two streams is preserved."
    exit 0
}

while [[ -n "$1" ]]; do
    case "$1" in
        --ttl)      RUNCACHED_MAX_AGE="$2"; shift 2;;
        --cache-dir)    RUNCACHED_CACHE_DIR="$2"; shift 2;;
        --ignore-env)   RUNCACHED_IGNORE_ENV=1; shift;;
        --ignore-pwd)   RUNCACHED_IGNORE_PWD=1; shift;;
        --prune-cache)  RUNCACHED_PRUNE=1; shift;;
        --help)     usage;;
        --)     shift; break;;
        *)      break;;
    esac
done

zmodload zsh/datetime
zmodload zsh/stat
zmodload zsh/system
zmodload zsh/files

# the built-in mv doesn't fall back to copy if renaming fails due to EXDEV;
# since the cache directory is likely on a different fs than the tmp
# directory, this is an important limitation, so we use /bin/mv instead
disable mv  

mkdir -p "$RUNCACHED_CACHE_DIR" >/dev/null 2>/dev/null

((RUNCACHED_PRUNE)) && find "$RUNCACHED_CACHE_DIR/." -maxdepth 1 -type f \! -newermt @$[EPOCHSECONDS-RUNCACHED_MAX_AGE] -delete 2>/dev/null

[[ -n "$@" ]] || exit 0 # if no command specified, exit silently

(
    # Almost(?) nothing uses OLDPWD, but taking it into account potentially reduces cache efficency.
    # Thus, we ignore it for the purpose of coming up with a cache key.
    unset OLDPWD
    ((RUNCACHED_IGNORE_PWD)) && unset PWD
    ((RUNCACHED_IGNORE_ENV)) || env
    echo -E "$@"
) | md5sum | read RUNCACHED_CACHE_KEY RUNCACHED__crap__
# make the cache dir hashed unless a cache file already exists (created by a previous version that didn't use hashed dirs)
if ! [[ -f $RUNCACHED_CACHE_DIR/$RUNCACHED_CACHE_KEY.exitstatus ]]; then
    RUNCACHED_CACHE_KEY=$RUNCACHED_CACHE_KEY[1,2]/$RUNCACHED_CACHE_KEY[3,4]/$RUNCACHED_CACHE_KEY[5,$]
    mkdir -p "$RUNCACHED_CACHE_DIR/${RUNCACHED_CACHE_KEY:h}" >/dev/null 2>/dev/null
fi

# If we can't obtain a lock, we want to run uncached; otherwise
# 'runcached' wouldn't be transparent because it would prevent
# parallel execution of several instances of the same command.
# Locking is necessary to avoid races between the mv(1) command
# below replacing stderr with a newer version and another instance
# of runcached using a newer stdout with the older stderr.
: >>$RUNCACHED_CACHE_DIR/$RUNCACHED_CACHE_KEY.lock 2>/dev/null
if zsystem flock -t 0 $RUNCACHED_CACHE_DIR/$RUNCACHED_CACHE_KEY.lock 2>/dev/null; then
    if [[ -f $RUNCACHED_CACHE_DIR/$RUNCACHED_CACHE_KEY.stdout ]]; then
        if [[ $[EPOCHSECONDS-$(zstat +mtime $RUNCACHED_CACHE_DIR/$RUNCACHED_CACHE_KEY.stdout)] -le $RUNCACHED_MAX_AGE ]]; then
            cat $RUNCACHED_CACHE_DIR/$RUNCACHED_CACHE_KEY.stdout &
            cat $RUNCACHED_CACHE_DIR/$RUNCACHED_CACHE_KEY.stderr >&2 &
            wait
            exit $(<$RUNCACHED_CACHE_DIR/$RUNCACHED_CACHE_KEY.exitstatus)
        else
            rm -f $RUNCACHED_CACHE_DIR/$RUNCACHED_CACHE_KEY.{stdout,stderr,exitstatus} 2>/dev/null
        fi
    fi

    # only reached if cache didn't exist or was too old
    if [[ -d $RUNCACHED_CACHE_DIR/. ]]; then
        RUNCACHED_tempdir=$(mktemp -d 2>/dev/null)
        if [[ -d $RUNCACHED_tempdir/. ]]; then
            $@ >&1 >$RUNCACHED_tempdir/${RUNCACHED_CACHE_KEY:t}.stdout 2>&2 2>$RUNCACHED_tempdir/${RUNCACHED_CACHE_KEY:t}.stderr
            RUNCACHED_ret=$?
            echo $RUNCACHED_ret >$RUNCACHED_tempdir/${RUNCACHED_CACHE_KEY:t}.exitstatus 2>/dev/null
            mv $RUNCACHED_tempdir/${RUNCACHED_CACHE_KEY:t}.{stdout,stderr,exitstatus} $RUNCACHED_CACHE_DIR/${RUNCACHED_CACHE_KEY:h} 2>/dev/null
            rmdir $RUNCACHED_tempdir 2>/dev/null
            exit $RUNCACHED_ret
        fi
    fi
fi

# only reached if cache not created successfully or lock couldn't be obtained
exec $@

答案2

这里有一个实现:https://bitbucket.org/sivann/runcached/src缓存可执行路径、输出、退出代码、记住参数。可配置的到期时间。用 bash、C、python 实现,选择适合你的。 “bash”版本有些限制。

答案3

对于我创建的 Bash 用户bash缓存,它提供了与 András Korn 的 zsh 方法类似的功能集。它支持环境变量、避免竞争、缓存 stderr 和退出代码、使陈旧数据无效、支持异步预热和互斥锁等等。

例如,我认为您可以像这样实现您所描述的内容:

foo() {
  python foo.py <<<"$*" # this writes all arguments to the python stdin
} && bc::cache foo

然后您可以将其调用为:

foo param1 param2 param3

并将foo缓存结果并在后续调用中重用它们。

看我的先前的回答更多细节。

答案4

基于现有包的简单 python 解决方案(joblib.内存用于记忆):

cmdcache.py:

import sys, os
import joblib
import subprocess

mem = joblib.Memory('.', verbose=False)
@mem.cache
def run_cmd(args, env):
    process = subprocess.run(args, capture_output=True, text=True)
    return process.stdout, process.stderr, process.returncode

stdout, stderr, returncode = run_cmd(sys.argv[1:], dict(os.environ))
sys.stdout.write(stdout)
sys.stdout.write(stderr)
sys.exit(returncode)

使用示例:

python cmdcache.py ls

相关内容