使用 jq 根据特定关联键分割非常大的 JSON 文件

假设我有一个非常大的 JSON 文件,其中包含一个数组,其中包含数百万个具有这些键(以及许多其他键)的条目

        "name": "assets/fUCcxWczWT0",
        "displayName": "The House",
        "authorName": "John Smith",
        "resources": {"house" : "address","car":"bla"},

如何使用jq切片(删除)特定名称键之前的内容?例如,由 唯一标识的元素之前的所有内容"name": "assets/fUCcxWczWT0"


  "name": "assets/4vds6twPsb7",
  "displayName": "pim",
  "authorName": "Erwin Braak",
  "createTime": "2017-12-06T11:52:40.557236Z",
  "updateTime": "2020-10-07T09:49:08.752848Z",
  "formats": [
      "root": {
        "relativePath": "Pim.gltf",
        "url": "https://poly.googleapis.com/downloads/fp/1602064148752848/4vds6twPsb7/c8Ksvo0_VjG/Pim.gltf",
        "contentType": "model/gltf+json"
      "resources": [
          "relativePath": "Pim.bin",
          "url": "https://poly.googleapis.com/downloads/fp/1602064148752848/4vds6twPsb7/c8Ksvo0_VjG/Pim.bin",
          "contentType": "application/octet-stream"
          "relativePath": "C:/Users/PC/Desktop/pom/pimSurface_Color.png",
          "url": "https://poly.googleapis.com/downloads/fp/1602064148752848/4vds6twPsb7/c8Ksvo0_VjG/C:/Users/PC/Desktop/pom/pimSurface_Color.png",
          "contentType": "image/png"
      "formatComplexity": {
        "triangleCount": "586149"
      "formatType": "GLTF2"
      "root": {
        "relativePath": "Pim.obj",
        "url": "https://poly.googleapis.com/downloads/fp/1602064148752848/4vds6twPsb7/5QVGpvfauVK/Pim.obj",
        "contentType": "text/plain"
      "resources": [
          "relativePath": "Pim.mtl",
          "url": "https://poly.googleapis.com/downloads/fp/1602064148752848/4vds6twPsb7/5QVGpvfauVK/Pim.mtl",
          "contentType": "text/plain"
          "relativePath": "C:/Users/PC/Desktop/pom/pimSurface_Color.png",
          "url": "https://poly.googleapis.com/downloads/fp/1602064148752848/4vds6twPsb7/5QVGpvfauVK/C:/Users/PC/Desktop/pom/pimSurface_Color.png",
          "contentType": "image/png"
      "formatComplexity": {
        "triangleCount": "586149"
      "formatType": "OBJ"
  "thumbnail": {
    "relativePath": "4vds6twPsb7.png",
    "url": "https://lh3.googleusercontent.com/C_il4QubLWYMAyrnGPMjXWx4E7MVYLZAoX_Hf-qr4WyHfebvf2y3lndh71A350g",
    "contentType": "image/png"
  "license": "CREATIVE_COMMONS_BY",
  "visibility": "PUBLIC",
  "isCurated": true,
  "presentationParams": {
    "orientingRotation": {
      "w": 1
    "colorSpace": "LINEAR",
    "backgroundColor": "#eeeeee"



为了实现这一点,我们确实需要一些文本操作,以便jq即使对于最终的输出数组也不使用 的数组处理设施。我可以想到一些变体,每个变体都有一些(可变的)程度的工作需要通过文本操作来完成,但我会选择将这种操作限制在添加数组结构语法的绝对最低限度的变体仅最终输出。



jq -rn --stream --arg f name --arg q 'assets/fUCcxWczWT0' '"[", foreach fromstream(1|truncate_stream(inputs)) as $o ([null,null]; if .[1] or $o[$f]? == $q then [.[1], ","] else . end; .[0]//empty, if .[1] then $o else empty end), "]"'


  • -r用于输出任意文本时的原始(即未转换为 JSON)输出
  • -n用于通过inputs内置消耗输入数据仅有的jq,否则输入中的第一个整个对象将被的常规循环吃掉
  • --stream用于使用jq自己的 Streaming 模式来消耗输入
  • --arg f name包含我们要查找的值的字段的名称
  • --arg q 'assets/fUCcxWczWT0'我们在每个对象的字段中寻找的值


# Firstly, the initial bit of arbitrary text: the opening bracket for the final output,
# as required for a valid JSON array syntax


# Then a foreach statement, looping over the objects provided singularly, one by one,
# by the streamlined input
# NOTE 1: jq's streaming mode is used for this solution primarily so that we
#         can use `1 | truncate_stream()` here, which courteously (and natively)
#         strips the first (1 | ...) structure of the original input along the course of
#         the streamlined operation.
#         The first structure is obviously the main huge array containing the
#         objects, hence we receive these latter singularly in a truly streamlined
#         fashion, freed by the containment of the array
# NOTE 2: using `inputs | tostream` here in place of `--stream`, although functioning,
#         would not obtain the streaming goal, because it would first take the entire
#         input as a whole rather than streaming it from the start
    foreach fromstream( 1|truncate_stream(inputs) ) as $o (
        # the initial state of the loop: we use 2 values as a "shifting 2-value state-machine"
        # for the comma to be output (as text) along with all the elements except the
        # first. We use the second value as an overall state too
        [null, null];
        # here we look for the wanted value in the wanted field unless already found
        # previously according to the overall state, and we update the loop state "shifting"
        # the 2-state for the comma as soon as the wanted value is found
        if .[1] or $o[$f]? == $q then [.[1], ","] else . end;
        # here we print the comma text (if need be according to its 2-state)
        # as required for separating elements in a JSON array syntax, then we print
        # the object element itself if the overall state says so
        .[0]//empty, if .[1] then $o else empty end

# Lastly, the final bit of arbitrary text: the closing bracket for the output,
# as required for a valid JSON array syntax



  • 显然,这种方法假设(并且确实利用了这一事实)原始输入本质上是平坦且简单的,尽管巨大:如果它是一个更复杂的整体结构,也许内部和外部值之间有交叉引用,那么脚本很容易需要同样更加复杂和令人费解
  • 如果我们想输出元素取决于给通缉犯,而不是从...开始想要的,流逻辑可以更简单,可能根本不需要状态,因为原则上我们可以只是halt遇到想要的对象时的脚本,因此作为副作用也会导致更快的速度(完全流式操作)








$ jq '.[]' file | jq -c --argjson query 4 'select(. == $query) as $elem | [$elem, inputs]'


如果我们能找到一种方法来避免将输入数据存储在内存中,那就太好了。我们可以尝试使用 的流媒体功能来做到这一点jq。使用 时--streamjq表达式将接收表示输入流当前状态的数组流,而无需解析完整的输入(请参阅“流媒体”手册部分jq)。

$ jq -c --stream --argjson query 4 'fromstream(select(.[1] == $query) as $elem | $elem, inputs | .[0][0] -= $elem[0][0])' file




$ cat file
      "name": "assets/1",
      "authorName": "John Smith",
      "displayName": "The House",
      "resources": { "car": "bla", "house": "address" }
      "name": "assets/6",
      "authorName": "John Smith",
      "displayName": "The House",
      "resources": { "car": "bla", "house": "address" }
      "name": "assets/0",
      "authorName": "John Smith",
      "displayName": "The House",
      "resources": { "car": "bla", "house": "address" }
      "name": "assets/2",
      "authorName": "John Smith",
      "displayName": "The House",
      "resources": { "car": "bla", "house": "address" }


$ jq --stream --arg query 'assets/0' 'fromstream(select(.[0][1] == "name" and .[1] == $query) as $elem | $elem, inputs | .[0][0] -= $elem[0][0])' file
    "name": "assets/0",
    "authorName": "John Smith",
    "displayName": "The House",
    "resources": {
      "car": "bla",
      "house": "address"
    "name": "assets/2",
    "authorName": "John Smith",
    "displayName": "The House",
    "resources": {
      "car": "bla",
      "house": "address"

为了解决查询键必须是每个对象中的第一个键的限制,我们可以执行 2 遍操作,首先根据键确定数组元素的第一个索引name,然后使用该键来提取我们的数组元素。数据:

jq --stream --arg query 'assets/0' \
   'fromstream(select(.[0][1] == "name" and .[1] == $query) | [0,.[0][0]])' file |
   head -n 1 |
   jq --stream '.[1] as $index | fromstream(inputs | select(.[0][0] >= $index) | .[0][0] -= $index)' - file

上述管道中的第一个将输出具有我们正在查找的特定值的字段jq的所有元素的索引。name选择head -n 1这些索引中的第一个。第二个从标准输入jq将整数读入内部$index变量,然后从输入文件中提取索引等于或大于的任何元素(第二次读取)$index
