使用 jq 与 grep/sed

2024-6-8 • tag-icon

使用以下模式搜索推文的文本：e，“text”：

有人建议我不要使用 grep，而使用 jq，但 jq 对于初学者来说似乎相当困难。我想知道在包含 100 条推文信息的 json 文件中搜索上述模式时，我有哪些选择。以下是 json 文件的片段：

 {
        "favorited": false, 
        "contributors": null, 
        "truncated": false, 
        "text": "RT @Shakti_Shetty: Hillary Clinton is killing it mad for the first time since Benghazi. \n\n#DebateNight", 
        "is_quote_status": false, 
        "in_reply_to_status_id": null, 
        "user": {
            "follow_request_sent": false, 
            "has_extended_profile": false, 
            "profile_use_background_image": true, 
            "time_zone": null, 
            "id": 1082649110, 
            "description": "words are who someone wants to be..\nACTIONS are who they truly are. if you want peace in life please don't assume don't over-think & be a nonjudgemental person", 
            "default_profile": true, 
            "verified": false, 
            "entities": {
                "description": {
                    "urls": []
                }
            }, 
            "profile_image_url_https": "https://pbs.twimg.com/profile_images/691152324291661824/kr2IsMs8_normal.jpg", 
            "profile_sidebar_fill_color": "DDEEF6", 
            "is_translator": false, 
            "geo_enabled": false, 
            "profile_text_color": "333333", 
            "followers_count": 162, 
            "protected": false, 
            "id_str": "1082649110", 
            "default_profile_image": false, 
            "listed_count": 25, 
            "lang": "en", 
            "utc_offset": null, 
            "statuses_count": 38396,

这是另一个：

{
        "favorited": false, 
        "contributors": null, 
        "truncated": true, 
        "text": "Wikileaks: NYT\u2019s Amy Chozick Privately Praised Hillary for Strong Connection with Working Class\u2026 https://t.co/bXUEHwEccE", 
        "possibly_sensitive": false, 
        "is_quote_status": false, 
        "in_reply_to_status_id": null, 
        "user": {
            "follow_request_sent": false, 
            "has_extended_profile": false, 
            "profile_use_background_image": true, 
            "time_zone": null, 
            "id": 763916668171149312, 
            "description": "We show you the truth Hot Breaking news, USA politics, Trump and conservative support", 
            "default_profile": true, 
            "verified": false, 
            "entities": {
                "description": {
                    "urls": []
                }
            }, 
            "profile_image_url_https": "https://pbs.twimg.com/profile_images/763917371702513664/IPlCWEqa_normal.jpg", 
            "profile_sidebar_fill_color": "DDEEF6", 
            "is_translator": false, 
            "geo_enabled": false, 
            "profile_text_color": "333333", 
            "followers_count": 155, 
            "protected": false, 
            "id_str": "763916668171149312", 
            "default_profile_image": false, 
            "listed_count": 3, 
            "lang": "es", 
            "utc_offset": null, 
            "statuses_count": 14162, 
            "profile_background_color": "F5F8FA", 
            "friends_count": 295, 
            "profile_link_color": "1DA1F2", 
            "profile_image_url": "http://pbs.twimg.com/profile_images/763917371702513664/IPlCWEqa_normal.jpg", 
            "notifications": false, 
            "profile_background_image_url_https": null, 
            "profile_banner_url": "https://pbs.twimg.com/profile_banners/763916668171149312/1470967188", 
            "profile_background_image_url": null, 
            "name": "Politic Manager",

文件本身很大，我不知道可以在哪里免费分享它（接受建议）。

从更大的范围来看，我有 500k 个这样的 json 文件，我需要处理并计算其中的推文数量并提取其他类型的信息，包括推文的文本。

答案1

我假设您的 JSON 文件包含推文对象列表，就像您从 Twitter API 转储时间线时获得的那样。

计算推文数量：

jq '. | length' tweets.json

获取text每条推文的字段：

jq '.[] | .text' tweets.json

从text每条推文的字段中获取与正则表达式匹配的部分leak.*：

jq '.[] | .text | scan("leak.*")' tweets.json

答案1

相关内容