如何迭代curl命令的URL？

2024-6-7 • tag-icon

我是网络抓取（以及一般编程）的新手，我使用 python 和 bash 脚本来获取我需要的信息。我正在使用 WSL（Linux 的 Windows 子系统）运行，并且由于某种原因，脚本使用 git-bash 运行。
我正在尝试创建一个 bash 脚本来下载网页的 Html，然后发送到一个 python 脚本，该脚本返回 2 个 txt 文件，其中包含其他网页的链接。然后，原始脚本循环访问 txt 文件的链接之一，并将每个网页的 html 内容下载到以链接的特定部分命名的文件中。但最后的循环不起作用。
如果我手动编写curl命令的链接，它就可以工作。但如果我尝试运行该脚本，它就不起作用。
这是 bash 脚本：

#!/bin/bash

curl http://mythicspoiler.com/sets.html |
cat >>mainpage.txt
python creatingAListOfAllExpansions.py #returns two txt files containing the expansion links and the commander decks' links
rm mainpage.txt

#get the pages from the links
cat commanderDeckLinks.txt |
while read a ; do
    curl $a |          ##THIS DOESN'T WORK
    cat >>$(echo $a | cut --delimiter="/" -f4).txt
done

我尝试了几种不同的方法并看到了类似的问题，但对于我的一生，我无法解决这个问题。总是出现同样的错误：

curl: (3) URL using bad/illegal format or missing URL

这是CommanderDeckLinks.txt的内容：

http://mythicspoiler.com/cmd/index.html
http://mythicspoiler.com/c13/index.html
http://mythicspoiler.com/c14/index.html
http://mythicspoiler.com/c15/index.html
http://mythicspoiler.com/c16/index.html
http://mythicspoiler.com/c17/index.html
http://mythicspoiler.com/c18/index.html
http://mythicspoiler.com/c19/index.html
http://mythicspoiler.com/c20/index.html

这是Python脚本

#reads the main page of the website
with open("mainpage.txt") as datafile:
    data = datafile.read()

#gets the content after the first appearance of the introduced string
def getContent(data, x):
    j=0
    content=[]
    for i in range(len(data)):
        if(data[i].strip().startswith(x) and j == 0):
            j=i
        if(i>j and j != 0):
            content.append(data[i])
    return content

#gets the content of the website that is inside the body tag
mainNav = getContent(data.splitlines(), "<!--MAIN NAVIGATION-->")

#gets the content of the website that is inside of the outside center tags
content = getContent(mainNav, "<!--CONTENT-->")

#removes extra content from list
def restrictNoise(data, string):
    content=[]
    for i in data:
        if(i.startswith(string)):
            break
        content.append(i)
    return content

#return only lines which are links
def onlyLinks(data):
    content=[]
    for i in data:
        if(i.startswith("<a")):
            content.append(i)
    return content


#creates a list of the ending of the links to later fetch
def links(data):
    link=[]
    for i in data:
        link.append(i.split('"')[1])
    return link

#adds the rest of the link
def completLinks(data):
    completeLinks=[]
    for i in data:
        completeLinks.append("http://mythicspoiler.com/"+i)
    return completeLinks

#getting the commander decks
commanderDecksAndNoise = getContent(content,"<!---->")
commanderDeck = restrictNoise(commanderDecksAndNoise, "<!---->")
commanderDeckLinks = onlyLinks(commanderDeck)
commanderDecksCleanedLinks = links(commanderDeckLinks)

#creates a txt file and writes in it
def writeInTxt(nameOfFile, restrictions, usedList):
    file = open(nameOfFile,restrictions)
    for i in usedList:
        file.write(i+"\n")
    file.close()

#creating the commander deck text file
writeInTxt("commanderDeckLinks.txt", "w+", completLinks(commanderDecksCleanedLinks))

#getting the expansions
expansionsWithNoise = getContent(commanderDecksAndNoise, "<!---->")
expansionsWithoutNoise = restrictNoise(expansionsWithNoise, "</table>")
expansionsLinksWNoise = onlyLinks(expansionsWithoutNoise)
expansionsCleanedLinks = links(expansionsLinksWNoise)

#creating the expansions text file
writeInTxt("expansionLinks.txt", "w+", completLinks(expansionsCleanedLinks))

如果需要更多信息来解决我的问题，请告诉我。感谢所有试图提供帮助的人

答案1

这里的问题是 bash(Linux) 和 windows 的行结尾不同，分别是 LF 和 CRLF（我不太确定，因为这对我来说都是新的）。因此，当我在 python 中创建一个包含由行分隔的项目的文件时，bash 脚本无法很好地读取它，因为创建的文件具有 CRLF 结尾，并且 bash 脚本仅读取 LF，使 URL 变得无用，因为它们具有 CR结局不应该在那里。我不知道如何使用 bash 代码解决这个问题，但我所做的是创建一个文件（使用 python），每个项目用下划线“_”分隔，并添加最后一个项目 n，这样我就永远不会必须处理行结尾。然后我在 bash 中运行了一个 for 循环，迭代由下划线分隔的每个项目，除了最后一个项目。这解决了问题。

答案1

相关内容