我想与网站建立网络连接,逐行读取数据,然后使用 shell 脚本将其存储在系统中的文本文件中。我已经使用 java 完成了此操作,我可以使用 URLConnection 对象读取该特定资源。
在 shell 脚本中,WGET Spider 是唯一可以做到这一点的方法吗?如果不是,还有哪些其他方法可以从网站读取文本文件、对其进行解析并将其存储在我的本地目录中?
编辑
我尝试使用 WGET wget -o /home/user/Desktop/training.txt https://www.someurl.com
。但输出是这样的
--2014-04-15 00:39:15-- https://s3.amazonaws.com/hr-testcases/368/assets/trainingdata.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 176.32.99.154
Connecting to s3.amazonaws.com (s3.amazonaws.com)|176.32.99.154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1554016 (1.5M) [text/plain]
Saving to: ‘trainingdata.txt.1’
0K .......... .......... .......... .......... .......... 3% 47.5K 31s
50K .......... .......... .......... .......... .......... 6% 129K 20s
100K .......... .......... .......... .......... .......... 9% 136K 16s
150K .......... .......... .......... .......... .......... 13% 149K 14s
200K .......... .......... .......... .......... .......... 16% 1.57M 11s
250K .......... .......... .......... .......... .......... 19% 162K 10s
300K .......... .......... .......... .......... .......... 23% 678K 9s
350K .......... .......... .......... .......... .......... 26% 612K 7s
400K .......... .......... .......... .......... .......... 29% 307K 7s
450K .......... .......... .......... .......... .......... 32% 630K 6s
500K .......... .......... .......... .......... .......... 36% 699K 5s
550K .......... .......... .......... .......... .......... 39% 520K 5s
600K .......... .......... .......... .......... .......... 42% 580K 4s
650K .......... .......... .......... .......... .......... 46% 516K 4s
700K .......... .......... .......... .......... .......... 49% 551K 3s
750K .......... .......... .......... .......... .......... 52% 713K 3s
800K .......... .......... .......... .......... .......... 56% 720K 3s
850K .......... .......... .......... .......... .......... 59% 701K 2s
900K .......... .......... .......... .......... .......... 62% 603K 2s
950K .......... .......... .......... .......... .......... 65% 670K 2s
1000K .......... .......... .......... .......... .......... 69% 715K 2s
1050K .......... .......... .......... .......... .......... 72% 671K 1s
1100K .......... .......... .......... .......... .......... 75% 752K 1s
1150K .......... .......... .......... .......... .......... 79% 535K 1s
1200K .......... .......... .......... .......... .......... 82% 607K 1s
1250K .......... .......... .......... .......... .......... 85% 675K 1s
1300K .......... .......... .......... .......... .......... 88% 727K 1s
1350K .......... .......... .......... .......... .......... 92% 707K 0s
1400K .......... .......... .......... .......... .......... 95% 632K 0s
1450K .......... .......... .......... .......... .......... 98% 785K 0s
1500K .......... ....... 100% 931K=4.5s
2014-04-15 00:39:23 (341 KB/s) - ‘trainingdata.txt.1’ saved [1554016/1554016]
它似乎只提供诸如下载所用时间等统计数据。它并未保存来自 URL 的实际数据。
答案1
听起来你想要网猫,
Netcat 是一款功能强大的网络实用程序,它使用 TCP/IP 协议跨网络连接读取和写入数据。它旨在成为一种可靠的“后端”工具,可以直接使用或由其他程序和脚本轻松驱动。同时,它还是一种功能丰富的网络调试和探索工具,因为它可以创建您需要的几乎任何类型的连接,并且具有多种有趣的内置功能。
欲了解更多信息,您可以随时man nc
答案2
您正在运行的命令正在使用-o
执行以下任务的标志(来自man wget
):
-o logfile
--output-file=logfile
Log all messages to logfile. The messages are normally reported to
standard error.
它实际上并不保存该文件的 URL 目标,而只是保存 的标准错误wget
。默认情况下,wget
将目标保存为与远程文件相同的名称。例如,执行此操作
wget http://www.foo.com/index.html
将文件保存为index.html
当前目录。要为文件指定其他名称,请改用-O
(大写o
,如 Oliver):
-O file
--output-document=file
The documents will not be written to the appropriate files, but all
will be concatenated together and written to file. If - is used as
file, documents will be printed to standard output, disabling link
conversion. (Use ./- to print to a file literally named -.)
Use of -O is not intended to mean simply "use the name file instead
of the one in the URL;" rather, it is analogous to shell
redirection: wget -O file http://foo is intended to work like wget
-O - http://foo > file; file will be truncated immediately, and all
downloaded content will be written there.