从站点地图（xml）中提取链接

Question 1

您可以在这里使用 python 脚本

此脚本获取所有链接http

import re

f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
    data = re.findall('>(http:\/\/.+)<',d)
    for i in data:
        print i

在你的情况下，下一个脚本会找到所有包裹在标签中的数据

import re

f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
    data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
    for i in data:
        print i

这里如果你不熟悉正则表达式的话，这是一个很好的工具。

如果您需要加载远程文件，则可以使用下一个代码

import urllib2 as ur
import re

f = ur.urlopen(u'http://server.com/sitemap.xml')
res = f.readlines()
for d in res:
  data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
  for i in data:
    print i

Answer

您可以在这里使用 python 脚本

此脚本获取所有链接http

import re

f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
    data = re.findall('>(http:\/\/.+)<',d)
    for i in data:
        print i

在你的情况下，下一个脚本会找到所有包裹在标签中的数据

import re

f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
    data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
    for i in data:
        print i

这里如果你不熟悉正则表达式的话，这是一个很好的工具。

如果您需要加载远程文件，则可以使用下一个代码

import urllib2 as ur
import re

f = ur.urlopen(u'http://server.com/sitemap.xml')
res = f.readlines()
for d in res:
  data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
  for i in data:
    print i

Question 2

如果你使用的是 Linux 系统或者其他带有grep工具，您只需运行：

grep -Po 'http(s?)://[^ \"()\<>]*' 站点地图.xml

Answer

如果你使用的是 Linux 系统或者其他带有grep工具，您只需运行：

grep -Po 'http(s?)://[^ \"()\<>]*' 站点地图.xml

Question 3

这可以通过一个 sed 命令来完成，这似乎比 grep 解决方案更可靠：

sed '/<loc>/!d; s/[[:space:]]*<loc>\(.*\)<\/loc>/\1/' inputfile > outputfile

（发现于：linuxquestions.org）

Answer

这可以通过一个 sed 命令来完成，这似乎比 grep 解决方案更可靠：

sed '/<loc>/!d; s/[[:space:]]*<loc>\(.*\)<\/loc>/\1/' inputfile > outputfile

（发现于：linuxquestions.org）

Question 4

XSLT 解决方案：

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:s="http://www.sitemaps.org/schemas/sitemap/0.9">

  <xsl:output method="text" />

  <xsl:template match="s:url">
    <xsl:value-of select="s:loc" />
    <xsl:text>
</xsl:text>
  </xsl:template>

</xsl:stylesheet>

Answer

XSLT 解决方案：

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:s="http://www.sitemaps.org/schemas/sitemap/0.9">

  <xsl:output method="text" />

  <xsl:template match="s:url">
    <xsl:value-of select="s:loc" />
    <xsl:text>
</xsl:text>
  </xsl:template>

</xsl:stylesheet>

从站点地图（xml）中提取链接

答案1

答案2

答案3

答案4

相关内容