How to find out the exact RSS XML path of a website?_问答_开发者

How to find out the exact RSS XML path of a website?

开发者 https://www.devze.com 2022-12-22 17:47 出处：网络

How do I get the exact feed.xml/rss.xml/atom.xml path of a website? For example, I supplied \"http://www.exa开发者_如何学编程mple.com/news/today/this_is_a_news\", but the rss is pointing to \"http://

How do I get the exact feed.xml/rss.xml/atom.xml path of a website?

For example, I supplied "http://www.exa开发者_如何学编程mple.com/news/today/this_is_a_news", but the rss is pointing to "http://www.example.com/rss/feed.xml", most modern browsers have this features already and I'm curious how did they get them.

Can you cite an example code in ruby, python or bash?

Something like this in Ruby will work...

require 'rubygems'
require 'nokogiri'
require 'open-uri'

html = Nokogiri::HTML(open('http://stackoverflow.com/questions/2441954/how-to-find-out-the-exact-rss-xml-path-of-a-website'))
puts html.css('link[type="application/atom+xml"]').first.attr('href')
#  => "/feeds/question/2441954"

Notice it's an absolute URL path, which is legal so you'd need to prepend the host info.

Also, "application/atom+xml" could also be "application/rss+xml" or "application/rdf+xml", and multiple links can be found in a page so you'll need to decide how to handle multiples. According to the autodiscovery docs the first one presented should be the preferred one, but from experience I've seen otherwise. Also, according to the docs the links should not be alternate data types (RSS and ATOM pointing to the same content) but should be different content, but again, I've seen that happen.

You may also use a command line tool like xmlstarlet (together with HTML tidy):

# version 1
curl -s http://stackoverflow.com/questions/2441954/how-to-find-out-the-exact-rss-xml-path-of-a-website | 
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
xmlstarlet sel -T -t -m "//*[local-name()='link']" --if "@type='application/atom+xml' or @type='application/rss+xml'" -m "@href" -v '.' -n

# version 2
curl -s http://stackoverflow.com/questions/2441954/how-to-find-out-the-exact-rss-xml-path-of-a-website | 
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -T -t -m "//x:link[@type='application/atom+xml' or @type='application/rss+xml']" -v "@href" -n

In python use this classic solution: http://www.aaronsw.com/2002/feedfinder/