filtering pdf links from html source code_问答_开发者

filtering pdf links from html source code

开发者 https://www.devze.com 2023-02-28 02:50 出处：网络

im about to write a class that takes a lo开发者_JS百科ok on the html source code and filters all pdf links from it. the idea behind it is just take the parent link + the relative link..

相关专题：hyperlink

im about to write a class that takes a lo开发者_JS百科ok on the html source code and filters all pdf links from it. the idea behind it is just take the parent link + the relative link.. basically it's working for

<a href="blabla/123.pdf">pdf</a>

but in some cases it doesn't e.g. if the same pdf link is written as

<a href="./blabla/123.pdf">pdf</a>

<a href=" blabla/123.pdf">pdf</a>

(point and space) both are working links and goes to the same pdf in the same directory if they are parsed in browsers, but for the composition in my class completely useless.

i fixed the problem for the two cases above. the question is if there are other special cases in syntax where i should pay attention on.

You do not know what the link points to until you download the file.

I can have a link like http://www.mysite.com/pages/brochure.html which internally redirects to a PDF file.

So, if you're not in control of the links, or working on a particular section of your site, you're going to fail.

On the other hand, if you're working on a specific section of the site, where you know every PDF link has a .pdf estension, you can simply check the extension and not the whole path (don't know how's written in Java the .lastIndexOf("string") thing of C#).