I'm just starting out on my Networking Assignment and I'm already stuck. Assignment asks me to check the user provided website for links and to determine if they are active or inactive by reading the header info. So far after googling, I just have this code which retrieves the website. I don't get how to go over this information and look for HTML links. Here's the code:
import java.net.*;
import java.io.*;
public class url_checker {
public static void main(String[] args) throws Exception {
URL yahoo = new URL("http://yahoo.com");
URLConnection yc = yahoo.openConnection();
BufferedReader in = new BufferedReader(
开发者_开发百科 new InputStreamReader(
yc.getInputStream()));
String inputLine;
int count = 0;
while ((inputLine = in.readLine()) != null) {
System.out.println (inputLine);
}
in.close();
}
}
Please help. Thanks!
You can also try jsoup html retriever and parser.
Document doc = Jsoup.parse(new URL("<url>"), 2000);
Elements resultLinks = doc.select("div.post-title > a");
for (Element link : resultLinks) {
String href = link.attr("href");
System.out.println("title: " + link.text());
System.out.println("href: " + href);
}
With this code you can list and analyze all elements inside a div with class "post-title" from the url .
You can try this:
URL url = new URL(link);
Reader reader= new InputStreamReader((InputStream) url.getContent());
new ParserDelegator().parse(reader, new Page(), true);
Then Create a class called Page
class Page extends HTMLEditorKit.ParserCallback {
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if (t == HTML.Tag.A) {
String link = null;
Enumeration<?> attributeNames = a.getAttributeNames();
if (attributeNames.nextElement().equals(HTML.Attribute.HREF))
link = a.getAttribute(HTML.Attribute.HREF).toString();
//save link some where
}
}
}
I don't get how to go over this information and look for HTML links
I cannot use any external library on my Assignment
You have a couple of options:
1) You can read the web page into an HTMLDocument. Then you can get an iterator from the Document to find all the HTML.Tag.A tags. Once you find the attrbute tags you can get the HTML.Attribute.HREF from the attribute set of the attribute tag.
2) You can extend HTMLEditor.ParserCallback and implement the handleStartTag(...) method. Then whenever you find an A tag, you can get the href attribute which will again contain the link. The basic code for invoking the parser callback is:
MyParserCallback parser = new MyParserCallback();
// simple test
String file = "<html><head><here>abc<div>def</div></here></head></html>";
StringReader reader = new StringReader(file);
// read a page from the internet
//URLConnection conn = new URL("http://yahoo.com").openConnection();
//Reader reader = new InputStreamReader(conn.getInputStream());
try
{
new ParserDelegator().parse(reader, parser, true);
}
catch (IOException e)
{
System.out.println(e);
}
HtmlParser is what you need here. A lot of things can be done with it.
You need to get the HTTP status code that the server returned with the response. A server will return a 404 if the page does not exist.
Check out this: http://download.oracle.com/javase/1.4.2/docs/api/java/net/HttpURLConnection.html
most specifically the getResponseCode method.
I would parse the HTML with a tool like NekoHTML. It basically fixes malformed HTML for you and allows to access it like XML. Then you can process the link elements and try to follow them like you did for the original page.
You can check out some sample code that does this.
精彩评论