I have system i开发者_如何学编程n PHP that the user enters a website url and we download the html and check values in tags. I have to rewrite it in java now. I been search for days and cant find any easy way to do the following tasks.
1) download HTML based on URL
2) After downloading HTML check values in tags
THIS WILL NOT BUILD! CAN SOMEONE HELP ME
public String tagValue(String inHTML, String tag) throws DataNotFoundException
{
String value = null;
String searchFor = "/<" + tag + ">(.*?)<\/" + tag + "\>/";
Pattern pattern = Pattern.compile("<a href=([^ >]*)[^>]*>([^<]*)");
Matcher matcher = pattern.matcher(inHTML);
return value;
}
- check out http://download.oracle.com/javase/6/docs/api/java/net/URLConnection.html
- google "java html parser" for options. you could also use regular expressions if the requirements are fairly simple and straightforward.
An example follows. It took me a while, I haven't worked with these APIs for a long time.
jcomeau@intrepid:~/tmp$ cat test.java; javac test.java; java test
import java.util.regex.*;
import java.net.*;
import java.io.*;
public class test {
public static void main(String args[]) throws Exception {
URL target = new URL("http://www.example.com/");
URLConnection connection = target.openConnection();
connection.connect();
String html = "", line = null;
BufferedReader input = new BufferedReader(new InputStreamReader(
connection.getInputStream()));
while ((line = input.readLine()) != null) html += line;
Pattern pattern = Pattern.compile("<a href=([^ >]*)[^>]*>([^<]*)");
Matcher matcher = pattern.matcher(html);
System.out.println("href\ttext");
while (matcher.find()) {
System.out.println(matcher.group(1) + "\t" + matcher.group(2));
}
}
}
href text
"/"
"/domains/" Domains
"/numbers/" Numbers
"/protocols/" Protocols
"/about/" About IANA
"/go/rfc2606" RFC 2606
"/about/" About
"/about/presentations/" Presentations
"/about/performance/" Performance
"/reports/" Reports
"/domains/" Domains
"/domains/root/" Root Zone
"/domains/int/" .INT
"/domains/arpa/" .ARPA
"/domains/idn-tables/" IDN Repository
"/protocols/" Protocols
"/numbers/" Number Resources
"/abuse/" Abuse Information
"http://www.icann.org/" Internet Corporation for Assigned Names and Numbers
"mailto:iana@iana.org?subject=General%20website%20feedback" iana@iana.org
1) download HTML based on URL
There are various options. There are some helper libraries, e.g. Apache HTTPComponents. You can also just use Java's built-in classes. See e.g. java code to download a file from server .
2) After downloading HTML check values in tags
You probably want to use an HTML parser. For very simple cases, you could use regular expressions (as it seems you are trying to in your example), but this quickly leads to problems. See this famous question: RegEx match open tags except XHTML self-contained tags
THIS WILL NOT BUILD! CAN SOMEONE HELP ME
To put a "\" (backslash) into a literal Java string, you need to double it (because \ is used to introduce special sequences in a Java string literal). So to get a string with just a "\", write it as
String myBackslash = "\\";
See e.g. How can I print "\t" (as it looks) in Java?
精彩评论