Checking HTML (Website) tags within Java Code_问答_开发者

Checking HTML (Website) tags within Java Code

开发者 https://www.devze.com 2023-03-28 05:25 出处：网络

I have system i开发者_如何学编程n PHP that the user enters a website url and we download the html and check values in tags.I have to rewrite it in java now.I been search for days and cant find any eas

相关专题：regex

I have system i开发者_如何学编程n PHP that the user enters a website url and we download the html and check values in tags. I have to rewrite it in java now. I been search for days and cant find any easy way to do the following tasks.

1) download HTML based on URL

2) After downloading HTML check values in tags

THIS WILL NOT BUILD! CAN SOMEONE HELP ME

public String tagValue(String inHTML, String tag) throws DataNotFoundException
    {
        String value = null;

        String searchFor = "/<" + tag + ">(.*?)<\/" + tag + "\>/";

        Pattern pattern = Pattern.compile("<a href=([^ >]*)[^>]*>([^<]*)");
        Matcher matcher = pattern.matcher(inHTML);

        return value;

    }

check out http://download.oracle.com/javase/6/docs/api/java/net/URLConnection.html
google "java html parser" for options. you could also use regular expressions if the requirements are fairly simple and straightforward.

An example follows. It took me a while, I haven't worked with these APIs for a long time.

jcomeau@intrepid:~/tmp$ cat test.java; javac test.java; java test
import java.util.regex.*;
import java.net.*;
import java.io.*;
public class test {
 public static void main(String args[]) throws Exception {
  URL target = new URL("http://www.example.com/");
  URLConnection connection = target.openConnection();
  connection.connect();
  String html = "", line = null;
  BufferedReader input = new BufferedReader(new InputStreamReader(
   connection.getInputStream()));
  while ((line = input.readLine()) != null) html += line;
  Pattern pattern = Pattern.compile("<a href=([^ >]*)[^>]*>([^<]*)");
  Matcher matcher = pattern.matcher(html);
  System.out.println("href\ttext");
  while (matcher.find()) {
   System.out.println(matcher.group(1) + "\t" + matcher.group(2));
  }
 }
}
href    text
"/" 
"/domains/" Domains
"/numbers/" Numbers
"/protocols/"   Protocols
"/about/"   About IANA
"/go/rfc2606"   RFC 2606
"/about/"   About
"/about/presentations/" Presentations
"/about/performance/"   Performance
"/reports/" Reports
"/domains/" Domains
"/domains/root/"    Root Zone
"/domains/int/" .INT
"/domains/arpa/"    .ARPA
"/domains/idn-tables/"  IDN Repository
"/protocols/"   Protocols
"/numbers/" Number Resources
"/abuse/"   Abuse Information
"http://www.icann.org/" Internet Corporation for Assigned Names and Numbers
"mailto:iana@iana.org?subject=General%20website%20feedback" iana@iana.org

1) download HTML based on URL

There are various options. There are some helper libraries, e.g. Apache HTTPComponents. You can also just use Java's built-in classes. See e.g. java code to download a file from server .

2) After downloading HTML check values in tags

You probably want to use an HTML parser. For very simple cases, you could use regular expressions (as it seems you are trying to in your example), but this quickly leads to problems. See this famous question: RegEx match open tags except XHTML self-contained tags

THIS WILL NOT BUILD! CAN SOMEONE HELP ME

To put a "\" (backslash) into a literal Java string, you need to double it (because \ is used to introduce special sequences in a Java string literal). So to get a string with just a "\", write it as