开发者

Help extracting text from html tag with Java and Regex

开发者 https://www.devze.com 2023-03-31 16:20 出处:网络
I would like to extract some text from an html file using Regex. I am learning regex and I still have trouble understanding it all. I have a code which extracts all the text included betweeen <body

I would like to extract some text from an html file using Regex. I am learning regex and I still have trouble understanding it all. I have a code which extracts all the text included betweeen <body> and </body> here it is:

public class Harn2 {

public static void main(String[] args) throws IOException{

String toMatch=readFile();
//Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?"); this one works fine
Pattern pattern=Pattern.compile(".*?<table class=\"claroTable\".*?>(.*?)</table>.*?"); //I want this one to work
Matcher matcher=pattern.matcher(toMatch);

if(matcher.matches()) {
    System.out.println(matcher.group(1));
}

}

 private static String readFile() {

      try{
            // Open the file that is the first 
            // command line parameter
            FileInputStream fstream = new FileInputStream("user.html");
            // Get the object of DataInputStream
            DataInputStream in = new DataInputStream(fstream);
            BufferedReader br = new BufferedReader(new InputStreamReader(in));
            String strLine = null;
            //Read File Line By Line
            while (br.readLine() != null)   {
                // Print the content on the conso开发者_JS百科le
                //System.out.println (strLine);
                strLine+=br.readLine();
            }
            //Close the input stream
            in.close();
            return strLine;
            }catch (Exception e){//Catch exception if any

                System.err.println("Error: " + e.getMessage());
                return "";
            }
}
}

Well it works fine like this but now I would like to extract the text between the tag: <table class="claroTable"> and </table>

So I replace my regex string by ".*?<table class=\"claroTable\".*?>(.*?)</table>.*?" I have also tried ".*?<table class=\"claroTable\">(.*?)</table>.*?" but it doesn't work and I don't understand why. There is only one table in the html file but there is an occurence of "table" in a javascript code : "...dataTables.js..." could that be the reason for the mistake?

Thank you in advance for helping me,

EDIT: the html text to extranct is something like:

<body>
.....
<table class="claroTable">
<td><th>some data and manya many tags </td>
.....
</table>

What I would like to extract is anything between <table class="claroTable"> and </table>


Here's how you can do it with the JSoup parser:

File file = new File("path/to/your/file.html");
String charSet = "ISO-8859-1";
String innerHtml = Jsoup.parse(file,charSet).select("body").html();

Yes, you can also somehow do it with regex, but it will never be this easy.

Update: The main problem with your regex pattern is that you are missing the DOTALL flag:

Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?",Pattern.DOTALL);

And if you just want the specified table tag with contents, you can do something like this:

String tableTag = 
    Pattern.compile(".*?<table.*?claroTable.*?>(.*?)</table>.*?",Pattern.DOTALL)
           .matcher(html)
           .replaceFirst("$1");

(Updated: now returns the contents of the table tag only, not the table tag itself)


As stated, this is a bad place to use regex. Only use regex when you actually need to, so basically try to stay away from it if you can. Take a look at this post though for parsers:

How to parse and modify HTML file in Java

0

精彩评论

暂无评论...
验证码 换一张
取 消