Here is my code for Regex matching which worked for a webpage:
public class RegexTestHarness {
public static void main(String[] args) {
File aFile = new File("/home/darshan/Desktop/test.txt");
FileInputStream inFile = null;
try {
inFile = new FileInputStream(aFile);
} catch (FileNotFoundException e) {
e.printStackTrace(System.err);
System.exit(1);
}
BufferedInputStream in = new BufferedInputStream(inFile);
DataInputStream data = new DataInputStream(in);
String string = new String();
try {
while (data.read() != -1) {
string += data.readLine();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Pattern pattern = Pattern
.compile("<div class=\"rest_title\">.*?<h1>(.*?)</h1>");
Matcher matcher = pattern.matcher(string);
boolean found = false;
while (matcher.find()) {
System.out.println("Name: " + matcher.group(1) );
found = true;
}
if(!found){
System.out.println("Pattern Not found");
}
}
}
But the same code doesn't work on the crwaler code for which I'm testing the regex, my crawler code is:(I'm using Websphinx)
// Our own Crawler class extends the WebSphinx Crawler
public class MyCrawler extends Crawler {
MyCrawler() {
super(); // Do what the parent crawler would do
}
// We could choose not to visit a link based on certain circumstances
// For now we always visit the link
public boolean shouldVisit(Link l) {
// String host = l.getHost();
return false; // always visit a link
}
// What to do when we visit the page
public void visit(Page page) {
System.out.println("Visiting: " + page.getTitle());
String content = page.getContent();
System.out.println(content);
Pattern pattern = Pattern.compile("<div class=\"rest_title\">.*?<h1>(.*?)</h1>");
Matcher matcher = pattern.matcher(content);
boolean found = false;
while (matcher.find()) {
System.out.println("Name: " + matcher.group(1) );
found = true;
}
if(!found){
System.out.println("Pattern Not found");
}
}
}
This is my code for running the crawler:
public class WebSphinxTest {
public static void main(String[] args) throws MalformedURLException, InterruptedException {
System.out.println("Testing Websphinx. . .");
// Make an instance of own our crawler
Crawler crawler = new MyCrawler();
// Create a "Link" object and set it as the crawler's root
Link link = new Link("http://justeat.in/restaurant/spices/5633/indian-tandoor-chinese-and-seafood/sarjapur-road/bangalore");
crawler.setRoot(link);
// Start running the crawler!
System.out.println("Starting crawler. . .");
crawler.run(); // Blocking function, could implement a thread, etc.
}
}
A little detail about the crawler code. shoul开发者_如何学Godvisit(Link link)
filters whether to visit a link or not. visit(Page page)
decides what to do when we get the page.
In the above example, test.txt and content contains the same String
In your RegexTestHarness
you're reading in lines from a file and concatenating the lines without line breaks after which you do your matching (readLine()
returns the contents of the line without the line breaks!).
So in the input of your MyCrawler
class, there probably are line break characters in the input. And since the regex meta-char .
by default does not match line break chars, it doesn't work in MyCrawler
.
To fix this, append (?s)
in from of all your patterns that contain a .
meta char. So:
Pattern.compile("<div class=\"rest_title\">.*?<h1>(.*?)</h1>")
would become:
Pattern.compile("(?s)<div class=\"rest_title\">.*?<h1>(.*?)</h1>")
The DOT-ALL flag, (?s)
, will cause the .
to match any character, including line break chars.
精彩评论