I am trying to get all the url's that have header as Content-Type:text/html so I am checking the response header of each url and If they have content-type: text/html, then I want to print that url that has content-type:text/html. But in my code when I am checking that if the header has Content-Type, it is not displaying anything.. And If I remove the if loop then it prints every link related to that particular url that I want to crawl and their response header..
public class MyCrawler extends WebCrawler {
Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
/*
Pattern filters = Pattern.compile("(\\.(html))");
*/
public MyCrawler() {
}
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
//System.out.println("Href: " +href);
/*
if (filters.matcher(href).matches()) {
return false;
}*/
if (href.startsWith("http://www.somehost.com/")) {
开发者_高级运维 return true;
}
return false;
}
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String text = page.getText();
List<WebURL> links = page.getURLs();
int parentDocid = page.getWebURL().getParentDocid();
//HttpGet httpget = new HttpGet(url);
try {
URL url1 = new URL(url);
URLConnection connection = url1.openConnection();
Map responseMap = connection.getHeaderFields();
for (Iterator iterator = responseMap.keySet().iterator(); iterator.hasNext();)
{
String key = (String) iterator.next();
if(key==("Content-Type")) //(Anything wrong with this if loop)
{
System.out.print(key + " = ");
List values = (List) responseMap.get(key);
for (int i = 0; i < values.size(); i++) {
Object o = values.get(i);
System.out.print(o + ", ");
}
System.out.println("");
System.out.println(url1);
}
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
//System.out.println("Docid: " + docid);
//System.out.println("URL: " + url);
//System.out.println("Text length: " + text.length());
//System.out.println("Number of links: " + links.size());
//System.out.println("Docid of parent page: " + parentDocid);
System.out.println("=============");
}
}
The key variable contains:
Content-Type=[text/html; charset=ISO-8859-1]
and therefor can't be caught with ==
or .equals("Content-Type")
If you try to run the following code, see what it prints out
URLConnection connection = url1.openConnection();
Map responseMap = connection.getHeaderFields();
Iterator iterator = responseMap.entrySet().iterator();
while (iterator.hasNext())
{
String key = iterator.next().toString();
if (key.contains("Content-Type"))
{
System.out.println(key);
// Content-Type=[text/html; charset=ISO-8859-1]
if (filters.matcher(key) != null){
System.out.println(url1);
// http://google.com
}
}
}
Here is the output:
Content-Type=[text/html; charset=ISO-8859-1]
http://google.com
It looks like you could also just do with one if statement as following:
while (iterator.hasNext())
{
String key = iterator.next().toString();
if (key.contains("text/html"))
{
System.out.println(url1);
// http://google.com
}
}
BTW string comparison in Java is very intuitive, gets me all the time!
For string comparison, use .equals()
.
It should work with
if (key != null && key.equals("Content-Type"))
精彩评论