I'm using the HTMLAgilityPack to parse HTML pages. However at some point I try to parse wrong data (in this specific case an image), which ofc fails for o开发者_运维问答bvious reasons.
Private Sub parseHtml(ByVal content As String, ByVal url As String)
Try
Dim contentHash As String = hashGenerator.ComputeHash(content, "SHA1")
Dim doc As HtmlDocument = New HtmlDocument()
doc.Load(New StringReader(content))
Dim root As HtmlNode = doc.DocumentNode
Dim anchorTags As New List(Of String)
For Each link As HtmlNode In root.SelectNodes("//a")
cururl = link.OuterHtml
If link.Attributes("href") Is Nothing Then Continue For
If Uri.IsWellFormedUriString(link.Attributes("href").Value, UriKind.Absolute) Then
urlQueue.Enqueue(link.Attributes("href").Value)
Else
Dim myUri As New Uri(url)
urlQueue.Enqueue(myUri.Scheme & "://" & myUri.Host & link.Attributes("href").Value)
End If
Next
Catch ex As Exception
MsgBox(ex.Message, MsgBoxStyle.Critical, "Error (parseHtml(" & url & "))")
End Try
End Sub
The error I get is:
A first chance exception of type 'System.NullReferenceException' occurred in Webcrawler.exe Object reference not set to an instance of an object.
On the content I try to parse:
�����Iޥ�+�: 8�0�x�
How to check whether the content is 'parse-able' before trying to parse it to prevent the error?
For now it is an image which makes an error popup however I think it might be just anything which isn't (x)html.
Thanks in advance ow great community :)
You need to check the returned content-type
header before trying to parse the returned data.
For an HTML page this should be text/html
, for XHTML is would be application/xhtml+xml
.
If you only have the content (If you can't have access to original HTTP headers like Oded suggested), you could assume a good HTML string should contain at least a "<" character within, say, the 10 first characters of the string.
Of course, there is no guarantee and you will still need to handle the extreme cases, but this should discard most garbage or unexpected content types, and will let specific encoding bytes pass fine (like UTF-8 byte order mark, etc...).
精彩评论