开发者

Preventing errors with HTMLAgilitypack in VB.Net

开发者 https://www.devze.com 2023-02-01 19:39 出处:网络
I\'m using the HTMLAgilityPack to parse HTML pages. However at some point I try to parse wrong data (in this specific case an image), which ofc fails for o开发者_运维问答bvious reasons.

I'm using the HTMLAgilityPack to parse HTML pages. However at some point I try to parse wrong data (in this specific case an image), which ofc fails for o开发者_运维问答bvious reasons.

Private Sub parseHtml(ByVal content As String, ByVal url As String)
    Try
        Dim contentHash As String = hashGenerator.ComputeHash(content, "SHA1")
        Dim doc As HtmlDocument = New HtmlDocument()

        doc.Load(New StringReader(content))

        Dim root As HtmlNode = doc.DocumentNode
        Dim anchorTags As New List(Of String)

        For Each link As HtmlNode In root.SelectNodes("//a")
            cururl = link.OuterHtml
            If link.Attributes("href") Is Nothing Then Continue For
            If Uri.IsWellFormedUriString(link.Attributes("href").Value, UriKind.Absolute) Then
                urlQueue.Enqueue(link.Attributes("href").Value)
            Else
                Dim myUri As New Uri(url)
                urlQueue.Enqueue(myUri.Scheme & "://" & myUri.Host & link.Attributes("href").Value)
            End If
        Next
    Catch ex As Exception
        MsgBox(ex.Message, MsgBoxStyle.Critical, "Error (parseHtml(" & url & "))")
    End Try
End Sub

The error I get is:

A first chance exception of type 'System.NullReferenceException' occurred in Webcrawler.exe Object reference not set to an instance of an object.

On the content I try to parse:

�����Iޥ�+�: 8�0�x�

How to check whether the content is 'parse-able' before trying to parse it to prevent the error?

For now it is an image which makes an error popup however I think it might be just anything which isn't (x)html.

Thanks in advance ow great community :)


You need to check the returned content-type header before trying to parse the returned data.

For an HTML page this should be text/html, for XHTML is would be application/xhtml+xml.


If you only have the content (If you can't have access to original HTTP headers like Oded suggested), you could assume a good HTML string should contain at least a "<" character within, say, the 10 first characters of the string.

Of course, there is no guarantee and you will still need to handle the extreme cases, but this should discard most garbage or unexpected content types, and will let specific encoding bytes pass fine (like UTF-8 byte order mark, etc...).

0

精彩评论

暂无评论...
验证码 换一张
取 消