I have an asp.net vb project that needs to parse some raw XML that is coming out of a databa开发者_运维问答se the XML is laid out like this:
<HTML><HEAD><TITLE></TITLE></HEAD><BODY><STRONG><A name=SN>AARTS</A>, <A name=GN>Michelle Marie</A>, </STRONG><A name=HO>B.Sc.</A>, <A name=HO>M.Sc.</A>, <A name=HO>Ph.D.</A>; <A name=OC>scientist, professor</A>; b. <A name=BC>St. Marys</A>, Ont. <A name=BY>1970</A>; <A name=PA>d. Wm. and H. Aarts</A>; <A name=ED>e. Univ. of Western Ont. B.Sc.(Hons.) 1994, M.Sc. 1997</A>; <A name=ED>McGill Univ. Ph.D. 2002</A>; <A name=MA>m. L. MacManus</A>; two children; <A name=PO>CANADA RESEARCH CHAIR IN SIGNAL TRANSDUCTION IN ISCHEMIA</A> and <A name=PO>ASST. PROF., DEPT. OF BIOL. SCI., UNIV. OF TORONTO SCARBOROUGH 2006– </A>; Postdoctoral Fellow, Toronto Western Hosp. 2000–06; Expert Cons., Auris Med. SAS, Montpellier, France; mem., Centre for the Neurobiol. of Stress; named INMHA Brainstar of the Year 2003; Bd. of Dirs. & Fundraising Chair, N'Sheemaehn Childcare; mem., Soc. for Neurosci.; Cdn. Physiol. Soc.; Cdn. Assn. for Neurosci.; <A name=WK>co-author: 'Therapeutic Tools in Brain Damage' in <EM>Proteomics and Protein Interactions: Biology, Chemistry, Bioinformatics and Drug Design </EM>2005; 18 pub. journal articles</A>; Office: <A name=OF1_L1>1265 Military Trail</A>, <A name=OF1_CT>Scarborough</A>, <A name=OF1_PR>Ont.</A> <A name=OF1_PC>M1C 1A4</A>. </BODY></HTML>
And the code behind I'm using is this
Dim FullBio As New System.Xml.XmlDocument
Dim NodeList As System.Xml.XmlNodeList
Dim Node As System.Xml.XmlNode
FullBio.LoadXml(bio.Item(11))
NodeList = FullBio.SelectNodes("a")
For Each Node In NodeList
Dim name = Node.Attributes.GetNamedItem("name").Value()
lblEducation.Text = lblEducation.Text + name.ToString() + Node.InnerText + "<br />"
Next
So the XML loaded into the Xml Document at
FullBio.LoadXml(bio.Item(11))is the XML I provided at the top. I am getting this error message:
'SN' is an unexpected token. The expected token is '"' or '''. Line 1, position 49.
I know that the error is because the attributes are not quoted. Is there anyway to make XmlDocument understand the attributes anyway or an easy way to use a reg expression to add quotes to the attributes before loading the string into the xmldoc?
What you have is invalid XML. An XmlDocument expects that the input is valid XML. I would recommend you using an HTML parser such as Html Agility Pack in order to parse HTML (which is what you have as input). So for example if you wanted to list all name
attribute values for all anchors it's as simple as that:
using System;
using HtmlAgilityPack;
class Program
{
static void Main()
{
var document = new HtmlDocument();
document.Load("test.html");
foreach (var a in document.DocumentNode.Descendants("a"))
{
Console.WriteLine("Name: {0}", a.Attributes["name"].Value);
}
}
}
I would write some logic to insert quotes around the attribute values. The document will load with errors if the XML isn't properly formatted.
You can use the Html2Xhtml library for this. Here is a link:
http://corsis.sourceforge.net/index.php/Html2Xhtml
And you should be able to use the library to put the contents into an XDocument, like this:
string html = "<html><head><TITLE>title</TITLE></head><body>I♥NY<p>b<br>c:±<img src=2 nonsense=x></a><font size=2>c</font></body></html>";
var xdoc = Html2Xhtml.RunAsFilter(stdin => stdin.Write(html)).ReadToXDocument(keepXhtmlNamespace: true);
Console.WriteLine(xdoc);
I believe that Html2Xhtml supports .NET 2.0 framework and above, and if not I'm pretty sure that one of the previous versions will, but if not you can use this:
http://www.codeproject.com/KB/XML/HTML2XHTML.aspx
This article uses HTML Tidy, and the source code from this article should work in 2.0.
Yuo can also try SgmlReader, great for this kind of problem.
using (var strReader = new StringReader(html))
{
using (SgmlReader sgmlReader = new SgmlReader())
{
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = strReader;
// create document
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.Load(sgmlReader);
}
}
精彩评论