开发者

C# - Is it possible (and how) to perform XSL transformations using SgmlReader

开发者 https://www.devze.com 2023-01-27 09:52 出处:网络
I needed to transform the contents of an HTML web page using XSLT . Hence I used SgmlReader and wrote the snippet shown below (I

I needed to transform the contents of an HTML web page using XSLT . Hence I used SgmlReader and wrote the snippet shown below (I thought, in the end, it's an XmlReader too ...)

XmlReader xslr = XmlReader.Create(new StringReader(
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
    "<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\">" +
    "<xsl:output method=\"xml\" encoding=\"UTF-8\" version=\"1.0\" />" +
    "<xsl:template match=\"/\">" +
    "<XXX xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\"><xsl:value-of select=\"count(//br)\" /></XXX>" +
    "</xsl:template>" +
    "</xsl:stylesheet>"));

XslCompiledTransform xslt = new XslCompiledTransform();
xslt.Load(xslr);

using (SgmlReader html = new SgmlReader())
{
    StringBuilder sb = new StringBuilder();
    using (TextWriter sw = new StringWriter(sb))
    using (XmlWriter xw = new XmlTextWriter(sw))
    {
        html.InputStream = new StringReader(Resources.html_orig);
        html.DocType = "HTML";

        try
        {
            xslt.Transform(html, xw);
            string output = sb.ToString();
            System.Console.WriteLine(output);
        }
        catch (Exception exc)
        {
            System.Console.WriteLine("{0} : {1}", exc.GetType().Name, exc.Message);
            System.Console.WriteLine(exc.StackTrace);
        }
    }
}

Nonetheless , I get thos error message

NullReferenceException : Object reference not set to an instance of an object.
   at MS.Internal.Xml.Cache.XPathDocumentBuilder.Initialize(XPathDocument doc, IXmlLineInfo lineInfo, String baseUri, LoadFlags flags)
   at MS.Internal.Xml.Cache.XPathDocumentBuilder..ctor(XPathDocument doc, IXmlLineInfo lineInfo, String baseUri, LoadFlags flags)
   at System.Xml.XPath.XPathDocument.LoadFromReader(XmlReader reader, XmlSpace space)
   at System.Xml.XPath.XPathDocument..ctor(XmlReader reader, XmlSpace space)
   at System.Xml.Xsl.Runtime.XmlQueryContext.ConstructDocument(Object dataSource, String uriRelative, Uri uriResolved)
   at System.Xml.Xsl.Runtime.XmlQueryContext..ctor(XmlQueryRuntime runtime, Object defaultDataSource, XmlResolver dataSources, XsltArgumentList argList, WhitespaceRuleLookup wsRules)
   at System.Xml.Xsl.Runtime.XmlQueryRuntime..ctor(XmlQueryStaticData data, Object defaultDataSource, XmlResolver dataSources, XsltArgumentList argList, XmlSequenceWriter seqWrt)
   at System.Xml.Xsl.XmlILCommand.Execute(Object defaultDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlSequenceWriter results)
   at System.Xml.Xsl.XmlILCommand.Execute(Object defaultDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlWriter writer, Boolean closeWriter)
   at System.Xml.Xsl.XmlILCommand.Execute(XmlReader contextDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlWriter results)
   at System.Xml.Xsl.XslCompiledTransform.Transform(XmlReader input, XmlWriter results)

I found a way to work around this by converting the HTML to XML and then applying the transform , but that's an inefficient solution because :

  1. Intermediate XHTML output goes to a buffer , so extra memory is needed
  2. Conversion process needs extra CPU processing and the same hierarchy is traversed twice (in theory unnecessarily).

So (since I know StackOverflow community always provides great answers whereas other C# forums have completely disappointed me ;o) I'll be looking for feedback and suggestions so开发者_如何学Python as to perform XSL transformations using HTML directly (even if SgmlReader needs to be replaced by another similar library).


Even if the SgmlReader class is extending the XmlReader class it doesn't mean that it also behaves like an XmlReader.

Technically it also does not make sense that SgmlReader is a subclass of XmlReader, simply because SGML is a superset of XML and not a subset.

You didn't write about the purpose of your transformation, but in general HTML Agility Pack is a good option for manipulating HTML.


Have you tried using the HTML Agility Pack instead of SgmlReader? You can load the html into it, and run a transform against it directly. I'm not positive if an XML document is created internally, though - although it seems as though one is not you would probably want to compare memory and CPU usage against the conversion method you tried and discarded.

//You already have your xslt loaded into var xslt...

HtmlDocument doc = new HtmlDocument();
doc.Load( ... );  //load your HTML doc, or use LoadXML from a string, etc  
xslt.Transform(doc, xw);

See also this question: How to use HTML Agility pack

0

精彩评论

暂无评论...
验证码 换一张
取 消