Omair Shakeel

Thursday, May 05, 2011

SgmlReader - converting HTML into a well formed XML


SgmlReader is a nice .NET library that converts an SGML document into a well formed XML. It has a built-in support for converting HTML as well.


One of our clients was sending us emails to our systems that were extracting the required information from those emails. The content type of their emails was HTML. The main job was to convert the HTML into an XML and parse the XML and extract the information that our system was looking for. Most of todays browsers and email clients are able to view content even if the HTML is not well formed. And HTML is not itself required to be well-closed / well-formed. 

Unclosed tags such as <br /> are acceptable. Also attributes without enclosing double quotes are also allowed such as
<div id=mDiv> </div> . Loading an HTML string into an XmlDocument will throw an exception. Free libraries such as SgmlReader can come into handy in such cases that can correct your ill-formed HTML into a well formed XML document.

SgmlReader is an XmlReader API over any SGML document. You can download it from here.

A common code example looks like this:

SgmlReader reader = new SgmlReader();
reader.DocType = "HTML";
reader.WhitespaceHandling = WhitespaceHandling.All;
reader.CaseFolding = CaseFolding.ToLower;

using (StringReader htmlStringReader = new StringReader(html))
{
    reader.InputStream = htmlStringReader;

    // Load the xml document
       XmlDocument document = new XmlDocument();
    document.PreserveWhitespace = true;
       document.XmlResolver = null;
       document.Load(reader);
}

0 Comments:

Post a Comment

<< Home