Is there a DTD for HTML? If so, can an XML parser parse HTML documents with it? I'm looking for a simple way to parse HTML documents the same as I can parse XML documents with JAXP.
HTML is an SGML application and although it has a DTD (https://www.w3.org/TR/html401/), SGML DTDs are different from XML DTDs.
An SGML system can work with all XML DTDs (in theory at least) but XML systems cannot work with all SGML DTDs.
The big problem is that HTML parsers need to infer the structure of the document when tags (such as </p>) are missing. This is hard and not all tools give you the same result.
Dig Deeper on Topics Archive
Have a question for an expert?
Please add a title for your question
Get answers from a TechTarget expert on whatever's puzzling you.