What’s most interesting it was not the XPath evaluation but xhtml parsing.

It was only 12k B large and took around 2 minutes to parse!

To get around this limitation you could use a streaming parser, such as Woodstox (which implements standard Stax API). Not only can you validate the documents structure, but you can supply some pretty complex rules about what type of content your nodes and attributes can contain.

If so, you would create an XMLStream Reader, and just call "" as long as "Next()" returns true.

What you are asking is how to verify that a piece of content is well-formed XML document.

This is easily done by simply letting an XML parser (try to) parse content in question -- if there are issues, parser will report an error by throwing exception.

