HTMLTidy is a means to take badly formed HTML markup and generate well-formed XHTML.
There's a command-line utility, as well as a Java API.
This tool is vital if you want to 'screen scrape' data from HTML pages. Cocoon provides HTML Tidy as a Generator.