- Why are the DOM element names always uppercase?
- Why do I get a hierarchy request error using DOM?
- How do I add filters before the tag balancer?
- How do I parse HTML document fragments?
- How can I get the location of document information?
- Do I have to use all of Xerces2?
- What version of NekoHTML am I using?
NekoHTML 可以解析、修整和净化html文档,可以自动关闭标记,修补一些常见的错误,也可以用NekoHTML从html文档里抽取文本。
NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.
NekoHTML is written using the Xerces Native Interface (XNI) that is the foundation of the Xerces2 implementation. This enables you to use the NekoHTML parser with existing XNI tools without modification or rewriting code.