There isn’t much to add to Nokogiri’s "Parsing an HTML/XML Document" tutorial, which is an easy introduction to the subject, so start there, then return to this page to help fill in some gaps.
Nokogiri’s basic parsing attempts to clean up a malformed document, sometimes adding missing closing tags, and will add some additional tags to make it correct.
This is an example of telling Nokogiri that the document being parsed is a complete HTML file, and Nokogiri discovering it isn’t:
require 'nokogiri' doc = Nokogiri::HTML('<body></body>') puts doc.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body></body></html>
Notice that the DTD statement was added, along with a wrapping
If we want to avoid this we can parse the document as a DocumentFragment:
require 'nokogiri' doc = Nokogiri::HTML.fragment('<body></body>') puts doc.to_html
which now outputs only what was actually passed in:
There is an XML variant also:
require 'nokogiri' doc = Nokogiri::XML('<node />') puts doc.to_xml
<?xml version="1.0"?> <node/>
doc = Nokogiri::XML.fragment('<node />') puts doc.to_xml
A more verbose variation of
fragment is to use
DocumentFragment.parse, so sometimes you’ll see it written that way.
Occasionally, Nokogiri will have to do some fix-ups to try to make sense of the document:
doc = Nokogiri::XML::DocumentFragment.parse('<node ><foo/>') puts doc.to_xml
With the modified code now being:
<node> <foo/> </node>
The same can happen with HTML.
Sometimes the document is mangled beyond Nokogiri’s ability to fix it, but it will try anyway, resulting in a document that has a changed hierarchy. Nokogiri won’t raise an exception, but it does provide a way to check for errors and the actions it took. See "How to check for parsing errors" for more information.
See the Nokogiri::XML::ParseOptions documentation for various options used when parsing.
if you want to reproduce, please indicate the source:
Getting started with Nokogiri – How to parse HTML or XML into a Nokogiri XML or HTML document - CodeDay