Getting started with Nokogiri – How to parse HTML or XML into a Nokogiri XML or HTML document

There isn’t much to add to Nokogiri’s "Parsing an HTML/XML Document" tutorial, which is an easy introduction to the subject, so start there, then return to this page to help fill in some gaps.

Nokogiri’s basic parsing attempts to clean up a malformed document, sometimes adding missing closing tags, and will add some additional tags to make it correct.

This is an example of telling Nokogiri that the document being parsed is a complete HTML file, and Nokogiri discovering it isn’t:

require 'nokogiri'

doc = Nokogiri::HTML('<body></body>')
puts doc.to_html 

Which outputs:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body></body></html>

Notice that the DTD statement was added, along with a wrapping <html> tag.

If we want to avoid this we can parse the document as a DocumentFragment:

require 'nokogiri'

doc = Nokogiri::HTML.fragment('<body></body>')
puts doc.to_html 

which now outputs only what was actually passed in:

<body></body>

There is an XML variant also:

require 'nokogiri'

doc = Nokogiri::XML('<node />')
puts doc.to_xml

Which outputs:

<?xml version="1.0"?>
<node/>

and:

doc = Nokogiri::XML.fragment('<node />')
puts doc.to_xml

resulting in:

<node/>

A more verbose variation of fragment is to use DocumentFragment.parse, so sometimes you’ll see it written that way.

Occasionally, Nokogiri will have to do some fix-ups to try to make sense of the document:

doc = Nokogiri::XML::DocumentFragment.parse('<node ><foo/>')
puts doc.to_xml

With the modified code now being:

<node>
  <foo/>
</node>

The same can happen with HTML.

Sometimes the document is mangled beyond Nokogiri’s ability to fix it, but it will try anyway, resulting in a document that has a changed hierarchy. Nokogiri won’t raise an exception, but it does provide a way to check for errors and the actions it took. See "How to check for parsing errors" for more information.

See the Nokogiri::XML::ParseOptions documentation for various options used when parsing.

if you want to reproduce, please indicate the source:
Getting started with Nokogiri – How to parse HTML or XML into a Nokogiri XML or HTML document - CodeDay