Getting started with Nokogiri – How to extract text from a node or nodes

How to correctly extract text from nodes is one of the most popular questions we see, and almost invariably is made more difficult by misusing Nokogiri’s "searching" methods.

Nokogiri supports using CSS and XPath selectors. These are equivalent:

doc.at('p').text   # => "foo"
doc.at('//p').text # => "foo"

doc.search('p').size   # => 2
doc.search('//p').size # => 2

The CSS selectors are extended with many of jQuery’s CSS extensions for convenience.

at and search are generic versions of at_css and at_xpath along with css and xpath. Nokogiri makes an attempt to determine whether a CSS or XPath selector is being passed in. It’s possible to create a selector that fools at or search so occasionally it will misunderstand, which is why we have the more specific versions of the methods. In general I use the generic versions almost always, and only use the specific version if I think Nokogiri will misunderstand. This practice falls under the first entry in "Three Virtues".

If you are searching for one specific node and want its text, then use at or one of its at_css or at_xpath variants:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p>foo</p>
    <p>bar</p>
  </body>
</html>
EOT

doc.at('p').text # => "foo"

at is equivalent to search(...).first, so you could use the longer-to-type version, but why?

If the text being extracted is concatenated after using search, css or xpath then add map(&:text) instead of simply using text:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p>foo</p>
    <p>bar</p>
  </body>
</html>
EOT

doc.search('p').text        # => "foobar"
doc.search('p').map(&:text) # => ["foo", "bar"]

See the text documentation for NodeSet and Node for additional information.

if you want to reproduce, please indicate the source:
Getting started with Nokogiri – How to extract text from a node or nodes - CodeDay