How to correctly extract text from nodes is one of the most popular questions we see, and almost invariably is made more difficult by misusing Nokogiri’s "searching" methods.
Nokogiri supports using CSS and XPath selectors. These are equivalent:
doc.at('p').text # => "foo" doc.at('//p').text # => "foo" doc.search('p').size # => 2 doc.search('//p').size # => 2
The CSS selectors are extended with many of jQuery’s CSS extensions for convenience.
search are generic versions of
at_xpath along with
xpath. Nokogiri makes an attempt to determine whether a CSS or XPath selector is being passed in. It’s possible to create a selector that fools
search so occasionally it will misunderstand, which is why we have the more specific versions of the methods. In general I use the generic versions almost always, and only use the specific version if I think Nokogiri will misunderstand. This practice falls under the first entry in "Three Virtues".
If you are searching for one specific node and want its text, then use
at or one of its
require 'nokogiri' doc = Nokogiri::HTML(<<EOT) <html> <body> <p>foo</p> <p>bar</p> </body> </html> EOT doc.at('p').text # => "foo"
at is equivalent to
search(...).first, so you could use the longer-to-type version, but why?
If the text being extracted is concatenated after using
xpath then add
map(&:text) instead of simply using
require 'nokogiri' doc = Nokogiri::HTML(<<EOT) <html> <body> <p>foo</p> <p>bar</p> </body> </html> EOT doc.search('p').text # => "foobar" doc.search('p').map(&:text) # => ["foo", "bar"]
if you want to reproduce, please indicate the source:
Getting started with Nokogiri – How to extract text from a node or nodes - CodeDay