Computer Things

Querying HTML and XML in Elixir with HtmlQuery and XmlQuery

See other articles in the “Elixir” topic.

My colleague Eric Saxby and I created two Elixir libraries for finding and extracting information from HTML and XML. One is called HtmlQuery and the other is called XmlQuery.

Quick overview

Before diving into the details, here is a quick overview.

HTML
<h1>Please update your profile</h1>

<form id="profile" test-role="profile">
  <label>Name <input name="name" type="text" value="Fido"> </label>
  <label>Age <input name="age" type="text" value="10"> </label>
  <label>Bio <textarea name="bio">Fido likes long walks and playing fetch.</textarea> </label>
</form>
XML
<profile id="123">
  <name>Fido</name>
  <age>10</age>
  <bio>Fido likes long walks and playing fetch.</bio>
</profile>
Elixir
HtmlQuery.find(
  html_contents,
  "textarea[name=bio]"
)
|> HtmlQuery.text()

# returns "Fido likes long walks and playing fetch."

XmlQuery.find(
  xml_contents,
  "//bio"
)
|> XmlQuery.text()

# returns "Fido likes long walks and playing fetch."

Motivation

In the 6 years or so that we have both been writing Elixir software, we have needed to search and extract a lot of HTML and XML. The HTML searching and extracting has been mostly in unit tests of web apps. The XML searching and extracting has mostly been in healthcare data processing apps.

We started writing various helpers in each project we worked on and eventually extracted HtmlQuery, and more recently, XmlQuery. The API has changed over time, starting off relatively simple, then getting more complicated, and then eventually back to simple but easily composable.

The API

The API has 3 main query functions: all/2, find/2, and find!/2. They each take a string of HTML or XML (depending on the library) and a CSS selector (for HtmlQuery) or an XPath (for XmlQuery). all/2 returns every element that matches the selector, find/2 returns the first element that matches the selector (or nil if none is found), and find!/2 returns the single element that matches the selector, raising if no elements were found or if more than one element was found.

Once an element is found, text/1 returns its text, and attr/2 returns the value of the given attribute. When finding multiple elements with all/2, we discovered that it’s simpler to pipe the results into another function, rather than having various options built into all/2. So to get the text or value of each button in the following HTML:

HTML
<fieldset>
  <input type="radio" name="animal" value="1">Ant</input>
  <input type="radio" name="animal" value="2">Bat</input>
  <input type="radio" name="animal" value="3">Cat</input>
</fieldset>

you would use the following code:

Elixir
# text
html
|> HtmlQuery.all("input[type=radio]")
|> Enum.map(&HtmlQuery.text/1)

# value
html
|> HtmlQuery.all("input[type=radio]")
|> Enum.map(&HtmlQuery.attr(&1, :value))

That’s the main API. Easy to remember, yet very powerful. For a more detailed description, documentation about some extra functions, and details on a shortcut for writing CSS selectors, check out the documentation for HtmlQuery and the documentation for XmlQuery.