My colleague Eric Saxby and I created two Elixir libraries for finding and extracting information from HTML and XML. One is called HtmlQuery and the other is called XmlQuery.
Quick overview
Before diving into the details, here is a quick overview.
<h1>Please update your profile</h1>
<form id="profile" test-role="profile">
<label>Name <input name="name" type="text" value="Fido"> </label>
<label>Age <input name="age" type="text" value="10"> </label>
<label>Bio <textarea name="bio">Fido likes long walks and playing fetch.</textarea> </label>
</form>
<profile id="123">
<name>Fido</name>
<age>10</age>
<bio>Fido likes long walks and playing fetch.</bio>
</profile>
HtmlQuery.find(
html_contents,
"textarea[name=bio]"
)
|> HtmlQuery.text()
# returns "Fido likes long walks and playing fetch."
XmlQuery.find(
xml_contents,
"//bio"
)
|> XmlQuery.text()
# returns "Fido likes long walks and playing fetch."
Motivation
In the 6 years or so that we have both been writing Elixir software, we have needed to search and extract a lot of HTML and XML. The HTML searching and extracting has been mostly in unit tests of web apps. The XML searching and extracting has mostly been in healthcare data processing apps.
We started writing various helpers in each project we worked on and eventually extracted HtmlQuery, and more recently, XmlQuery. The API has changed over time, starting off relatively simple, then getting more complicated, and then eventually back to simple but easily composable.
The API
The API has 3 main query functions: all/2
, find/2
, and find!/2
.
They each take a string of HTML or XML (depending on the library) and a CSS selector (for HtmlQuery) or an
XPath (for XmlQuery). all/2
returns every element that matches the selector, find/2
returns the first element that matches the selector (or nil if none is found), and find!/2
returns
the single element that matches the selector, raising if no elements were found or if more than one element
was found.
Once an element is found, text/1
returns its text, and attr/2
returns the value
of the given attribute.
When finding multiple elements with all/2
, we discovered that it’s simpler to pipe the results
into another function, rather than having various options built into all/2
. So to get the text
or value of each button in the following HTML:
<fieldset>
<input type="radio" name="animal" value="1">Ant</input>
<input type="radio" name="animal" value="2">Bat</input>
<input type="radio" name="animal" value="3">Cat</input>
</fieldset>
you would use the following code:
# text
html
|> HtmlQuery.all("input[type=radio]")
|> Enum.map(&HtmlQuery.text/1)
# value
html
|> HtmlQuery.all("input[type=radio]")
|> Enum.map(&HtmlQuery.attr(&1, :value))
That’s the main API. Easy to remember, yet very powerful. For a more detailed description, documentation about some extra functions, and details on a shortcut for writing CSS selectors, check out the documentation for HtmlQuery and the documentation for XmlQuery.