Automating webbish stuff

2012-12-05

I’ve been working on updating (really, rewriting) a Twitter library to support the 1.1 API and OAuth (see earlier post). Will write more about that at a later date; today I reached a milestone in a related piece of code, and decided it was time to share it. I created a new Webbing repo on Github to contain the html module I’ll describe here.

As I’m tracking interesting Twitter content, I want to be able to summarize any web links I find for later processing. To do that, I need to clip some relevant content from the link.

HTML parsing isn’t much fun, but the trane library abstracts away much of the complexity, so I decided to start with that. I needed to export a function from the library that isn’t by default, so I forked it.

For better or worse, trane uses a SAX model. I’ve typically used DOM in the past for XML processing, so this was a bit of an experiment.

Statism

The first, most obvious problem with SAX and Erlang: SAX depends heavily on tracking state. You want to grab the title of a page? You must keep track of the tags you see so that when you get a callback for a text block, you know you’re in the title.

And, managing state in a functional language is definitely not the same as doing so in an imperative language.

I considered using OTP, particularly gen_fsm, and may yet change my code to do so. Instead, I created a record type and stashed my state in a process that can receive messages from the trane parser.

So far as I know, those are the primary paradigms for tracking state in an Erlang program.

Record syntax

At first, my code was much longer than it needed to be, but eventually I bothered to look at the documentation to discover that the shortcut I wanted was, in fact, available.

Each relevant SAX message my code receives needs to change the state of the process. Originally I did that by explicitly copying all of the old state values from one record to the new record for the next iteration of the function. Long and error-prone.

Fortunately, I was an idiot to do so.

Simplified record example

Here’s a simplified version of the record I use. It contains a boolean value indicating whether I’m currently nested somewhere in the title tag, and an empty list that will hopefully eventually be a list of strings that I encountered while inside that tag.

-record(state, { inside_title=false,
                 title_text=[] }).

Let’s create some fake functions to illustrate how to use the record.

a(#state{inside_title=InTitle, title_text=List}=State) ->
    io:format("Am I in a title? ~p~n", [InTitle]),
    b(State).

The pattern match in function a creates 3 different immutable variables:

State contains the entire record
InTitle contains the boolean value assigned to the portion of the record labeled inside_title
List contains the list assigned to the portion of the record labeled title_text

In this case, we didn’t use List, so we can simplify the code a bit:

a(#state{inside_title=InTitle}=State) ->
    io:format("Am I in a title? ~p~n", [InTitle]),
    b(State).

Now we only reference the values we’re interested in. Much cleaner.

Note that we’re passing the entire record to the b function. What if we want to change the state record before calling b? That’s a fundamental operation we need for this code, so let’s tackle that question.

Let’s assume that a is called when we are entering the title tag in the parsed HTML, so we need to call b with inside_title changed to true.

a(#state{inside_title=InTitle}=State) ->
    b(State#state{inside_title=true}).

The argument to b is definitely a little awkward, but it asserts that State is a record of type state, and whatever the State variable contains for title_text will be copied as is, but inside_title will be true when b sees it.

We can simplify that further, since we don’t care what InTitle is.

a(State) ->
    b(State#state{inside_title=true}).

Advanced pattern matching

Another new problem I encountered: the arguments that trane gives me for HTML attributes are not amenable to pattern matching.

For example, I was interested in the description metadata from HTML pages, represented as:

<meta content='Some text' name='description'/>

When trane gives that to my code as a message, it arrives as a complex tuple:

{ tag, "meta", [ { "content", "Some text" },
                 { "name", "description" } ] }

There’s no way using standard pattern matching to identify that particular meta tag.

Instead, I used what I imagine is a typical Erlang convention: I treated the list of tuples as a property list and matched the content/name values in another function.

Here’s the relevant portion of my receive block, inside my track_state function:

{tag, "meta", Attribs} ->
    track_state(check_for_description(proplists:get_value("name", Attribs),
                                      proplists:get_value("content", Attribs),
                                      State));

And here’s the check_for_description function definition:

check_for_description("description", Value, State) ->
    State#state{description = Value};
check_for_description(_, _, State) ->
    State.

Simple, elegant.

Normalization

Once I was done retrieving the values I wanted, I needed to report back to my calling process the relevant data.

The first part of the normalization process was simple: save the collected strings from the state record and throw away all of the purely internal state mechanisms, such as the inside_text boolean I illustrated above.

The more complex part, however, is the whitespace normalization.

Anyone who’s dealt with XML or HTML processing can attest to the fact that extraneous whitespace is an unavoidable evil. For example, given this HTML code:

<title>
   Page  title

</title>

The text accumulated by a SAX client would be: "\n Page title\n\n"

Fortunately, the Erlang re library makes it fairly easy to change that to "Page title".

My first step: collapse all consecutive whitespace characters to one space. By using \s to indicate any whitespace character, I deal with tabs, newlines, carriage returns, and normal spaces with the same brushstroke.

(Note, however, that in Erlang the \s has to be double-escaped, as \\s.)

re:replace(String, "\\s+", " ", [global,{return,list}])

global as an option indicates that I want the replacement to be applied as many times as possible across the full length of the string; {return,list} indicates that I want a traditional character list back, not a binary string.

Second step: eliminate leading and trailing whitespace. I originally used another re:replace, but realized later that I could rely on string:strip now that all whitespace had been rewritten as simple spaces.

string:strip(String)

My normalize_string function, then:

normalize_string(String) ->
    string:strip(re:replace(String, "\\s+", " ", [global,{return,list}])).

Message processing pitfall

One subtle error (with distinctly unsubtle symptoms) I encountered was in my message handling. As mentioned above, I asked the trane library to send messages to my process so that I could track the state in a loop.

Initially, I ignored any messages that didn’t match one of the SAX events I cared about. That proved to be a serious error.

In a long-running process, that can result in a significant degradation of performance, because the message processing bogs down as it compares each message in the mailbox against the various patterns provided. That, however, wasn’t a problem for me, because this only needed to run long enough to handle a single HTML page.

In my case, the problem is easy enough to recognize after the fact, but wasn’t so obvious to me while I was fighting it.

Messages are guaranteed to arrive in the order sent, at least between two communicating processes.

I ignored any text events until I was inside of a tag that interested me. Once I recognized a tag of interest (such as the title tag) I started storing text events in my title string.

However, because I hadn’t discarded any previous text events that I’d been ignorning, they were all still in my process mailbox, and all text previous to the title landed in the title string as I once again looped through the mailbox with my inside_title state set to true.

Simple solution: match all uninteresting events at the end of the receive block.

Reality is always uglier than theory

My real code involves many more ugly corners than I’ve captured here, so please feel free to grab the code and play with it. I decided to be polite and license this code, rather than tempt anyone to steal the code illicitly and risk polluting some other code base with unlicensed snippets.

Please contact me on Twitter (@macintux) if you have any comments on how to improve my code, or just fork and fix it on Github. (And yes, I know the code needs many more comments.)

erlang twitter

Tech Mesh 2012 Erlang fans, unite

Distributed pair programming Chatroulette done wrong