URL Text Histogram

A toy two-page web service that generates a table of all words and corresponding frequencies for a given URL.

Building

Dependencies:

ghc 7.8 or greater
cabal-install 1.20 or greater
lynx
postgreSQL 9.4 or greater

Installing the dependencies on OSX is easy if you have homebrew:

> brew update
> brew install ghc cabal-install lynx postgres

Building is not quite as easy, but we can push through it together. Clone this repo to your directory of choice, cd into said directory, ensure Postgres is running, and then:

> bin/db create                      # create database and apply the schema
> cabal sandbox init                 # keep app packages out of global listing
> cabal install --only-dependencies  # install only what is needed to run
> cabal run 1337                     # or the port number of your choice

That's it! Now just visit localhost:1337 to use the running application.

Caveats

The form only allows full URLs, i.e. the host is required.
Input and server errors are not particularly user-friendly.

Why the Lynx Dependency?

Nothing is better at parsing the contents of various documents (particularly HTML) than a browser, and attempting to program such would likely lead to insanity.

Instead I pass the validated URL to Lynx and process the streaming results via conduits (which HOMG are awesome and you should totally use them in any data processing applications!).

What's the Point?

There are two:

I was asked to build this as part of a job application.
I've repeatedly heard that "Haskell-framework-of-your-choice is just a thin wrapper around WAI", and I wanted to know if it was true.

As for #2, that really depends on what you mean by thin. If you don't want to find yourself manually appending headers to an arbitrary HTTP response, I'd recommend starting with Scotty for general HTTP request routing, parameter parsing, and typical response generation.

That said...

Conclusions

The description of this problem as per the job application was that something of this simplicity should only require a few hours development time.

This took me a few days.

The description would've held true had I spun this up in Ruby on Rails or some similar "five-minute blog" web framework; for that purpose alone this would've been massive overkill. However, the fruits from this experiment have been myriad. I've learned that:

Types are absolutely for humans first.

I cannot imagine attempting an operation at this level or lower without types to ensure the correctness of and guide the architecture of this program. Beyond assisting with construction, I've managed to learn quite a bit about HTTP from the types alone.

I've abstracted this concept into a currently unrefined talk, Learning through Libraries.

The HTTP request-response pipeline is relatively simple.

These results really call into question the intentional referential opacity of frameworks like Rails. The functional pipeline of languages like Haskell ensures that the HTTP request and subsequent response can be traced throughout the lifetime of the program, and without having to consult entropied and unproven documentation.

The compromises required when adopting a framework are heavy.

Did you know that the HTTP spec doesn't describe how conflicting query and POST body parameters should be resolved? This means the arbitrary decision is left to each framework to decide— and for the programmer to discover via ad-hoc experimentation. And this holds for many more arbitrary decisions.

Constructing a service from this lower level— but with the assistance of types— means that the arbitrary decisions are yours and are discernible via the aforementioned referential transparency.

The benefits of Haskell are more in the longer-term.

I often describe Haskell as "front-loading the programming effort", because the types ensure that any possible error states are described and resolved. The greater payoff exists in the longer-term, because once constructed, such systems are demonstrably more robust.

These benefits are not realized when authoring throwaway applications, meaning the programmer must suffer through most of the front-loaded effort without seeing the greater payoff.

I would sans-doubt build another system in this manner.

The ability to have this finer-grained (yet robustly structured because of types) control over the HTTP request/response lifetime would have proven invaluable on any larger applications I've dealt with. And again, I cannot overstate the value of referential transparency for any long-term software system.

For a short-term application, though, this was hell.

Jonplussed / url-text-histogram