Jacob O'Bryant
Home · Archive
Building an essay recommender system in 10 days
27 October 2020

I recently finished the MVP for an idea I had last month: Findka Essays, a newsletter that adapts to your preferences with machine learning. I pivoted to this from a much more complicated recommender system app, so I was already familiar with (almost) all the parts needed to build it. I'm extremely happy with the result: the new app is simple, the codebase is clean, and I launched it in under two weeks. In the grand tradition of our people, it is now time for an architecture + toolkit walkthrough. Feel free to skip to whichever sections interest you most. If you read only one section, I suggest Recommendations.

Note: I've abstracted a lot of the code into Biff, a web framework I released several months ago. I'll reference Biff frequently.


Language

The vast majority of the app is written in Clojure (the only exception is the actual recommendation algorithm, which is written in Python). Clojure is a fabulous language that has a strong emphasis on simplicity and stability, which makes codebases easy to maintain and extend in the long run. As a Lisp, it also has an extremely tight feedback loop which is great for rapid development.

There is a trade-off: since Clojure focuses on long-term concerns over short-term concerns, it can take a while to get up to speed (though it's a lot easier if you have a mentor). For example, to do web dev in Clojure, you'll need to learn how to piece together a bunch of individual libraries—there isn't a go-to framework like Rails or Django. (Luminus is probably the best starting point, after you're familiar with the language itself).

In most cases, I'd say the trade-off is well worth it. But for small or experimental apps (like you might build in a startup), speed at the beginning is extra valuable. If you want to move fast right away but you're not already comfortable in the Clojure ecosystem, you might have a bad time. One of my goals for Biff is to help mitigate that.

Front end

OK, now for some actual code. Findka Essays, unlike its predecessor, is a humble multi-page application. No React here. So the front end is pretty simple.

(For brevity, I'll call it "Findka" from here on out).

I use Rum for HTML generation. Here's a quick example:

(ns hello.world
 (:require
   [rum.core :as rum]))

(defn -main []
  (spit "hello.html"
    (rum/render-static-markup
      [:html
       [:body
        [:p {:style {:color "red"}}
         "Hello world!"]]])))

After installing Clojure, you could put that in src/hello/world.clj and then run the program with clj -Sdeps '{:deps {rum/rum {:mvn/version "0.11.5"}}}' -M -m hello.world. That's a great way to get started with Clojure, actually—you can make a whole static site with just functions, data structures, and Rum. That's how I made Findka's landing page. Here's a snippet:

(defn home []
  (base-page {:scripts [[:script {:src "/js/ensure-logged-out.js"}]]}
    [:.bg-dark.text-white
     [:.container.px-3.mx-auto.max-w-screen-md
      [:.nunito-sans-bold.text-2xl.mt-4.mb-8.sm:text-center
       "Great essays delivered to your inbox. " [:br.hidden.md:inline]
       "Chosen specifically for you with machine learning."]
      [:a.btn.btn-green.btn-block.sm:max-w-sm.mx-auto
       {:href "/signup"}
       "Sign up"]
      ...

This project is the first time I've used Tailwind CSS (I used Bootstrap previously). I give it two thumbs up. Tailwind gives you smaller, better building blocks than Bootstrap, and it has responsive variants for every class (hallelujah). Setup was straightforward. After an npm init; npm install tailwindcss --save-dev, you just need a few files:

tailwind.config.js:

module.exports = {
  purge: [
    './src/**/*.clj', // for tree-shaking
  ],
  theme: {
    extend: {
      colors: {
        'dark': '#343a40',
        'hn-orange': '#ff6600',
        ...
      }
    }
  }
}

tailwind.css:

@tailwind base;

@tailwind components;
@responsive {
  .btn-thin {
    @apply text-center py-2 px-4 rounded;
  }
  .btn {
    @apply btn-thin font-bold;
  }
  ...
}

@tailwind utilities;
@responsive {
  .nunito-sans-bold {
    font-family: 'Nunito Sans', sans-serif;
    font-weight : 900;
  }
  ...
}

Build with npx tailwindcss build tailwind.css -o output.css.

The last piece is fonts. I am not a designer, at all. Despite this, I have recently made a startling discovery: using different fonts, instead of just the default, can actually make your site look a lot better. It turns out you can just doom scroll through Google Fonts until you find some you like. I did that for Findka's logo (I tried AI logo generators in the past, but they weren't good).

Authentication

Findka supports signin via email link or Google. Biff mostly handles email link auth for you. Just make a form with an email field that POSTs to /api/signup. Biff sends the user an email with a link, they click on it, boom. You have to provide an email template and a function that actually sends the email. I use Mailgun, so Findka's email function looks like this:

(def templates
  {:biff.auth/signup
   (fn [{:keys [biff.auth/link]}]
     {:from "Findka Essays <...>"
      :subject "Create your Findka Essays account"
      :html (rum/render-static-markup
              [:div
               [:p "We received a request to create a Findka Essays account using this email address."]
               [:p [:a {:href link :target "_blank"} "Click here to create your account."]]
               [:p "If you did not request this link, you can ignore this email."]])})
   ...})

(defn send** [api-key opts]
  (http/post (str "https://api.mailgun.net/v3/mail.findka.com/messages")
    {:basic-auth ["api" api-key]
     :form-params opts}))

(defn send* [{:keys [mailgun/api-key template data] :as opts}]
  (if (some? template)
    (let [template-fn (get templates template)
          mailgun-opts (template-fn data)]
      (send** api-key mailgun-opts))
    (send** api-key (select-keys opts [:to :subject :text :html]))))

(defn send [{:keys [params template recaptcha/secret-key] :as sys}]
  (if (= template :biff.auth/signup)
    (let [{:keys [success score]}
          (:body
            (http/post "https://www.google.com/recaptcha/api/siteverify"
              {:form-params {:secret secret-key
                             :response (:g-recaptcha-response params)}
               :as :json}))]
      (when (and success (<= 0.5 score))
        (send* sys)))
    (send* sys)))

As shown at the end, I use Recaptcha for bot control. It's nice because you don't have to make the user do anything; Google's script will simply give you a score that represents how likely the user is to be a human.

Biff doesn't have Google sign-in support built in (yet), but it's pretty simple to add. After the front-end bit is taken care of, you just have to add an HTTP endpoint that receives a token, verifies it using Google's library, and sets a session cookie.

That's sort of a funny way to use Google sign-in, I'll admit. You're "supposed" to send the token with every request and verify it each time, letting Google's code handle sessions on the client. However, Biff is already set up for authenticating requests via session cookie, and that's more convenient for multi-page applications anyway.

CRUD

I use Crux for the database. Crux is an immutable document database with datalog queries. Another way to explain it is that Crux works well as a general-purpose database (e.g. a replacement for postgres), but it fits better with functional programming. You also get flexible data modeling without giving up query power.

(That ignores Crux's raison d'être, bitemporal queries—a cool feature, and no doubt extremely useful if you need those. For my simple applications, I haven't.)

Crux doesn't enforce any schema, but Biff does. You can specify schema using Clojure spec:

(require '[trident.util :as u])

(u/sdefs
  ::event-type #{:submit :email :click :like :dislike}
  ::timestamp  inst?
  ::url        string?
  ::parent     uuid?
  ::user       uuid?
  ::event      (u/only-keys
                 :req-un [::event-type
                          ::timestamp
                          ::user
                          ::url]
                 :opt-un [::parent]))

(def schema
  {:events {:spec [uuid? ::event]}})

This schema defines a "table" for events. The [uuid? ::event] part means that the primary key for an event document should be a UUID, and the rest of the document should conform to the spec given for ::event above. In this case, that means the document must have an event type, timestamp, user ID and URL, and it can optionally have a "parent" key (which is set to the primary key of another event).

Like a standard multi-page app, Findka has POST endpoints for writing to the database and GET endpoints for reading (though I have no idea if the endpoints adhere to REST or not). Here's one for submitting an essay:

(defn submit-essay [{:keys [biff/node session/uid params] :as sys}]
  (if (nil? uid)
    {:status 302
     :headers/Location "/login/"}
    (do
      (crux/await-tx node
        (biff.crux/submit-tx sys
          {[:events] {:user uid
                      :timestamp :db/current-time
                      :event-type :submit
                      :url (:url params)}}))
      {:status 302
       :headers/Location "/settings"})))

(def routes
  [["/api/submit-essay"
   {:post submit-essay
    :name ::submit-essay
    :middleware [anti-forgery/wrap-anti-forgery]}]
   ...])

And here's a snippet from the /settings page. The call to crux/q performs a datalog query which returns the current user's 10 most recent events, which are displayed in the UI:

(defn settings* [{:keys [biff/db session/uid]}]
  (let [events (map first
                 (crux/q db
                   {:find '[event timestamp]
                    :full-results? true
                    :args [{'user uid}]
                    :order-by '[[timestamp :desc]]
                    :limit 10
                    :where '[[event :event-type]
                             [event :user user]
                             [event :timestamp timestamp]]}))]
    ...
    [:div
     [:.h-5]
     [:.nunito-sans-bold.text-lg "Recent activity"]
     (for [[i {:keys [event-type url timestamp]}] (map-indexed vector events)]
       [:.p-2.leading-snug {:class (when (odd? i) "bg-gray-200")}
        [:.text-xs.text-gray-700 timestamp]
        [:div (case event-type
                :submit  "Submitted: "
                :email   "Received: "
                :click   "Clicked: "
                :like    "Added to favorites: "
                :dislike "Show less like this: ")
         [:a.text-blue-600 {:href url :target "_blank"} url]]])]
    ...))

(defn settings [sys]
  {:body (rum/render-static-markup
           (static/base-page
             {:scripts [[:script {:src "https://apis.google.com/js/platform.js"}]
                        [:script {:src "/js/ensure-logged-in.js"}]
                        [:script {:src "/js/settings.js"}]]}
             (settings* sys)))
   :headers/content-type "text/html"})

(def routes
  [["/settings" {:get settings
                 :name ::settings
                 :middleware [anti-forgery/wrap-anti-forgery]}]
   ...])

Recommendations

And here we are; the whole reason that Findka exists. Findka sends you daily or weekly emails. Each one includes a list of links to essays (submitted by other users). Whenever you click a link, Findka saves it as an event. Over time, Findka learns what kinds of essays you like. Everyone's click data is combined into a model, which Findka uses to pick essays for you.

Most of the work is done by Surprise, a Python library that includes several different collaborative filtering algorithms. (That library is why I'm using Python at all instead of just Clojure—no need to reinvent the wheel). Findka uses the k-NN baseline algorithm. It looks for essays that are often liked by people who like the same essays as you. If there's not much data yet (as is the case right now, since Findka launched recently), it defaults to recommending essays that are the most liked in general.

I've added two additional layers. I call the first one "popularity smoothing." I sort all the essays by the number of times they've been recommended in the past, and I partition them into 10 bins. Whenever I pick an essay to send someone, I first choose a bin with uniform probability, and then I use k-NN to select an essay from that bin. Thus, the top 10% most popular essays will take up only 10% of the recommendations (if left unchecked, popular items can end up taking much more than their fair share of recommendations).

The other layer is for exploration vs. exploitation, which is the need to balance recommending relevant essays with recommending more diverse essays, in order to better learn the user's preferences. I use an "epsilon-greedy" strategy: for a fixed percentage of the time, I recommend a random essay instead of using k-NN. At the moment, that percentage is quite high: 50%. As Findka grows and accumulates more data, I'll likely turn the percentage down.

Notably, I am not currently using content-based filtering. Often, article recommenders analyze the articles' text and use that to figure out which ones are similar. This makes a lot of sense for news, where you're always recommending new articles that may not have much user interaction data yet. However, Findka is intended for articles that stay relevant over time. We can afford to let articles build up clicks gradually before we recommend them to lots of users. Content-based filtering would likely still be helpful, so I'll experiment with it at some point.

To run the Python code, I simply call it as a subprocess from Clojure. First, I generate a CSV with all the user events. The Python code ingests the CSV and then spits out a different CSV that has a list of URLs for each user. From Clojure, I then send out emails with those URLs.

(require '[trident.util :as u])

(defn get-user->recs [db]
  (write-event-csv db)
  (u/sh "python3" (.getPath (io/resource "python/recommend.py")))
  (read-recommendation-csv))

I read the Python file from the JVM classpath, which makes deployment easy. I just include the file in my app's resources. For better scaling, I'll eventually move the Python code to a dedicated server.

Devops

This is my favorite section. Linux system administration always warms my heart, since messing around with Linux as a teenager played a huge part in my early computing education. I still remember the feeling of wonder that came after I learned shutdown -h now (at least, after I figured out that typing your password in the terminal doesn't make little asterisks show up). There was something about being able to interect with the system so directly that caught my attention.

Anyway. Findka runs on a DigitalOcean droplet. I use Packer and Terraform for provisioning, and I deploy code from Findka's git repository using tools.deps' git dependency feature.

For all my projects, I like to add a task bash script with the following form:

#!/bin/bash
set -e

foo () {
  ...
}

bar () {
  ...
}

"$@"

This way, you add build tasks by simply defining a new function. I put alias t='./task' in my .bashrc, so I can run a task with e.g. t foo.

Here is Findka's task file. I'll go through each of the tasks:

#!/bin/bash
set -e

init () {
  terraform init
}

build-image () {
  packer build -var "do_key=$DIGITALOCEAN_KEY" webserver.json
  curl -X GET -H "Authorization: Bearer $DIGITALOCEAN_KEY" \
       "https://api.digitalocean.com/v2/images?private=true" | jq
}

tf () {
  terraform $1 \
    -var "do_token=${DIGITALOCEAN_KEY}" \
    -var "github_deploy_key=$GITHUB_DEPLOY_KEY"
}

dev () {
  BIFF_ENV=dev clj -m biff.core
}

css () {
  npx tailwindcss build tailwind.css -o resources/www/findka-essays/css/custom.css
}

css-prod () {
  NODE_ENV=production css
}

deploy () {
  git push origin master
  scp config.edn biff@essays.findka.com:config.edn
  scp blank-prod-deps.edn biff@essays.findka.com:deps.edn
  ssh biff@essays.findka.com systemctl restart biff
  ssh biff@essays.findka.com journalctl -u biff -f
}

prod-connect () {
  ssh -NL 7800:localhost:7888 biff@essays.findka.com
}

"$@"

init: not much to say. You have to run this once.

build-image: this sets up an Ubuntu image for the DigitalOcean droplet. Here's the config file, webserver.json:

{
  "builders": [
    {
      "monitoring": true,
      "type": "digitalocean",
      "size": "s-1vcpu-1gb",
      "region": "nyc1",
      "ssh_username": "root",
      "image": "ubuntu-20-04-x64",
      "api_token": "{{user `do_key`}}",
      "private_networking": true,
      "snapshot_name": "findka-essays-webserver"
    }
  ],
  "provisioners": [
    {
      "type": "shell",
      "script": "./provision.sh"
    }
  ],
  "variables": {
    "do_key": ""
  },
  "sensitive-variables": [
    "do_key"
  ]
}

provision.sh is a ~100 line script that:

  1. Installs packages (Clojure, Nginx, Python Surprise, Certbot)
  2. Sets up a non-root user
  3. Creates a Systemd service which starts the app on boot
  4. Configures Nginx, Letsencrypt, and a firewall

After Packer finishes, the build-image task prints out the ID for the newly created image, which Terraform needs.

tf: this task deploys infrastructure according to a webserver.tf file. Here's a snippet:

...

resource "digitalocean_droplet" "webserver" {
    image = "<image ID from the build-image task>"
    name = "findka-essays-webserver"
    region = "nyc1"
    size = "s-1vcpu-1gb"
    private_networking = true
    ssh_keys = [
      data.digitalocean_ssh_key.Jacob.id
    ]
    connection {
      host = self.ipv4_address
      user = "root"
      type = "ssh"
      timeout = "2m"
    }
    provisioner "file" {
      source = "config.edn"
      destination = "/home/biff/config.edn"
    }
    provisioner "file" {
      content = var.github_deploy_key
      destination = "/home/biff/.ssh/id_rsa"
    }
}

...

Not shown is some configuration for DNS records and a managed Postgres database, which Crux uses for persistence.

dev, css: these are used for local development. After some code is finished, I run css-prod and then commit. If I was using CI/CD, I'd have that build the CSS artifact instead of commiting it to git.

deploy: the funnest task. It pushes to git, copies over a gitignored config file (which includes secrets, like API keys), deploys the new code, and then watches the logs.

Clojure has a feature where you can depend on a git repository. When you start your app, Clojure will clone the repo and add its code to the classpath. You need to specify a commit hash, but if you omit the hash, you can run a command to fetch the hash for the latest commit. So the scp blank-prod-deps.edn ... does just that:

{:deps
 {github-jacobobryant/findka-essays
  {:git/url "git@github.com:jacobobryant/findka.git",
   :tag "HEAD",
   :deps/root "essays"}}}

After running ssh biff@essays.findka.com systemctl restart biff, the Systemd service will fetch the latest commit hash (which was just pushed to master), add it to the dependency file, and start the application. (This works with private repos, by the way: that's why Terraform copies my Github deploy key to the server). This causes roughly a minute of downtime, but that's fine for Findka's scale.

The last task, prod-connect, lets you run arbitrary Clojure code over the wire. It's kind of like SSHing in to your server and running psql so you can query your production database, but it's more powerful: you can run Clojure code in the running production app's JVM process. Thanks to Clojure's late binding, you can even redefine functions—like an HTTP handler—and have the new definitions take effect immediately. All from the comfort of your favorite text editor (Vim, right?).

When I first launched Findka, I used this feature to send out emails manually for a couple days, before adding it to a cron job. I have a Clojure namespace that looks like this:

(ns findka.essays.admin
  (:require
    [trident.util :as u]
    ...))

(defn get-email-data [db]
  ...)

(comment
  (u/pprint
    (let [{:keys [biff/node] :as sys} @biff.core/system
          db (crux/db node]
      (get-email-data db)))

If I put my cursor on u/pprint and then type cpp while prod-connect is running, it will execute that code on the server. So I can print out the data that will be passed to my email-sending function without actually sending the emails. If something looks wrong in the data, I can modify the get-email-data function and re-run it without having to run the deploy task. When everything's good, I run a different function that sends the emails.

In general, this is great for:

Analytics

The last piece of the puzzle. Since Findka already saves events whenever someone submits an article or clicks a link in an email, I use those to calculate daily active users and other stats. I have a daily cron job (not an actual cron job, just a recurring task scheduled by chime) that queries for all the events, prints out a plain text table with stats, and emails it to me.

(defn print-usage-report [{:keys [db now]}]
  (let [events (map first
                 (crux/q db
                   {:find '[doc]
                    :full-results? true
                    :where '[[doc :event-type event-type]
                             [(== event-type #{:like :dislike :click :submit})]
                             [doc :user]]}))
        ...]
    (u/print-table
      [[:t "Day     "]
       [:signups "Conversions"]
       [:returning "Returning users"]]
      (for [t ts
            :let [users (get day->users t)
                  signups (get day->signups t)
                  returning (count (filter #(not= t (user->first-day %)) users))]]
        {:t (u/format-date t "dd MMM")
         :signups signups
         :returning returning}))
    ...))

That doesn't cover page hits, so I also use Simple Analytics. However it wouldn't be hard to add some custom JS to each page that sends the current URL and the referrer to an endpoint, which saves it as another event. I'll probably do that at some point. Then I could add page hits to my daily analytics emails, including landing page conversion rate.


Well, diligent reader, I congratulate you for sticking with me until the end. Give Findka a try and let me know what you think. Also let me know if you'd like to use this development stack for your own projects. I'm planning on doing a lot more work on Biff soon. Biff already includes parts of this stack, but not all.

I don't write this newsletter anymore but am too lazy to disable this signup form. Subscribe over at tfos.co instead.