29 June 2021
I posted The Sample on Hacker News recently and there was a good discussion (about 30 comments, half from me). I've written up the highlights in an interview format below. The questions have been highly editorialized and summarized, and I've made slight updates to some of my answers. You can read the original discussion if you like.
This is the latest of several recommender systems that I've attempted to grow over the past couple years. It launched in February and there are about 800 subscribers [now 1,100—thanks HN!]. The algorithm uses a collaborative filtering model ("people who liked X also liked Y"), and to help with cold-start I augment the training data with content-based filtering: I use keyword extraction (tf-idf) and a pre-trained language model (fastText) to cluster the newsletters, then for each cluster I generate k "fake users" who like each newsletter in the cluster. This way, the model will gradually switch from content-based filtering to collaborative filtering as it collects user ratings.
Some of the newsletters I found on my own, but most are submitted by users (there's a "what other newsletters do you subscribe to" question after you sign up). I set up an inbound-only mail server, and I generate a unique address for each newsletter, which I use to sign up manually. I approve each issue that comes in so that we don't forward welcome emails, promotions etc. (It only takes 5 - 10 minutes a day). Before forwarding I also scrub out any links with certain keywords like "unsubscribe", "manage your preferences" and so on. It's not a perfect process but it's good enough for now.
Long-term I want to build a general-purpose recommender system. I'm starting with newsletters because I think it'll be the easiest way to grow fast initially (which seems to have been validated so far). The short explanation is that I've designed The Sample to be extremely effective at cross-promotion. (If you have a newsletter, submit it and I'll send you a referral link).
Are you doing anything to prevent filter bubbles?
Yes, the algorithm accounts for the exploration-vs-exploitation problem. Right now it uses a simple epsilon-greedy strategy: 20% of the newsletters you receive are picked completely at random. I also use a technique I call "popularity smoothing," which limits the number of times that popular newsletters can be forwarded.
I think the concerns about machine learning causing echo chambers/other problems, while not completely unfounded, are overblown (perhaps due to overall anti-big-tech sentiment). I think human behavior plus ease of sharing online is a more important factor. I'm optimistic that ML filtering can actually help people get a much larger variety of information, which is one of my goals for this.
Instead of having some recommendations be completely random, could you be more strategic about it? For example, what if you recommended newsletters that had opposite political views from the ones a user has already rated positively?
Something like that could be an option. However every time I've thought about "hey, what if we made the random recommendations better by...", my conclusion has been "oh, but then we might never recommend [items with some set of properties that should get recommended], and the algorithm won't be able to correct itself." Any time you try to optimize in a certain direction, you'll always introduce at least a little bias/false negatives, and randomness at least gives a chance of breaking out of that. So this could be a good idea, but I wouldn't want to have it completely replace random recommendations.
Why recommend newsletters instead of individual articles?
There are a few reasons. The biggest one is distribution. I came up with the idea for The Sample because I was working on an essay recommender system and I was trying to figure out how to incentivise people to share it. I've built a referral system into The Sample, and whenever someone submits a newsletter, I send them a referral link. If they share the link, their newsletter gets forwarded more often. So far it's been promising; last I checked, about 15% of our subscribers came from referral links. As The Sample gets larger, the referral system will become more compelling (network effects). I've used a fair amount of paid advertising to get things going, but after some threshold, I think it will grow extremely fast with just the referral system.
The next reason is implementation-related. I prefer to use collaborative filtering as much as possible, falling back to content-based filtering only when there isn't enough rating data for collaborative filtering. But for collaborative filtering, you need long-lived items. If you're recommending news articles, by the time you collect much rating/interaction data, the article will be stale and you won't want to recommend it more anyway. Recommending newsletters is IMO an excellent way to combine ML with human curation: humans figure out what news is good, and ML helps readers find curators who are good/relevant to them. (To make a transportation analogy: humans are great at last-mile curation).
The categories on the landing page seem too broad. And does it really take 21 days?
In both cases, it's more about marketing/helping people understand what The Sample even does. After you sign up, there's a text field where you can type arbitrary topics to indicate what you're interested in. And topic modeling/content-based filtering is only used as a starting point anyway; as you rate the newsletters you receive, your recommendations will start to be dominated more by collaborative filtering ("people who liked X also liked Y", without thought for what X and Y even are). I added the checkboxes to help people understand immediately that this is an individually personalized newsletter; they're not going to get the same thing as everyone else who signs up.
Similarly, "give us 21 days to hone in on your preferences" is purely meant to explain that as you continue to use The Sample, it will learn from your behaviour and adapt to your preferences. I picked 21 on a whim—someone suggested 30 days, but I thought that was too long. Maybe I should say 7 or 14?
Another somewhat recent change is that new users' emails include a message that says, for example, "You've rated 2 newsletters so far. Rate at least 5 more to help us learn your preferences." There's nothing special about 7; we just needed a way to get more people to rate the newsletters we send them. And this did the trick.