The Algorithm that Makes Data Segmenting Simple
This post is part of our Summer Intern Blog Series! Each of our most recent class of interns wrote a blog post on their biggest accomplishments and lessons of the summer. This installment comes from Tyrone, a rising senior studying Computer Science at Boston University. He enjoys rock climbing, listening to jazz, and getting overly excited when talking about good food options in Boston.
In college, many computer science projects are small to medium sized affairs, designed to be learning exercises that practice a specific set of concepts or test knowledge retention. Rarely will a well-polished project blossom into something larger. While my previous internships introduced me to the work environment of large-scale collaborative research effort, my summer at AppNexus was my first foray into working at a company that maintains large scale applications impacting millions of people around the globe.
AppNexus transacts billions of impressions daily. When a user clicks on a website, they can be associated with specific data segments that contain information about, among other things, geographic location and demographics. Advertisers seeking to glean insights from segments would naturally be interested in the number of users per segment. Another useful query would be to find the set of users belonging to a specific set of segments, with the assumption that users in similar segments likely have similar product tastes and ad preferences.
Segmenting: Not as easy as it sounds
It’s not difficult to think up simple algorithms to do unique set operations on small data sets. The problem is these algorithms don’t scale well. Counting the number of distinct elements in a set requires keeping track of each unique element being iterated over. For a platform dealing with billions of users a day, this solution is infeasible. Luckily, not knowing the exact number of users becomes less of an issue if we can accept some degree of error while still producing a reasonably accurate estimate. Over the summer, my project was to generate that estimate, creating a ‘sketch’ of the data stored in a data structure aptly titled a data sketch.
The main advantage of sketches is their sublinear space complexity. Instead of storing all the unique items from an incoming data stream, the sketch uses a hashing algorithm to produce an estimate. At the cost of some accuracy, we get reduced memory usage and significant performance speedups. As a bonus, there exist data sketch libraries implemented in Go, the language we would be using to build our application.
We named the project Van (Go)gh after the artist himself, so his artistic talent in sketching would bleed over to coding our Go data sketches. Accordingly, my first order of business was to get a handle on the fundamentals of Go.
Next, we ironed out Van Gogh’s architecture. We settled on running two different Go server types dubbed collectors and aggregators. For a given incoming ad request, a collector will extract a list of segments belonging to the user who sent the request and update a data sketch tracking the unique segment. The aggregator periodically pulls down and merges sketches coming from collectors to store a time series of segment counts, which is exposed to clients through a web API.
Van Gogh in action
The API lets you query for specific segments as a URL parameter, returning the unique count number. To show something more exciting than a blank white screen with a number on it, I also added a web visualization that compares time series data across multiple segments.
One major design consideration I wrestled with was the need for Van Gogh to have some level of concurrency. The collectors needed to extract segment data from hundreds of thousands of incoming ad requests. All of this would be written to the collector’s global sketch and subsequently read across a network by the aggregator. Here we run into the classic readers-writers problem from concurrency. Luckily, Go provides built-in primitives to handle concurrent resource access. We implemented a simple data structure to wrap all sketch read/write operations in Go’s version of locks, preventing other threads from accessing the sketch at the same time.
This summer has been a wild ride, and I’m grateful to the community that made me feel welcome in a chaotic, unfamiliar city. As I worked through the difficulties involved in building my project, I was always able to turn to a friendly team member or slack a resident expert for advice. At the same time, having a multi-faceted, self-contained project allowed me to continually challenge myself and foster the ability to work independently. I hope the experience I’ve gained at AppNexus will aid me far into the future.