AppNexus Common Framework: It’s Out! (Also How to Print Integers Faster)
In the words of a friend and former colleague:
After trying to package some internals for more than two years, we recently got the OK to open source the AppNexus Common Framework under the Apache License 2.0, despite its incomplete state.
ACF contains the base data structure and runtime library code we use to build production services, in C. Some of it is correctly packaged, most of it just has the raw files from our internal repository. Ironically, after settling on the project’s name, we decided not to publish the most “framework-y” bits of code: it’s not clear that anyone else would benefit. The data structures are in C, and tend to be read-optimised, with perhaps some support for non-blocking single-writer/multi-reader concurrency. There’s also some non-blocking algorithms to support the data structures, and some basic HTTP server code that we found useful to run CPU-intensive or mixed CPU/network-intensive services.
Publishing our internal code took a long time because we were trying to open a project that didn’t exist yet, despite being composed of code that we use every day. AppNexus doesn’t sell code or binaries. Like many other companies, AppNexus sells services backed by in-house code. Our internal code base is full of informal libraries (I would be unable to make sense of the code base if it wasn’t organised that way), but enforcing a clean separation between pseudo-libraries is a lot of extra work for questionable value.
These fuzzy demarcations are made worse by the way we imported some ideas directly from Operating Systems literature, in order to support efficient concurrent operations. That had a snowball effect: everything, even basic data structures, ends up indirectly depending on framework-y code that’s specialised for our use case. The usual initial offenders are the safe memory reclamation module, and the tracking memory allocator (with a bump pointer mode); both go deep in internals that probably don’t make sense outside of AppNexus.
Back in 2015, we looked at our internal support code (i.e., code that doesn’t directly run the business) and decided we should share it. We were — and still are — sure that other people face similar challenges, and exchanging ideas, if not directly trading code, can only be good for us and for coders at large. We tried to untangle the internal “Common” support library from the rest of the code base, and to decouple it from the more opinionated parts of our code, while keeping integration around (we need it) but purely opt-in.
That was hard. Aiming for a separate shared object and a Debian package made it even harder than it had to be. The strong separation between packaged ACF code and the rest of repo added a lot of friction, and the majority of the support code stayed in-tree.
Maybe we made a mistake when we tried to “librarify” our internals. We want a library of reusable code; it doesn’t mean we need a literal shared object. I’m reminded of the two definitions of portable code: code sprinkled with platform conditionals, or code that can be made to run on a different platform with minimal effort. Most of the time, I’d rather have the latter. Especially when code mostly runs on a single platform, or is integrated in few programs, I’d rather reduce overhead for the common case, while making reuse possibleand easy enough that others can benefit.
And that’s how we got the ACF effort out of the door: we accepted that the result would not be as polished as our favourite open source libraries, and that most of the code wouldn’t even be packaged or fully disentangled from internals. Far from an ideal state, but it’s closer to our goals than keeping the project private and on the backburner. We got it out by “feature” boxing the amount of work — paring it down to figuring out what would never be useful to others, and tracking down licenses and provenance — before pushing the partial result out to a public repository. Unsurprisingly, once that was done, we completed more work on ACF in a few days than we have in the past year.
Now that ACF is out, we still have to figure out the best way to help others co-opt our code, to synchronise the public repository with our internal repository, and, in my dreams, accept patches for the public repo and have them also work for the internal one. In the end, what’s important is that the code is out there with a clear license, and that someone with similar problems can easily borrow our ideas, if not our code.
If anyone else is in the same situation, please give yourself the permission to open source something that’s not yet fully packaged. As frustrating as such projects can be, it has to be better than keeping it closed.
How to print integers faster
The integer to string conversion file (
an_itoa) implemented a few cute ideas to solve a depressingly common problem. I like it so much that it's the first file I cleaned up and packaged after making the repository public.
We wrote that code in July 2014. Back then, we had an application with a moderate deployment (a couple racks on three continents) that was approaching full capacity… while more machines were in the pipeline, a quick
perf run showed it was spending a lot of time converting strings to integers and back. We already had a fast-ish string to integer function. Converting machine integers back to string however, is a bit more work, and took up around 20% of total CPU time.
Human readable formats resulting in wasted CPU cycles to print integers is a common problem, and we quickly found a few promising approaches or libraries. Our baseline was the radix-10 code in stringencoders. This post about Lwan suggested using radix-10, but generating strings backward instead of reversing like the
stringencoders library. Facebook apparently hit a similar problem in 2013, which lead to this solution by Andrei Alexandrescu. The Facebook code is a combination of two key ideas: radix-100 encoding, and determining the size of the string ahead of time with galloping search to generate the result backward, directly where it should go.
Radix-100 made sense, although I wasn’t a fan of the 200-byte lookup table. I was also dubious of the galloping search; it’s a lot of branches, and not necessarily easy to predict. The kind of
memmove we need to fixup after conversion is small and easy to specialise on x86, so we might not need to know the number of digits at all.
I then looked at the microbenchmarks in Andrei’s post, and they made it look like the code was either tested on integers with a fixed number of digits (e.g., only 4-digit integers), or randomly picked with uniform probability over a large range.
If the number of digits is fixed, the branchiness of galloping search isn’t an issue. When sampling uniformly… it’s also not an issue: most integers are large! If I pick an integer at random in
[0, 1e6), 90% of the integers have 6 digits, 99% 5 or 6, etc.
Sometimes, uniform selection is representative of the real workload (e.g., random uids or sequential object ids). Often, not so much. In general, small numbers are more common; e.g., small counts can be expected to roughly follow an exponential.
I was also worried about data cache footprint with the larger lookup table for radix-100 encoding, but then realised we were converting integers in tight loops, so the lookup table should usually be hot. That also meant we could afford a lot of instruction bytes; a multi-KB
atoi function wouldn't be acceptable, but a couple hundred bytes was fine.
Given these known solutions, John and I started doodling for a bit. Clearly, the radix-100 encoding was a good idea. We now had to know if we could do better.
Our first attempt was to find the number of decimal digits more quickly than with the galloping search. It turns out that approximating
log_10 is hard, and we gave up ;)
We then realised we didn’t need to know the number of decimal digits. If we generated the string in registers, we could find the length after the fact, slide characters with bitwise shifts, and directly write to memory.
I was still worried about the lookup table: the random accesses in the 200 byte table for radix-100 encoding could hurt when converting short arrays of small integers. I was more comfortable with some form of arithmetic that would trade best-case speed for consistent, if slightly sub-optimal, performance. As it turns out, it’s easy to convert values between 0 and 100 to unpacked BCD with a reciprocal multiplication by 1/10 and some in-register bit twiddling. Once we have a string of BCD bytes, we can add
'0' vertically to convert to ASCII characters.
The radix-100 approach is nice because it chops up the input two digits at a time; the makespan for a given integer is roughly half as long, since modern CPUs have plenty of execution units for the body.
The dependency graph for radix-10 encoding of
12345678 looks like the following, with 7 serial steps.
Going for radix-100 halves the number of steps, to 4; the steps are still serial, except for the conversion of integers in
[0, 100) to strings.
Could we expose even more ILP than the radix-100 loop?
The trick is to divide and conquer: divide by 10000 (
1e4) before splitting each group of four digits with a radix-100 conversion.
True recursive encoding gives us fewer steps, and 2 of the 3 steps can execute in parallel. However, that might not always be worth the trouble for small integers, and we know that small numbers are common. So, even if we have a good divide-and-conquer approach for larger integers, we must also implement a fast path for small integers.
The fast path for small integers (or the most significant limb of larger integers) converts a 2 or 4 digit integer to unpacked BCD, finds the number of leading zeros, converts the BCD to ASCII by adding
0x30) to each byte, and shifts out any leading zero; we assume that trailing noise is acceptable, and it's all
NUL bytes anyway.
For 32-bit integers
an_uitoa) looks like:
if number < 100: execute specialised 2-digit function
if number < 10000: execute specialised 4-digit function partition number with first 4 digits, next 4 digits, and remainder. convert first 2 groups of 4 digits to string. if the number is < 1e8: # remainder is 0! shift out leading zeros, print string. else: print remainder # at most 100, since 2^32 < 1e10 print strings for the first 2 groups of 4 digits.
The 64 bit version,
an_ultoa) is more of the same, with differences when the input number exceeds
Is it actually faster?
I had already concluded that cache footprint was mostly not an issue, but it’s still important to make sure we don’t get anything too big.
an_itoa: 400 bytes.
an_ltoa: 880 bytes
fb_itoa: 426 bytes + 200 byte LUT
fb_constant_itoa(without the galloping search): 172 bytes + 200 byte LUT
lwan_itoa(radix-10, backward generation): 60 bytes.
modp_uitoa10: 91 bytes.
The galloping search in Facebook’s converter takes a lot of space (there’s a ton of conditional branches, and large numbers must be encoded somewhere). Even without the lookup table,
an_itoa is smaller than
an_ltoa (which adds code for > 32 bit integers) is only 254 bytes larger than
fb_itoa. Facebook's galloping search attempts to make small integers go faster by checking for them first; if we convert small numbers, we don't expect to use all ~250 bytes in the galloping search. That's also true for
an_ltoa: the code is setup such that larger numbers jump forward over specialised subroutines for small integers. Small integers thus fall through to only execute code at the beginning of the functions. 400 or 800 bytes are sizable footprints compared to the 60 or 90 bytes of the radix-10 functions, but acceptable when called in tight loops.
Now that we feel like the code and data sizes are reasonable (something that microbenchmarks rarely highlight), we can look at speed.
I first ran the conversion with random integers in each digit count class from 1 digit (i.e., numbers in
[0, 10)) to 19 (numbers in
[1e8, 1e9)). The I$ was hot, but the routines were not warmed on a set of numbers in the same class (more realistic that way).
We have data for:
an_itoa, our 32 bit conversion routine;
an_ltoa, our 64 bit conversion routine;
fb_constant_itoa, Facebook's code, with the galloping search stubbed out;
fb_itoa, Facebook's radix-100 code;
itoa, GNU libc conversion (via sprintf);
lw_itoa, Lwan's backward radix-10 converter;
modp, stringencoder's radix-10 /
fb_constant_itoa to serve as a lower bound on the radix-100 approach: the conversion loop stops as soon as it hits 0 (same as
fb_itoa), but the data is written at a fixed offset, like
lw_itoa does. In both cases, we'd need another copy to slide the part of the output buffer that was populated with characters over the unused padding (Facebook's code solves the problem by determining where the output should end before converting).
Some functions only go up to
UINT32_MAX, in which case we have no data after 9 digits.
The solid line is the average time per conversion (in cycles), over 10K data points, while the shaded region covers the 10th percentile to the 90th percentile.
(GNU) libc conversion is just wayy out there. The straightforward
modp(stringencoders) code overlaps with Facebook's
itoa; it's slightly slower, but so much smaller.
We then have two incomplete string encoders: neither
lw_itoa generate their output where it should go. They fill a buffer from the end, and something else (not benchmarked) is responsible for copying the valid bytes where they belong. If an incomplete implementation suffices, Lwan's radix-10 approach is already competitive with (arguably faster than) the Facebook code. The same backward loop, but in radix-100, is definitely faster than Facebook's full galloping search/radix-100 converter.
Finally, we have
an_ltoa, that are neck and neck with one another, faster than both
fb_itoa on small and large integers, and even comparable with or faster than the incomplete converters. Their runtime is also more reliable (less variance) than
modp pays for the second variable length loop in
fb_itoa for the galloping search. There are more code paths in
an_ltoa, but no loop, so the number of (unpredictable) conditional branches is lower.
What have we learned from this experiment?
- It’s easy to improve on (g)libc’s
sprintf. That makes sense, since that code is so generic. However, in practice, we only convert to decimal, some hex, even less octal, and the rest is noise. Maybe we can afford to special case these bases.
- The double traverse in
modp_uitoa10hurts. It does make sense to avoid that by generating backward, ideally in the right spot from the start.
- Radix-100 encoding is a win over radix-10 (
fb_constant_itoais faster than
- Using registers as buffers while generating digits is a win (
an_ltoaare faster for small values).
- Divide and conquer is also a win (
an_ltoais flatter for large integers).
With results that made sense for an easily understood microbenchmark, I decided to try a bunch of distributions. Again, the code was hot, the predictors lukewarm, and we gathered 10K cycle counts per distribution/function.
Clearly, we can disregard glibc’s
The first facet generated integers by choosing uniformly between 100, 1000, ¹⁰⁴, …, ¹⁰⁸. That’s a semi-realistic tweak on the earlier dataset, which generated a bunch of numbers in each size class, and serves as an easily understood worst-case for branch prediction. Both
an_ltoa are faster than the other implementations, and branchier implementations (
modp) show their variance. Facebook's
fb_itoa isn't even faster than
strreverse encoder. The galloping search really hurts (
fb_constant_itoa, without that component, is slightly faster than then comparable radix-10
The second facet is an even harder case for branch predictors: random exponentials (
pow(2, 64.0 * random() / RAND_MAX)), to simulate real-world counts. Both
an_ltoa are faster than the other implementations, although
an_ltoa less so:
an_itoa only handles 32 bit integers, so it deals with less entropy. Between the 32-bit implementations,
an_itoa is markedly faster and more consistent than
lw_itoa (which is also incomplete) and
modp. Full 64-bit converters generally exhibit more variance in runtime (their input is more randomised), but
an_ltoa is still visibly faster than
fb_itoa, and even than the incomplete
fb_constant_itoa. We also notice that
fb_itoa's runtimes are more spread out than
fb_constant_itoa: the galloping search adds overhead in time, but also a lot of variance. That makes me think that the Facebook code is more sensitive than others to difference in data distribution between microbenchmarks and production.
The third facet should be representative of printing internal sequential object ids: uniform integers in
[0, 256K). As expected, every approach is tighter than with the exponential distribution (most integers are large). The
an_ltoa options are faster than the rest, and it's far from clear that
fb_itoa is preferable to even
modp. The range was also chosen because it's somewhat of a worst case for
an_itoa: the code does extra work for values between 10⁴ and 10⁸ to have more to do before the conditional branch for
x < 1e8. That never pays off in the range tested here. However, even with this weakness,
an_itoa still seems preferable to
fb_itoa, and even to the simpler
The fourth facet (first of the second row) shows what happens when we choose random integers in
[0, 20). That test case is interesting because it's small, thus semi-representative of counts for rare events, and because it needs 1 or 2 digits with equal probability. Everything does pretty well, and runtime distributions are overall tight; branch predictors can do a decent job when there are only two options. I'm not sure why there's such a difference between
an_ltoa's distribution. The code for any value less than 100 is identical at the C level, but there are small difference in code generation… but I can't pinpoint where the difference might come from.
The fifth facet, for random integers in
[100, 200) is similar, with a bit more variance.
The sixth facet generates unix timestamps around a date in 2014 with uniform selection plus or minus one million second. It’s meant to be representative of printing timestamps. Again,
an_ltoa are faster than the rest, with
an_itoa being slightly faster and more consistent. Radix-100 (
fb_constant_itoa) is faster and more consistent than radix-10 (
lw_itoa), but it's not clear if
fb_itoa is preferable to
modp. The variance for
modp is larger than for the other implementations, even
fb_itoa: that's the cost of a radix-10 loop and of the additional
This set of results shows that conditional branches are an issue when converting integers to strings, and that the important of branches strongly depends on the distribution. The Facebook’s approach, with a galloping search for the number of digits, seems particularly sensitive to the distribution. Running
fb_itoa in production because it does well in microbenchmark is thus only a good idea if we know that the microbenchmark is representative of production.
Bigger numbers take more time to convert, but the divide and conquer approach of
an_ltoa is consistently faster at the high end, while the unrolled SIMD within a register approach does well for small numbers.
So, if you ever find yourself bottlenecked on
The correct solution to the “
itoais too slow" problem is simple: don't do that. After all, If there's no special requirement, we find Protobuf does very well as a "better JSON."
However, once you find yourself in this bad spot, it’s trivial to do better than generic libc conversion code. That makes it a dangerously fun problem in a way… especially given that the data distribution can matter so much. No benchmark is perfect, but various implementations are affected differently by flaws in microbenchmarks. It’s thus essential not to overfit on the benchmark data, probably even more important than improving performance by another factor of 10% or 20% (doing 4–5x better than libc code is already a given). That’s why I prefer integer conversion code with more consistent cycle counts: there’s less room for differences due to the distribution of data.
Finally, if, like 2014-AppNexus, you find yourself converting a lot of integers to strings in tight loops, try
an_ltoa! The whole repository is Apache 2.0, and it should be easy to copy and paste all the dependencies to pare it down to two files. Also, note that the functions use their destination array (up to 10 bytes for
an_ltoa) as scratch space, even for integers that need fewer digits.
Written by Dr. Paul khuong