The future: no one truly knows what it holds in store for us. Some people read the future with tarot cards or by gazing into crystal balls, with limited to moderate success. On the other hand, I like to emulate Nate Silver and make informed predictions with data and statistical models. Luckily for me, AppNexus has a plethora of data, ripe for building forecasts and helping me learn and develop as a data scientist.
The Project As a Data Science intern for the Client Insights Analytics (CIA) team, I expected to utilize my background in statistics and computer science and my enthusiasm for helping others. As soon as I met the CIA team, I knew I was in the right place. My manager, Liz, informed me that my main summer project would focus on building a dashboard for internal clients (Technical Account Managers and salespeople) to forecast metrics including inventory availability, impressions, and revenue. This endeavor seemed perfect for me – a combination of data science and internal consulting. Liz and I met with several fellow AppNexians during the first few weeks of my internship to explore what related tools were currently in use, how we could improve upon them, and what data I would need.
Data Wrangling Speaking of data, what’s a statistical venture without some data snags? Upon brainstorming the underlying factors for a publisher’s trends and growth over time as well as the corresponding inputs for the web-based dashboard, Liz empowered me to pull the necessary information from our databases. I dove right into learning SQL and was soon able to write queries in Python using the link package. Of course, it couldn’t be that easy or simple. I spent quite some time exploring different tables from AppNexus’ databases, discovering that certain tables had longer look-backs than others, but those with shorter time frames often contained more of the fields required for forecasting. It was a case of a classic trade-off between quantity and quality of data. After investigating overlaps between the different tables in an attempt to merge information as a combination of look-back and significant variables, I realized the solution was choice rather than compromise. Though some friends at school had jokingly nicknamed me “The Wrangler” while working on a data visualization project, I could not quite wrangle the data the way I had hoped in this instance. Instead, I decided to proceed with the different datasets separately and give users of the dashboard the options of varying look-backs and input fields.
Modelling Having written (and continually re-written) functions to collect the data, I pressed on with exploratory data analysis. With excellent guidance and assistance from Adam on the Data Science team, I generated several statistical models for predicting future publisher metrics (dependent variables) based on the inputs (latent independent variables) and on time-series information including lag terms, day of the week, general time of the month, and quarter of the year. With standard regression outputs, I got a sense of which models typically fit the data better. However, I was able to go a step further by back-testing the models on out-of-sample (OOS) data and comparing the predictions with OOS error. Thus, I determined which forecasting method yielded the most accurate results and utilized this in the back-end of the web-based tool. As soon as I generated a graph showing past data adjoined to forecasted data, it was time to build the front-end.
Web Development With help from Liz, Adam, and various teammates on CIA, I learned about Flask, which is essentially a package and technique to produce a webpage with Python. I used the CIA Custom Analytics Tool template as my starting point and went about building my tool based on other CIA examples and with my scripts for data collection and statistical modelling. Web development can often be tedious, and I wanted to avoid boring repetition. However, some of the inputs for the dashboard were best visualized as dropdowns, including a few that were two-layered multi-select dropdowns. Typing up the HTML to accomplish such a task could take hours even with copying and pasting a template. Therefore, I wrote a function in Python that takes in a dataframe and some column names and prints out the HTML for the desired dropdown. This was a particularly great learning experience and fun accomplishment.
[caption id="attachment_5718" align="alignleft" width="438"] Screenshot of example input in dashboard, prominently displaying the 2-layered multi-select dropdowns.[/caption]
After building out the initial version of the web-tool, I solicited feedback from teammates and internal clients who would be using it. I appreciated their advice and tacked on certain additional input fields, clarified some outputs with supplementary descriptions, and figured out (through some struggles) how to generate some error messages for certain use cases.
The Future My summer project was an exciting, engaging experience in which I learned a lot and developed many new skills while building upon others. My hope is that internal clients will use this for forecasting well into the future, as I love the opportunity to contribute in a meaningful way. Though I cannot say for sure; after all, no one can predict the future perfectly.
About the Author: Adam is a senior at Harvard College, concentrating in Statistics, pursuing the Data Science track. Most of his time is spent with friends, learning/working, or exploring statistics as a leader of the Harvard Sports Analysis Collective and rooting for Boston sports teams.