Conducting an Internal Data Science Contest at AppNexus
(Co-Authored with Daniel Austin)
The AppNexus Data Science team is always looking for ways to spread data science knowledge across AppNexus. We’re not the first company to recognize that scaling Machine Learning (ML) means improving ML knowledge beyond the data science department: others including Airbnb (here, here), Facebook (here, here), Google (here) all have data and ML literacy strategies in place, usually through an internal data science university and formal training program. We’ve considered developing an ML training curriculum, experimented with mentoring and coaching people taking online courses (e.g. the Mining of Massive Datasets and Coursera Machine Learning class) and we ran a deep learning study group based on Ian Goodfellow’s Deep Learning book.
We were looking for a problem to apply our new skills to, and sometime during a team dinner someone suggested “why don’t we do a Kaggle-style contest?”. Kaggle is a platform for data science competitions in which (professional and amateur) data scientists compete to produce the best models for predicting and classifying datasets uploaded by companies. This crowdsourcing approach relies on the fact that there are a large number of strategies that can be applied to any data science problem, and it is impossible to know beforehand which technique or analysis will be most effective.
We wanted to try that internally at AppNexus. Here’s how we did it, why we did it, and how you can do this yourself.
Building the Contest
We needed these things to run the contest: motivation, a problem, a dataset, people, contest management tools and support.
Running a contest takes effort: there has to be a good reason to do it. The AppNexus Data Science team creates machine learning products to assist online content creators (publishers) and advertisers with the billions of auctions transacted daily on our platform, and to keep our platform safe. Just some of these products include:
- Reserve price optimization in real-time bidding (RTB) auctions (a “reserve price” is the minimum price that a seller will accept from a bidder in an auction)
- Optimally allocating impressions between guaranteed contracts and RTB auctions
- Bid Price Pacing (automatically adjusting bid prices to match targets)
- Discovery (identifying which publisher inventory is best to show ads)
- Filtering non-human and other invalid traffic and inappropriate domains
We help publishers and advertisers get the best value and outcomes for their money, but our Data Science team can only tackle so many projects a year. So we’ve been experimenting with different ways to engage our Engineering and Product teams, to expand the use of machine learning in our products and services and share responsibility for its evaluation. Side benefits from this include creating an internal pool of data science talent, and more widespread data literacy (e.g. having more people thinking from a data-driven point of view).
An internal contest adds a competitive element, and makes the learning process more fun. It can work on a problem that’s important to the company instead of working on generic examples (e.g. “cat v dog” classification), and give contestants familiarity with the tools used by the DS team.
We needed a problem where:
- The problem is relevant to AppNexus, and ideas, models and algorithms for it are valuable to the company.
- Data to analyze the problem can be easily extracted, is of reasonable size (100 MB to a few GB), and its analysis can be done on the participant’s company laptop.
- We have metrics (and baseline benchmarks for those metrics) to compare contest submissions, so teams can be unequivocally ranked.
We chose Click-Through Rate (CTR) prediction: predicting the probability of a user click, given a set of auction features. This problem met the criteria above, and:
- The Buy Side Data Science team has a lot of expertise on CTR prediction, with existing ML-based models in production.
- We have recent data for train/test sets, metrics to compare submissions and benchmarks obtained by running the production models on this dataset.
- There is a lot of online literature on this topic.
- The problem has good pedagogic value: many different ML models can be applied to this problem, ranging from logistic regression and factorization machines to neural networks.
For metrics, participants would be provided with historical auction data to train their model(s) then use those models to make predictions on a test dataset. These predictions would then be scored by comparing them with actual click results.
We used a sample of historical auction data from a specific advertising campaign conducted in September 2017. We created a training dataset consisting of 900,000 samples and a test dataset of 100,000 samples, with each sample representing a unique auction, with numerical and categorical features for the ad impression being auctioned, and a “click_label” field that reported whether the ad was clicked or not.
The contest was open to Appnexians at our Portland, Oregon office and was held over 6 weeks (October to December 2017). Three teams participated (a total of 9 Appnexians from about 40 employees at this location). We deliberately limited the pool of contestants so we could pilot the competition (to work out implementation details and demonstrate the value of a contest before scaling to the whole company), and help people one-on-one throughout the competition.
Contest management tools
We wanted to provide participants with a development environment where they could start their analysis without being bogged down by setup issues. We also wanted to make it very easy for teams to submit their predictions and see their scores and ranking instantly.
We used a Docker image for the development environment; this contained the datasets, Python and Jupyter notebooks (although participants could use any programming language to do their analysis, we encouraged the use of Python and Jupyter because they’re popular tools in the AppNexus Data Science community and beyond). We extended a basic Jupyter Docker image with packages we considered relevant for our problem, including Pandas, NumPy and SciPy for analysis, Scikit-Learn for modeling traditional ML models, Tensorflow and Keras for deep learning models, and Matplotlib and Bokeh for data visualization. When the Docker container starts, it launches a Jupyter server: participants can start using Jupyter notebooks from the browser, and install additional Python packages or other software from the Jupyter terminal window.
We scored each submission as soon as it was uploaded from the app. We used the Logarithmic Loss (log loss) metric to score submissions (this is a popular metric to measure the performance of a classification model where the prediction is a probability value between 0 and 1). Similar to Kaggle contests, each submission received two scores: a public score and a private score. The public score was calculated using the log loss metric on a fixed 20% random sample of the test dataset and was displayed immediately upon submission. The private score was calculated using the log loss metric on the full test dataset, but was kept hidden from the participants.
To add a competitive streak (this is after all a contest!), we created a Public Leaderboard based on the public score, dynamically updating a read-only Confluence wiki page whenever a submission was made. A Private Leaderboard was also generated from the private score, but was kept hidden until the end of the contest.
For our pilot run, we adhered to most of the official Kaggle contest rules. One exception was that we did not place any restrictions on the number of daily submissions.
We ran tutorials on setting up the Docker development environment and using Jupyter notebooks. We created starter notebooks that illustrated basic aspects of machine learning including use of Pandas and NumPy, how to handle categorical features, how to run a logistic regression with Scikit-Learn, and how to run a simple neural network using Keras. We created a slack channel where participants could ask questions, and we also held weekly office hours.
Lessons learned from our pilot, and next steps
After the contest ended, we ran a retrospective session, sent a follow-on survey and conducted a semi-structured interview with the participants. We captured the following feedback:
- Overall: Our pilot run was largely successful: We found a lot of interest from people in doing data science, and in participating in these activities. Each team made more than 10 submissions and there was good engagement with the contest and the office hours we conducted (this was especially true at the beginning of the contest, but it tailed off towards the end). Participants also went on to use machine learning techniques in the AppNexus hackathon that happened during the pilot, and we reused the training material and docker image from the contest for the hackathon learn session.
- Problem selection: The CTR problem we selected was interesting and relevant to participants, and most participants enjoyed the general structure of the contest. There was a lot of interest at the end of the competition in learning more about machine learning and data science, and also applying machine learning in their jobs.
- Contest Infrastructure: Participants thought the submission process was straightforward and good, but having a command line script or other tool to automate submissions would have helped. Most participants liked using Python in Jupyter notebooks because of the ease of use and ability to make plots. Additionally, the Docker container made getting started quick and easy.
- Teaching and Learning Support: The initial classes were a good starting point to get participants competing, but more hands-on classes at the beginning of the contest would have helped. In addition to the office hours, it would have been helpful if we did more teaching during the course of the contest. For example, some folks have benefited from more details on how to pick model hyper-parameters and how to use a hold-out set to test generalization. Finally, providing each team with mentors from the Data Science department — perhaps rotating mentors weekly — would likely improve motivation and help keep teams unblocked throughout the competition.
In general, we had a fairly successful pilot study with plenty of room for improvement.
We’re considering rolling out this contest to the whole company in Spring 2018. We will incorporate most of the lessons learned during the pilot run:
- Improve the submission app so that it provides authentication and displays progress while the submission is processed. Both limitations came from the choice of the underlying framework (Dash) that we used to create the app. The second limitation has been addressed by the Dash team and we now have a progress indicator. The free version of the Dash package does not have LDAP integration, but we can assign passwords to participating teams to authenticate them for making submissions. We are also exploring the use of Jupyter Dashboard for the submission app.
- Provide a command-line interface to make submissions.
- Reduce the length of the contest (4 weeks or less) and restrict the number of daily submissions.
- Provide more hands-on training and example notebooks.
- If feasible, provide a rotating set of mentors to each participating team.
We also plan to open source the code we developed for this contest: look for it at https://github.com/appnexus.
We would like to thank the following AppNexians for their generous help and support:
• Moussa Taifi: for creating the Docker infrastructure and assistance with tooling
• Lei Hu: for providing us with the dataset
• Stephanie Tzeng: for fruitful discussions on how to conduct the contest
• Allison Krug: for encouraging participation and getting sponsorship for the prizes
• Sara-Jayne Terp: for helping us convert our writeup into a readable and outward-facing blog post
• John Murray and Scott Moore: for review and feedback of the blog post
• Ryan Woodard and Sam Seljan: our fellow Data Scientists in the Portland office, for their continuous feedback throughout the entire phase of the contest and creation of this blog post
And all the participants for their active engagement in the contest and suggestions on how we can improve this going forward.