More Data is Not Better and Machine Learning is a Grind…. Just Ask Amazon
I had an opportunity to sit in on a lecture today by two experts on Machine Learning from Amazon, during a lecture at UNC. Ed Banti is Director of Software Development, Core AI Amazon, and Pat Bajari, was VP and Chief Economist, Core AI, at Amazon as well. Both gentlemen shared their insights on working with AI solutions at Amazon, and spoke to a group of students and faculty at UNC’s Carroll Hall (my old graduate school stomping grounds!)
After an overview by the corporate recruiter on Amazon’s values (including the infamous Flywheel Effect), Ed spoke about how data analytics was important in impacting several parts of the flywheel, including getting the right products in front of customers, getting the right quantity, and ensuring that customers are satisfied. He also made some very interesting comments on how AI really is about the process of learning, and does not always imply getting it right the first time. In fact, it was really interesting listening to Ed speak about the major types of “things that go wrong” which occurs in machine learning programs:
- A model is now running in production but not producing the same results (or at the same level of accuracy) as what was demonstrated during experimentation… no one knows why
- Input data is messy, incomplete, or doesn’t appear on time, leading to model training delays (note that this is a theme I have belabored to death in prior blogs, emphasizing the critical nature of data governance as a foundation for AI and machine learning)
- Model fixes or improvements get stuck in the same cycle of re-writing, which delays critical changes from getting deployed
- Model was trained offline and is now sitting in some production system without being retrained or monitored.
- Eventually your sciences feel frustrated by the slow speed of seeing their work translate into impact and the engineers feel less ownership over their work.
Ed noted that Amazon had made all of these mistakes in the past (and more), and that an important rule of thumb for machine learning programmers was to have a robust set of “guardrails” that includes standardizing on a single framework, creating environments for experimentation that mirrors production, defining standard interfaces that models must conform to, and encapsulating systems that abstract both experimentation and production.
Pat then spoke on the outcomes of a recent white paper he worked on while initiated at the University of Washington with some graduate students. As an academic turned executive, Pat is an experienced applied academic econometrician who specializes in empirical industrial organization. At Amazon, his team employs econometrics, software development and machine learning to data driven decisions. In this lecture, he spoke on “The Impact of Data on Firm Performance”, noting that as organizations take in more data, they can produce better models, reach more Amazon users, which in turn generates more data, referring back to the Amazon Flywheel Effect.
In this study, Amazon focused on 36 product lines with 5 years of weekly data, and compared forecasts with actuals. His team was interested in learning if forecast errors change as more data is gathered, and sought to be more precise in discussing what it means for data to get “big”. He first noted, tongue in cheek, that Amazon once used a single forecasting model for ordering all of their 25 million book titles, which involved stocking at the 85th percentile. In general, this approach might work well, but there will be a lot of variance in outcomes . The preliminary results of their research showed the following:
- More products was only useful when they were at a “dry” start of the forecasting process
- As the number of product data grows, the benefits were negligible
- More observations per product was important
- The results were consistent with asymptotic theory (central limit theorem) that predicts that more data has diminishing returns. In general, the results showed that there is no single model used in Amazon forecasts, and results occurred optimally with a factor structure with latent time and product effects, generalizing a standard fixed effects model.
- Higher velocity data does not improve percentage accuracy and makes accuracy levels worse!
The implications of these findings are very important, and contain some important lessons for hyperbolic claims of “big data is better” marketing. In fact, Pat surmised that the results seem inconsistent with a naive “data feedback loop”, and raises doubts as to whether forecast accuracy gains as the number of products increases. Marginal improvements in accuracy can be offset by diseconomies of scale in modeling a larger number of products, and effort instead should be focused on observing an individual product longitudinally.
He also concluded with some very interesting points on how companies should view and invest in machine learning technologies. It is important to pick a single metric to improve, even if it is not perfect, but to use it as the basis for measuring performance improvement. Pat noted that improvement and learning is often very slow – sort of like a slow weight loss program, where you lose weight very slowly. Processes may only be improving by 20 basis points a quarter, or 80 basis points a year. That isn’t a lot, but over a decade, it really makes a difference. He also noted that Tech firms are often run by scientists, who are much more willing to take on new methods, adapting PhD dissertation technologies. These companies led by “nerds” are the early adopters, and often management is the most important barrier to the development and adoption of machine learning models. Cloud computing is also allowing the diffusion of technologies, and companies that adopt the scientific method, using rational approaches to explore irrational problems, will be the ones that succeed. His final word of advice – students should be broad in their knowledge of a lot of things, but need to be very deep in one area. Data Interpreters needed!