Susan Athey on Machine Learning and Econometrics

40 min readSep 15, 2016

EconTalk Episode #542 (Archive of all episodes, here)

SUBSCRIBE TO ECONTALK: iTUNES, SOUNDCLOUD, STITCHER

(Conversation lightly edited for clarity and readability)

Russ Roberts: Today is July [corrected] 18th, 2016 and my guest is Susan Athey, the Economics of Technology professor at the Graduate School of Business at Stanford University. She won the John Bates Clarke Medal in2007 and is a member of the National Academy of Sciences.

We’re going to start with a recent paper you co-authored [with Guido Imbens], on the state of applied econometrics. It’s an overview paper and it gets at a lot of issues that are current in the field. I want to start with an example you use there, a pretty important policy issue, which is the minimum wage. It’s in the news these days; people are talking about increasing the minimum wage; we’ve talked about it in passing on the program many times. What’s the challenge of measuring the impact of the minimum wage on employment, say? Why is that hard? Can’t we just go out and count how many jobs change when we change the minimum wage in the past?

LISTEN TO THE CONVERSATION

Susan Athey: Sure. There are sort of two problems when we think about measuring the impact of a policy like the minimum wage. The first is just basic correlation versus causality. So, if you think about looking at a point in time, we could say, some states have a high minimum wage and other states have a lower minimum wage. And so, the naive thing to do would be to say, ‘Hey, look, this state has a high minimum wage,’ or ‘This city has a high minimum wage, and employment is doing just fine. So, gosh, if we increase the minimum wage everywhere else, they would do just fine, too.’ And the big fallacy there is that the places that have chosen to raise the minimum wage are not the same as the places that have not.

And recently certain cities that have raised the minimum wage tend to be cities that are booming, that have, say, a high influx of tech workers that are making very high wages; the rents have gone up in those cities and so it’s made it very difficult for low-wage workers to even find anyplace to live in those cities. And that’s often been the political motivation for the increase in the minimum wage. And so a minimum wage that works in San Francisco or Seattle is not necessarily going to work in the middle of Iowa.

And there’s a number of underlying factors that come into play here. One is that a city that has lots of tech workers may have their — their customers at fast food restaurants may not be very price-sensitive. So it might be very easy for those restaurants to increase their prices and basically pass on the cost increase to their customers, and the customers may not really mind. So it may not really have a big impact on the bottom line of the establishment.

Another factor is that if it’s a place where workers are in scarce supply, maybe partly because it’s hard to find a place to live, then the establishments may see a tradeoff when they raise wages: it they have higher wages it may help them keep a more stable work force, which helps kind of compensate for the cost of raising the minimum wage. So, all of these forces can lead to very different responses than in a place where those forces aren’t operating. And so, if you just try to measure — relate the magnitude of the minimum wage: ‘Hey, it’s $15 here; it’s $10 there; it’s $8 there,’ and here’s the differences in employment and try to connect the dots between those points, you are going to get a very misleading answer.

Russ Roberts: Just to bring up an example, we’ve talked about many times on the program with a similar challenge, if you look at people’s wages who have gone to college or done graduate work, it’s a lot higher than people who haven’t. And people say, ‘Well, that’s the return to education.’ The question is, if you increase the amount of education in the United States, would you necessarily get the returns that the people who are already are educated are getting? And the answer is probably not. The people who go on to college might not be, and probably aren’t, exactly the same as the people who are going now; and therefore the full effect would be more challenging to measure.

Susan Athey: Exactly. And there’s also equilibrium effects, because if you send more people to college, you’ll decrease the scarcity of college graduates, which can also lower the equilibrium wages. So, these things are very hard to measure. So, something that’s typically a better starting point is looking over time and trying to measure the impact of changes in the minimum wage over time. But those also are confounded by various forces. So, it’s not that San Francisco or Seattle have static labor markets. Those markets have been heating up and heating up and heating up over time. And so there are still going to be time trends in the markets.

And so if you try to look, say, one year before and one year after, the counterfactual — what would have happened if you hadn’t changed the minimum wage — is not static. There would have been a time trend in the absence of the change in policy. And so one of the techniques that economists try to use in dealing with that is something that is called ‘difference in differences.’

Learn about “difference in differences” here:

Essentially you can think about trying to create a control group for the time trend of the cities or states that made a change. So, if you think about the problem you are trying to solve, suppose the city tries to implement a higher minimum wage. You want to know what the causal effect of that minimum wage increase was. And in order to do that, you need to know what would have happened if you hadn’t changed the minimum wage. And we call that a counterfactual.

Counterfactually, if the policy hadn’t changed, what would have happened? And so you can think conceptually about what kind of data would have helped inform what would have happened in Seattle in the absence of a policy change. And so, one way that people traditionally looked at this is they might have said, well, Seattle looks a lot like San Francisco, and San Francisco didn’t have a policy change at exactly the same time; and so we can try to use the time trend in San Francisco to try to predict what the time trend in Seattle would have been.

Russ Roberts: And they both start with the letter ‘S’, so therefore it could be reliable. [humor] Or it might not be. So that one of the challenges of course is that you are making a particular assumption there.

Susan Athey: Exactly. And so the famous Card/Krueger study of the minimum wage compared Pennsylvania to New Jersey. And that was kind of a cutting edge analysis at the time. Since then, we’ve tried to advance our understanding of how to make those types of analyses more robust. So, one thing you can do is to really develop systematic ways to evaluate the validity of your hypothesis.

Just as a simple thing, if you were doing Seattle and San Francisco, you might track them back over 5 or 6 years and look at their monthly employment patterns and see if they are moving up and moving down in the same patterns, so that you basically validate your assumption: There was no change, say, a year ago; and did it look like there was a change? So, for example, if Seattle was on a steeper slope than San Francisco, hypothetically, it might, if you did a fake analysis a year ago when there wasn’t really a change, that fake analysis might also make it look like there was an impact unemployment. We call that a placebo test. It’s a way to validate your analysis and pick up whether or not your assumptions underlying your analysis are valid.

A more complex thing developed by Alberto Abadie is a method called Synthetic Control Groups. Instead of just relying on just one city, I might want to look at a whole bunch of cities. I might look at other tech cities like Austin. I might look at San Jose and San Francisco, and Cambridge; find other areas that might look similar in a number of dimensions to Seattle. And create what we call a synthetic control group. And that’s going to be an average of different cities that together form a good control. And you actually look for weights for those cities, so that you match many different elements: you match the characteristics of those cities across a number of dimensions and you make sure that the time trend of those all together look similar. And that’s going to give you a much more credible and believable result.

Russ Roberts: The standard technique in these kind of settings in economics is what we would call [using] control variables: we are going to try to, in the data itself, for that particular city, control for educational level, family structure, overall economic growth in the city, etc.; and then try to tease out from that the independent effect of the change in the legislation as a discrete event — a one/zero change — in policy. Say, the minimum wage. Of course, I also care about the magnitude of the increase, and so on.

That standard methodology in the past is multivariate regression, which has the general problem that you don’t know whether you’ve controlled for the right factors. You don’t know what you’ve left out. You have to hope that those things are not correlated with the thing you are trying to measure — for example, the political outlook of the city. There’s all kinds of things that would be very challenging that — perhaps harder growth in a particular sector such as restaurants. You’d try to find as much data as you could. But in these synthetic, placebo cases, they, I assume have the same challenge, right? You’ve got an imperfect understanding of the full range of factors.

Susan Athey: That’s right. So if I think about trying to put together a synthetic control group that’s sort of similar to the city that I’m trying to study, there are many, many factors upon, that could be predictive of the outcome that I’m interested in. That brings us into something called high dimensional statistics, or colloquially, machine learning. These are techniques that are very effective when you might have more variables, more co-variates, more control variables than you have, even, observations. And so if you think about trying to control for trends in the political climate of the city, that’s something that there’s not even one number you could focus on. You could look at various types of political surveys. There might be 300 different, noisy measures of political activity. I mean, you could go to Twitter, and you could see in this city how many people are posting political statements of various types.

Russ Roberts: You could look at Google and see what people are searching on; you might get some crude measures, like, did they change the mayor or city council, or some kind of change like that would be some indication, right?

Susan Athey: Exactly. So just trying to capture sort of political sentiments in a city would be highly multidimensional in terms of raw data. And so, I think the way that economists might have posed such a problem in the past is that we would try to come up with a theoretical construct and then come up with some sort of proxy for that. And maybe one, or two, or three variables; and sort of argue that those are effective. We might try to demonstrate, kind of cleverly, that they related to some other things of interest in the past. But it would be very much of like an art form, how you would do that. And very, very subjective. Machine learning has this brilliant kind of characteristic: that it’s going to be data-driven model selection. So, you are basically going to be able to specify a very large list of possible covariates; and then use the data to determine which ones are important.

Russ Roberts: Covariates being things that move together?

Susan Athey: Things that you want to control for. And we’ve talked about a few different examples of ways to measure causal effects. There’s difference in differences; there’s things called regression discontinuity; there’s things where you try to control for as much as you possibly can in a cross section, in a single point in time. But all of those methods share a common feature: that it’s important to control for a lot of stuff. And you want to do it very flexibly.

Russ Roberts: (13:20) When I was younger and did actual empirical work — and I think — I know I’m older than you are. In my youth we actually had these things called ‘computer cards’ that we would have to stack in a reader when we wanted to do statistical analysis. And so it was a very, very different game then. We invented, people invented these wonderful software packages that made things a lot easier like SAS (Statistical Analysis System) and SPSS (Statistical Package for the Social Sciences). And one of the things we used to sneer at, in graduate school, was the people who would just sort of — there was a setting or, I can’t remember what you’d call it, where you’d just sort of go fishing. You could explore all the different specifications. So, you don’t know whether this variable should be entered as a log [logarithm — Econlib Ed.] or a squared term. You don’t know whether you want to have nonlinearities, for example. [And what variables to include, I should have added — Russ]. You just let the data tell you. And we used to look down on that, because we’d say, ‘Well, that’s just that set of data. It’s a fishing expedition. And there’s no theoretical reason for it.’ Is machine learning just a different version of that?

Susan Athey: In some ways, yes. And in some ways, no. So, I think if you applied machine learning off the shelf without thinking about it from an economic perspective, you would run back into all the problems that we used to worry about. So, let me just back up a little bit. Let’s first talk about the problem you were worried about. So, you can call it ‘data mining in a bad way.’ Now data mining is a good thing.

Russ Roberts: It’s cool.

Susan Athey: It’s cool. But we used to use ‘data mining’ in a pejorative way. It was sort of a synonym for, as you say, fishing of the data. It was a-theoretical. But the real concern with it is was that it would give you invalid results.

Russ Roberts: Non-robust results, for example. And then you’d publish the one you’d, say, was confirming your hypothesis; and then you’d say that the best specification. But we don’t know how many you searched for.

Susan Athey: Yeah. So, ‘non-robust’ is a very gentle description. So, I think ‘wrong’ is a more accurate one. So, part of what you are doing when you report the results of a statistical analysis is that you want to report the probability that that finding could have occurred by chance. So we use something called ‘standard errors,’ which are used to construct what’s called the ‘p value’.

And that’s basically telling you, if you resampled from a population, how often would you expect to get a result this strong. So, suppose I was trying to measure the effect of the minimum wage, the question is, if you drew a bunch of cities and you drew their outcomes, there’s some noise there. There’s — and so the question is: Would you sometimes get a positive effect? And so how large is your effect relative to what you would just get by chance, under the null that there is actually no effect of the minimum wage?

The problem with “data mining,” or looking at lots of different models is that if you look hard enough, you’ll find the result you are looking for. Suppose you are looking for the result that the minimum wage decreased employment. If you find, if you look enough ways to control for things, just, you know, if you try a hundred ways, and there’s a good chance one of them will come up with your negative result.

Russ Roberts: Even five [of them].

Susan Athey: Exactly. And if you were to just report that one, it’s misleading. And if you report it with a p-value that says, ‘Oh, this is highly unlikely to have happened,’ but you actually tried a hundred to get this, that’s an incorrect p-value. And we call that p-value hacking. And it’s actually led to sort of a crisis in science. And we found in some disciplines, particularly some branches of psychology, large fractions of their studies can’t be replicated, because of this problem. It’s a very, very serious problem.

Russ Roberts: In economics I associate it historically with Ed Leamer, although calling it historically — he’s alive; he’s been a guest on EconTalk a few times. And I think his paper, “How to Take the Con Out of Econometrics” was basically saying that if you are on a fishing expedition, and not as grand even as the one, absurd as the one we talked about where you tried a zillion things, if you are running lots of different variations, the classical measures of hypothesis testing and statistical significance literally don’t hold. So there’s a fundamental dishonesty there that’s not necessarily fraud because it sort of became common practice in our field — that you could try different things and see what worked, and then eventually, though, you might come to convince yourself that the thing that worked was the one that was the weirdest or cleverest or confirmed your bias.

Susan Athey: Yeah, confirmed what you were expecting. Exactly. And so I think that this is a very big problem. I have a recent paper about using machine learning to try to develop measures of robustness. So, how do you think about this problem in a new world where you have, perhaps, more variables to control for than you have observations? So, let’s take like a simpler example of measuring the effect of a drug. Suppose I had a thousand patients, but I had also a thousand characteristics of the patients.

Russ Roberts: That might interact with whether the drug works.

Susan Athey: Exactly. Suppose that 15 of these 1000 patients have really great outcomes. Now I know a thousand things about these people. There must be something that those 15 people have in common. Maybe they are most men — not all men — mostly men. Maybe they are between 60 and 65. Maybe they mostly had a certain pre-existing condition. We have so many characteristics of them, if I just search I can find some characteristics they have in common. Then I can say: My new data-driven hypothesis is that men between 60 and 65 who had the pre-existing condition have a really good treatment effect of the drug. Now, there’s going to be some other people like that, too; but because this group had such good outcomes, if you just average in a few more average people you are still going to find really good outcomes for this group of people. And so that is the kind of bad example of data mining. It’s like, just spuriously these people just happen to have good outcomes, and I find something in the data that they have in common and then try to prove that the drug worked for these people.

Russ Roberts: An ex post story to tell that this makes sense because — and it turned out if it was a different set of characteristics I could tell a different ex post story.

Susan Athey: Exactly. So, you might have liked to do that as a drug company or as a researcher if you had two characteristics of the people. But it just wouldn’t have worked very well, because if it was random that these 15 people did well, it would be unlikely that they’d have these two characteristics in common. But if you have 1000 characteristics then you can almost certainly find something that they have in common, and so you can pretty much always find a positive effect for some subgroup. And so that type of problem is really a huge issue.

Russ Roberts: Because it doesn’t tell you whether in a different population of 1000 people with 1000 characteristics that those four things you identified [in the first study] are going to work for them.

Susan Athey: Exactly.

Russ Roberts: You had no theory.

Susan Athey: Yeah. And so I started from the hypothesis that was spurious: of 1000 people, 15 people are going to have good outcomes. And so even if there was no effect of the drug, you would “prove” that this subgroup had great treatment effects.

Russ Roberts: (21:30) And going the other way, if 15 people had tragic side effects, you would not necessarily conclude that the drug was dangerous. Or bad for the rest of the population.

Susan Athey: Exactly. And so the FDA (Food and Drug Administration) requires that you put in a pre-analysis plan when you do a drug trial to exactly avoid this type of problem. But the problem with that, the inefficiency there, is that most drugs do have heterogeneous effects. And they do work better for some people than others. And it’s kind of crazy to have to specify in advance all the different ways that that could happen.

You are actually throwing away good information by not letting the data tell you afterwards what the treatment effects were for different groups. Machine learning can come in here to solve this problem in a positive way, if you are careful. There’s a few ways to be careful. The first point is just to distinguish between the causal variable that you are interested in and everything else that are just sort of characteristics of the individuals. So, in the example of the minimum wage you are interested in the minimum wage. That variable needs to be sort of carefully modeled, and you don’t sort of maybe put in and maybe you don’t put it in. You are studying the effect of the minimum wage: that’s the causal treatment. So you treat that separately. But then all the characteristics of the city are sort of control variables.

Russ Roberts: And the [characteristics of the] population.

Susan Athey: And the population. And so, you want to give different treatment to those. And in classic econometrics you just kind of put everything in a regression, and the econometrics were the same — you’d put in a dummy variable for whether there was a higher minimum wage, and you would put in all these other covariates; but you’d use the same statistical analysis for both. And so I think the starting point for doing causal inference, which is mostly what we’re interested in, in economics, is that you treat these things differently.

I’m going to use machine learning methods, or sort of data mining techniques, to understand the effects of covariates, like all the characteristics of the patients; but I’m going to tell my model to do something different about the treatment effect. That’s the variable I’m concerned about. That’s the thing I’m trying to measure the effect of. And so, the first point is that, if you are using kind of machine learning to figure out which patient characteristics go into the model, or which characteristics of the city go into the model, you can’t really give a causal interpretation to a particular political variable, or a particular characteristic of the individual, because there might have been lots of different characteristics that are all highly correlated with one another.

What the machine learning methods are going to do is they are going to try to find a very parsimonious way to capture all the information from this very, very large set of covariates. So, the machine learning methods do something called regularization, which is a fancy word for picking some variables and not others. Remember, you might have more variables than you have observations, so you have to boil it down somehow. So, they’ll find — they might find one variable that’s really a proxy for lots of other variables. So, if you’re going to be doing data mining on these variables, you can’t give them a causal interpretation. But the whole point is that you really only should be giving causal interpretations to the policy variables you were interested in to start with — like, the drug. Like the minimum wage. So, in some sense you shouldn’t have been trying to do that anyways. So the first step is if you are going to do kind of data mining, you don’t give causal interpretations to all the control variables you did data mining over. The second point, though, is that even if you are not giving them causal interpretations, you can still find spurious stuff. If you tell a computer to go out and search, like the example that I gave of the patients, it will find stuff that isn’t there.

The more variables you have, the more likely it is to find stuff that isn’t there. So one of the ways that I’ve advocated modifying machine learning methods for economics is to sort of protect yourself against that kind of over-fitting, by a very simple method — it’s so simple that it seems almost silly, but it’s powerful — is to do sample splitting. So you use one data set to figure out what the right model is; and another data set to actually estimate your effects. And if you come from a small data world, like, throwing away half your data sounds like a crime. How am I going to get my statistical significance if I throw away half my data? But you have to realize the huge gain that you are getting, because you are going to get a much, much better functional form, a much, much better goodness of fit by going out and doing the data mining; but the price of that is that you will have overfit. And so you need to find a clean data set to actually get a correct standard error.

Russ Roberts: (26:45) Let me make sure I understand this. It strikes me as an improvement but not as exciting as it might appear. So, the way I understand what you are saying is, let’s say you’ve got a population of low education workers or low skill workers and you’re trying to find out the impact of the minimum wage on them. So, what you do is only look at half of them. You build your model, and you do your data mining. And then you try to see how the curve that you fit or relationship that you found in the data in the first half of the sample, if it also holds in the second. Is that the right way to say it?

Susan Athey: So, I would use the model — so, what I would be doing with the model is trying to control for time trends. And so I would have selected a model, say, and selected variables for regression, in a simple case. And then I would then use those selected variables and I would estimate my analysis in the second stage as if, just as if I were doing it in the old fashioned way. The old fashioned way is sort of the gods of economics told you that these three variables were important, and so you write your paper as if those are the only three you ever considered, and you knew from the start that these were the three variables. And then you report your statistical analysis as if you didn’t search lots of different specifications. Because if you listen to the gods of economics, you didn’t need to search lots of different specifications. And so —

Russ Roberts: That’s in the kitchen. Nobody saw the stuff you threw out.

Susan Athey: Nobody saw you do it.

Russ Roberts: Nobody saw the dishes that failed.

Susan Athey: Exactly. And so you just report the output. Now, what I’m proposing is: let’s be more systematic. Instead of asking your research assistant to run 100 regressions, let the machine run 10,000. Then pick the best one. But you only use what you learned on a separate set of data.

Russ Roberts: It seems to me that you’re trying to hold its [the model’s] feet to the fire. You went through these 10,000 different models, or 100,000. You found 3 that were, or 1, whatever it is, that seemed plausible or interesting. And now you are going test it on this other set of data?

Susan Athey: No, you’re going to just use it, as if the gods of economics told you that was the right one to start with.

Russ Roberts: Why would that be an improvement?

Susan Athey: Because it’s a better model. It would be a better fit to the data. And you don’t, by the way, in the first part you don’t actually look for what’s plausible. You actually define an objective function for the machine to optimize, and the machine will optimize it for you.

Russ Roberts: And what would that be, for example? So, in the case of traditional statistical analysis, in economics, I’m trying to maximize the fit; I’m trying to minimize the distance between my model and the data set, the observations, right?

Susan Athey: Exactly. So, it would be some sort of goodness of fit. So, you would tell it to find the model that fit the data the best — with part of the sample.

Russ Roberts: And so then I’ve got — traditionally I’d have what we’d call marginal effects: the impact of this variable on the outcome holding everything else constant. Am I going to get the equivalent of a coefficient, in this first stage?

Susan Athey: Sure. So, let’s go back. In the simplest example, suppose you were trying to do a predictive model. So, all I wanted to do is predict employment.

Russ Roberts: Yup. How many jobs are going to be lost or gained.

Susan Athey: And so if you think about a difference in difference setting, the setting where somebody changed the minimum wage, what you are trying to do is predict what would have happened without the minimum wage change. So, you can think of that component as the predictive component. So, I’m going to take a bunch of data that — this would be data from, say, cities that didn’t have a minimum wage change, or also data from the city that did but prior to the change. I would take sort of the untreated data and I would estimate a predictive model. And then the objective I give to my machine learning algorithm is just how well you predict.

Russ Roberts: Do the best job fitting the relationship between any of the variables we have to, say, employment in the city for workers 18–24 who didn’t finish college.

Susan Athey: Exactly. But a key distinction is that that is a predictive model; and predictive models — the marginal effect of any component of that model should not be given a causal interpretation. So, I don’t want to say that just because political sentiment was correlated with employment, it doesn’t mean that political sentiment caused employment.

Russ Roberts: Agreed.

Susan Athey: If you want to draw a conclusion like that, you’re going to need to design a whole model and justify the assumptions required to establish a causal effect. So, a key distinction is that these predictive models shouldn’t be given causal interpretations. But you tell the machine: ‘Find me the best predictive model.’

Russ Roberts: (31:48) When you say ‘predictions,’ it’s really just fitting. Because I’m never going to be able to evaluate whether the relationships between the variables that I’ve been using or that the machine told me to use, whether they accurately predict going forward.

Susan Athey: Right. It’s only going to fit in the data set you have. And you are going to assume that all those relationships are stable. So what can go wrong is there could be some things that are true in your current data set, but spuriously true.

Russ Roberts: Right. They won’t hold in the future.

Susan Athey: They won’t hold in the future. Just from random variation in terms of how your sample ended up, it just happened that people who were high in some variables happened to be high in other variables: that those relationships were spurious and they are not going to hold up in the future. The machine learning tries to control for that, but it’s limited by the data that it has. And so if there’s a pattern that’s true across many individuals in a particular sample, it’s going to find that and use that.

So the models by being uninterpretable are prone to pick up things that may not make sense in terms of making predictions over a longer term, holding up when the environment changes, holding up when something could be a bad-weather winter or something and that affected employment. And so, the machine learning model would, might pick up things that are correlated with bad weather; and the next year the weather isn’t bad.

Russ Roberts: And those other variables are still there.

Susan Athey: And those other variables are still there, but the relationship between them and employment no longer holds. So, there’s actually a big push right now in some subsets of the machine learning community to try to think about, like, what’s the benefit of having an interpretable model? What’s the benefit of having models that work when you change the environment?

They have words like ‘the main adaptation,’ ‘robust,’ ‘reliable,’ ‘interpretable.’ So it’s interesting is in econometrics we’ve always focused on interpretable, reliable models and we always have wanted models that are going to hold up in a lot of different settings. And so, that push in econometrics has caused us to sacrifice predictive power. And we never really fully articulated the sacrifice we were making, but it was sort of implicit that we were trying to build models of science, trying to build models of the way the world actually works — not trying to just fit the data as well as possible in the sample.

As a result, we’re not that good at fitting data. The machine learning techniques are much better at fitting any particular data set. Where the econometric approaches can improve is where you want your model to hold up in a variety of different circumstances and where you’re very worried about picking up spurious relationships or being confounded by underlying factors that might change. And so I think where I’d like to go with my research and where I think the community collectively wants to go, both in econometrics and machine learning, is to use the best of machine learning — automate the parts that we’re not good at hand-selecting or interpreting — allowing us to use much larger, much wider data sets, lots of different kind of variables — but figuring out how to constrain the machine to give us more reliable answers.

And finally, to distinguish very clearly the parts of the model that could possibly have a causal interpretation from parts that cannot. All the old tools and insights that we have from econometrics about distinguishing between correlation and causality — none of that goes away. Machine learning doesn’t solve any of those problems. Because most of our intuitions and theorems about correlation versus causation already kind of assumed that you have an infinite amount of data; that you could use the data to its fullest.

Even though in practice we weren’t doing that, when we would say something like, in effect ‘it’s not identified, — we’d say ‘you cannot figure out the effect of the minimum wage without some further assumptions’ and without some assumption like that these cities have a similar time trend, the cities that didn’t get the change have a similar time trend than the ones that did — those are kind of conceptual, fundamental ideas. And using a different estimation technique doesn’t change the fundamentals of identifying causal effects.

Russ Roberts: (36:17) So, let’s step back and let me ask a more general question about the state of applied econometrics. The world looks to us, to economists, for answers all the time. I may have told this story on the program before — I apologize to my listeners, but Susan hasn’t heard it — but, a reporter once asked me: How many jobs did NAFTA (North American Free Trade Agreement) create? Or how many jobs did NAFTA destroy? And I said, ‘I have no idea.’ And he said, ‘So what do you mean, you have no idea?’ And I said, ‘I have a theoretical framework for thinking about this, that trade is going to destroy certain kinds of jobs: A factory that moves to Mexico, you can count thosejobs; but what we don’t see are the jobs that are created because people ideally have lower prices for many of the goods they face, and therefore they have some money left over and might expand employment elsewhere. And I don’t see that so I can’t count those jobs. So I don’t really have a good measure of that.’

And he said, ‘But you are a professional economist.’ And I said, ‘I know’ (whatever that means.) And he said, ‘Well, you’re ducking my question.’ I said, ‘I’m not ducking your question: I’m answering it. You don’t like the answer, which is: I’d love to know but I don’t have a way to answer that with any reliability.’ And so, similarly, when we look at the minimum wage, when I was younger, everyone knew the minimum wage destroyed lots of jobs; it was a bad idea. In the 1990s with Card and Krueger’s work, and others, people started to think, ‘Well, maybe it’s not so bad,’ or ‘Maybe it has no effect whatsoever.’

And there’s been a little bit of a pushback against that. But now it’s a much more open question. Recently, I’d say the most important question in the last, in a while, has been: Did the [fiscal] stimulus create jobs and how many? People on both sides of the ideological fence argue both positions: that it created a ton, and others say it didn’t; some even would say it cost us jobs. So how good do you think we are at the public policy questions that the public and politicians want us to answer? And, have we gotten better? And do you think it’s going to continue improve, if the answer is Yes? Or do you think these are fundamentally artistic questions, as you suggested they might be earlier in our conversation?

Susan Athey: I think we’re certainly getting a lot better. And having more micro data is especially helpful. So, all of these questions are really counterfactual questions. We know what actually happened in the economy: that’s pretty easy to measure. What we don’t know is what would have happened if we hadn’t passed NAFTA. Now, something like NAFTA is a particularly hard one, because, as you said, the effects can be quite indirect.

Russ Roberts: It’s a small part of our economy, contrary to the political noise that was made around it.

Susan Athey: But also the benefits are maybe quite diffuse.

Russ Roberts: Correct.

Susan Athey: And so, one side can be easy to measure, like a factory shutting down. But the benefits from, say, lots of consumers having cheaper goods and thus having more disposable income to spend on other things. The benefits to all the poor people who are not having to pay high prices to products, those benefits are much harder to measure because they are more diffuse. So that’s a particularly hard one.

For something like the minimum wage, you know, having more micro level data is super helpful. So, in the past you only had state-level aggregates. Now you have more data sources that might have, you know, city-level aggregates. Plus, more, different data sources to really understand at the very micro level what’s happening inside the city to different industries, to different neighborhoods. And so on. And so, better measurement, I think can really, really help you. If you think about macro predictions, you might ask a question: ‘Gee, well, if we had a major recession, what’s going to happen in a particular industry?’ Well, in the past, you might have just had to look back and say, ‘Gee, what happened when the GDP (Gross Domestic Product) of the United States went down, to that industry?’

If you have more micro data you could look back at the county level, or even the city level, and see the specific economic performance of that specific city, and see how that related to purchases in that particular industry. And then you have much more variation to work with. You don’t just have, like, 2 recessions in the past 10 years — like, 2 data points. Because each locality would have experienced their own mini-recession. And they might have experienced more recessions, in fact. And so you could really have much more variation in the data: many more mini-experiments. Which allows you to build a better model and make much more accurate predictions. So, broadly, modeling things at the micro level rather than at the macro level can be incredibly powerful.

But, that’s only going to work well if the phenomenon are micro-level phenomena. So, if you think about the economics, say, the average income in a county is a good predictor of restaurant consumption in that county. And you can pretty much kind of look at that locally. But NAFTA, or something like that, that might affect, you know, people’s movements between cities. It could have other types of macroeconomic impact. So, where each city wouldn’t be an independent observation. And so for those kind of macro level questions, yeah, more micro data can help, but it’s not giving you, you know, 3000 experiments where you used to have one.

Russ Roberts: (41:49) It strikes me that in economics we’re always trying to get to the gold standard of a controlled experiment. And since we often don’t have controlled experiments, we try to use statistical techniques that we’ve been talking about to try to [deal with] the reality that we don’t have a controlled experiment to control for factors that actually change. And we’re inevitably going to be imperfect because of the challenges of, the fact that we don’t observe every variable; and inevitably there are things that are correlated that we don’t understand.

Yet, I feel like, in today’s world, at the super-micro level, and in particular on the Internet with certain companies’ access to data about us, they can do something much closer to a controlled experiment — what’s called an A/B test — which is something we can’t do as policy makers, as economists. Do you think that’s true? Are we learning? Are companies who are trying different treatment effects, essentially, on their customers, are they getting more reliable results from those effectively than they would, say, through hiring an economist in the past used to just do analysis on their customers [data]? With, say, an advertising campaign, which was always very difficult.

most of the innovation that happens in the big tech companies is incremental innovation. And the A/B test is probably the most impactful business process innovation that has occurred in a very long time. So, just to think about how does Google get better, they run 10,000 or more randomized control trials every single year. So, nothing changes in the Google search engine — not the fonts, not the color, not the algorithms behind it that tell you what to see — without going through a randomized control trial.

Susan Athey: This idea of the A/B test is really incredibly powerful. And I think when people think about Silicon Valley they imagine Steve Jobs in a garage, or the invention of the iPhone or the iPad, and they think that’s what Silicon Valley is. But most of the innovation that happens in the big tech companies is incremental innovation. And the A/B test is probably the most impactful business process innovation that has occurred in a very long time. So, just to think about how does Google get better, they run 10,000 or more randomized control trials every single year. So, nothing changes in the Google search engine — not the fonts, not the color, not the algorithms behind it that tell you what to see — without going through a randomized control trial.

Russ Roberts: It’s not a brilliant person sitting in an armchair trying to figure out [the best] font — which is what Steve Jobs sometimes did.

Susan Athey: Exactly. That is not the way that Amazon or Bing or Google or eBay operate. That’s not how they improve. They might have a brilliant hypothesis, but if it’s not borne out in a randomized control trial, it doesn’t ship to the customers.

Listen to Adam D’Angelo talk about experimentation at Quora — starts at the 32:56 mark

That allows decentralization. That allows thousands of engineers, even interns, to make changes. So, you could be a summer intern at Facebook and have some idea about improving their ranking algorithms; you can write the code; you can test it; and it can go live on users within days. As long as it follows certain rules and guidelines, and as long as the randomized control trial showed that users loved it. So that’s a really powerful concept. Now, the A/B tests are reduced to evaluating the machine learning algorithms that create all these great results, figure out which new story to show you, figure out which search results to show.

Russ Roberts: And ‘A/B’ meaning one group, A, gets the treatment; group B doesn’t.

Susan Athey: Exactly. Well, it’s Group B gets the treatment; group A is the control group. So they are able to separate correlation from causality almost perfectly. That is, one group saw the light blue and the other group saw the dark blue.

Russ Roberts: And they assign who got A or B totally randomly.

Susan Athey: They assign that randomly. And so if the people with the dark blue seemed happier — that is, they came to the website more, they clicked more, other types of measures of user happiness — then the dark blue goes to all the users. So they are going to get the causal effect of dark blue because they’ve randomly assigned some people to light blue and some people to dark blue, just like a drug trial. And so they are going to say the causal effect of dark blue is that people come to the site more; therefore, we are going to use dark blue.

Russ Roberts: Even though they don’t understand why that causal relationship might be there, the fact that it’s a large enough number of observations and the effect was significantly large, say, or even small but over a large enough set of people, they’ll know that there’s a real impact.

Susan Athey: Exactly.

Russ Roberts: There’s still some chance of randomness.

Susan Athey: Yeah. And they’re going to be very strict about statistical significance, so they’re going to make sure they have enough data to be sure that it’s dark blue before they ship dark blue to all the users. So they are very rigorous in their scientific analysis. In fact, some of the most important meetings at Google or at Amazon would be the meetings where you review the results of your randomized control trials, because that’s going to determine what gets shipped. And so those might take place, say, twice a week; and you might have a separate meeting for the advertising in Google, say, than the algorithms and so on that do the natural results. But ultimately, that’s how the product changes: through these randomized control trials.

Russ Roberts: (47:03) In the academic world, if you’re a development economist and you think, ‘Well, maybe having access to textbooks is going to help improve the education in this village,’ and you give them the textbooks; you don’t give this other village the textbooks — or you give [textbooks to] half the schools in the village, you end up making it more controlled for the things you want — and you find out what the impact is. And you might find nothing, or you might find a lot. And, you find a lot — even if you might find nothing — you’re going to usually publish that paper and advance our knowledge about — we hope — the relationship between some variable and educational outcomes in very poor places in the world.

These natural experiments that are going on in Silicon Valley, of course they are proprietary. And nobody publishes a paper that says, ‘Dark blue is better than light blue.’ How does that [non-transparency] change things? And, it strikes me that I guess it’s sort of published, because when they change the color, they change the font, I guess everybody has learned something, if they are paying attention. Are these companies looking at other people’s websites to see what’s working for them? Or they don’t need to because they are already doing it for their own? Or do they assume their customers are different? Right? There’s just a lot of interesting issues there.

Susan Athey: It’s true that most of those 10,000 experiments were not designed to uncover some fundamental truth about the world. So, economists’ experiments are often designed to understand risk aversion, or understand the effect of class size, or some kind of generalizable parameter. There’s an argument in economics about how generalizable are these experiments, and so on. But roughly you are hoping for a generalizable result.

An experiment that’s done at tech companies is really specific to their interface; and often the experiments are actually — color is a very simple one, but often it’s just Algorithm A versus Algorithm B. And both algorithms are black boxes and you won’t really know why one of them worked better than another. It’s just — it was better, so we use it. And so, even if they were publishing it, these wouldn’t be fundamental truths that are generalizable. So, I actually did release the result of an experiment I ran at Bing where I re-ranked search results. And so I found the causal effect of taking a link from the top position to the third position, say. But that was a very unusual experiment. That wasn’t the type of experiment that was typically —

Russ Roberts: Correct. Because they don’t really care. To measure that isn’t so useful, necessarily.

Susan Athey: Right. Well, that particular fact does have some generalizable use. But many things are not like that. So, what’s interesting here is, first of all, how scientific these firms are, and how statistically careful they are. Statistically sophisticated. So, these boring formulas about p-values are right at the heart of how innovation happens at these firms.

Some things I worked on when I consulted for Bing were things like improving the statistical validity of the A/B testing form to account for various factors, like if the same user saw, within the treatment group multiple times, their results may be correlated with one another. That’s just a very simple, early example of the kind of thing you need to do, to take care of, to make sure that you are actually reporting correct statistical results.

And people are very, very serious about getting the statistics right, because, you know, you are making decisions that affect bonuses and promotions and operating on billions of dollars worth of revenue. It’s very important. I think that that’s one thing that’s interesting. So they are getting very scientifically valid results, but mostly about things that are very specific to those firms: that are very context-specific, that are not generalizable; and they are not looking for fundamental truths.

So it’s not like they are hiding all of this great social science information from the rest of the world. It’s just that that’s not what the experiments are about. Also, many of these tech firms actually do allow social scientists to publish. So we have learned some very interesting fundamental facts from social science collaborating with tech firms. In fact, those are the things they are more likely to allow them to publish, rather than kind of proprietary, business-secrety type things. That’s an interesting phenomenon.

However, when you put social scientists together with these things, you can actually learn some pretty important things. So, there’s a large experiment on Facebook where they randomized nudges for voting. And they found that when you show people that their friends voted, that encouraged people to get out and vote. My experiment on re-ranking search results shows the generalizable thing that how a news site or how a search engine or how a commerce engine ranks results has a very large causal effect on what people buy. What they read. It tells us that technology platforms have a big impact on the informativeness of the economy as well as the winners and losers in commerce. So, that’s an important, generalizable fact.

We’ve also seen other scholars collaborate with tech companies to learn things like the effect of incorporating food health information on food poisoning. So, if somebody like Yelp incorporates more information about the quality of the health scores, that can affect people’s behavior and can affect people’s health.

So there is actually this collaboration that goes on; and I think it’s only growing. A final thing that’s getting back to the statistics part of it: So, some of my research has been specifically about helping tech firms understand better what they are getting out of an A/B test. My methods would allow a firm to look at the results of a randomized controlled experiment and figure out for whom is it good and for whom is it bad — in this personalized way — and still have valid standard errors. To still have valid p-values. To still be able to evaluate accurately whether this result is spurious or whether it’s real — whether it will hold up in another sample.

I had mentioned earlier that the simplest way to make sure that your results are statistically valid is to use sample splitting. Some of my more advance statistical techniques incorporate sample splitting behind the scenes, but don’t actually cause you to use only half your data. So, they are techniques that are actually the average of a bunch of models. So, you repeatedly take half the data to select a model, in the second half to estimate. You do that over and over again, with different splits, and then average up the results. And each one of the elements is honest, because for each individual model you used half the data to choose which variables were important and the other half to estimate. So, if each individual one is honest, the average of all of them is honest. But you’ve used all of your data. You haven’t sort of thrown anything away. And so that type of technique can give you valid confidence intervals, valid statistical conclusions.

That’s a modification of the way the standard machine learning tools work, but it’s one that doesn’t actually hurt their performance very much but still gives you better statistical properties. And so those are the types of innovations that a social scientist is interested in, because we don’t just want to have a good prediction. We want to have statistical guarantees and statistical properties of the predictions that we make, because they are being used as part of causal inference.

Russ Roberts: (54:50) Let’s close on a related point, which is: I was struck by your remark that, in these companies in Silicon Valley there is a lot of money at stake, so they are really careful about how they measure stuff, and if they do it reliably, and if it’s likely to pay off. And yet in economics, we publish papers that affect people’s lives. Whether it’s the stimulus package or the minimum wage or drug policy. And, I bet we’re not as careful, and I think we should be, in terms of the claims we make for our results. And yet the urge to publish and the urge to get attention for ourselves — we’ve got a different level of skin in the game compared to the Silicon Valley experiment. And it’s also a different amount of skin in the game relative to the people whose lives are being affected. React to that.

Susan Athey: I think what’s interesting is that these different disciplines and different contexts — whether it’s scientific economists or business economists, or machine learning people, or tech firm internal people — actually all of them make systematic mistakes. They’re just different mistakes.

One of the things I see — the tech firms are very, very serious about getting their statistics right in the short term. But it’s very, very hard to operate a business based on all of these experiments. If you really want, if you want long term outcomes. So, suppose that you make a change that’s sort of bad for consumers in the short term, but good in the long term. So, if you could change the interface, it almost always confuses users to start with. And it takes a while for them to really adapt to it. Because the tech firms are trying to run so many different experiments and learn about so many different things, there’s a bias towards running short-term experiments. And it’s expensive, to run an experiment, for, you know, months. So they tend to systemtatically be biased toward things that are good in the short term. And bad in the long term.

If you look at — these randomized control tests, they are great. And they are perfectly statistically valid for what they are measuring. Which is short term. And if you add up all of their results of all of their experiments over the years, they would often predict something like, you know, 300% improvement in revenue or user base. But then, in fact, you only had a 10% change. And that’s because each of those experiments was actually a short-term experiment.

So they tend to systematically innovate in the wrong dimensions. They are awesome at incremental short-term innovation — getting the fonts right, you know, all that stuff. But if there is something more radical, they’ll sometimes miss it.

Russ Roberts: It’s also expensive. Not just in terms of the time. You don’t want some large section of your user base having the wrong exposure, whatever it is.

The machine learning community has systematically ignored causal inference. They’ve gotten awesome at prediction. They’ll kill an economist every day of the week, in terms of prediction. But they are much less well developed when it comes to measuring the effect of a policy. That gap is actually where I’ve come to work — because it was a gap.

Susan Athey: I think they overdo it. But I think even if you were a perfectly rational organizational designer, because of the costs and benefits and because of the difficulty of measuring the long term, you would probably set up your organization to innovate more in the short term directions than in the long term directions, because it’s more expensive to measure the long-term directions. But that leads to biases. That leads to firms not fully explore the innovation space, and so on.

As I’ve moved among all of these disciplines — for the sake of scientific progress, for the sake of smoother functioning organizations, for the sake of more rapid innovation — every group, whether it be academic or business, tends to narrow their focus and get really good at a few things. And ignore other problems. The machine learning community has systematically ignored causal inference. They’ve gotten awesome at prediction. They’ll kill an economist every day of the week, in terms of prediction. But they are much less well developed when it comes to measuring the effect of a policy.

That gap is actually where I’ve come to work — because it was a gap. Economists are very good at small-data causal inference. And machine learning people were very good at big data prediction. But actually, we have now economic data sets that are big data. And so, let’s go out there and get good at big data causal inference, with systematic model selection, being honest about how we are using the machine to choose our model and getting out valid statistical inference at the end, rather than hiding it all in the kitchen and not telling people about the back-end work that you did.

My vision of the future of applied work is that we are going to be able to be more honest as researchers. We’re going to be able to delegate some component of our analysis to the machine and be honest about how we’re doing it. Actually, it’s much easier to correct for an algorithm, you know, what the algorithm is doing, than if you are arbitrarily selecting among columns of an Excel spreadsheet that a research assitant gave to you.

So let’s just take the part that we were kind of embarrassed of and hand it to a machine and be honest about it. And then we can focus all of our energy on the part that’s truly hard, which is the causal inference component. And then, as my article you referred to at the beginning, suggests, let’s also get more systematic about the robustness checks and about the sensitivity analyses and all these other supplementary analyses we do, for the hard part: the causal inference part.

Susan Athey on Machine Learning and Econometrics

Written by Russ Roberts