Generalizability by Representativeness

TL;DR: Many psychological studies rely on reasoning by representativeness to argue that their studies capture the causes of important phenomena in the real world. This is fallacious, and psychologists should stop doing it.

In this post I’ll explain what the representativeness heuristic is, provide an example of a recent paper that reasons using it, and try and explain why this is bad.

This idea has been kicking around in the back of my mind for years, but only recently did a salient enough example pop up that I felt compelled to write this.

What is reasoning by representativeness?

(Note: You can skip this if you are already a Judgment and Decision-Making expert)

Maya Bar Hillel, in her chapter on “Studies of Representativeness” in the classic edited volume, Judgment Under Uncertainty: Heuristics and Biases, describe this reason as follows:

Daniel Kahneman and Amos Tversky have proposed that when judging the probability of some uncertain event, people often resort of heuristics, or rules of thumb, which are less than perfectly correlated (if, indeed, at all) with the variables that actually determine the event’s probability. one such heuristic is representativeness defined as a subjective judgment of the extent to which an event in question is “similar in essential properties to its parent population” or “reflects the salient features of the process by which it is generated” (Kahneman and Tversky, 1972b, p. 431, 3).

Ok, let’s make this a little bit more concrete. The idea is here is that when you ask someone to assess the probability of an item belonging to a group, they think about the features of the item, the prototypical features of the group, and then compare them. To the extent that the item seems “representative” of the group – i.e. the more that the item shares features with the group – the more likely someone would judge the item’s membership in the group to be.

The classic illustration of how this reasoning can go astray is in the so-called ‘Linda problem’. This is the set-up:

Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.

Which is more probable?

  1. Linda is a bank teller.
  2. Linda is a bank teller and is active in the feminist movement.

The fallacy is that most people will say that Linda being a bank teller and active in the feminist movement is more likely than Linda being a bank teller. This violates the laws of probability – the probability of an event conjoined with another event cannot be greater than the probability of the event alone. That’s why this is called the “conjunction fallacy.” And yet, the description of Linda intuitive fits our concept of a feminist bank teller much more than our concept of a bank teller alone.

To be clear, there’s a lot of debate about what the Linda problem shows, whether it is in fact a mistake that people make, or a function of wording, framing, conversational pragmatics, etc. Whether or not it’s a fallacy I think is not that interesting (others may disagree!). What’s more interesting is how it does a good job illustrating the pull of intuition in cases like this – when I think about this problem it feels like I can perceive the intuitiveness of  these heuristic. It feels to me like it’s more likely that Linda is a feminist bank teller!

Now what’s the problem with reasoning in this way? Bar-Hillel again, with the harmful consequences:

Although in some cases more probable events also appear more representative, and vice versa, reliance on the representativeness of an event as an indicator of it’s probability may introduce two kinds of systematic error into the judgment. First, it may give undue influence to variables that affect the representativeness of an event but not its probability. Second, it may reduce the importance of variables that are crucial to determining the event’s probability but are unrelated to the event’s representativeness.

In other words, reasoning this way, one might (1) be misled to think that the probability is higher than it actually is because some irrelevant features are shared between the item and the category or (2) you might underweight how important other features of the item are to the probability the item is in the category.

How do psychologists rely on representativeness in their reasoning?

Here’s how psychologists argue this way. Sometimes when experimental psychologists demonstrate an effect in their lab, they want to make a claim that this effect actually matters for real world behaviors (i.e. it generalizes). Think about something like stereotype threat – why is it important? It’s important because psychologists believe that stereotype threat is a causal mechanism that underpins outcomes that we see happening in the real world (e.g. a race gap in test taking).

How do they make that argument? Well, the situation of the experiment is designed in such a way that it shares salient features with the real world phenomena – i.e. that the lab situation is representative of the real world. The study is explicitly trying to capture all the most important aspects of the situation in testing the hypothesis. In the case of stereotype threat, the situation of the experiment is nearly identical to that of the real world (test-taking), so as long as they captured enough of the most important aspects of the real world, one could say the effect generalizes. Of course in arguments about generalizability we often argue about whether some important detail truly was captured – e.g. in stereotype threat people sometimes argue that the thing missing from the lab experiment is incentives/stakes. People won’t suffer these effects when the stakes are high, and thus lab studies can’t underpin whatever test gap we see in the real world.

Now sometimes this kind of representative reasoning is probably ok — the situation in the lab can certainly be exactly the same as the real world. And often times psychologists will run field studies to show that their mechanism is driving some real world outcome. Stereotype threat does seem like a pretty good analog of real test taking. The problem is when the the situation of the lab isn’t the same as the real world and yet psychologists rely on the public’s representativeness intuitions to convince people of the phenomenon’s importance.

Which brings me to a recent paper published in Science, maybe the most visible and prestigious scientific journal, “Prevalence-induced concept change in human judgment”, written by a group of perhaps the most prominent social psychologists currently working in our field (the group includes Dan Gilbert and Tim Wilson – this is like seeing Lennon/McCartney in album liner notes). The paper provides a number of  beautiful demonstration of an effect –

In a series of experiments, we show that people often respond to decreases in the prevalence of a stimulus by expanding their concept of it. When blue dots became rare, participants began to see purple dots as blue; when threatening faces became rare, participants began to see neutral faces as threatening; and when unethical requests became rare, participants began to see innocuous requests as unethical. This “prevalence-induced concept change” occurred even when participants were forewarned about it and even when they were instructed and paid to resist it.

The studies are well run, seem well-powered, and don’t in general seem to suffer from any internal validity issues. The problem is that the generalizability of this effect is argued for on the basis of representativeness:

These results may have sobering implications. Many organizations and institutions are dedicated to identifying and reducing the prevalence of social problems, from unethical research to unwarranted aggressions. But our studies suggest that even well-meaning agents may sometimes fail to recognize the success of their own efforts, simply because they view each new instance in the decreasingly problematic context that they themselves have brought about. Although modern societies have made extraordinary progress in solving a wide range of social problems, from poverty and illiteracy to violence and infant mortality (22, 23), the majority of people believe that the world is getting worse (24). The fact that concepts grow larger when their instances grow smaller may be one source of that pessimism.

This is clearly an instance of reasoning by representativeness. The studies do show that people’s judgments can shift within a study when the prevalence of a target decreases. But how do we know that this phenomena has anything to do with these real world cases? In the wider world certainly there are cases where problems are gradually solved, and in those cases it can certainly seem like people often become harsher judges as time goes on 1.

Per Bar-Hillel, reasoning this way about psychology studies  can steer us wrong in two ways.

“First, it may give undue influence to variables that affect the representativeness of an event but not its probability.” On the basis of this study we might believe that the only relevant facts about our judgments of a social phenomena’s prevalence is our own personal history or memory of of that phenomena. I might think that because I perceive that racism has become less prevalent, other people’s focus on it could stem from this change in their standards, and really it’s not so bad. Is that actually true?

“Second, it may reduce the importance of variables that are crucial to determining the event’s probability but are unrelated to the event’s representativeness.” On the basis of this study we might ignore the fact that these judgments are social and historical processes that play out. Whether I judge something to be racist stems from more than just repeated observations – it’s not just ‘in my head.’ There could be plenty of reasons why people continue to be sensitized to instances of injustice that is not simply a shifting standard – they were taught, they have personal experiences that make such behavior salient, a different bias, (what if it’s availability? or an availability cascade, per Cass Sunstein?), or maybe the injustice is simply invisible to people of privilege who don’t perceive all ways in which that injustice in manifest? There’s a lot going on here, and flattening out all this nuance into a basic property of statistical reasoning might miss more than it captures.

I will say this – I appreciate their use of “may” to give themselves an out (they aren’t really making this strong claim!), but I worry that the potential implications of their work is what got this paper into Science in the first place. I would love to live in a future where instead of telling stories about generalizability we actually test them. And instead of the most prominent journal in science publishing such claims uncritically, they demand more from researchers. Or maybe we can just eliminate “Discussion” sections from papers. That would suit me as well.

Notes:

  1. It’s striking to me that they don’t seem willing to name any actual social problems that might be a case of this – my guess is that if they did, they would have to confront all the ways in which their lab studies are not representative of those situations.

A modest proposal for solving the gender problem in technology

TL;DR – The way to get more women into tech is to pay women more than men.

My starting point for this argument is the contention that diversity would improve the quality of tech companies’ products and by extension, increase shareholder value. If you don’t believe this proposition, go read Scott Page’s The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies and convince yourself. I’ll wait.

Ok, now that we are on the same page about that, the path forward is clear. In Econ 101, you learn that when the supply of something is smaller, the price is higher. There are fewer women in technology, and given their value to companies, it is clear that the market should be paying more money for them. Therefore, my modest proposal for solving the problem of gender diversity (or any diversity you might want) is to pay more to women engineers than male engineers. If large tech firms actually did this, I have no doubt that top firms (Google, Apple, Facebook, Microsoft, etc) would see their diversity numbers improve markedly. I also have little doubt that we would get more women in tech overall. In fact, maybe in our professional lifetimes we could get to a point where men could be paid as much because the supply is the same (a dream, I know).

You might object that this is illegal–well, if we believe that diversity improves shareholder value, then you are not discriminating (i.e. it’s not equal work since women give you better outcomes, so paying women more is ok).

So, big tech company, if you are unwilling to do this and put your money where your mouth is–why is that? Is it that you are just paying lip service to diversity but unwilling to really do anything about it? If you had another factor of production that was important and you didn’t have enough of it, you’d pay a premium. Why aren’t you at least trying?

(At the very least you might consider increasing your referral bonuses for referring female engineers).

Three authors misunderstanding nudges

David Berreby’s critique of nudging
Jeremy Waldron’s critique of nudging
Steven Poole’s critique of nudging

I am being nudged and I know it. Worse yet, in some cases I literally know the person nudging me. Google is well-known for its behavioral researchers in People Operations, and a friend from graduate school is one of their “nudgers”. When I go into a kitchen and see healthy snacks prominently displayed and the unhealthy snacks hidden away in opaque jars, I know I’m being manipulated (this was literally my friend’s research). When I get an email comparing my retirement contributions to my peers, I know what effect they are drawing upon. Indeed there are researchers I know at CMU using the same techniques to encourage energy conservation. And yet, these nudges influence me just the same. How can that be?

An argument that all three of these pieces make is that nudging depends upon the ignorance of the populace in order to be effective. And this is an affront to human dignity.

Berreby writes,

If the “nudge” works correctly, you can’t evaluate the attempt to influence you, because you aren’t aware of it.

Waldron writes,

Sunstein says he is committed to transparency, but he does acknowledge that some nudges have to operate “behind the back” of the chooser.

Poole writes,

Nudging depends on our cognitive biases being reliably exploitable, and a Stanovichian programme of mindware upgrades would interfere with that. In this sense, nudge politics is at odds with public reason itself: its viability depends precisely on the public not overcoming their biases.

All three of these authors make the same mistake, that awareness / unawareness are binary, that “mindware upgrades” can lead to a person free of bias, which is at odds with the nudgers goals. As I alluded to in my account of being nudged above, however, knowledge of biases does not equate to expertise in overcoming them. As Wilson and Brekke argued in their fantastic (classic) paper “Mental Contamination”, actually correcting for a bias involves the following steps:
1. Awareness of unwanted processing through introspection or application of theory.
2. Motivation to correct bias (I’ll come back to this one)
3. Awareness of direction and magnitude of the bias
4. Ability to adjust response via mental control

So, to describe this in terms of the snack example above. I come into the kitchen–if I’m not in a hurry then I might think about how the snacks are laid out. Then, being aware of the bias I think, “Do I just grab the fruit or am I motivated to engage in extra effort to grab the candy?” Then, when weighing the two options, I might consider “how much is the layout pushing me to grab the fruit?” That last question is quite hard to answer! It’s impossible to introspect to access my unbiased preference, and anyway, even if I could, I now have this biased preference all the same–will it affect my enjoyment of the candy if I override my default reaching for the fruit? That last point is key–Hal Arkes, in another classic paper, “Costs and benefits of judgment errors: Implications for debiasing”, points out that preference biases, which he calls psychophysical errors, (like risk aversion, for instance) present themselves as what we want. One doesn’t change the nature of one’s preferences (the heart wants what it wants)–in the best case one can reframe those options. So once the accessibility of the fruit makes it more desirable, knowing that it was accessibility-driven makes it no less desirable seeming.

I want to say one more thing on this Straussian idea of elites scheming to apply the psychological effects on the unsuspecting masses. I think it’s important to keep in mind that the techniques that are most likely to work as nudges are the most time-honored and well-known techniques from psychology! So the basis for the most effective nudges are likely to be the manipulations that are the most widely known. This provides an important check on the scheming of “Government House utilitarianism”, as Waldron quotes Bernard Williams describing this setup.

Therefore, knowledge how (or that) we are being manipulated in many cases will have no bearing whatsoever on whether the nudges work. We can safely continue on our “Stanovichian programme of mindware upgrades”, as Poole puts it. (Stanovich is a psychologist who writes about using meta-cognition to overcome biases). Psychologists, Cass Sunstein included, would like to see people more aware of how their mind works!

Hyper-success and the globalization of envy

Economists have written about the star system and the impact of globalization on inequality 1. Capsule version: global markets mean that talent can be monetized at a much greater scale, e.g. the greatest athletes of today make way more money than the greatest athletes of 50 years ago. Not to underplay the role of structural imbalances of power in the rise of the super-wealthy, I think it is nonetheless safe to say that at least part of the rise of the fractal rich has been the result of global markets.

One aspect of this that I think is underexplored is how the presence of the hyper successful in our society foster envy in a more pervasive way than before. One important aspect of globalization is the tyranny of small differences. What I mean by this is that in these global competitions (many of which are winner-take-all or winner-take-most), the difference between first and second in terms of ability is ever shrinking. The result of this is that outcomes for people who are very similar on observable talent have diverged. Think about the 8th place olympian who gets few endorsements or wealth but is still better than 99.9999% of people at their sport, or the ace programmer who makes a very nice middle class living but is no tech millionaire. These people are all tremendously successful, and yet they have ready models of people who are only a little bit better and way way more successful. It’s easy to see how people in that position might feel envy despite the fact that they are living objectively better than most everyone else in all of human history. In fact, in a really fantastic paper on “Hypermotivation“, Scott Rick and George Loewenstein interpret experimental evidence showing that people in more fierce competitions are more likely to cheat out of loss aversion, and a sense that they are only getting themselves what they feel they are entitled to or deserve (they talk about this in the context of fraud in psychology).

Nonetheless, I do think that this kind of envy is what motivates many in the top 5% to increase their output rather than enjoy their added productivity in the form of more leisure. Says the person enjoying his Saturday.

Notes:

  1. Chyrstia Freedland writes about Sherwin Rosen’s work on the “economics of superstars” to great effect in her wonderful book, Plutocrats: The Rise of the New Global Super-Rich and the Fall of Everyone Else

Why social science grad students would make great product managers

After my interview with InDecision Blog, a number of graduate students emailed asking me about careers in technology (hey, I asked for it). They were a very impressive lot from top universities, but their programming skills varied quite a bit. Some less technically minded folks were looking at careers in technology aside from data scientist. Enough of them asked specifically about product management, so I thought I would combine my answers for others who might be interested.

What does a product manager do?
Brings the donuts. The nice thing about social science grad students for whom reading about product managers is news is that we can skip over the aggrandized misconceptions about product management that many more familiar with the technology space might harbor. The product manager is the person (or persons) that stands at the interface between an engineering team building a product and the outside world (here includes not only the customers/users of the product, but also the other teams within a given company who might be working on related products). The product manager is in charge of protecting the “vision” of the product. Sometimes they come up with that vision, but more often than not, the scope of what the product should be and what features it needs to have today, next week, or next year is something that emerges out of interactions between the engineers, the engineers’ manager, the product manager, company executives, etc etc. The product manager is really just the locus of where that battle plays out. So obviously there is a great need for politicking at times as well.

But wait, there’s more! Once the product is actually launched, it is typically still worked on and improved (or fixed). So the product manager is also the person that gets to figure out how to prioritize the various additional work that could be done. But how do they figure out what needs to be changed or fixed? This is one of the places where research comes in! So someone like me might do analysis on the data of people’s actual usage of the product (the product manager prioritized getting the recording of people’s actions properly instrumented, right? RIGHT?). Or a qualitative researcher might conduct interviews of users in the field and try and abstract an understanding from that. Either way, the product manager has to make sense of all this incoming information and figure out how to allocate resources accordingly.

Why would social science graduate students be good at that?
Perhaps you can see where I’m going with this. Products are increasing in scope. Even a simple app has potentially tens of thousands of users. Quantitative methods are becoming increasingly important for understanding what customers do. In such an environment, being savvy about data is hugely advantageous. In the same way that many product managers benefit from computer science degrees without coding on a daily basis, product managers will benefit from knowing statistics, along with domain expertise in psychology, sociology, anthropology even if they aren’t the ones collecting and analyzing the data themselves. It will help them ask the right questions and to when to trust results, and when to be more skeptical. It will help them operationalize their measures of success more intelligently.

The soft skills of graduate school also translate more nicely. Replace “crazy advisor” with “manager” (hopefully a good one) and replace “fellow graduate students” with “other product managers” and many of the lessons apply. Many graduate social scientists will have plenty of experience with being part of a lab and engaging in large-scale collaborative projects. Just like in graduate school, a typical product manager will spend hours fine tuning slide decks and giving high stakes presentations meant to convince skeptical elders of the merit of a certain course of research (replace with: feature, product, or strategy).

Finally, building technology products is a kind of applied social science. You start with a hypothesis about a problem that people are having that you can solve. Of course, as a social scientist, the typical grad student understands just how fraught this is! Anthropologist readers of James Scott and Jane Jacobs and economists who love their Hayek will have a keen appreciation for spontaneous order (“look! users are using this feature in a totally unexpected way!”), as well as the difficulties of a priori theories of users’ problems or competencies. In fact, careful reading of social science should make a fledging PM pretty skeptical of grand theories. For instance–should interfaces be simpler or more complicated? How efficient should we make it to do some set of common actions? If everything is easily accessible from one click on the front page, will there be overload of too many buttons? Is that simpler or more complicated? These sorts of debates, much like debates about the function of particular social institutions or legal proscriptions, are not easily solved with simple bromides like “less is always better”, or “more clear rules, less discretion” (I am reading Simpler: The Future of Government by Cass Sunstein right now, and he makes this point very well with respect to regulations). The ethos of the empirical social scientist is to look for incremental improvements bringing all of our particularist knowledge to bear on a problem, not to solve everything with one sweeping gesture. This openness is exactly the right mentality for a product manager, in my opinion.

Conclusion
I hope I have at least partially convinced you that as an empirical social scientist, you would make a great product manager. Now the question is, how do I convince someone in technology of that? The short and most truthful answer is, I’m not 100% certain. It might take some work to break into project management, but I see lots of people with humanities background doing it, so it can’t be that hard (One of my favorite Google PMs is an English PhD). One thing I would suggest is carefully framing your resume to emphasize your PM-pertinent skills–things like, group project management, public speaking experience, making high stakes presentations, etc. You might also consider making a small persuasive deck to show as a portfolio example of a situation where you convinced someone of something (your dissertation proposal could work?). This would be a great start. Another thing is consider more junior PM roles initially–as a PhD coming out of grad school you are still going to make a fine salary as an entry-level product manager. If you apply these principles I have no doubt that you will quickly move up.

Why a focus on p-hacking is misplaced, or the coming co-evolution

There has been a lot of recent work on p-hacking (making things statistically significant through taking advantage of analysis degrees-of-freedom), which I think is good (it’s starting to make people aware of the scope of the problem facing social psychology and related fields); however, I think people are missing something fundamental.

As Tal Yarkoni recently pointed out (and as I pointed out in a previous blog post), the incentives in the academy are messed up. Success in funding, in getting a job, etc, all hinges on your ability to produce positive results. When you livelihood literally depends on getting a positive result, it’s very hard to avoid putting your thumb on that scale.

So the solutions thus far proffered involve things like “publishing your data” and other such controls that will purport to “solve” this problem. However, the deep problem with this can be illustrated with a hypothetical computer program called “the Fake-ulator” (I thought about actually writing this program–but I think the thought experiment is enough for now). Version 1 is just a beta, so it only works for Likert scales. But the idea is simple enough–if we scour the literature for Likert scale data and effects we quickly realize that simple random draws from a response distribution will be easy to spot. Humans have lots of unique biases that lead to systematic patterns in response data like Likert scale data. So, the authors of the Fake-ulator have scoured the literature and have built a random data generator that generates data that looks indistinguishable statistically from real human response data! Better yet, you can input an effect size and generate beautiful (but not too beautiful) data that is statistically significant. You can even generate a fake file drawer, since many of these fake experiments will be “failures”! But hey, since your fake effect is positive, random fake experiments on average will find your effect. So with a computer program like this, you could easily imagine someone faking all of their data in a way that no one would ever notice.

Now what keeps me up at night is, does this computer program already exist? Did we only catch the really dumb fakers who didn’t take the time to do it the right way? One objection might be that anyone smart enough to do this will just run the studies–I think this is wrong. Actually running the studies leaves things up to chance. If you really want a 6-figure tenure track job at Harvard or Princeton, real data just won’t do!

The point of this is just to say that we need more than just clever statistics and safeguards–until we fundamentally change the incentives of science to reward process instead of outcome, we aren’t going to solve this problem. We are only going to make it much harder to determine if something is real or not. The adaptations are already upon us!

in which I comment on meritocracy

This link, which of course touches on many of the same themes as Chris Hayes’ Twilight of the Elites, points out that an increasingly metrics focused way of weeding out potential candidates for some elite group leads to a narrowing of the backgrounds and viewpoints of that elite. This happens as applicants increasingly narrow their focus of study to optimize their chances of success (the gaokao in China is another modern day example of this–there are many).

This connects up to a comment that Cosma Shalizi made regarding my previous post on SimGradSchool, objecting that “I like this a lot, but suspect the assumption of a unidimensional ability score misses a lot of why shit is fucked up and bullshit in the current academic job market.” I think I understand Cosma’s objection more broadly, and it connects directly to the notion of cognitive diversity.

If you read Scott Page’s terrific book on diversity, The Difference, he utilizes simulation to compellingly argue that the key to solving difficult problems is having a diversity of viewpoints drawn from a large pool of possible ways of thinking. Cosma and Henry Farrell have made a similar argument for the benefits of democracy–that a the voting mechanism of democracy is the best way to solve the problem of aggregating preferences and solving complex coordination problems among agents.

So, I think these arguments point to another deeper problem for a unidimensional perspective on research ability. Discovery in science requires a diversity of viewpoints to make progress. If we make all the undergrads come from the same background (e.g. research assistant at a top lab from the beginning of undergrad, poster presentations at relevant conferences, etc.), or new faculty (come from these 10 schools and have 2 JPSPs / psych science journal articles), the problem is that we are going to get too narrow of a pool of potential researchers. One of the unique strengths of my graduate program at CMU was that they took students from many different backgrounds (I basically did a psych/econ grad degree with 0 econ classes, 2 psych classes and a philosophy/cs major). I think it definitely gave us a unique perspective. More broadly, I worry about whether a grades/test scores focused society is going to quash the very creativity that has been so central to innovation. Imagine Steve Jobs trying to get a job today in tech as a dropout from Reed with some calligraphy coursework and no technical major–not happening.

Of course, the problem remains–what do you do with the flood of applicants? You still have a sorting problem. How do you select for cognitive diversity in the right way? This has become an increasingly large problem at tech companies which are leaning on referrals even more than before. I have a few thoughts about this that I will share in an upcoming blog post.

Now the only problem is I probably took the window out of Cosma’s sails and he won’t blog about me anymore 🙁

SimGradSchool, a study in new faculty hiring practices

[Attention conservation notice: 2000+ words about hiring in academia including an overly complex numerical model. Navel gazing surely to follow.] 1

Like many other social science graduate students who have graduated in the last 5-10 years 2, I have experienced the ratcheting up of competition. It’s interesting to think about the changing landscape of the competition for scarce tenure-track faculty positions in the last 30 years. On the one hand, the successes of identity politics have brought us an increasingly diverse (in terms of race, gender, nationality) academy. On the other, at least in the behavioral science, the sheer number of competitors and the need for impartial measures of candidate quality has narrowed the range of what top candidates’ dossiers look like 3. To be clear, I am not advocating for us to go back to the good old old days where knowing the right person and a pat on the back is all that mattered 4. But I do think that what we have now is also far from the optimal strategy in selecting the most promising behavioral scientists and giving them the best resources to succeed.

So what does the hiring game today look like? A few recent papers give us some clues. First, there’s this analysis of the political science hiring market, finding that the graduates of top programs (Harvard, Stanford, etc) dominate the job market. So much of the hiring game is already decided when you go through graduate admissions at the outset. In my experience advisor choice definitely is key–pick a good advisor and your chances increase. Another key factor is publication–it used to be the case that a top grad student from a top department maybe had one top tier journal article to their name. Now you typically see top candidates with 3-4 such papers, plus a host of “secondary” publications (like textbooks or edited volumes, which because they are not peer-reviewed, are typically worth less).

As readers of this are likely aware, the behavioral sciences (and biological sciences, to a lesser extent) are facing somewhat of a replicability crisis, with many key findings in our field being called into question. Did these games exist before? Yes, in all likelihood. However, what has changed is the environment of hyper-motivation. Loewenstein and Rick coined this term to describe the feeling of being “in a hole” or disadvantaged in a competition. The desire to “level the playing field” or “get back to even” is shown to be a prime motive in cheating behavior 5. So in a world where you constantly feel like you don’t have enough papers to succeed and get a top job and will be consigned to some far corner of the country or a low paying post-doc, you are all the more likely to cheat. Oh, and lest you think that whether a promising researcher gets into a top tier school or a second tier school doesn’t really matter–there is strong evidence that it does. See this paper that shows that an exogenous shock to hiring (a recession), affects the long term productivity of economists that are hired. So people are legitimately trying to maximize the career opportunities when they cut corners to get a few more papers out.

A Simple Model of Graduate School and Hiring
Struck by this system in which your success is determined by your pedigree and the number of papers you publish, I sought out to construct a simple numerical model of graduate school. I wanted to see how your underlying quality as a research correlated with hiring outcomes in a stylized environment. I posted the R simulation code that I wrote below so that other people can examine and even extend my model if they wish. Note that what follows is a high level summary of how the model works and my general findings from playing with it. Readers interested in the details of how it works are advised to check out the code, which is fairly well-commented.

(TL;DR Summary
The number of papers has at best a moderate correlation with your underlying quality as a researcher. The lion’s share of selection comes from advisors picking the best students. If graduate school sorting is reasonably good, hiring will be reasonably good.)

The basic idea of the model is that each graduate student is represented by an underlying quality parameter. This parameter determines the average true effect sizes of experiments they run. So in this representation, a better researcher is someone who proposes more effects with larger average effect size. I realize this is a bit of an oversimplification, but setting things up this way has some nice properties. Essentially we can generate some proposed effects from a pool of researchers, set up some very simple well-powered or under-powered studies to test those effects and simulate how many positive results they get.

The way I set up the course of graduate school then consisted in two parts. First, graduate students are assigned an advisor of some varying quality score. I had an external parameter modify the degree of correlation between the advisor quality and the underlying graduate student quality. So either advisors are great judges of talents, or not so good (I leave it to the reader to decide what the underlying value for the parameter should be). Graduate students then either run somewhat under-powered or somewhat well-powered studies, the number of which range over the number someone might expect to run in a typical grad student career (This can range from just a few studies all the way up to 100. Yes, there are definitely graduate students who run 100 studies over the course of graduate school.). So there are a number of different factors that contribute to the number of successful (aka t-test reveals p < .05) studies a graduate student in my simulation ends up with--their sample sizes, the number of studies they manage to pull off, and of course, their underlying quality. From there, I created a weighted average of the advisor's quality and the number of successful studies each graduate student had. This weighted score is a measure of job market desirability. In a crude way, it measures how well each student would fare on the market. Assuming you believe this is a reasonable model (see objections below if/when you don't), you can then correlate the weighted hiring score to the original quality score and ask, under a different set of model parameters, what is the correlation between hiring score and quality score? If the given hiring procedure were doing a reasonable job, then you would expect the correlation would be high. If bad, then low. What do actual runs of the model yield? As you can see from the code below, I set up a pool of 1000 graduate of students of different quality, assigned them to advisors with fairly high correlation between student and advisor quality (I set it at .5 and .7 for the various models), randomized how many studies they ran and what kind of sample sizes (25 or 50 per condition). Basically what you find is a moderate correlation between hiring score and (~ .6). However, interestingly, most of that is driven by advisor selection. The correlation between positive results and your quality as a student was significantly lower (~.35). Now, you might think that's pretty good (that is a medium effect size how psychologists typically define it). But the question you should be asking yourself is--relative to what? Also, this model assumes no p-hacking or file-drawer effects. The effects posited by experimenters are assumed to be independent so dropping failed studies isn't a problem. In real life, the correlation is likely to be much smaller, not larger. Some Obvious Objections
Now one obvious objection is that this is a gross oversimplication of the job market. There is an extensive interview process, so if someone’s research does not adequately reflect their skills, then the interview process would remove that. Notice, however, that the people a typical institution chooses to interview in the first place is largely a weighted average of the papers and advisor–no one is interviewing a candidate from University of Random with no papers. So even if there are other factors that affect selection that are more closely correlated with underlying quality, those other factors do not come into play until the end of the process where arguably the other promising students have already been weeded out. Also there is a huge hindsight bias coming into play–you know already that this person has successful research, so you are more likely to believe that what they are saying is plausible.

Another objection is that we actually do want to reward the people who run more studies–we want to reward some element of “can do gumption” beyond just researcher skill, and your positive result count is a nice indicator of that (since it partly reflects the sheer amount of work you put in). I have some sympathy for this argument, but given what I wrote above about hypermotivation and the various file drawer issues, I’m not sure we want to reward people just for running a lot of studies. Also, its not clear to me that the difference between someone who ran “many” and “few” studies translates into differences of effort expended–you might run more studies because you have more grant money at a prestigious university, or a better set up subject pool, or better undergrad research experience, or a very organized advisor–many reasons that have nothing to do with your overall quality as a researcher, even assuming that ceteris parabis, more studies = harder worker.

Conclusions
So improvements in our hiring algorithm definitely could potentially increase the quality of scientific output. Arguably there is a Moneyball opportunity here–the correlation between IQ and SAT/ACT stores range from around .6 to .8 6. So one might imagine a hiring process built around evaluating research before it’s executed, or a general purpose test meant to measure knowledge of the social sciences, or your ability to think about and present a research problem and solution with a few day’s notice. Surely a battery of tests that attempted to measure quality of research hypotheses a priori could potentially be much more successful in identifying and promoting the best scientists rather than relying on the small sample size of a few results done at the start of their careers.

library("plyr")
 
# Indicates the number of graduate students in the simulation.
num.students <- 1000
 
# Student quality is drawn from real values from 1-5 which correspond
# to the alpha parameter in a beta distribution with the
# beta parameter set to 10. The mean of this distribution is
# alpha / (alpha + beta) and indicates the average true effect size
# of any experiment that a student runs. 
students <- runif(num.students) * 4 + 1
 
# Experiments denotes how many experiments the students run
# On average, a student runs 18 experiments in graduate school.
# Because the distribution is a log-normal, there are a few
# outliers who run a lot of studies (~100). This lines up roughly
# with the author's observations and conversation with other
# students.
experiments <- floor(rlnorm(num.students, 2.75, .65))
 
# Subjects denotes how many subjects the students run per
# condition in their experiments. For simplicity, students
# can either run a large amount per condition (50), or a
# small amount.
subjects <- sample(c(25,50), num.students, replace = T)
 
# Inputs
# ss: vector of students
# r: quality of admissions (correlation between the quality of the 
#    students and the quality of the advisors)
# Outputs
# vector of scores which represent advisor quality, on average correlated with
# students with probability r
run.admissions <- function(ss, r) {
  admit.scores <- (ss * r) + ((runif(length(ss)) * 4 + 1) *
                  sqrt(1-(r^2)))
  admit.scores
}
 
# Inputs
# df: data.frame consisting of a vector for student quality, a vector for
# the number of experiments each student runs in graduate school, and a
# vector for the number of subjects per condition the student runs.
# Outputs
# The number of successful experiments each student has at the end
# of grad school 
run.grad.school <- function(df) {
  out <- adply(df, 1, function(x) run.research(s = x$quality, 
                                               n = x$num.exp,
                                               p = x$num.subjects))
  names(out)[length(out)] <- "num.positives"
  out <- out[order(out$quality, decreasing = T),]
  out
}
 
# Inputs
# s: avg true effect size of a particular student
# p: number of subjects per condition
# Outputs
# 1: indicates a successful study, 0: indicates a failure
run.study <- function(s, p) {
  m <- rbeta(1,s,10)
  x <- rnorm(p)
  y <- rnorm(p,mean = m)
  p.value <- t.test(x,y)$p.value
  out <- ifelse(p.value <= .05, 1, 0)
  out
}
 
# Inputs
# s: avg effect size represents student quality
# n: number of experiments run in grad school
# p: number of subjects per condition in each study  
# Outputs
# number of successful results (i.e. papers published)
run.research <- function(s, n, p) {
  out <- sum(unlist(lapply(1:n, FUN = function(x) run.study(s, p))))
  out
}
 
# Inputs
# df: dataframe containing advisors' quality as well as the number
# of positive results that the student obtained.
run.hiring <- function(df) {
  df$z.exp <- (df$num.positives -  mean(df$num.positives)) / sd(df$num.positives)
  df$z.adv <- (df$advisor - mean(df$advisor)) / sd(df$advisor)
  df$total_score <- df$z.exp + df$z.adv
  df <- df[order(df$total_score, decreasing = T),]
  df
} 
 
### Simulation Code Here
advisors <- run.admissions(students, .5)
s.df <- data.frame(quality = students, num.exp = experiments,
                   num.subjects = subjects, advisor = advisors)
out.df <- run.grad.school(s.df)
hired.df <- run.hiring(out.df)

Notes:

  1. You might have noticed me copying a few tropes from another, much more intelligent blogger than me. Well, imitation flattery blah blah.
  2. You might ask, “why are you writing this, Paul? Are you actually embittered about academia given your current life?” Well, overly invasive individual, I am not trying to make some grand claim about how I was jobbed. For one thing, I didn’t even go on the job market. For another, if I had, I probably would have benefited at least partially from some of the effects I discuss. The issue for me had more to do with student debt, being an immigrant with no assets, socio-economic class, being genuinely passionate about technology from a young age, and discovering an interest in business as well as the desire to have a wider impact on people’s lives.
  3. I would also guess it has affected the SES of successful academics–if you are carrying a lot of student debt, it makes the choice to become an academic very difficult. I don’t know of any research on this, however.
  4. Of course, in some ways it still very much matters. I’m not aware of any systematic study, but I’ve observed that the children of professors have an extremely high success rate in academia. Part of that is surely being well taught and genetically more likely to be intelligent. But there is another factor–children of academics seem more savvy about the game, are always well positioned with the right advisor, in the right program, the right research, etc. Hard to say which of these factors is most important.
  5. Incidentally, you can see an instance of this in a recent paper on teacher incentives. The incentive that worked the best to motivate teachers was giving them a bonus and then threatening to take it away unless students’ scores increased. Of course, they increased. The lesson of hyper-motivation is that we shouldn’t be so quick to employ incentives such as these–they may be more likely to lead to cheating.
  6. I’m just using IQ as an example. Not that I believe in IQ as a great measure of intelligence. See another brilliant post by Shalizi here on this topic.

First post

After a long hiatus I set this up so I could post random musings on topics I couldn’t put anywhere else. Coming very soon, a project I have been working on– SimGradSchool.