Tag Archives: academia

Why a focus on p-hacking is misplaced, or the coming co-evolution

There has been a lot of recent work on p-hacking (making things statistically significant through taking advantage of analysis degrees-of-freedom), which I think is good (it’s starting to make people aware of the scope of the problem facing social psychology and related fields); however, I think people are missing something fundamental.

As Tal Yarkoni recently pointed out (and as I pointed out in a previous blog post), the incentives in the academy are messed up. Success in funding, in getting a job, etc, all hinges on your ability to produce positive results. When you livelihood literally depends on getting a positive result, it’s very hard to avoid putting your thumb on that scale.

So the solutions thus far proffered involve things like “publishing your data” and other such controls that will purport to “solve” this problem. However, the deep problem with this can be illustrated with a hypothetical computer program called “the Fake-ulator” (I thought about actually writing this program–but I think the thought experiment is enough for now). Version 1 is just a beta, so it only works for Likert scales. But the idea is simple enough–if we scour the literature for Likert scale data and effects we quickly realize that simple random draws from a response distribution will be easy to spot. Humans have lots of unique biases that lead to systematic patterns in response data like Likert scale data. So, the authors of the Fake-ulator have scoured the literature and have built a random data generator that generates data that looks indistinguishable statistically from real human response data! Better yet, you can input an effect size and generate beautiful (but not too beautiful) data that is statistically significant. You can even generate a fake file drawer, since many of these fake experiments will be “failures”! But hey, since your fake effect is positive, random fake experiments on average will find your effect. So with a computer program like this, you could easily imagine someone faking all of their data in a way that no one would ever notice.

Now what keeps me up at night is, does this computer program already exist? Did we only catch the really dumb fakers who didn’t take the time to do it the right way? One objection might be that anyone smart enough to do this will just run the studies–I think this is wrong. Actually running the studies leaves things up to chance. If you really want a 6-figure tenure track job at Harvard or Princeton, real data just won’t do!

The point of this is just to say that we need more than just clever statistics and safeguards–until we fundamentally change the incentives of science to reward process instead of outcome, we aren’t going to solve this problem. We are only going to make it much harder to determine if something is real or not. The adaptations are already upon us!

in which I comment on meritocracy

This link, which of course touches on many of the same themes as Chris Hayes’ Twilight of the Elites, points out that an increasingly metrics focused way of weeding out potential candidates for some elite group leads to a narrowing of the backgrounds and viewpoints of that elite. This happens as applicants increasingly narrow their focus of study to optimize their chances of success (the gaokao in China is another modern day example of this–there are many).

This connects up to a comment that Cosma Shalizi made regarding my previous post on SimGradSchool, objecting that “I like this a lot, but suspect the assumption of a unidimensional ability score misses a lot of why shit is fucked up and bullshit in the current academic job market.” I think I understand Cosma’s objection more broadly, and it connects directly to the notion of cognitive diversity.

If you read Scott Page’s terrific book on diversity, The Difference, he utilizes simulation to compellingly argue that the key to solving difficult problems is having a diversity of viewpoints drawn from a large pool of possible ways of thinking. Cosma and Henry Farrell have made a similar argument for the benefits of democracy–that a the voting mechanism of democracy is the best way to solve the problem of aggregating preferences and solving complex coordination problems among agents.

So, I think these arguments point to another deeper problem for a unidimensional perspective on research ability. Discovery in science requires a diversity of viewpoints to make progress. If we make all the undergrads come from the same background (e.g. research assistant at a top lab from the beginning of undergrad, poster presentations at relevant conferences, etc.), or new faculty (come from these 10 schools and have 2 JPSPs / psych science journal articles), the problem is that we are going to get too narrow of a pool of potential researchers. One of the unique strengths of my graduate program at CMU was that they took students from many different backgrounds (I basically did a psych/econ grad degree with 0 econ classes, 2 psych classes and a philosophy/cs major). I think it definitely gave us a unique perspective. More broadly, I worry about whether a grades/test scores focused society is going to quash the very creativity that has been so central to innovation. Imagine Steve Jobs trying to get a job today in tech as a dropout from Reed with some calligraphy coursework and no technical major–not happening.

Of course, the problem remains–what do you do with the flood of applicants? You still have a sorting problem. How do you select for cognitive diversity in the right way? This has become an increasingly large problem at tech companies which are leaning on referrals even more than before. I have a few thoughts about this that I will share in an upcoming blog post.

Now the only problem is I probably took the window out of Cosma’s sails and he won’t blog about me anymore 🙁

SimGradSchool, a study in new faculty hiring practices

[Attention conservation notice: 2000+ words about hiring in academia including an overly complex numerical model. Navel gazing surely to follow.] 1

Like many other social science graduate students who have graduated in the last 5-10 years 2, I have experienced the ratcheting up of competition. It’s interesting to think about the changing landscape of the competition for scarce tenure-track faculty positions in the last 30 years. On the one hand, the successes of identity politics have brought us an increasingly diverse (in terms of race, gender, nationality) academy. On the other, at least in the behavioral science, the sheer number of competitors and the need for impartial measures of candidate quality has narrowed the range of what top candidates’ dossiers look like 3. To be clear, I am not advocating for us to go back to the good old old days where knowing the right person and a pat on the back is all that mattered 4. But I do think that what we have now is also far from the optimal strategy in selecting the most promising behavioral scientists and giving them the best resources to succeed.

So what does the hiring game today look like? A few recent papers give us some clues. First, there’s this analysis of the political science hiring market, finding that the graduates of top programs (Harvard, Stanford, etc) dominate the job market. So much of the hiring game is already decided when you go through graduate admissions at the outset. In my experience advisor choice definitely is key–pick a good advisor and your chances increase. Another key factor is publication–it used to be the case that a top grad student from a top department maybe had one top tier journal article to their name. Now you typically see top candidates with 3-4 such papers, plus a host of “secondary” publications (like textbooks or edited volumes, which because they are not peer-reviewed, are typically worth less).

As readers of this are likely aware, the behavioral sciences (and biological sciences, to a lesser extent) are facing somewhat of a replicability crisis, with many key findings in our field being called into question. Did these games exist before? Yes, in all likelihood. However, what has changed is the environment of hyper-motivation. Loewenstein and Rick coined this term to describe the feeling of being “in a hole” or disadvantaged in a competition. The desire to “level the playing field” or “get back to even” is shown to be a prime motive in cheating behavior 5. So in a world where you constantly feel like you don’t have enough papers to succeed and get a top job and will be consigned to some far corner of the country or a low paying post-doc, you are all the more likely to cheat. Oh, and lest you think that whether a promising researcher gets into a top tier school or a second tier school doesn’t really matter–there is strong evidence that it does. See this paper that shows that an exogenous shock to hiring (a recession), affects the long term productivity of economists that are hired. So people are legitimately trying to maximize the career opportunities when they cut corners to get a few more papers out.

A Simple Model of Graduate School and Hiring
Struck by this system in which your success is determined by your pedigree and the number of papers you publish, I sought out to construct a simple numerical model of graduate school. I wanted to see how your underlying quality as a research correlated with hiring outcomes in a stylized environment. I posted the R simulation code that I wrote below so that other people can examine and even extend my model if they wish. Note that what follows is a high level summary of how the model works and my general findings from playing with it. Readers interested in the details of how it works are advised to check out the code, which is fairly well-commented.

(TL;DR Summary
The number of papers has at best a moderate correlation with your underlying quality as a researcher. The lion’s share of selection comes from advisors picking the best students. If graduate school sorting is reasonably good, hiring will be reasonably good.)

The basic idea of the model is that each graduate student is represented by an underlying quality parameter. This parameter determines the average true effect sizes of experiments they run. So in this representation, a better researcher is someone who proposes more effects with larger average effect size. I realize this is a bit of an oversimplification, but setting things up this way has some nice properties. Essentially we can generate some proposed effects from a pool of researchers, set up some very simple well-powered or under-powered studies to test those effects and simulate how many positive results they get.

The way I set up the course of graduate school then consisted in two parts. First, graduate students are assigned an advisor of some varying quality score. I had an external parameter modify the degree of correlation between the advisor quality and the underlying graduate student quality. So either advisors are great judges of talents, or not so good (I leave it to the reader to decide what the underlying value for the parameter should be). Graduate students then either run somewhat under-powered or somewhat well-powered studies, the number of which range over the number someone might expect to run in a typical grad student career (This can range from just a few studies all the way up to 100. Yes, there are definitely graduate students who run 100 studies over the course of graduate school.). So there are a number of different factors that contribute to the number of successful (aka t-test reveals p < .05) studies a graduate student in my simulation ends up with--their sample sizes, the number of studies they manage to pull off, and of course, their underlying quality. From there, I created a weighted average of the advisor's quality and the number of successful studies each graduate student had. This weighted score is a measure of job market desirability. In a crude way, it measures how well each student would fare on the market. Assuming you believe this is a reasonable model (see objections below if/when you don't), you can then correlate the weighted hiring score to the original quality score and ask, under a different set of model parameters, what is the correlation between hiring score and quality score? If the given hiring procedure were doing a reasonable job, then you would expect the correlation would be high. If bad, then low. What do actual runs of the model yield? As you can see from the code below, I set up a pool of 1000 graduate of students of different quality, assigned them to advisors with fairly high correlation between student and advisor quality (I set it at .5 and .7 for the various models), randomized how many studies they ran and what kind of sample sizes (25 or 50 per condition). Basically what you find is a moderate correlation between hiring score and (~ .6). However, interestingly, most of that is driven by advisor selection. The correlation between positive results and your quality as a student was significantly lower (~.35). Now, you might think that's pretty good (that is a medium effect size how psychologists typically define it). But the question you should be asking yourself is--relative to what? Also, this model assumes no p-hacking or file-drawer effects. The effects posited by experimenters are assumed to be independent so dropping failed studies isn't a problem. In real life, the correlation is likely to be much smaller, not larger. Some Obvious Objections
Now one obvious objection is that this is a gross oversimplication of the job market. There is an extensive interview process, so if someone’s research does not adequately reflect their skills, then the interview process would remove that. Notice, however, that the people a typical institution chooses to interview in the first place is largely a weighted average of the papers and advisor–no one is interviewing a candidate from University of Random with no papers. So even if there are other factors that affect selection that are more closely correlated with underlying quality, those other factors do not come into play until the end of the process where arguably the other promising students have already been weeded out. Also there is a huge hindsight bias coming into play–you know already that this person has successful research, so you are more likely to believe that what they are saying is plausible.

Another objection is that we actually do want to reward the people who run more studies–we want to reward some element of “can do gumption” beyond just researcher skill, and your positive result count is a nice indicator of that (since it partly reflects the sheer amount of work you put in). I have some sympathy for this argument, but given what I wrote above about hypermotivation and the various file drawer issues, I’m not sure we want to reward people just for running a lot of studies. Also, its not clear to me that the difference between someone who ran “many” and “few” studies translates into differences of effort expended–you might run more studies because you have more grant money at a prestigious university, or a better set up subject pool, or better undergrad research experience, or a very organized advisor–many reasons that have nothing to do with your overall quality as a researcher, even assuming that ceteris parabis, more studies = harder worker.

So improvements in our hiring algorithm definitely could potentially increase the quality of scientific output. Arguably there is a Moneyball opportunity here–the correlation between IQ and SAT/ACT stores range from around .6 to .8 6. So one might imagine a hiring process built around evaluating research before it’s executed, or a general purpose test meant to measure knowledge of the social sciences, or your ability to think about and present a research problem and solution with a few day’s notice. Surely a battery of tests that attempted to measure quality of research hypotheses a priori could potentially be much more successful in identifying and promoting the best scientists rather than relying on the small sample size of a few results done at the start of their careers.

# Indicates the number of graduate students in the simulation.
num.students <- 1000
# Student quality is drawn from real values from 1-5 which correspond
# to the alpha parameter in a beta distribution with the
# beta parameter set to 10. The mean of this distribution is
# alpha / (alpha + beta) and indicates the average true effect size
# of any experiment that a student runs. 
students <- runif(num.students) * 4 + 1
# Experiments denotes how many experiments the students run
# On average, a student runs 18 experiments in graduate school.
# Because the distribution is a log-normal, there are a few
# outliers who run a lot of studies (~100). This lines up roughly
# with the author's observations and conversation with other
# students.
experiments <- floor(rlnorm(num.students, 2.75, .65))
# Subjects denotes how many subjects the students run per
# condition in their experiments. For simplicity, students
# can either run a large amount per condition (50), or a
# small amount.
subjects <- sample(c(25,50), num.students, replace = T)
# Inputs
# ss: vector of students
# r: quality of admissions (correlation between the quality of the 
#    students and the quality of the advisors)
# Outputs
# vector of scores which represent advisor quality, on average correlated with
# students with probability r
run.admissions <- function(ss, r) {
  admit.scores <- (ss * r) + ((runif(length(ss)) * 4 + 1) *
# Inputs
# df: data.frame consisting of a vector for student quality, a vector for
# the number of experiments each student runs in graduate school, and a
# vector for the number of subjects per condition the student runs.
# Outputs
# The number of successful experiments each student has at the end
# of grad school 
run.grad.school <- function(df) {
  out <- adply(df, 1, function(x) run.research(s = x$quality, 
                                               n = x$num.exp,
                                               p = x$num.subjects))
  names(out)[length(out)] <- "num.positives"
  out <- out[order(out$quality, decreasing = T),]
# Inputs
# s: avg true effect size of a particular student
# p: number of subjects per condition
# Outputs
# 1: indicates a successful study, 0: indicates a failure
run.study <- function(s, p) {
  m <- rbeta(1,s,10)
  x <- rnorm(p)
  y <- rnorm(p,mean = m)
  p.value <- t.test(x,y)$p.value
  out <- ifelse(p.value <= .05, 1, 0)
# Inputs
# s: avg effect size represents student quality
# n: number of experiments run in grad school
# p: number of subjects per condition in each study  
# Outputs
# number of successful results (i.e. papers published)
run.research <- function(s, n, p) {
  out <- sum(unlist(lapply(1:n, FUN = function(x) run.study(s, p))))
# Inputs
# df: dataframe containing advisors' quality as well as the number
# of positive results that the student obtained.
run.hiring <- function(df) {
  df$z.exp <- (df$num.positives -  mean(df$num.positives)) / sd(df$num.positives)
  df$z.adv <- (df$advisor - mean(df$advisor)) / sd(df$advisor)
  df$total_score <- df$z.exp + df$z.adv
  df <- df[order(df$total_score, decreasing = T),]
### Simulation Code Here
advisors <- run.admissions(students, .5)
s.df <- data.frame(quality = students, num.exp = experiments,
                   num.subjects = subjects, advisor = advisors)
out.df <- run.grad.school(s.df)
hired.df <- run.hiring(out.df)


  1. You might have noticed me copying a few tropes from another, much more intelligent blogger than me. Well, imitation flattery blah blah.
  2. You might ask, “why are you writing this, Paul? Are you actually embittered about academia given your current life?” Well, overly invasive individual, I am not trying to make some grand claim about how I was jobbed. For one thing, I didn’t even go on the job market. For another, if I had, I probably would have benefited at least partially from some of the effects I discuss. The issue for me had more to do with student debt, being an immigrant with no assets, socio-economic class, being genuinely passionate about technology from a young age, and discovering an interest in business as well as the desire to have a wider impact on people’s lives.
  3. I would also guess it has affected the SES of successful academics–if you are carrying a lot of student debt, it makes the choice to become an academic very difficult. I don’t know of any research on this, however.
  4. Of course, in some ways it still very much matters. I’m not aware of any systematic study, but I’ve observed that the children of professors have an extremely high success rate in academia. Part of that is surely being well taught and genetically more likely to be intelligent. But there is another factor–children of academics seem more savvy about the game, are always well positioned with the right advisor, in the right program, the right research, etc. Hard to say which of these factors is most important.
  5. Incidentally, you can see an instance of this in a recent paper on teacher incentives. The incentive that worked the best to motivate teachers was giving them a bonus and then threatening to take it away unless students’ scores increased. Of course, they increased. The lesson of hyper-motivation is that we shouldn’t be so quick to employ incentives such as these–they may be more likely to lead to cheating.
  6. I’m just using IQ as an example. Not that I believe in IQ as a great measure of intelligence. See another brilliant post by Shalizi here on this topic.