SimGradSchool, a study in new faculty hiring practices

[Attention conservation notice: 2000+ words about hiring in academia including an overly complex numerical model. Navel gazing surely to follow.] ¹

Like many other social science graduate students who have graduated in the last 5-10 years ², I have experienced the ratcheting up of competition. It’s interesting to think about the changing landscape of the competition for scarce tenure-track faculty positions in the last 30 years. On the one hand, the successes of identity politics have brought us an increasingly diverse (in terms of race, gender, nationality) academy. On the other, at least in the behavioral science, the sheer number of competitors and the need for impartial measures of candidate quality has narrowed the range of what top candidates’ dossiers look like ³. To be clear, I am not advocating for us to go back to the good old old days where knowing the right person and a pat on the back is all that mattered ⁴. But I do think that what we have now is also far from the optimal strategy in selecting the most promising behavioral scientists and giving them the best resources to succeed.

So what does the hiring game today look like? A few recent papers give us some clues. First, there’s this analysis of the political science hiring market, finding that the graduates of top programs (Harvard, Stanford, etc) dominate the job market. So much of the hiring game is already decided when you go through graduate admissions at the outset. In my experience advisor choice definitely is key–pick a good advisor and your chances increase. Another key factor is publication–it used to be the case that a top grad student from a top department maybe had one top tier journal article to their name. Now you typically see top candidates with 3-4 such papers, plus a host of “secondary” publications (like textbooks or edited volumes, which because they are not peer-reviewed, are typically worth less).

As readers of this are likely aware, the behavioral sciences (and biological sciences, to a lesser extent) are facing somewhat of a replicability crisis, with many key findings in our field being called into question. Did these games exist before? Yes, in all likelihood. However, what has changed is the environment of hyper-motivation. Loewenstein and Rick coined this term to describe the feeling of being “in a hole” or disadvantaged in a competition. The desire to “level the playing field” or “get back to even” is shown to be a prime motive in cheating behavior ⁵. So in a world where you constantly feel like you don’t have enough papers to succeed and get a top job and will be consigned to some far corner of the country or a low paying post-doc, you are all the more likely to cheat. Oh, and lest you think that whether a promising researcher gets into a top tier school or a second tier school doesn’t really matter–there is strong evidence that it does. See this paper that shows that an exogenous shock to hiring (a recession), affects the long term productivity of economists that are hired. So people are legitimately trying to maximize the career opportunities when they cut corners to get a few more papers out.

A Simple Model of Graduate School and Hiring
Struck by this system in which your success is determined by your pedigree and the number of papers you publish, I sought out to construct a simple numerical model of graduate school. I wanted to see how your underlying quality as a research correlated with hiring outcomes in a stylized environment. I posted the R simulation code that I wrote below so that other people can examine and even extend my model if they wish. Note that what follows is a high level summary of how the model works and my general findings from playing with it. Readers interested in the details of how it works are advised to check out the code, which is fairly well-commented.

(TL;DR Summary
The number of papers has at best a moderate correlation with your underlying quality as a researcher. The lion’s share of selection comes from advisors picking the best students. If graduate school sorting is reasonably good, hiring will be reasonably good.)

The basic idea of the model is that each graduate student is represented by an underlying quality parameter. This parameter determines the average true effect sizes of experiments they run. So in this representation, a better researcher is someone who proposes more effects with larger average effect size. I realize this is a bit of an oversimplification, but setting things up this way has some nice properties. Essentially we can generate some proposed effects from a pool of researchers, set up some very simple well-powered or under-powered studies to test those effects and simulate how many positive results they get.

The way I set up the course of graduate school then consisted in two parts. First, graduate students are assigned an advisor of some varying quality score. I had an external parameter modify the degree of correlation between the advisor quality and the underlying graduate student quality. So either advisors are great judges of talents, or not so good (I leave it to the reader to decide what the underlying value for the parameter should be). Graduate students then either run somewhat under-powered or somewhat well-powered studies, the number of which range over the number someone might expect to run in a typical grad student career (This can range from just a few studies all the way up to 100. Yes, there are definitely graduate students who run 100 studies over the course of graduate school.). So there are a number of different factors that contribute to the number of successful (aka t-test reveals p < .05) studies a graduate student in my simulation ends up with--their sample sizes, the number of studies they manage to pull off, and of course, their underlying quality. From there, I created a weighted average of the advisor's quality and the number of successful studies each graduate student had. This weighted score is a measure of job market desirability. In a crude way, it measures how well each student would fare on the market. Assuming you believe this is a reasonable model (see objections below if/when you don't), you can then correlate the weighted hiring score to the original quality score and ask, under a different set of model parameters, what is the correlation between hiring score and quality score? If the given hiring procedure were doing a reasonable job, then you would expect the correlation would be high. If bad, then low. What do actual runs of the model yield? As you can see from the code below, I set up a pool of 1000 graduate of students of different quality, assigned them to advisors with fairly high correlation between student and advisor quality (I set it at .5 and .7 for the various models), randomized how many studies they ran and what kind of sample sizes (25 or 50 per condition). Basically what you find is a moderate correlation between hiring score and (~ .6). However, interestingly, most of that is driven by advisor selection. The correlation between positive results and your quality as a student was significantly lower (~.35). Now, you might think that's pretty good (that is a medium effect size how psychologists typically define it). But the question you should be asking yourself is--relative to what? Also, this model assumes no p-hacking or file-drawer effects. The effects posited by experimenters are assumed to be independent so dropping failed studies isn't a problem. In real life, the correlation is likely to be much smaller, not larger. Some Obvious Objections
Now one obvious objection is that this is a gross oversimplication of the job market. There is an extensive interview process, so if someone’s research does not adequately reflect their skills, then the interview process would remove that. Notice, however, that the people a typical institution chooses to interview in the first place is largely a weighted average of the papers and advisor–no one is interviewing a candidate from University of Random with no papers. So even if there are other factors that affect selection that are more closely correlated with underlying quality, those other factors do not come into play until the end of the process where arguably the other promising students have already been weeded out. Also there is a huge hindsight bias coming into play–you know already that this person has successful research, so you are more likely to believe that what they are saying is plausible.

Another objection is that we actually do want to reward the people who run more studies–we want to reward some element of “can do gumption” beyond just researcher skill, and your positive result count is a nice indicator of that (since it partly reflects the sheer amount of work you put in). I have some sympathy for this argument, but given what I wrote above about hypermotivation and the various file drawer issues, I’m not sure we want to reward people just for running a lot of studies. Also, its not clear to me that the difference between someone who ran “many” and “few” studies translates into differences of effort expended–you might run more studies because you have more grant money at a prestigious university, or a better set up subject pool, or better undergrad research experience, or a very organized advisor–many reasons that have nothing to do with your overall quality as a researcher, even assuming that ceteris parabis, more studies = harder worker.

Conclusions
So improvements in our hiring algorithm definitely could potentially increase the quality of scientific output. Arguably there is a Moneyball opportunity here–the correlation between IQ and SAT/ACT stores range from around .6 to .8 ⁶. So one might imagine a hiring process built around evaluating research before it’s executed, or a general purpose test meant to measure knowledge of the social sciences, or your ability to think about and present a research problem and solution with a few day’s notice. Surely a battery of tests that attempted to measure quality of research hypotheses a priori could potentially be much more successful in identifying and promoting the best scientists rather than relying on the small sample size of a few results done at the start of their careers.



library("plyr")
 
# Indicates the number of graduate students in the simulation.
num.students <- 1000
 
# Student quality is drawn from real values from 1-5 which correspond
# to the alpha parameter in a beta distribution with the
# beta parameter set to 10. The mean of this distribution is
# alpha / (alpha + beta) and indicates the average true effect size
# of any experiment that a student runs. 
students <- runif(num.students) * 4 + 1
 
# Experiments denotes how many experiments the students run
# On average, a student runs 18 experiments in graduate school.
# Because the distribution is a log-normal, there are a few
# outliers who run a lot of studies (~100). This lines up roughly
# with the author's observations and conversation with other
# students.
experiments <- floor(rlnorm(num.students, 2.75, .65))
 
# Subjects denotes how many subjects the students run per
# condition in their experiments. For simplicity, students
# can either run a large amount per condition (50), or a
# small amount.
subjects <- sample(c(25,50), num.students, replace = T)
 
# Inputs
# ss: vector of students
# r: quality of admissions (correlation between the quality of the 
#    students and the quality of the advisors)
# Outputs
# vector of scores which represent advisor quality, on average correlated with
# students with probability r
run.admissions <- function(ss, r) {
  admit.scores <- (ss * r) + ((runif(length(ss)) * 4 + 1) *
                  sqrt(1-(r^2)))
  admit.scores
}
 
# Inputs
# df: data.frame consisting of a vector for student quality, a vector for
# the number of experiments each student runs in graduate school, and a
# vector for the number of subjects per condition the student runs.
# Outputs
# The number of successful experiments each student has at the end
# of grad school 
run.grad.school <- function(df) {
  out <- adply(df, 1, function(x) run.research(s = x$quality, 
                                               n = x$num.exp,
                                               p = x$num.subjects))
  names(out)[length(out)] <- "num.positives"
  out <- out[order(out$quality, decreasing = T),]
  out
}
 
# Inputs
# s: avg true effect size of a particular student
# p: number of subjects per condition
# Outputs
# 1: indicates a successful study, 0: indicates a failure
run.study <- function(s, p) {
  m <- rbeta(1,s,10)
  x <- rnorm(p)
  y <- rnorm(p,mean = m)
  p.value <- t.test(x,y)$p.value
  out <- ifelse(p.value <= .05, 1, 0)
  out
}
 
# Inputs
# s: avg effect size represents student quality
# n: number of experiments run in grad school
# p: number of subjects per condition in each study  
# Outputs
# number of successful results (i.e. papers published)
run.research <- function(s, n, p) {
  out <- sum(unlist(lapply(1:n, FUN = function(x) run.study(s, p))))
  out
}
 
# Inputs
# df: dataframe containing advisors' quality as well as the number
# of positive results that the student obtained.
run.hiring <- function(df) {
  df$z.exp <- (df$num.positives -  mean(df$num.positives)) / sd(df$num.positives)
  df$z.adv <- (df$advisor - mean(df$advisor)) / sd(df$advisor)
  df$total_score <- df$z.exp + df$z.adv
  df <- df[order(df$total_score, decreasing = T),]
  df
} 
 
### Simulation Code Here
advisors <- run.admissions(students, .5)
s.df <- data.frame(quality = students, num.exp = experiments,
                   num.subjects = subjects, advisor = advisors)
out.df <- run.grad.school(s.df)
hired.df <- run.hiring(out.df)
library("plyr")

# Indicates the number of graduate students in the simulation.
num.students <- 1000

# Student quality is drawn from real values from 1-5 which correspond
# to the alpha parameter in a beta distribution with the
# beta parameter set to 10. The mean of this distribution is
# alpha / (alpha + beta) and indicates the average true effect size
# of any experiment that a student runs. 
students <- runif(num.students) * 4 + 1

# Experiments denotes how many experiments the students run
# On average, a student runs 18 experiments in graduate school.
# Because the distribution is a log-normal, there are a few
# outliers who run a lot of studies (~100). This lines up roughly
# with the author's observations and conversation with other
# students.
experiments <- floor(rlnorm(num.students, 2.75, .65))

# Subjects denotes how many subjects the students run per
# condition in their experiments. For simplicity, students
# can either run a large amount per condition (50), or a
# small amount.
subjects <- sample(c(25,50), num.students, replace = T)

# Inputs
# ss: vector of students
# r: quality of admissions (correlation between the quality of the 
#    students and the quality of the advisors)
# Outputs
# vector of scores which represent advisor quality, on average correlated with
# students with probability r
run.admissions <- function(ss, r) {
  admit.scores <- (ss * r) + ((runif(length(ss)) * 4 + 1) *
                  sqrt(1-(r^2)))
  admit.scores
}

# Inputs
# df: data.frame consisting of a vector for student quality, a vector for
# the number of experiments each student runs in graduate school, and a
# vector for the number of subjects per condition the student runs.
# Outputs
# The number of successful experiments each student has at the end
# of grad school 
run.grad.school <- function(df) {
  out <- adply(df, 1, function(x) run.research(s = x$quality, 
                                               n = x$num.exp,
                                               p = x$num.subjects))
  names(out)[length(out)] <- "num.positives"
  out <- out[order(out$quality, decreasing = T),]
  out
}

# Inputs
# s: avg true effect size of a particular student
# p: number of subjects per condition
# Outputs
# 1: indicates a successful study, 0: indicates a failure
run.study <- function(s, p) {
  m <- rbeta(1,s,10)
  x <- rnorm(p)
  y <- rnorm(p,mean = m)
  p.value <- t.test(x,y)$p.value
  out <- ifelse(p.value <= .05, 1, 0)
  out
}

# Inputs
# s: avg effect size represents student quality
# n: number of experiments run in grad school
# p: number of subjects per condition in each study  
# Outputs
# number of successful results (i.e. papers published)
run.research <- function(s, n, p) {
  out <- sum(unlist(lapply(1:n, FUN = function(x) run.study(s, p))))
  out
}

# Inputs
# df: dataframe containing advisors' quality as well as the number
# of positive results that the student obtained.
run.hiring <- function(df) {
  df$z.exp <- (df$num.positives -  mean(df$num.positives)) / sd(df$num.positives)
  df$z.adv <- (df$advisor - mean(df$advisor)) / sd(df$advisor)
  df$total_score <- df$z.exp + df$z.adv
  df <- df[order(df$total_score, decreasing = T),]
  df
} 

### Simulation Code Here
advisors <- run.admissions(students, .5)
s.df <- data.frame(quality = students, num.exp = experiments,
                   num.subjects = subjects, advisor = advisors)
out.df <- run.grad.school(s.df)
hired.df <- run.hiring(out.df)

Notes:

You might have noticed me copying a few tropes from another, much more intelligent blogger than me. Well, imitation flattery blah blah. ↩
You might ask, “why are you writing this, Paul? Are you actually embittered about academia given your current life?” Well, overly invasive individual, I am not trying to make some grand claim about how I was jobbed. For one thing, I didn’t even go on the job market. For another, if I had, I probably would have benefited at least partially from some of the effects I discuss. The issue for me had more to do with student debt, being an immigrant with no assets, socio-economic class, being genuinely passionate about technology from a young age, and discovering an interest in business as well as the desire to have a wider impact on people’s lives. ↩
I would also guess it has affected the SES of successful academics–if you are carrying a lot of student debt, it makes the choice to become an academic very difficult. I don’t know of any research on this, however. ↩
Of course, in some ways it still very much matters. I’m not aware of any systematic study, but I’ve observed that the children of professors have an extremely high success rate in academia. Part of that is surely being well taught and genetically more likely to be intelligent. But there is another factor–children of academics seem more savvy about the game, are always well positioned with the right advisor, in the right program, the right research, etc. Hard to say which of these factors is most important. ↩
Incidentally, you can see an instance of this in a recent paper on teacher incentives. The incentive that worked the best to motivate teachers was giving them a bonus and then threatening to take it away unless students’ scores increased. Of course, they increased. The lesson of hyper-motivation is that we shouldn’t be so quick to employ incentives such as these–they may be more likely to lead to cheating. ↩
I’m just using IQ as an example. Not that I believe in IQ as a great measure of intelligence. See another brilliant post by Shalizi here on this topic. ↩

Paul Litvak

SimGradSchool, a study in new faculty hiring practices

One thought on “SimGradSchool, a study in new faculty hiring practices”

Leave a Reply