What If Everyone Knew Which Science to Trust?
And now for something completely different. . .
In graduate school I studied decision science.1 I learned methods, the great (in)controvertible findings of the field, ran dozens of experiments, crunched many numbers. I also learned something else: if your results weren’t significant, your career was in trouble.
I felt pressure to p-hack. To run a few more subjects and check the results, drop a condition, move failed studies into the file drawer, anything to cross that magical p < .05 threshold. My dissertation proposal was accepted on the condition that I get a positive result in an experiment. But I didn’t want to play that game. Instead I published a meta-analysis with a null effect as part of my dissertation. The findings didn’t support the hypothesis. That’s what the data said, so that’s what I reported.
Then I left academia.
The Problem Followed Me
I went into tech, working at top companies like Facebook, Google and Airbnb. I learned a lot about what it takes to build great products and run effective teams. Eventually I co-founded my own company, a Stanford nanotech spinout building a health-sensing toothbrush. The core technology was based on published research. Thousands of papers supported the sensor approach we were using.
The technology didn’t work. When we tried to replicate the foundational science, we couldn’t. Thousands of papers, and the basic claims didn’t hold up.
The problem had followed me out of academia and into industry! Meanwhile, the news kept confirming what I’d experienced firsthand.
Fraud Makes Headlines
In 2022 we learned that a landmark paper on a protein called Aβ*56, cited nearly 2,500 times and the basis for over a billion dollars in annual Alzheimer’s research funding, contained fabricated images.2 Other labs had quietly failed to find the protein for years, but those null results went unpublished. The field had spent sixteen years chasing a lead that was never real.
Marc Tessier-Lavigne, president of Stanford, resigned after investigations revealed problems in papers he’d authored, flagged by Elisabeth Bik, a microbiologist who has personally scanned over 20,000 papers for image manipulation.3
And in May 2025, Harvard revoked tenure for the first time in 80 years.4 The professor was Francesca Gino, who had built her career studying honesty and ethics, and was fired after forensic analysis revealed she had allegedly fabricated data across multiple studies. The irony of a dishonesty researcher faking her honesty research would be funny if it weren’t so devastating.
The Scale of the Problem
The numbers are hard to fully absorb. The landmark 2015 Reproducibility Project tested 100 psychology studies from top journals and found that only 36 percent successfully replicated.5 An eight-year project attempting to verify high-impact cancer biology found that less than half of experimental effects could be reproduced, and the successful replications showed effects 85 percent smaller than originally claimed.6 Over 10,000 papers were retracted in 2023 alone, a new record, double the previous year.7
One widely-cited estimate puts the cost of irreproducible preclinical research in the United States at tens of billions of dollars annually.8 The authors acknowledge significant uncertainty in that figure, but even the lower bound suggests an enormous waste of resources. And that’s just the direct costs. It doesn’t count the years lost chasing false leads, the patients enrolled in trials testing hypotheses that were never true, the graduate students whose careers collapsed when they couldn’t reproduce their mentors’ results.
Something Changed: AI Made a New Approach Possible
Methods to address these problems exist, but they are too tedious to compute manually for all of science. Existing efforts have only scaled to thousands of papers, while fieldwide efforts have been unable to discriminate between relevant and irrelevant p-values.
For example, p-curves can detect when a literature has too many p-values clustering just below .05, a telltale sign of p-hacking.9 Meta-analytic techniques can assess whether effect sizes are consistent across studies. Preregistration checks can verify whether researchers tested what they said they’d test. Statistical forensics can flag impossible numbers.10
Having moved into AI product development as an applied practitioner, I started to see a new possibility: AI can now do the tedious extraction work, pulling hypotheses, sample sizes, test statistics, and p-values from thousands of papers automatically, so these analyses can run at scale.
I started building tools to do exactly this. And I realized the impact it could have.
The Gap in Today’s Tools
There’s a whole ecosystem of scientific AI tools emerging right now. But they all stop short of what’s really needed.
Scientific search engines like Elicit and Consensus are genuinely useful. Elicit searches 138 million papers. Consensus shows you how many studies support or oppose a claim. But as Consensus acknowledges, each claim counts the same regardless if it comes from a meta-analysis of a thousand studies or the study of a single individual.11 They help you find papers. They can’t tell you which ones to trust.
Automated peer review systems are proliferating12, tools that scan papers for methodological issues and suggest improvements. They’re helpful for researchers and they surface potential problems that need to be addressed. But they stop short of actual evaluation or scoring. They won’t tell you that a finding is probably unreliable.
Specialized fraud-detection tools exist for specific problems. ImageTwin and Proofig catch duplicated or manipulated images.13 StatCheck flags calculation errors.14 The GRIM test detects impossible means. The Institute for Replication is building an AI engine to re-execute code and verify computational reproducibility.15
These are all good. But they’re siloed. Nobody is building the integrated system, one that brings all the signals together, reasons about individual papers the way a critical scientist would, looks at the full body of meta-analytic evidence, and produces an overall assessment of how much you should trust a given claim.
That’s what I’m building.
Vision: What a Critical Scientist Looks Like at Scale
Imagine you could ask: what’s the evidence that intervention X actually works? And instead of getting a list of papers, you got an assessment.
Here are 47 studies testing this claim. 12 were preregistered; of those, 8 found the predicted effect. The non-preregistered studies show a suspicious clustering of p-values just below .05. Three independent replications failed to find the effect. The original study was underpowered and has never been directly replicated. Two papers have statistical errors that, when corrected, eliminate significance. Bottom line: weak evidence, high risk of false positive.
This is how a careful, critical scientist thinks about evidence. They don’t just count papers. They weigh methodology, check for red flags, look at replication status, consider the full pattern. We can build AI systems that do this—systems that make this kind of careful evaluation possible for every claim, not just the few that get manual scrutiny.
The output should be meaningfully predictive of two things: whether a finding will replicate, and whether it is in line with the thinking of skeptical experts. Those are the real tests of credibility.
Who Needs This?
Almost everyone, it turns out.
Grantmakers and philanthropists deciding which interventions to fund. Right now they rely on manual literature reviews that can’t possibly keep up.
Policymakers basing policies on research. The growth mindset interventions that schools adopted based on Carol Dweck’s work? A large-scale UK trial found zero statistically significant effects on any academic outcome.16
Journalists trying to report on science accurately. Every week brings new studies with dramatic claims. Which ones should they cover? Which should they be skeptical of?
Government research agencies trying to improve the quality of science they fund. You can’t reform what you can’t measure, and right now there’s no way to monitor whether policy changes—like preregistration requirements—actually shift the distribution of evidence quality across a portfolio.
The general public, increasingly trying to navigate scientific papers themselves. If you’ve ever Googled a health question and tried to read the studies, you know how hard it is to evaluate what you’re reading.
AI labs and companies building on scientific literature. As AI systems increasingly use scientific papers for training and retrieval, they need to know which papers to trust. Garbage in, garbage out, at unprecedented scale.
This is a vital public good. Reliable information about what science actually knows, and doesn’t know, is infrastructure for a functioning society.
What We’ve Built So Far
The first piece already exists: an API at evidence.guide that extracts structured data from papers. Upload a PDF, get back JSON with study details, hypotheses, test statistics, and p-values. I’ve validated it against hundreds of hand-coded papers with 92%+ accuracy on p-value extraction. There are no other extraction APIs currently out there.
The next step is to build out a more comprehensive set of quality signals and a meta-analytic reasoning model that can weigh them appropriately.
How You Can Help
I need help.
Money. We’re a 501(c)(3) nonprofit. Donations fund development, compute, and eventually a small team. Even small amounts help.
Compute. Running sophisticated analyses on millions of papers requires serious compute. The Render Network Foundation has generously provided initial support. If you have access to compute resources and want to support open science infrastructure, I want to talk to you.
Engineering talent. I’m looking for a junior (potentially new grad!) full-stack engineer or data engineer, someone passionate about this problem who can work on a tightly scoped 3-month project with potential for a full-time role. If that’s you, or you know someone, reach out.
Introductions. If you know funders, researchers, or organizations who should be aware of this work, I’d be grateful for connections.
Feedback. If you’re a researcher who would use these tools, I want to hear from you. What signals matter most? What would make this useful for your work?
Why This Matters
I left academia because the incentives were broken. I watched my startup fail because the foundational science wasn’t real. I’ve seen the toll this takes, on researchers who can’t replicate their mentors’ work, on patients enrolled in trials testing fabricated hypotheses, on the public’s trust in science itself.
The replication crisis is mostly a story of broken incentives, not bad actors. Most researchers want to do good work. The system rewards cutting corners. We need infrastructure that makes reliable science visible and unreliable science obvious.
Learn more: dawes.institute | evidence.guide
Support the work: Donate
Get in touch: info@dawes.institute
For readers of this Substack, the sudden change in topic may feel a bit jarring. Don’t worry, more dharma-related content is coming! For those of you that aren’t regular readers, welcome! I write about Buddhist-related things and now metascience too!
Piller, C. (2022). Blots on a field? Science. https://www.science.org/content/article/potential-fabrication-research-images-threatens-key-theory-alzheimers-disease
Bik, E. (2024). Einstein Foundation Award recipient profile. https://award.einsteinfoundation.de/award-winners-finalists/recipients-2024/elisabeth-bik
NBC News. (2025). Harvard professor Francesca Gino’s tenure revoked amid data fraud investigation. https://www.nbcnews.com/news/us-news/know-harvard-professor-francesca-gino-tenure-revoked-data-fraud-invest-rcna209219
Simonsohn, U., Nelson, L., & Simmons, J. (2023). Data Falsificada (Part 2): “My Class Year Is Harvard.” Data Colada. https://datacolada.org/110
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. — Note that the 36% is a contested number depending on how you define successful replication. Depending on how you measure it, you could argue the % is somewhat higher, but I don’t think you could call the resulting replication rate good.
Errington, T. M., et al. (2021). Investigating the replicability of preclinical cancer biology. eLife, 10, e71601. https://elifesciences.org/articles/71601
Van Noorden, R. (2023). More than 10,000 research papers were retracted in 2023—a new record. Nature. https://www.nature.com/articles/d41586-023-03974-8
Freedman, L. P., Cockburn, I. M., & Simcoe, T. S. (2015). The economics of reproducibility in preclinical research. PLOS Biology, 13(6), e1002165. https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002165
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143(2), 534.
Brown, N. J. L., & Heathers, J. A. J. (2017). The GRIM test: A simple technique detects numerous anomalies in the reporting of results in psychology. Social Psychological and Personality Science, 8(4), 363-369.
Consensus. (2024). Consensus Meter: Guardrails and Limitations. https://consensus.app/home/blog/consensus-meter/
One of the best (and only actually available) ones is refine.ink. It’s really impressive! And it costs around $50 dollars per paper — not easily scalable to evaluate the entire scientific record.
Proofig. (2024). How Scientific Journals Are Fighting Image Manipulation with AI. https://www.proofig.com/newsroom/nature-shares-how-scientific-journals-are-using-tools-like-proofig-ai-to-combat-image-integrity-issues
Nuijten, M. B., et al. (2016). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 48(4), 1205-1226.
Institute for Replication. (2024). The AI Replication Engine: Automating Research Verification. https://i4replication.org/the-ai-replication-engine-automating-research-verification/
Foliano, F., Rolfe, H., Buzzeo, J., Runge, J., & Wilkinson, D. (2019). Changing Mindsets: Effectiveness trial. Education Endowment Foundation.


Hooray!! This is really exciting: much needed, and a plausible approach!