The future of academic journals?
Why journals lose their grip when scientific claims become legible
Marx wrote that in full communism the state would just wither away, but he conveniently left out exactly how. A lot of writing on the future of scientific publishing is like that — pretty hand wavy when it comes to exactly how we transition into the glorious post-journal future.
The good news is that the transition is already underway! The reality, though, is that there is a lot yet that will need to happen to get from here to there. In this piece, I’m going to explain where we are on that transition, tell a detailed story of what that transition could look like, and consider some drawbacks and objections. My goal is to convince you both that this process is in some sense inexorable and that it actually portends something good for science and scientists. There’s obviously a lot of fear and perhaps a mounting backlash to AI among academics, some of which is for justified reasons. But I think we can genuinely get excited about the possibility of real reform of academic publishing. This is important because it not only affects academia, but it affects all of us downstream consumers of the research process.
The present gestures toward the future
To that end, there have been a lot of experiments on the future of publishing, from new journals, to full end to end systems that span research submission, review and dissemination. But the effects of these experiments have been modest.1 To understand why, you have to consider all of the different functions an academic journal fulfills: distribution, curation, certification and ultimately prestige. Alt-journals have largely been able to break the monopoly on distribution, curation, and to a lesser extent, certification. But journals have largely maintained a monopoly on prestige - most academics’ careers and funding come from where they publish. In a world flooded by new publications, the brand of a top journal like Science or Nature is a powerful signal of quality. And unlike citations or other metrics of impact which take time to accumulate, publication in a top journal confers instant credibility. For most of academia, grants and tenure can be determined by just a few such papers.
Of course, one alternative approach is to change how tenure and promotion are conferred to deemphasize brand name journal publications. There are experiments like this underway, notably in the University of Maryland psychology department. This is terrific, but changing the hiring and award policies in each one of hundreds of thousands of academic departments is an uphill battle that will likely take a long time.
Free journals are an obvious lever that do exist and compete with the for-profit journals. They have made some headway, but coverage by field is patchy and the for-profit journals have still maintained most of their brand advantage. It’s possible that the brand value of journals will rapidly erode as AI submissions dilute the quality of journals. This could happen, though it’s likely that journals will do something about this, like augment peer review with AI triage. Or they might try to use AI text detectors (like Pangram) to filter out AI submissions. I doubt that will even partially stem the tide of submissions, though.
Another approach is to change the model of academic publishing. To date, there have been three kinds of approaches to replacing the prestige part of the traditional journal bundle: the overlay journal approach, the conference model, or relying on post-publication metrics.
In the overlay approach (e.g. Discrete Analysis, eLife reviewed preprints) they add peer review on top of public access repositories, unbundling curation and certification from distribution. Yet Annals of Mathematics, Inventiones, JAMS remain hugely prestige-laden for tenure. Overlay journals are a small fraction of math publishing and grow slowly. It’s possible this approach will crowd out for-profit journals over time, but for now the best one can hope for is that these overlay journals can become one alternative.
In the conference model, most prominent in computer science, conference acceptance and best paper awards are used to confer prestige and tenure. This has its own pathologies - deadline-driven half finished work, and a host of tactics to game the submission system, etc. But more importantly, in experimental sciences where it can take years to gather data to produce evidence for a claim, the short timeline of a conference model wouldn’t work. It’s also worth noting that all peer reviewed approaches in general requiring free labor are at risk from AI generated submissions, reviewer burnout etc. An ideal system would find a way to pay for good reviewers!
Finally, in the post-publication metrics approach, metrics are used to sort research and researchers. Examples include altmetrics, which relies on “attention” broadly construed, or the venerable h-index, which relies on citations. By and large academics hate these. Citation metrics, in particular, have become gamed significantly, with citation collusion rings being discovered regularly. H-index, also, of course, still relies on the paper as the unit of scientific currency. Metrics have had some impact on how scientists are judged, but everyone would say they’ve been corrupted. The last thing we need is to reduce scientists to a few incomplete and corruptible metrics.
And yet, I think AI-enabled metrics computed over a structured representation of scientific knowledge will be part of the answer…
A proposed path
Here’s the idea. We use AI to parse scientific research into atomic claims arranged in a nomological network. A nomological network is a way to represent scientific theories - the entities they operate over (the “ontology”) as well as their observable manifestations (the “operationalization”) and the relationships between them. The representation would also include the scope for the claim (e.g. how far it generalizes) and a list of auxiliary assumptions. The entities of this network can be things like “self-control”, “inflation expectations” or “ribosome”. For something like self-control, for example, we might have a number of different operationalizations, which could include a neurological signature or a survey instrument. The relationship between the entities and how we measure them is itself a claim with evidence. There are different kinds of claims one can make relating entities, like “Depression causes sleep disruptions” or “These gene SNPs correlate with educational attainment.”
Each claim also carries a posterior that reflects current evidence. And the system keeps track of provenance - i.e. which studies, how they were designed, and the various checks the evidence has passed (e.g. computational reproducibility). Moreover, as knowledge grows and evolves, the graph should too. Scientists (and AIs) could propose an entity splitting in two, merging (e.g. ego depletion is just fatigue!) proposing a new construct, contesting an operationalization, etc. And ultimately it’s the community of scientists, aided by powerful AI systems, that will govern2 and decide how this system reflects scientific consensus, or the lack thereof.
Producing something like this at scale, let alone keeping it up to date, would have been an enormous and massively expensive task to do in the past. But with current AI systems and the promise of even more capable future AI systems, building something like this is possible3. Humans can and likely will (especially at first) contribute to overseeing the creation of the overall ontology, as well as contributing much needed evaluation to ensure it’s correct. In fact, reading the foundational literature of a scientific field and then carefully making sure it’s well represented in this system is just the kind of task that would be useful for a beginning grad student to do.
Then the next step is to be able to quantify the contributions to science by changes to this graph over time. Once you create such metrics, such a system could attribute incremental improvements to our understanding to the scientists who caused it. Here’s how that could work, over time, to replace the scientific journal as a record of contribution for scientists. Each stage brings in a new set of users: research scientists first, then AI systems and policy makers, then everyone downstream of better-funded and better-targeted science.
Stage one: living layers form alongside journals
Initially, the system starts as a tool which allows research scientists to assess the scientific literature. They upload a set of published papers they locate via a search engine, and then the system extracts the claims, applies forensic audit, runs computational reproducibility where data is available, and outputs a claim-level posterior. The scientist can set a search that automatically updates the analysis when new papers come in. During this time, it’s mostly used by scientists and investigators, who are able to use the evidence aggregation and forensic tools to quickly assess a scientific literature. They are willing to use it despite its flaws, and in doing so, help provide the feedback and data needed to make LLM-based claim extraction highly accurate. They use it mainly to expose poor quality evidence and publish meta-analyses of their own. Some of these become the first living evidence systems. Therefore, at first, the coverage of this living layer is patchy; limitations in AI quality, limitations in funding, and limitations in human bandwidth to evaluate model outputs mean that it may take a few years for this system to gain momentum and coverage. Meanwhile, coordinating efforts, injections of philanthropic funding, and enabling technologies (e.g. turning PubMed into structured data AIs can accurately ingest) make investments in this burgeoning technology and employ underemployed researchers to help expand coverage of the system. During this time, journals continue to certify individual papers.
The turning point is when the review tool passes a threshold where it’s almost entirely accurate. At that point, an investigator makes their first set of major discoveries. A major area of research is found by the AI system to be unreliable or fraudulent, despite dozens of publications in top journals. Everyone takes notice. All of a sudden, the journal brand has been dented - the reliability of research is evaluable outside the traditional journal system. People start to have doubts about the journals. At this point, the system starts to really take off, and interest in scaling it starts to grow. More funding comes in, from governments, philanthropies, and AI companies themselves, who finally realize this is a valuable source of training data for them.
Stage two: certification shifts toward provenance and audit
As these evidence syntheses scale and become authoritative, certification of research quality follows naturally. More studies pass checks, and platforms (like asCollected, which recently launched) provide data provenance verification to certify they were collected legitimately. Increasingly, the entire scientific workflow is instrumented and recorded, so the entire research process is logged, tracked and certified. As the system becomes more authoritative, its mistakes are increasingly rooted out quickly by the scientists themselves, who are incentivized to have their work analyzed correctly. So misunderstandings are smoothed out, and the system’s accuracy increases further.
In addition, organizations of scientists decide to aggregate their own expertise, enabling voting and a way for anyone to see scientists’ consensus on claims4. The system and its metrics continue to evolve as scientists’ opinions change. The system is administered by a non-profit and is charged with making sure the metrics aren’t gamed. Constituted by forensic data scientists and meta-scientists, this organization is charged with measuring and maintaining the way results and contributions are certified. This is meant to resist, for example, review cartels. The audits, and the evidence aggregation mechanisms themselves continue to improve as scientists propose new ways to measure how science is done.
The evidence for atomic claims is certified by surviving forensic audit, computational reproduction where applicable, and integration into the living evidence graph with appropriate weighting. Good studies sway the balance of evidence more than bad studies. A small number of scientists start getting jobs and promotions based on their contributions to the research graph rather than traditional publications. Once that starts happening, other scientists really stop and take notice. Scientists start relying upon these systems more and more as the system gets more comprehensive. The more comprehensive this is, and the more certification it provides, and the more it’s relied upon, the more incentive journals have to participate. They need to include their articles, even if including their papers erodes the articles’ value as scientific artifact. Nonetheless, journals continue to play a role for narrative synthesis, theoretical work, and complex multi-claim research. They stop being the primary certification layer for atomic empirical claims, however.
As the system grows, the evidence layer becomes a direct input into AI training systems. Understanding the most up to date scientific knowledge becomes essential for both humans and AI scientists to quickly devise the best follow-on experiments. Meanwhile, the structuring of policy analysis has a dramatic effect on how government and philanthropic funds are deployed. It becomes very easy to understand the most up to date evidence for different global health interventions, speeding the spread of life-saving changes. Education policy improves with a clearer read on the data.
Stage three: incentives realign and many journals wither
Once there is a trusted self-updating evidence layer which can track and assign credit for scientific progress, it becomes increasingly common for scientists to directly submit their work to the evidence layer, bypassing traditional journal articles. At first, the amount a paper shifts the posterior on a named claim becomes a measurable and attributable way of rewarding scientists. But soon, more methods for measuring the impact of scientists proliferate. Some scientists make conceptual clarifications, while others make methodological improvements that clarify causal inference across dozens of empirical datasets at once. Others collect quality data that comprehensively adjudicates between competing theories5. There are so many different actions that are seen as furthering human and machine understanding. Scientists are rewarded for all of it. As researcher evaluation moves toward more holistic forms, these metrics become key contributors.
At this point, textbooks start being created automatically from these living meta-analyses. Public-facing explainers emerge alongside them and the living meta-analyses now provide the context layer for search engines and for journalists writing news stories. The general public becomes better informed, and it’s easy for anyone to see what the latest research suggests people should do in order to improve their health.
Caveats and complications
There are a number of boundary conditions on this vision, of course — some types of scientific work aren’t amenable to this kind of structured decomposition. As always, there are legitimate fears that we will accidentally disincentivize some key aspect of scientific practice not legible to the machine. But I want to resist the idea that because metrics are imperfect that we shouldn’t create them, or frankly that it’s possible to envision a world without them. We need some basis for selection, whether it’s grants or jobs6 - the overthrow of metrics would only lead us to a world that prioritizes who you know — networking and nepotism7. Moreover, the more quickly scientists work via AI, the more telemetry and data artifacts will be created. Ultimately, that may make the creation of these new metrics inevitable.
Of course it’s possible that a much worse version of this plays out. First and foremost, journals could end up co-opting and bankrolling the creation of these systems, using their massive catalogue of papers. They could steamroll any attempts to release public data, and win fair use cases that allow them to maintain their monopoly on the production of knowledge. It would be unfortunate if the infrastructure of science were privately owned. It should be a public good and ultimately maintained by scientists.
Another risk is that this AI system is opaque, full of subtle flaws that make it unreliable upon close inspection. This is why human collaboration is key. Especially at first, we’ll need the system to have confidence scores, to have humans audit high-stakes nodes, and for all the system’s interpretations of papers to be transparent and auditable. It’s crucial that the system be as open as possible to foster trust. This is another reason why the institution that maintains this system should be a non-profit entity.
It’s also possible that we end up with a system that fails to incentivize researchers properly, and ends up with more fruitless gaming. Goodhart’s law is a reality, but that doesn’t mean that all metrics are bad everywhere always. There are better and worse uses of metrics. Goodhart’s law isn’t solvable, but it can be mitigated through the correct institutional structures.
Two cheers for metrics
To mitigate co-option or gaming, metrics need to be flexible and support human judgment. To that end, I want to propose three principles to guide the development of new metrics for valuing scientific contributions.
Values made explicit, and tunable
Metrics are always a reflection of what we value8. Those values are often implicit: “we value writing papers which get cited a lot.” Goodhart’s law still holds; but if we make those values explicit and open, then at least we have the possibility of creating better, more responsive metrics. There are many different scientific values we could care about: novelty, risk-taking, rigor, methodological improvements. All of these and more can be operationalized and included.
Metrics must continue to evolve and respond
The conservatism of academia is such that the few metrics they have are completely frozen. This is a major problem. Consider an alternative model for how metrics can be responsive to efforts to game them: Google search. The Google search ranking algorithm continues to change as researchers find ways to improve it, but also in response to efforts to game it. We know approximately the kinds of signals that Google uses to rank, but we aren’t 100% sure. The real but bounded transparency balances the need for people to be incentivized to create good content for Google to rank with the need for secrecy so that the algorithm can’t be readily reverse engineered. For instance, some of the features that enable sleuths to detect fraudulent papers might need to be secret for a while so that we can better use them. Eventually, those become public and (unfortunately) bad actors will adapt. As stewards of the scientific enterprise, we’ll have to adapt too. Moreover, Google runs experiments to test changes to its ranking - there is a meaningful control. Decisions which hinge on these metrics, like grant decisions, are well served by some decisions made by an alternative procedure for comparison. This is the model we should use when thinking about how to create metrics around the scientific process.
Metric-informed, not metric-driven decision-making
Finally, metrics become tyrannical once they become full substitutes for human judgment. There are still “broken leg” cases. This is a term Paul Meehl used to describe rare circumstances that a human knows are important but sit outside a predictive model. Imagine an actuarial model is designed to predict if a professor will go to the movies on any given night. The model predicts a 90% probability based on historical data. However, the expert intuitively knows the professor just broke their leg. The expert intuitively adjusts the probability to 0%. The bureaucratic nightmare is one where it’s obvious to any human with a brain and a heart that the rules do not apply, and yet the bureaucrat doesn’t make an exception. To be data-informed means to use the data as a tool and not to be a tool of the data. Of course, Meehl’s point was that humans call for exceptions far more than is warranted, so of course there’s a balance.
Conclusion
Despite reasons for concern and pessimism, there is a real prize for humanity to be won if we steward this carefully. A claim level living evidence layer will earn scientists’ and the public’s trust as it improves. As it grows in scope and quality, the clarity it brings as to the most trustworthy science will start to shift academic culture. If we build it correctly, we can solve a problem that is only growing in importance with the advent of generative AI. Namely, how to use AI to safely expand the research enterprise, ensure high quality scientific work, and credit the best research and researchers. The current journal system is roughly sixty years old in the form we know it. It survived the move from print to web, from subscription to open access. It probably won't survive the move from paper-as-unit to claim-as-unit, because that move dissolves what the journal was actually doing. The state withers, in this case, because something better is doing its job.
Physics and arxiv has been one partial success story, but physics journals still confer prestige. In computer science you have conference proceedings, but so it’s on a different timescale, but the basic way it works is similar.
Governance, voting, mechanism design, aggregation of knowledge, membership and identity verification. These are all extremely complex issues that I’m mostly punting on.
In my previous essay, I outlined a concrete proposal for starting this in one narrowly useful domain - randomized control trials.
I’m not an expert on mechanism design, but there are a ton of interesting ways to aggregate preferences, e.g. quadratic voting, Bayesian truth serum, that could be worth considering. There are a lot of very smart economists who think about these kinds of questions who I’m sure will have great ideas.
When I consider all the types of theories, I always think about Murray Davis’ That’s Interesting, who created a very handy classification for theoretical moves. Some of these lend themselves cleanly to graph operations, while others less so because they imply more value judgments:
Single phenomena:
(i) Organization. (a) What seems disorganized is organized. (b) What seems organized is disorganized.
(ii) Composition. (a) What seems heterogeneous is composed of a single element. (b) What seems unitary is composed of heterogeneous elements.
(iii) Abstraction. (a) What seems individual is holistic. (b) What seems holistic is individual.
(iv) Generalization. (a) What seems local is general. (b) What seems general is local.
(v) Stabilization. (a) What seems stable is unstable. (b) What seems unstable is stable.
(vi) Function. (a) What seems to function ineffectively functions effectively. (b) What seems to function effectively functions ineffectively.
Multiple phenomena:
(vii) Evaluation. (a) What seems bad is good. (b) What seems good is bad.
(viii) Co-relation. (a) What seem unrelated are correlated. (b) What seem related are uncorrelated.
(ix) Co-existence. (a) What seem compatible are incompatible. (b) What seem incompatible are compatible.
(x) Co-variation. (a) What seems positive co-variation is negative. (b) What seems negative co-variation is positive.
(xi) Opposition. (a) What seem similar are opposite. (b) What seem opposite are similar.
(xii) Causation. (a) What seems the cause is the effect. (b) What seems the effect is the cause.
Maybe if we end up with fully automated luxury communism, everyone can be a yeoman researcher and grants will be as plentiful as gumdrops. For now though, the competition for jobs and grants remains.
Robyn Dawes was known for his work on the many problems in human interviews as a screening process. Without metrics, we’d have to rely even more on the biases of human judges.
Thanks to Katie Corker for suggesting this.


