Funding the Team, Not the Institution: A Two-Stage Pilot for Science Grants
I am a clinical endocrinologist by training and an applied scientist by daily practice. I run a small team that builds AI pipelines for medical research and clinical decision support. Over the past two years, I have helped researchers at multiple institutions develop grant applications, advised a longevity nonprofit on its funding strategy, and watched — closely and repeatedly — what happens when scientifically promising teams collide with the modern grant system. The pattern is consistent enough that I no longer think of it as bad luck.
The bottleneck I want to describe is this: science funding is allocated to institutions on the basis of one-shot proposals reviewed at one moment in time, with no built-in mechanism to validate execution before the bulk of the money is committed, and no continuous measurement of scientific quality once it is. The unit being funded is the institution, even though the work is done by a team. The signal being evaluated is a written document, even though the variable that predicts success is the team's ability to operate. And the evaluation is one-shot, even though good science is a long, iterative process that can quietly drift into bad science years before the final paper is submitted.
I think this is a structural problem rather than a problem of bad reviewers or insufficient money, and I think a small, well-designed pilot could test a different architecture.
The team-shaped hole
When a major grant is awarded, the formal recipient is the institution. The institution takes overhead, provides administrative scaffolding, and assumes nominal accountability. But the actual work is done by a principal investigator and the people they recruit. If that team is operationally weak — if it cannot scope, plan, communicate, course-correct, or absorb a setback — the project will underdeliver regardless of the institution's name on the letterhead. Every working scientist knows this. The funding system mostly does not.
This matters because the proposal-review model evaluates scientific vision (sometimes well) and team capability (almost never directly). A stellar idea written by a team that cannot ship will fail. A modest idea executed by a team that can ship and iterate often becomes excellent. Reviewers can guess at execution capability from the CV, but they cannot test it. And because the first tranche of money is also the last consequential decision point, there is no way to update on observed performance without the political and bureaucratic cost of clawing back a multi-year grant.
The result is a high-variance, low-feedback allocation system. Successful PIs accumulate funding partly because of past success rather than ongoing performance. Promising independent or non-traditionally-positioned scientists — a population that is now substantial — are hard to fund because there is no entry-level rung small enough to risk on them. Excellent operators without an institutional pedigree fall through the cracks; pedigreed operators who have stopped operating well keep being renewed.
A two-stage pilot
I propose testing the following architecture:
Stage 1 — Concept grants. A pool of capital is divided into a large number of small grants, in the range of $10,000 to $30,000 each. The applicant is a team, not an institution. The application is short: a hypothesis, a falsifiable prediction, a six-month plan, and a concrete deliverable — a preliminary dataset, a working prototype, a registered protocol, a replication, a methodological audit. The size of the grant is deliberately small enough that it forces operational discipline. The team has to scope, prioritize, and produce something defensible with limited resources. That constraint is the point. Stage 1 is a behavioral test as much as a scientific one. We observe whether the team can plan, distribute money sensibly, communicate, and ship.
Stage 2 — Scaled grants. Teams that produce credible Stage 1 deliverables become eligible for substantially larger awards — the kind of money that today is allocated in a single proposal cycle. The Stage 2 application is a full research program, but now the panel has something the conventional system never has: observed evidence of how this specific team operates with someone else's money on a real deliverable. Funding is awarded with frequent (quarterly or semi-annual) structured reporting and clear, pre-specified conditions under which it can be paused or redirected.
The Stage 1 pool needs to be large enough to fund hundreds of teams per cycle, because the value of the design comes from the breadth of the funnel. This is the inverse of the current system, where the funnel is narrow at the top and the consequences of misallocation are absorbed in multi-year increments.
Importantly, a Stage 1 grant is not contingent on confirming the team's hypothesis. Well-designed experiments are informative under failure, and a team that runs a clean negative result is more fundable than a team that produces a glossy non-result. The Stage 1 evaluation is about scientific rigor and operational capability, not outcome bias.
What role for AI
There is a version of this proposal that is just "use AI to review grants faster," and that is not what I am proposing. Faster bad review is still bad review. The role I see for AI is structural, not procedural — it makes a different funding architecture administratively cheap enough to actually try.
Concretely: in the systems I build, the language model is a component, not a decider. The structural spine is deterministic code; the model is invoked for narrowly scoped language tasks — parsing a free-text application into structured fields, scoring a section against a published rubric, drafting a summary from data the script has already prepared — and its outputs are logged, checked, and where consequential, surfaced to a human. The model never holds the control flow. It is an instrument for translating between text and structured representation, not an authority on what should happen next.
Within that constraint, two functions seem realistic on a near-term horizon:
First, matching at scale. A platform that ingests team-level applications and matches them to topically aligned funders, donors, and reviewer panels makes a Stage 1 system with hundreds of micro-grants per cycle logistically feasible. This is plumbing, not judgment.
Second, continuous quality scoring. What I would call a theory-quality pipeline is something specific: a structured system that assesses ongoing work, against a published rubric, on dimensions like internal consistency of reasoning, falsifiability of predictions, appropriate statistical handling, and engagement with the prior literature actually cited. Such a pipeline is imperfect today but is benchmarkable — it can be calibrated against expert-rated cases, its disagreements with experts are inspectable, and it produces a longitudinal signal that the current peer-review-then-silence model does not produce at all. It is not a judge. It is an instrument, closer to a quality metric than to a reviewer. If sustained low scores correlate with weak downstream output, that is a signal worth acting on; if they don't, the instrument is recalibrated or retired.
Both functions are enabling rather than substituting. They are the reason a structural change that would have been administratively prohibitive a decade ago is plausible now.
The pilot
Concretely, what I am proposing can be tested with the following experiment.
Set aside a defined pool — illustratively, $2–5M — for a single Stage 1 cycle of 100–200 micro-grants. Recipients are teams. Applications are evaluated by a hybrid pipeline: AI-assisted triage and rubric scoring, with a small expert panel making final selections. Six months later, deliverables are evaluated by the same panel plus an independent set of blinded domain reviewers. A pre-registered subset of high-performing teams advances to Stage 2 on terms comparable to a conventional grant, while a matched control of conventionally-awarded grants of similar topical scope is tracked alongside for comparison.
Pre-registered outcomes:
- Per-dollar quality of Stage 1 deliverables, scored by blinded experts against a published rubric.
- Stage 2 performance of teams selected via the two-stage path versus matched conventional awards, at 12 and 24 months.
- Demographic, geographic, and institutional composition of Stage 1 awardees versus conventional grant recipients in the same area — does the funnel reach scientists the existing system structurally misses?
- Calibration of the quality-scoring pipeline against blinded expert ratings, including a transparent error analysis of where it agrees and disagrees with humans.
The experiment is informative under any outcome. If the two-stage path produces no improvement over conventional review, the simpler model is vindicated and we save further effort. If it produces improvement on even one of the four axes — particularly the third — the case for restructuring at least one rung of science funding becomes empirically grounded rather than rhetorically argued.
Why now
Three things have changed in the last few years that make this pilot tractable in a way it was not before.
Language models have crossed a threshold where structured rubric scoring of scientific text is reproducible enough to be benchmarked against expert ratings. The output is not yet trustworthy as a sole judge, but it is good enough to serve as triage and as a calibrated longitudinal instrument — and crucially, its errors are now legible enough to audit. Online infrastructure can match thousands of small applications to topically appropriate reviewers and funders at marginal cost, which is what makes hundreds of micro-grants per cycle administratively realistic. And there is now a real population of working scientists — independent, industry-adjacent, internationally mobile, often holding strong credentials but not embedded in the institutions current funding flows through — who do excellent work and would respond to a system whose first rung is small, fast, and team-shaped.
I am not writing this as a critique of any particular funder. I am writing it as a working scientist who has watched the gap between what scientists actually need and what the system actually delivers, and who believes the gap is closable with a specific, runnable experiment that is informative under any result.
A pilot of this shape would cost a small fraction of a single conventional R01 cycle. The downside is bounded. The upside, if even a modest version of the hypothesis is correct, is a different and better-calibrated way to allocate the next generation of science funding.
Submitted to the Astera Institute 2026 Essay Competition on Identifying Systemic Bottlenecks to Science.