Goodhart's Law has a precise formulation but the pathology it describes is ancient: the moment a measure becomes a target, it ceases to be a good measure. Every auditor learns this, every program evaluator learns this, and then predictably a new round of metrics gets designed, gamed, and discarded. When the French colonial government in Indochina offered a bounty per rat tail, entrepreneurs started farming rats to collect the bounty, and the rat population grew. When NHS targets measured how quickly patients were seen after arriving at the emergency department, ambulance crews were instructed to circle the block until a doctor was available so the clock wouldn't start. When standardized test proficiency rates became the measure of school quality, teachers spent their energy drilling the kids near the cut score on bubble tests, because the struggling reader who needed foundational help and the gifted kid who could be doing calculus neither one moved the proficiency needle.

It helps to remember how recently we acquired the capacity to measure government operations at all. Over a century ago, the Progressive reform movement accomplished something genuinely revolutionary: it routinized the measurement of inputs. The New York Bureau of Municipal Research, founded in 1906, pioneered the radical idea that you could count how many dollars a city spent, how many employees it hired, how many tons of garbage it collected, and then compare those numbers across departments and across years. Before that, municipal government was a black box of patronage and guesswork. The Progressives gave us the budget, the civil service exam, the standardized personnel record. They made the machinery of government legible for the first time.

That achievement was enormous and it endures. But a century later, we remain largely stuck at the level of inputs and activities. We can tell you how much a city spent on street resurfacing, how many lane-miles were completed, how many FTEs were assigned to the project. What we almost never measure, in any routine or comparable way, is what the public actually got for its money. Did the streets get better? Did the water get cleaner? Did the after-school program keep kids safe, or did it just keep them enrolled? The traditional government program evaluation cycle is almost designed to avoid these questions. Success gets defined at the outset, when you know the least about the problem, and measured at the end, when it's too late to adjust. The report lands in an inbox, generates a footnote, and the next program cycle begins with the same assumptions intact.

The Progressives routinized the measurement of inputs. The next frontier is routinizing the measurement of outputs and outcomes. And the tools to do it, from low-cost sensors to AI-driven synthesis to novel funding mechanisms developed in unlikely corners of the internet, are finally becoming available. What if we borrowed from a model that already does this, imperfectly but continuously, in another domain entirely?

Moody's doesn't publish a one-time assessment of a municipality's fiscal health and call it done. It maintains a living judgment, updated as conditions change, sensitive to leading indicators, explicit about what it's watching. Downgrades happen mid-cycle. Outlooks shift from stable to negative before anything catastrophic occurs. The whole apparatus is designed to surface trouble while there's still time to act, and the market pays for that signal because the market moves on it.

Imagine a parallel apparatus for public programs: not an annual performance report but an ongoing impact rating that synthesizes administrative data, community feedback, and independent analysis into a regularly updated assessment. "Watch status." "Outlook: deteriorating." "Core outcomes stable; equity indicators warrant review." The language of credit markets applied to the public interest, with AI doing the continuous synthesis that makes frequent updates tractable rather than prohibitively expensive.

Consider how this would work for something as ordinary as a new after-school program in a mid-size school district. Under the current model, the district applies for a grant, defines success as "200 students enrolled" and "10% improvement in reading scores," runs the program for three years, and submits a final report. Maybe the program is working brilliantly for third graders and failing completely for sixth graders. Maybe attendance craters after the first semester because the bus schedule changed. Maybe the most important thing happening is that a cluster of kids who were otherwise unsupervised between 3 and 6 PM are no longer showing up in juvenile incident reports. None of this nuance survives the binary pass/fail of the final evaluation. A living impact rating would catch the bus schedule problem in month four. It would flag the divergence between age cohorts. It would notice the juvenile incident correlation before anyone thought to look for it, because a well-designed rating system watches for what's actually happening, not just what the original grant application promised would happen.

Or take universal transitional kindergarten, which California is currently rolling out statewide. The policy intention is clear and the aspiration is admirable: give every four-year-old access to a high-quality early learning environment regardless of family income. But implementation varies enormously across the state's thousand-plus school districts. Some districts are converting existing preschool classrooms. Others are hiring new staff, building new facilities, and inventing curriculum on the fly. The question that matters for children and families isn't whether TK exists on paper but whether it functions in practice, and that question has a dozen sub-dimensions: teacher credentialing, classroom ratios, developmental screening, facility quality, the degree to which TK connects to the K-12 continuum rather than sitting in administrative limbo. A rating system would let parents, school board members, and state officials see which implementations are on track and which ones need intervention, before a generation of four-year-olds has passed through a system that looked good in the press release but fell apart in execution.

Infrastructure provides an even starker case, partly because the data problem is so tractable and the current approach so crude. Most cities evaluate their street conditions using the Pavement Condition Index, a methodology dating to the 1970s in which a trained inspector drives around and assigns a subjective score to each road segment based on visible distress. The survey is expensive, infrequent, and immediately out of date. A few years ago, working with my collaborator Varun Adibhatla through our nonprofit ARGO Labs, we built SQUID, a low-cost device that integrates street imagery with ride quality data from phone-mounted accelerometers to produce continuous, objective street condition assessments. We piloted it in Syracuse and New York City, collecting hundreds of miles of data from a single vehicle in days, work that would have taken traditional inspectors months. The underlying logic is simple: if you can measure the actual condition of every street in a city continuously and cheaply, you can answer the question that matters, which is how much road quality did the public actually get for the money it spent on resurfacing?

That question almost never gets asked. A street resurfacing program reports lane-miles completed per year and total dollars expended. But lane-miles completed tells you nothing about whether those particular miles were the ones most in need of repair, or whether the resurfacing actually improved ride quality and extended the useful life of the pavement, or whether the per-mile cost was reasonable compared to peer cities. A living impact rating for a resurfacing program would track the ratio of dollars spent to measurable improvement in road quality across the entire network, updated continuously as sensor data flows in, and would flag when the program is spending efficiently and when it's not.

Water main replacement programs face the same accountability gap. The standard metric is miles of pipe replaced per year. But what the public is actually buying is water reliability and water quality, not pipe installation. A main replacement program that hits its footage targets while service interruptions and boil-water advisories persist in other parts of the system has succeeded on paper while failing in practice. A rating system grounded in what the infrastructure actually delivers, measured at the tap rather than in the procurement spreadsheet, would surface that gap before the annual report ever lands.

Permitting is where the model gets genuinely interesting and genuinely hard. California has set ambitious policy objectives around housing production, clean energy deployment, and climate adaptation, but the permitting apparatus that mediates between aspiration and construction often operates with no systematic feedback on its own performance. A permitting impact rating would need to capture multiple dimensions simultaneously: timeliness (how long from application to decision), predictability (how much does timeline variance cost applicants), environmental fidelity (are the substantive protections actually being enforced or just generating paperwork), and equity (do well-resourced repeat applicants navigate the system faster than community-based organizations attempting their first affordable housing project). These dimensions often work against each other. A system that prioritizes speed may shortchange environmental review. A system that prioritizes thoroughness may impose costs that only large developers can absorb. The honest rating holds that tension visible rather than collapsing it into a single score.

The financing logic matters as much as the analytical one. Credit ratings work partly because issuers have skin in the game. Bond markets move on Moody's judgments, which means someone is always paying attention. The question for impact ratings is: who pays for the signal, and who moves on it?

The traditional answer in government is: a funder commissions an evaluation, and the evaluation either renews the grant or doesn't. This creates a principal-agent problem so familiar it barely registers anymore. The evaluator works for the funder. The program works for the evaluator's metrics. The community works around both.

Social impact bonds, pioneered in the UK and expanded during the Obama administration, tried to solve this by introducing private investors who would fund a social program upfront and get repaid by government only if the program hit pre-defined outcome targets. The structure was clever. If a program to reduce recidivism actually kept people out of prison, the government would save money on incarceration and could share those savings with the investors who took the risk. Denver used this model for supportive housing. Massachusetts applied it to a program aimed at keeping formerly incarcerated young people from reoffending. The mechanism focused minds wonderfully on outcomes rather than outputs.

But social impact bonds also revealed the limits of the approach. The transaction costs were enormous. Each deal required years of negotiation, bespoke legal structures, and an independent evaluator whose methodology itself became a source of contention. And the Goodhart problem didn't disappear; it just migrated to whatever the bond's outcome metrics happened to be. If the SIB measured recidivism at two years, nobody was structurally incentivized to care about what happened at year three.

The Ethereum ecosystem, of all places, has been running a different set of experiments that speak directly to this problem. The mechanisms are worth understanding on their own terms, because the design principles underneath them are more portable than the crypto context might suggest.

The first is quadratic funding, developed in a 2018 paper by Ethereum co-founder Vitalik Buterin along with economists Zoë Hitzig and Glen Weyl. The core insight is mathematical but the intuition is democratic. Imagine a matching fund, the kind that foundations and governments use all the time: the Gates Foundation puts up $10 million and matches community donations to global health projects. Traditional matching works dollar for dollar, which means a project backed by one wealthy donor giving $10,000 gets the same match as a project backed by a hundred community members each giving $100. Both raised $10,000. Both get $10,000 in matching. But the second project clearly has broader community support.

Quadratic funding fixes this by making the match proportional not to the dollars contributed but to the square of the sum of the square roots of each contribution. In practice that means the project with a hundred small donors receives a dramatically larger match than the project with one large donor, even if both raised the same amount. The number of people who care enough to contribute matters more than the total dollar amount. It's a funding mechanism that mathematically weights breadth of community support over depth of any single patron's wallet.

Gitcoin, a platform built on Ethereum, has run this mechanism at scale since 2019, distributing over $67 million to more than 5,000 projects. The results are consistently interesting: community infrastructure projects, educational resources, and open-source tools that a traditional grants committee might overlook tend to surface when the community itself is doing the weighted signaling.

The second mechanism is retroactive public goods funding, which Buterin and the team at Optimism (an Ethereum scaling network) have been developing since 2021. The premise is disarmingly simple: it's easier to agree on what was useful than to predict what will be useful. Instead of writing a grant proposal describing impact you hope to create, you do the work first, and a "Results Oracle," essentially a governance body with resources, evaluates what actually happened and rewards the projects that delivered genuine public value. Optimism committed its network revenues to this experiment and has run multiple rounds, allocating millions to projects chosen by panels of community evaluators.

The design borrows the startup ecosystem's core incentive, the "exit," and applies it to public goods. A for-profit startup can attract investment because investors expect a return when the company succeeds. A nonprofit or open-source project has no equivalent mechanism, no way to offer early backers a share of future success, because there is no IPO, no acquisition, no equity event. Retroactive funding creates one. And because the reward comes after the work, it creates an ecosystem where investors can fund early-stage public goods projects on the speculation that if those projects succeed, the retroactive reward will cover the bet. The profit motive, redirected toward the public interest.

What makes these mechanisms relevant beyond cryptocurrency is that they address the same structural problems that plague government program evaluation. Quadratic funding solves the "who decides what's valuable" problem by aggregating community preference in a way that resists capture by large donors or political insiders. Retroactive funding solves the "predicting the future is hard" problem by letting reality do the sorting. Both create incentives for independent analysts and evaluators to develop genuine expertise rather than telling funders what they want to hear, because the market rewards accurate assessment over time.

Now imagine combining these mechanisms with the impact rating concept.

A city launches a water main replacement program. Instead of a single evaluation at year five, a consortium of independent analysts publishes quarterly impact ratings across multiple dimensions: infrastructure condition, service reliability, water quality at the tap, cost efficiency per unit of measurable improvement. These ratings are public. They're legible. They use the standardized language of credit assessment that bond markets and city councils and editorial boards already understand.

A retroactive funding pool, capitalized by some combination of municipal revenue, philanthropic investment, and potentially state funds, distributes annual rewards to the analysts whose ratings proved most accurate, the early warning signals that actually led to course corrections. This isn't paying for favorable reviews. It's paying for useful ones. The distinction matters enormously. An analyst who flags that the main replacement program is hitting its footage targets but not improving service reliability, and turns out to be right, gets rewarded. An analyst who publishes only favorable assessments accumulates a track record that the market learns to discount.

Meanwhile, the community provides real-time signal through a quadratic funding mechanism that lets residents allocate small amounts to the programs and the program dimensions they believe matter most. If a thousand residents in a neighborhood with chronic service interruptions each contribute five dollars to signal that water reliability should be weighted more heavily in the impact rating, that signal carries mathematical authority that no public comment period currently provides. The mechanism doesn't replace expert judgment. It informs it. It tells the analysts what the community is watching most closely, and it does so in a way that's resistant to the usual dynamics of public engagement, where the loudest voice in the room or the most organized interest group dominates.

For the after-school program, this hybrid approach means parents are signaling in real time which dimensions of the program they value, and independent analysts are tracking whether those dimensions are actually being delivered. For TK implementation, it means the state has a dashboard of living assessments across every district, updated quarterly, with community engagement scores that surface which districts have genuine parent buy-in and which are operating a compliance exercise. For permitting, it means the tension between speed and environmental fidelity becomes visible and trackable, not hidden inside a bureaucratic process that only insiders understand.

The harder problem is what credit raters call qualitative factors: a municipality's willingness to raise revenue, management quality, political stability, the things that don't parse easily into a spreadsheet but that any experienced analyst weights heavily. A program evaluation equivalent would need to capture the warm handoff between caseworker and family, the trust built over eighteen months, the organizational culture that either absorbs learning or defends against it. Goodhart's Law doesn't disappear because the assessment is more frequent; it migrates to whatever signals the raters happen to be watching most closely.

The honest answer to that migration isn't better methodology but institutional structure, and institutional structure requires someone with authority to actually be watching. The credit rating model works partly because the SEC regulates Nationally Recognized Statistical Rating Organizations, because issuers pay for ratings they can't control, and because the market imposes consequences for persistent inaccuracy. An impact rating ecosystem would need its own version of each: regulatory standards for assessment quality, a financing model that keeps the raters independent, and consequences that give the ratings teeth.

None of this is simple. The credit rating industry's own failures, particularly the mortgage-backed securities debacle of 2008, demonstrate vividly what happens when conflicts of interest corrupt the rating process. Any impact rating system would need to learn from those failures rather than replicate them.

But the alternative is what we have now: evaluation reports that arrive too late to be useful, program metrics that measure what's easy rather than what matters, and a structural inability to distinguish between programs that are genuinely transforming outcomes and programs that are generating impressive-looking spreadsheets while the real problems migrate elsewhere.

A decade ago, in the early days of ARGO Labs, Varun and I wrote a manifesto about what we called the next frontier of government operations. We were young and the vision was sweeping in the way that manifestos tend to be: a world where the digital revolution transformed not just how citizens interacted with government but how government actually delivered basic public services. We talked about the evidence gap, NYU's GovLab estimate that only one dollar out of every hundred in government spending was backed by evidence that the money was spent wisely. We built SQUID because we believed that if you could make the measurement of street quality cheap and continuous, you could close part of that gap for one of the most visible and universal public services.

The manifesto was naive in all the ways that early-career documents are naive. But the core intuition holds. The Progressive reformers a century ago routinized the measurement of inputs, and that achievement gave us modern public administration. The task now is to do the same thing for outputs and outcomes, to make the routine question not just "how much did we spend" but "what did the public actually get." The tools have finally caught up to the ambition. Sensors can measure road quality and water quality continuously and cheaply. AI can synthesize administrative data, community feedback, and independent analysis into assessments that would have required armies of analysts a decade ago. And the Ethereum ecosystem, of all places, has been quietly building the mechanism design toolkit, quadratic funding, retroactive public goods funding, results oracles, that an impact rating system needs to stay honest and stay funded.

The city of Split, Croatia, is already piloting municipal quadratic funding for green space projects. The principles aren't trapped in crypto. They're portable. And public programs, which affect real people's lives in ways that software protocols generally do not, need them more urgently than anyone.

Success is multidimensional. That's not a dodge, it's the actual structure of human flourishing. The interesting design question is whether you can build an assessment system that holds that complexity honestly, raises an alarm before the ambulances start circling, and is legible enough that someone with the power to act actually does.

Pioneering Spirit

Outcomes on tap

Towards the routine measurement of public impact

Further reading

interfluidity " Econometrics, open science, and cryptocurrency

Pioneering Spirit