Will AI ever be able to easily ETL messy real world data?

Here’s a simple task that you might think is right up the alley of new AI tools.

Generate a chart of California voter registration trends by party since 2000. Include the categories Democrat, Republican, Decline to State and Other Minor Parties.

These data are all public and available on the web.[1] The charting is trivial. It’s the type of task that an entry level analyst could do in an hour or two.

But no! AI fails pretty spectacularly here.

Granted, California’s voter registration data, like many public data sources, is a bit of a mess. See below for the organization on CA Secretary of State’s official webpage.

That specific link then leads to a series of PDFs and excel files, none of which provide a time series so would need to be aggregated and merged.

I have every confidence that AI could generate this chart with the data in the right format. In fact it’s very good at doing just that with dummy data it generates.

Generate a chart of California voter registration trends by party since 2000. Include the categories Democrat, Republican, Decline to State and Other Minor Parties.

<Wolfram’s version of Chat GPT failed at this point so I added the following prompt>

Generate an illustrative example of this chart using dummy data. Generate the chart in the style of xkcd using that python package

<The full GPT chat history is available here> https://chatgpt.com/share/23a2f8d1-52aa-4e89-9597-1d116ff54987

Here Gen AI succeeds, albeit after some what has become standard prompt engineering to ask follow up questions in slightly different ways. Regardless voila!

AI is great at digital exhaust and generating images within the mirror-world. By contrast ETL of messy real world data has been a perennial tarpit. Palantir solves the problem by basically throwing more data engineers at the issue. Perhaps this friction is just part of the nature of symbolic systems, something deeply embedded in the map-territory divide?[2]


[1] The state’s official open data portal has registration data, albeit just for 2016 and 2018. Examining the posted data, it looks a bit random. The column headers include wonky geospatial specific categories and the chronology, just 2016 and 2018 and not current to today, is a bit odd. Further note that the data was published by the California Department of Public Health as part of an indicators project, not from the Secretary of State, the official body tasked with managing these data. So I’d guess an analyst from that project got plugged into CA’s Government Operations Agency’s open data efforts, the body that runs the state open data portal, and these data were published as a one off.

[2] Or perhaps AI will supercharge efforts to organize, well, everything and enable an early encyclopedists style flourishing. See this long-ish essay on when intelligence becomes too cheap to meter and in particular the section on classifying makerspace errata.

Loading...
highlight
Collect this post to permanently own it.
Pioneering Spirit logo
Subscribe to Pioneering Spirit and never miss a post.
#brain#ai#data#etl#data engineering