~1.5 decades ago data science became a hot term. The title was originally created because Facebook wanted a talented scientist to work for them but the prospect didn’t want to have a lowly “analyst” title. Thus the term data scientist was born. A few O’Reilly articles and copycat companies and the title quickly grew in popularity.
The title rode a big wave. Tech companies were creating more and more data from all the actions people took online, creating a big need to make sense of all that information. The new role came with a bunch of definition about what data science entailed because if you think about it for a moment, all science involves data. So what exactly is data science?
Statistics sounds too stodgy. There’s definitely a truth that programming enables new creative strategies to acquire data. Today you can scrape data online, deploy novel sensors, parse illegible pdfs, integrate disparate data sources and generally do all the yak shave-ey work to “get the damn” data in a way that is generally novel, new and different from the way statistics has traditionally been practiced.
But the data science Venn diagram always annoyed me. What on earth is “domain?” That can range from behavior on the web to the sounds of cetaceans to California water conservation to tracking cracks in city streets. More critically why is domain a third circle analogous to coding and stats? A more accurate way to frame the diagram would be to see the coding and stats circles as constituting a lens through which you see the world.
That shift in perspective may seem minor but is actually quite profound. There’s an adage that 80% of the time on a data project involves getting and parsing the data into a usable form. There’s also an entire lifecycle to a data project after making a visualization or fancy machine learning model. That lifecycle is incredible domain specific. The opportunities for recommending products on amazon and finding opportunities to save water can involve similar tools and maths.
Domains are not simply a set of knowledge which overlaps. Domains are distinct regions of the world in which we live which deserve their own unique nuance, similar to the idea of umwelt articulated in this xkcd comic. There is a saying in water analytics that you need to take a meter reader — the person who in 98.9% (number provided for rhetorical effect) of situations actually generates the data — out to lunch in order to understand what the fancy tools are telling you.
Stats and coding are just tools, lenses through which we might better see the world. Highly refined spectacles without an appreciation of the context in which those lenses operate can just cause one to confuse mud puddle for Ursula Major.
In water, if you did not know how a meter is read you might not know what to do with a negative meter read (meaning there was a previous error) or that the meter just spins continuously and the meter is read by subtracting this months total from last months running total. That type of context should subtly guide an analysis and inform where you point the super sharp spectacles powered by whiz bang stats and programming tools.
Taking domain knowledge as a given can lead to “weapons of math destruction” or just mathmagic used to bludgeon opponents in service of an agenda. Consider the following, poorly formatted chart, showing the projected demand for traffic in various states and regions in the US.
The meta-review concluded that states and metropolitan planning organizations “generally have not updated their models and assumptions to account for current conditions, as if they expect the year to be 1980 forever.” One could say that again!
Transpo Agencies Are Terrible at Predicting Traffic Levels
In order to avoid this trap, I try to use data discovery, integration and analysis as tools a part of the larger process of inquiry. Tools are just tools and it critical to remember that any data, no matter how thoughtfully collected, is just a map of the world, not the actual territory it’s intended to represent. An understanding of the liberal arts provides an invaluable resource in navigating that fundamental and intractable divide.
Ergo data sophistry.*
*The word sophist has been sadly slandered by Plato’s two millennia long slander campaign. I’d recommend Zen and the Art of Motorcycle Maintenance or Diogenes Laerius’ Lives of Eminent Philosophers as works to reframe that bias.