Too much of a data scientists time is focused on data cleansing and not enough on the math
As a former computational chemist, I rode into the world of data science on a wave of eigenvectors and correlations functions.
I remember my first few interviews — before I knew what a KPI or EBITDA was — and thought I was being asked some very odd questions. It felt like I was given an intro to data science exam: What is overfitting? Can you write this SQL query? If a train is leaving NY Penn Station at 2pm at 50 miles an hour…. You get the picture.
Nobody handed me data sets asking me to poke at it and see what I could find. Though, from the jobs descriptions, everyone was totally into data-driven decisions. Super into it. It was cooler than Sharknado and twerking (it was 2013). So, the fact that they weren’t trying to hire people who could actualize data-driven decisions was perplexing.
The first thing I did when I got to my new job was get a project where I was given some data presented to me as useful for feature generation and then a problem, how we could use this data to optimize some KPI (key performance indicator for my not-yet-in-business future data science colleagues).
I was just a junior data scientist (albeit my then company’s first), but as I started to look at the data, I realized the task wasn’t so simple. The consultant we hired had built this large machine learning model (published right from the oracles at Google), but when I compared it to the simplest methods it wasn’t able to outperform anything.
The reason? The data we were using was modelled data — not raw data. And modelling on top of modelled data is like making new hot dogs by cobbling together bits of other hot dogs. (FRANKenstein’s hotdog?)
The answer was clear — GIVE ME RAW DATA.
Luckily for me, I was at a small company and the business development (BD) lead and I happened to play on our company’s soccer team together. (Worst team in the league, baby!) In between getting hit in the face by the opposing team’s throw-ins and nursing our wounds, he was able to go and secure some high-quality raw data from a data provider (which you can now buy on the Narrative platform, btw). My models went from about as good as coin-flipping to hitting it out of the park. (Sadly, my soccer skills did not improve appropriately.)
Some of what happened to me is serendipity because I’m inquisitive and my company let me poke at things until I felt like I understood the problem. Plus, I also happen to like playing soccer. But how can you set up your data science team so you don’t have to rely on a Ph.D’s outgoingness and delusions of athleticism?
It’s not enough to know the difference between an SVM and an SVD. Check a prospective data scientist’s intuition and data literacy by giving more open ended problems with real data and give them adequate time to explore it.
Walk through their thought process with them. Are they curious about the data? Can they see possibilities in the records? Don’t limit data science interviews to trivia questions from an online course or brain teasers.
Remember the science part of data science. That means being able and having time to poke at things. If you don’t let your data scientists play a bit, then you don’t have a data scientist. You have an improperly labelled data engineer or data analyst. (Some of my best friends are data engineers and data analysts!)
Offer them open ended business goals, such as the need to improve conversion rate and what can be done to get there. And don’t make it this amorphous self-regulated 20% time, because we all know product and business priorities creep in. Why not the first week of the month?
The most meaningful contributions I’ve made at my job have all come from such weeks where I was taken off of standard product sprint work and tasked with solving a problem.
So often BD will drop some new data onto an overworked data scientist’s lap and ask whether it’s useful, then request an answer by the following week.
The answers are maybe and almost certainly not. Why? Because they’re busy doing very important sprint work! They don’t have time to look at this unrelated data set at that very moment.
The process is backwards. The question should be initiated by data scientists when they need data they can use to build such-and-such model. Then it should be handed off to BD to locate, not the other way around.
In my experience, by the time I’d figured out the data sets BD had thrown at me were useful, we’d had it sitting in our data stores for half a year. The contracts were signed way too early; a total waste of money.
No need to ask their product owner to talk to the head of product to chat with the head of BD until their request for raw contextual data yields someone signing a contract for some data management platform’s modeled-out demographic data — a full quarter later. (“This is what you wanted right?”)
Extra bonus, it frees up BD to actually develop the business instead of hungry-hungry-hippoing all the data they can get their hands on and hoping someone downstream will actually use it.
Companies need to change how they build their data science team if they want to get the more out of data and get it faster. Find the right data scientists, then give them the autonomy to do work that will actually meet goals. And stop relying on the same old processes that don’t work and start exploring new ways to find data.