Chasing unicorns: Three questions about data scientists
The data scientist is a somewhat legendary figure that navigates the random forests of (big) data in a quest for insight and, eventually, impact. But as often happens with legends, there isn’t much clarity about her actual profile, and questions abound. The only thing that is clear is that she’s hard to find… like unicorns.
McKinsey Global Institute predicts that this situation is only going to get worse: according to them, the US will have a shortfall of 190,000 'deep data experts' by 2018. In its UK data capability strategy, the UK government has identified skills as a critical resource that will drive (or if it's lacking, hinder) the UK's ability to benefit from data. But if our educators and managers are going to nurture and organise data talent more effectively, they will need hard facts instead of the stuff of legends.
Our new programme of research, Skills for the Data Driven Economy, is taking us - with colleague Hasan Bakhshi - into the habitat of data talent (i.e. innovative businesses) to, building on our earlier 'Rise of the Datavores' report, fill important gaps in the state of knowledge about data talent.
Here are some questions that have emerged from an initial round of interviews with practitioners and managers of this (oft arcane sounding) lore:
1. What’s new about data science?
Statisticians, operations researchers, actuaries and data miners have long worked in industry, using sophisticated methods to discover patterns in data. Gossett (a.k.a ‘Student’) derived the t-distribution while working for the Guinness Brewery in the early 20th century. Operations Research got its start in World War II, optimizing shipping convoys and bomb targeting (the big data of that very modern war). So what’s really new about this?
Our interviews suggest that data science has some genuinely new aspects connected to:
The increase in the volumes and varieties of data that are becoming available for analysis. More about our lives is recorded digitally, and hence is available for analysis, than ever before – from transactional data, to social network data, to text, sound and image data.
The new opportunities to apply the results of analysis, going from description to prediction, and from reporting to building products and services completely based on data (for example, search and recommendation engines, or matchmaking platforms).
These new opportunities for analysis and action raise an immediate question:
2. What are the skills of the effective data scientist?
There is a long list of skills that, it is argued, are needed in data science:
Business understanding: knowing which questions to ask
Software engineering: coding, algorithm design, and database management
Data analysis: both traditional statistics and machine learning
Communication: visualisation and storytelling
Some people doubt all these skills can be found in the same person. Is the data scientist a single person – again, a ‘unicorn’ – or are these skills spread across teams of discipline specialists – erm, a hydra?
Our interviews reveal a spectrum, with variation partly linked to the levels of data work in a company: Large companies tend to support a deeper division of data labour, with more specialisation within their teams. We often see a combination of PhDs who do analysis and people from management consultancy backgrounds who feed them questions and communicate the answers. At the other extreme, start-ups seem to prefer generalists, for their flexibility.
3. How can data scientists be organised and managed?
The emergence of new occupations and competences brings with it innovations in business processes and project management – think of industrial scientists in the early R&D lab, software developers using agile methods, or designers organising the innovation process with design thinking principles.
We'd expect the same thing to happen with data science. In our research we want to identify good practices that can help companies create more value with their data talent. Some critical issues that have come up in the interviews include:
Finding the right place for data talent inside the business: specialist data teams benefit from a critical mass of (diverse) skills, and are able to address a wider range of business problems. At the same time, there is a risk they may get detached from the rest of the organisation, accused of data imperialism, or overwhelmed by a barrage of request from across all the business. Should data talent even be located in-house, or can it be effectively outsourced?
Applying domain expertise effectively: Domain expertise is knowledge about the processes that generate the data being analysed (e.g. knowing the operation of a hospital in order to analyse patient outcome data). There is a debate about how much of this expertise does a data scientist need to do her job well. Domain experts can help identify biases in data, and reduce the risk of confusing correlation with causation. But some argue too much domain expertise can be harmful if it keeps a data scientist from challenging the status quo (e.g. as Amazon did when it replaced its team of book reviewers with an aesthetically-blind recommendation engine). Our interviews suggest that domain expertise is an important component of the data team, if not always part of the profile of the individual data scientist.
Striking the right balance between exploration and exploitation: Some of our interviewees describe data scientists as people who are driven by curiosity and new and exciting problems (and data-sets). The challenge is how to create environments with some of the ‘academic freedom’ that keeps data scientists motivated while steering them towards problems that are relevant for the business, and managing the risk of failure. Mixing projects up, moving people around and setting project goals early can help.
Unicorns for courses
As we see, there are few absolute answers to our questions. Whether the data scientist adopts one shape or another depends to a great extent on the environment she operates and the type of challenges she faces. Going forward with our research, we will try to identify with greater precision what skills and practices are more suitable for what context, and also whether any important skills of the productive data scientist are in short supply. This way, we hope to create robust knowledge about how best to transform her data powers into impact: taking a little of the mythology out of data science.