Data science empowers software engineers and IT developers to extract meaningful insights from the processes and information they encounter during application development. But it also involves communicating the results of data analysis to other stakeholders in a software project – many of whom won’t have the technical level of understanding or expertise of an IT administrator or professional programmer.
So, at all stages of the software lifecycle, it’s necessary for data scientists and developers to be able to work together, understand each other’s points of view, and communicate effectively.
Data Science in Software Development
When developers are incubating their initial ideas for a new piece of software or an IT system, data science can be there to explore the ramifications and likely outcomes – outcomes like incorporating a particular feature, or of how one functionality plays off against another, or even what data will be produced or can be leverage
During programming and testing, the work of data scientists helps in collating results and making sense out of the figures of merit. Coupled with appropriate visualization techniques, data science can mold these results and insights from streams of numbers into stories that can be leveraged by financiers, marketers, sales personnel, and other stakeholders in the software ecosystem that hail from non-technical or non-computer related backgrounds. The user base for a particular system or product will likely also include people from a range of cultural and educational backgrounds.
For stakeholders, data scientists can provide tangible evidence of how revenue and business value are being generated, and insight into how and where actions are necessary to sustain good levels of performance or to make improvements.
Clearly, there’s a role for data science throughout the software development lifecycle – so it makes operational and economic sense to have data scientists and developers working together amicably at all stages of the process.
Finding A Common Language or Environment
Harmony and collaboration between software engineers and data scientists may be desirable, but it’s not necessarily that easy to achieve. In part, this derives from actual differences in the way the two disciplines operate, and perceptions that the two groups may have about the way their counterparts think and work. This graphic from CodeMentor sums it up neatly:
(Image source: codementor.io)
You’ll notice that big data frameworks are a tool common to both disciplines – and in nearly all industries, data-driven intelligence is now an essential part of day-to-day operations, whether it’s used for supply chain management or personalized marketing. It therefore makes sense to establishing and use a shared set of tools and languages for developers and data scientists.
Setting Up A Data Lake
The construction of a data lake is one way of making production data from the development process readily available to data scientists and software engineers alike. This lake is a common pool of information, set up in an environment separate from the production platform. Because it will be a repository for information generated throughout the lifecycle, the data lake must have the potential to store vast quantities of records – so a dedicated data center or cloud environment is best.
The data scientists will decide on the best way for information to be stored and optimized for the queries they expect to run in the near term, and as future ideas develop. Since much of the data will come from the main application that the developers are working on, the teams will need to collaborate on finding the best ways for data to flow into the lake in its raw form.
This design process should take into account factors like the data, schema, level of data compression (if any), and whether information flows will stream in real-time or enter via scheduled dumps. Each team’s level of responsibility for monitoring data flows should also be established.
Making the Right Tools Available
Creating a common environment for developers and data scientists requires tools that enable them to work on the same data sets simultaneously, writing and sharing code or rich text. Notebooks make this possible.
For online operations, open source platforms like PixieDust (a helper library for Jupyter notebooks) enable developers to explore data analysis models without having to learn or code in the statistical languages favored by data science. Originally created by data scientists as one-off scripts, Jupyter notebooks also allow for the offline analysis of data sets and algorithms.
Monitoring and Evaluation
Throughout the software development lifecycle, data science algorithms must trace the path from raw data to interpreted information to some kind of value. Both the work of the data scientists and the developers has to be assessed and observed at all stages. And this observation and evaluation need to be built into the development environment from the beginning.
The very process of setting up this scenario creates opportunities for collaboration in and of itself. On the one hand, the software engineers get a chance to build a framework that embeds the work of the data scientists in a pipeline combining various datasets and algorithms. On the other, data scientists play an integral part in its construction by setting parameters and framing the right kinds of questions.
Using Data Scientists and Developers in Cross-Functional Teams
The final piece of the collaboration puzzle comes with the formation of cross-functional teams consisting of representatives from both camps.
For one thing, having data scientists embedded in a development team (or developers attached to a data science unit) fosters debate and active communication between the members. It also promotes understanding, allowing software engineers and data scientists to better appreciate the needs of each other. In addition, having mixed groups of professionals within the same unit enables those practitioners to step in immediately with their particular skill sets if issues or opportunities arise.
Business units with experience of working with cross-functional teams also stress the importance of allowing a degree of flexibility for the data science members (who may occasionally need to branch out and explore particular topics in isolation for a while), and of creating a forum where various teams can meet to share ideas and knowledge.
At the end of the day, the aim is to enable data science professionals and software developers to use their unique skills to the best advantage of the team – in an environment that promotes creativity, and where trust and respect can build up between the team members as new knowledge and insights are acquired and invested back into the product.