ROI optimization

in data science workflows

Clément Mercier

03 November 2022 | 7 minutes

What are the important factors to consider when optimizing ROI in data science?

Data science is a field that is growing exponentially, with more and more companies starting to hire data scientists.

However, not all of these companies are getting the most out of their data science teams. This is because they don’t have a clear idea of what they want from their data scientists and how to get it.

This article will explore why ROI optimization in a data science workflows for businesses is important and how it can be done in order to get the best results from your team.

INTRO

Data scientists teams are often faced with the challenge of optimizing their workflow processes. This is where ROI optimization comes in. It allows them to standardize processes in order to get the most out of their data science projects, equally improving the company efficiency and business development.

In this sense, the ROI optimization process is a combination of software and human workflows that can help data scientists manage their work more efficiently.

But let’s start from the beginning. In this article, we will focus on profitability and key performance indicator (KPI) to standardize processes. Let’s analyze ROI and data science workflows separately in order to have a better perspective on how to transform data science projects.

CONCEPTS AND MAIN DATA

SCIENCE WORKFLOWS

Return on investment (ROI) is a financial measurement of profitability, it is usually expressed in a percentage and used as a key performance indicator (KPI). The mathematical formula to calculate would be the final value of investment – initial cost of investment)/initial cost of investment *100.

By looking at this formula, we can quickly observe that one of the simplest ways to optimize ROI is by lowering the costs by optimizing processes and resources.

On the other hand, a data science workflow is a phase within a data science project. Workflows are useful for reminding all data science team members of the work that needs to be done. It is similar to a path that assists data scientists in planning, organizing, and implementing data science projects.

Because this is a concept that has been around for a long time, there are numerous variations. For example, Harvard’s introductory data science courses employ Blitzstein and Pfister’s workflow, which is divided into five stages: Ask integrating questions, collect data, explore data, model data, and visually communicate the results. First, the data scientist will define the scope of his research by asking himself, “What am I looking for in this project?” or “What am I looking to estimate?” In the second phase, he will require access to the data that he has collected or that has been handed to him. He then investigates the data and attempts to apply a model to it. The final section of the framework is critical because his insights will be useless if he is unable to communicate them.

CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is the most well-known workflow. This framework’s goal is to standardize the process so that it can be used across industries. The Framework consists of six iterative phases, each of which must be completed in order to set a deliverable; similarly to the Harvard one, the project can loop back to a previous phase if necessary. It begins with business understanding, then moves on to data understanding, data preparation, modeling, evaluation, and finally deployment.

The first distinction between the two models will be in their application, one in various types of fields (ask questions) and the other for business purposes. Second, in the Harvard framework, data is not always provided to the data scientist, whereas in the CRISP-DM, because it is in a business environment, it is usually provided by the company. Finally, the CRISP-DM must incorporate the knowledge into the company’s ecosystem.

As previously stated, a return on investment could be quantified as a return on money or time, resulting in lower costs. In line with this, data scientists can be costly in terms of both money and time, and one way to reduce this cost is to optimize their workflow process.

This leads us to the study case question: how do we optimize the ROI of data science workflow?

This study case will be divided into two parts: how to reduce project time while achieving the same results, and how to increase return on investment.

Organization in the directory and the code

Organization in the directory and the code

When working on a project, good organization will keep you from getting lost in your work. Some types of analysis, for example, necessitate access to multiple documents, such as the dataset, util.py, or references to other Python documents. You will waste a lot of time searching for the location of the documents if the directory is not organized. Similarly, it is easy to lose track of code when coding in a Jupyter Notebook. One solution is to create an index in which you state exactly what you want to do with the dataset and then describe each step of the process. It would assist you and other data scientists in doing exactly what you intended.

Have a good understanding of the Python library

An understanding of libraries and how to code efficiently distinguishes a junior data scientist from a senior data scientist. Seniority is always associated with high performers, but efficiency comes with experience; the more you train on specific problems, the easier it will be the next time you start a new project. The more specific and efficient you are, the better your understanding of the main Python libraries, such as pandas, NumPy, and Plotly, and their various functions will be. Furthermore, if you have encountered a similar problem in a previous project, it will be easier to solve it on this one.

Know how to use tools to your advantage

Tools are essential in the workplace to be efficient and deliver a good product; for example, having plug-ins on Visual Studio could help your code writing flow. Tool usage can also help to improve workflow. For example, the Shapelets tool provides the best place to build your data apps visualization starting with just three lines of Python code; it is an easy and powerful tool to use. A data scientist will not only optimize his workflow but also become more efficient by using events association and triggers, data up visualization, Widget state and data flow, and function execution.

Event association and triggers

By allowing you to customize each component of your widgets throughout the entire interaction with the data app The data scientist will be able to automate more easily in the future. The widget can be bound (allows arguments to be passed within the function), it can specify other widgets that are not part of the function arguments to trigger the execution, and it can be connected.

Data Apps visualization

Once the Python script is completed, the data scientist will be able to register it on the system and visualize the app’s effect. The app is also in a collaborative space where data scientists and business-oriented people can interact. This test improves both parties’ interaction and cooperation in data visualization.

Widget state and data flow

The majority of the widgets in the app are animated, which means you can interact with them by entering the visualization and changing their value. By constantly checking every widget state, anything changed by the widget state or updated shape thoughts will be done quickly and efficiently.

Function execution

Shapelets can gather all the information needed and executive functions that can be executed with different arguments by serializing the function and starting the system. This allows the data scientist to use customized functions for different situations without having to code again.

Conclusion

As we’ve seen, financial returns on investments are measured by an increase in profits or a lower cost. Because data scientists spend so much time in exploratory mode, a good workflow is essential for efficiency.

Famous workflows such as CRISP-DM and Blitzstein & Pfister’s have been implemented to address this issue, but some improvements could be made on each task. For instance, an organization in the directory and the codes could assist in not getting lost in the work and providing insights to all colleagues. Understanding Python libraries will help you understand how to use each function efficiently and apply it to a specific problem; experience is also important for a better understanding of Python. Knowing your tools is the most important aspect of optimizing your workflow; by using advertisements that will help you code or understand functions, you will be able to write faster.

Furthermore, Shapelets is the ideal tool for promoting the best location to appeal to your data visualization and automate your tasks. From heaven association and triggers to data app visualization, Widget State, and Data Flow, knowing how to use Shapelets will drastically improve your performance and lower your business costs.

Clément Mercier

Clément Mercier

Data Scientist Intern

Clément Mercier originally received his Bachelor’s Degree in Finance from Hult International Business school in Boston and is currently finishing his Master’s Degree in Big Data at IE school of Technology.

Clément has good international experience working with startups and big corporations such as the Zinneken’s Group, MediateTech in Boston, and Nestlé in Switzerland.