Project Management | Towards Data Science

Ivory Tower Notes: The Problem

Marina Tosic — Thu, 10 Apr 2025 18:48:08 +0000

Did you ever spend months on a Machine Learning project, only to discover you never defined the “correct” problem at the start? If so, or even if not, and you are only starting with the data science or AI field, welcome to my first Ivory Tower Note, where I will address this topic.

The term “Ivory Tower” is a metaphor for a situation in which someone is isolated from the practical realities of everyday life. In academia, the term often refers to researchers who engage deeply in theoretical pursuits and remain distant from the realities that practitioners face outside academia.

As a former researcher, I wrote a short series of posts from my old Ivory Tower notes — the notes before the LLM era.

Scary, I know. I am writing this to manage expectations and the question, “Why ever did you do things this way?” — “Because no LLM told me how to do otherwise 10+ years ago.”

That’s why my notes contain “legacy” topics such as data mining, machine learning, multi-criteria decision-making, and (sometimes) human interactions, airplanes and art.

Nonetheless, whenever there is an opportunity, I will map my “old” knowledge to generative AI advances and explain how I applied it to datasets beyond the Ivory Tower.

Welcome to post #1…

How every Machine Learning and AI journey starts

— It starts with a problem.

For you, this is usually “the” problem because you need to live with it for months or, in the case of research, years.

With “the” problem, I am addressing the business problem you don’t fully understand or know how to solve at first.

An even worse scenario is when you think you fully understand and know how to solve it quickly. This then creates only more problems that are again only yours to solve. But more about this in the upcoming sections.

So, what’s “the” problem about?

Causa: It’s mostly about not managing or leveraging resources properly — workforce, equipment, money, or time.

Ratio: It’s usually about generating business value, which can span from improved accuracy, increased productivity, cost savings, revenue gains, faster reaction, decision, planning, delivery or turnaround times.

Veritas: It’s always about finding a solution that relies and is hidden somewhere in the existing dataset.

Or, more than one dataset that someone labelled as “the one”, and that’s waiting for you to solve the problem. Because datasets follow and are created from technical or business process logs, “there has to be a solution lying somewhere within them.”

Ah, if only it were so easy.

Avoiding a different chain of thought again, the point is you will need to:

1 — Understand the problem fully,
2 — If not given, find the dataset “behind” it, and
3 — Create a methodology to get to the solution that will generate business value from it.

On this path, you will be tracked and measured, and time will not be on your side to deliver the solution that will solve “the universe equation.”

That’s why you will need to approach the problem methodologically, drill down to smaller problems first, and focus entirely on them because they are the root cause of the overall problem.

That’s why it’s good to learn how to…

Think like a Data Scientist.

Returning to the problem itself, let’s imagine that you are a tourist lost somewhere in the big museum, and you want to figure out where you are. What you do next is walk to the closest info map on the floor, which will show your current location.

At this moment, in front of you, you see something like this:

Data Science Process. Image by Author, inspired by Microsoft Learn

The next thing you might tell yourself is, “I want to get to Frida Kahlo’s painting.” (Note: These are the insights you want to get.)

Because your goal is to see this one painting that brought you miles away from your home and now sits two floors below, you head straight to the second floor. Beforehand, you memorized the shortest path to reach your goal. (Note: This is the initial data collection and discovery phase.)

However, along the way, you stumble upon some obstacles — the elevator is shut down for renovation, so you have to use the stairs. The museum paintings were reordered just two days ago, and the info plans didn’t reflect the changes, so the path you had in mind to get to the painting is not accurate.

Then you find yourself wandering around the third floor already, asking quietly again, “How do I get out of this labyrinth and get to my painting faster?”

While you don’t know the answer, you ask the museum staff on the third floor to help you out, and you start collecting the new data to get the correct route to your painting. (Note: This is a new data collection and discovery phase.)

Nonetheless, once you get to the second floor, you get lost again, but what you do next is start noticing a pattern in how the paintings have been ordered chronologically and thematically to group the artists whose styles overlap, thus giving you an indication of where to go to find your painting. (Note: This is a modelling phase overlapped with the enrichment phase from the dataset you collected during school days — your art knowledge.)

Finally, after adapting the pattern analysis and recalling the collected inputs on the museum route, you arrive in front of the painting you had been planning to see since booking your flight a few months ago.

What I described now is how you approach data science and, nowadays, generative AI problems. You always start with the end goal in mind and ask yourself:

“What is the expected outcome I want or need to get from this?”

Then you start planning from this question backwards. The example above started with requesting holidays, booking flights, arranging accommodation, traveling to a destination, buying museum tickets, wandering around in a museum, and then seeing the painting you’ve been reading about for ages.

Of course, there is more to it, and this process should be approached differently if you need to solve someone else’s problem, which is a bit more complex than locating the painting in the museum.

In this case, you have to…

Ask the “good” questions.

To do this, let’s define what a good question means [1]:

A good data science question must be concrete, tractable, and answerable. Your question works well if it naturally points to a feasible approach for your project. If your question is too vague to suggest what data you need, it won’t effectively guide your work.

Formulating good questions keeps you on track so you don’t get lost in the data that should be used to get to the specific problem solution, or you don’t end up solving the wrong problem.

Going into more detail, good questions will help identify gaps in reasoning, avoid faulty premises, and create alternative scenarios in case things do go south (which almost always happens).

Image created by Author after analyzing “Chapter 2. Setting goals by asking good questions” from “Think Like a Data Scientist” book [2]

From the above-presented diagram, you understand how good questions, first and foremost, need to support concrete assumptions. This means they need to be formulated in a way that your premises are clear and ensure they can be tested without mixing up facts with opinions.

Good questions produce answers that move you closer to your goal, whether through confirming hypotheses, providing new insights, or eliminating wrong paths. They are measurable, and with this, they connect to project goals because they are formulated with consideration of what’s possible, valuable, and efficient [2].

Good questions are answerable with available data, considering current data relevance and limitations.

Last but not least, good questions anticipate obstacles. If something is certain in data science, this is the uncertainty, so having backup plans when things don’t work as expected is important to produce results for your project.

Let’s exemplify this with one use case of an airline company that has a challenge with increasing its fleet availability due to unplanned technical groundings (UTG).

These unexpected maintenance events disrupt flights and cost the company significant money. Because of this, executives decided to react to the problem and call in a data scientist (you) to help them improve aircraft availability.

Now, if this would be the first data science task you ever got, you would maybe start an investigation by asking:

“How can we eliminate all unplanned maintenance events?”

You understand how this question is an example of the wrong or “poor” one because:

It is not realistic: It includes every possible defect, both small and big, into one impossible goal of “zero operational interruptions”.
It doesn’t hold a measure of success: There’s no concrete metric to show progress, and if you’re not at zero, you’re at “failure.”
It is not data-driven: The question didn’t cover which data is recorded before delays occur, and how the aircraft unavailability is measured and reported from it.

So, instead of this vague question, you would probably ask a set of targeted questions:

Which aircraft (sub)system is most critical to flight disruptions?
(Concrete, specific, answerable) This question narrows down your scope, focusing on only one or two specific (sub) systems affecting most delays.
What constitutes “critical downtime” from an operational perspective?
(Valuable, ties to business goals) If the airline (or regulatory body) doesn’t define how many minutes of unscheduled downtime matter for schedule disruptions, you might waste effort solving less urgent issues.
Which data sources capture the root causes, and how can we fuse them?
(Manageable, narrows the scope of the project further) This clarifies which data sources one would need to find the problem solution.

With these sharper questions, you will drill down to the real problem:

Not all delays weigh the same in cost or impact. The “correct” data science problem is to predict critical subsystem failures that lead to operationally costly interruptions so maintenance crews can prioritize them.

That’s why…

Defining the problem determines every step after.

It’s the foundation upon which your data, modelling, and evaluation phases are built .

Image created by Author after analyzing and overlapping different images from “Chapter 2. Setting goals by asking good questions, Think Like a Data Scientist” book [2]

It means you are clarifying the project’s objectives, constraints, and scope; you need to articulate the ultimate goal first and, except for asking “What’s the expected outcome I want or need to get from this?”, ask as well:

What would success look like and how can we measure it?

From there, drill down to (possible) next-level questions that you (I) have learned from the Ivory Tower days:
— History questions: “Has anyone tried to solve this before? What happened? What is still missing?”
— Context questions: “Who is affected by this problem and how? How are they partially resolving it now? Which sources, methods, and tools are they using now, and can they still be reused in the new models?”
— Impact Questions: “What happens if we don’t solve this? What changes if we do? Is there a value we can create by default? How much will this approach cost?”
— Assumption Questions: “What are we taking for granted that might not be true (especially when it comes to data and stakeholders’ ideas)?”
— ….

Then, do this in the loop and always “ask, ask again, and don’t stop asking” questions so you can drill down and understand which data and analysis are needed and what the ground problem is.

This is the evergreen knowledge you can apply nowadays, too, when deciding if your problem is of a predictive or generative nature.

(More about this in some other note where I will explain how problematic it is trying to solve the problem with the models that have never seen — or have never been trained on — similar problems before.)

Now, going back to memory lane…

I want to add one important note: I have learned from late nights in the Ivory Tower that no amount of data or data science knowledge can save you if you’re solving the wrong problem and trying to get the solution (answer) from a question that was simply wrong and vague.

When you have a problem on hand, do not rush into assumptions or building the models without understanding what you need to do (Festina lente).

In addition, prepare yourself for unexpected situations and do a proper investigation with your stakeholders and domain experts because their patience will be limited, too.

With this, I want to say that the “real art” of being successful in data projects is knowing precisely what the problem is, figuring out if it can be solved in the first place, and then coming up with the “how” part.

You get there by learning to ask good questions.

To end this narrative, recall how Einstein famously said:

If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute solving it.

Thank you for reading, and stay tuned for the next Ivory Tower note.

If you found this post valuable, feel free to share it with your network.

Connect for more stories on Medium and LinkedIn .

References:

[1] DS4Humans, Backwards Design, accessed: April 5th 2025, https://ds4humans.com/40_in_practice/05_backwards_design.html#defining-a-good-question

[2] Godsey, B. (2017), Think Like a Data Scientist: Tackle the data science process step-by-step, Manning Publications.

The post Ivory Tower Notes: The Problem appeared first on Towards Data Science.

Data Scientist: From School to Work, Part I

Vincent Margot — Wed, 19 Feb 2025 12:00:00 +0000

Nowadays, data science projects do not end with the proof of concept; every project has the goal of being used in production. It is important, therefore, to deliver high-quality code. I have been working as a data scientist for more than ten years and I have noticed that juniors usually have a weak level in development, which is understandable, because to be a data scientist you need to master math, statistics, algorithmics, development, and have knowledge in operational development. In this series of articles, I would like to share some tips and good practices for managing a professional data science project in Python. From Python to Docker, with a detour to Git, I will present the tools I use every day.

The other day, a colleague told me how he had to reinstall Linux because of an incorrect manipulation with Python. He had restored an old project that he wanted to customize. As a result of installing and uninstalling packages and changing versions, his Linux-based Python environment was no longer functional: an incident that could easily have been avoided by setting up a virtual environment. But it shows how important it is to manage these environments. Fortunately, there is now an excellent tool for this: uv.
The origin of these two letters is not clear. According to Zanie Blue (one of the creators):

“We considered a ton of names — it’s really hard to pick a name without collisions this day so every name was a balance of tradeoffs. uv was given to us on PyPI, is Astral-themed (i.e. ultraviolet or universal), and is short and easy to type.”

Now, let’s go into a little more detail about this wonderful tool.

Introduction

UV is a modern, minimalist Python projects and packages manager. Developed entirely in Rust, it has been designed to simplify Dependency Management, virtual environment creation and project organization. UV has been designed to limit common Python project problems such as dependency conflicts and environment management. It aims to offer a smoother, more intuitive experience than traditional tools such as the pip + virtualenv combo or the Conda manager. It is claimed to be 10 to 100 times faster than traditional handlers.

Whether for small personal projects or developing Python applications for production, UV is a robust and efficient solution for package management.

Starting with UV

Installation

To install UV, if you are using Windows, I recommend to use this command in a shell:

winget install --id=astral-sh.uv  -e

And, if you are on Mac or Linux use the command:

To verify correct installation, simply type into a terminal the following command:

uv version

Creation of a new Python project

Using UV you can create a new project by specifying the version of Python. To start a new project, simply type into a terminal:

uv init --python x:xx project_name

python x:xx must be replaced by the desired version (e.g. python 3.12). If you do not have the specified Python version, UV will take care of this and download the correct version to start the project.

This command creates and automatically initializes a Git repository named project_name. It contains several files:

A .gitignore file. It lists the elements of the repository to be ignored in the git versioning (it is basic and should be rewrite for a project ready to deploy).
A .python-version file. It indicates the python version used in the project.
The README.md file. It has a purpose to describe the project and explains how to use it.
A hello.py file.
The pyproject.toml file. This file contains all the information about tools used to build the project.
The uv.lock file. It is used to create the virtual environment when you use uv to run the script (it can be compared to the requierements.txt)

Package installation

To install new packages in this next environment you have to use:

uv add package_name

When the add command is used for the first time, UV creates a new virtual environment in the current working directory and installs the specified dependencies. A .venv/ directory appears. On subsequent runs, UV will use the existing virtual environment and install or update only the new packages requested. In addition, UV has a powerful dependency resolver. When executing the add command, UV analyzes the entire dependency graph to find a compatible set of package versions that meet all requirements (package version and Python version). Finally, UV updates the pyproject.toml and uv.lock files after each add command.

To uninstall a package, type the command:

uv remove package_name

It is very important to clean the unused package from your environment. You have to keep the dependency file as minimal as possible. If a package is not used or is no longer used, it must be deleted.

Run a Python script

Now, your repository is initiated, your packages are installed and your code is ready to be tested. You can activate the created virtual environment as usual, but it is more efficient to use the UV command run:

uv run hello.py

Using the run command guarantees that the script will be executed in the virtual environment of the project.

Manage the Python versions

It is usually recommended to use different Python versions. As mentioned before the introduction, you may be working on an old project that requires an old Python version. And often it will be too difficult to update the version.

uv python list

At any time, it is possible to change the Python version of your project. To do that, you have to modify the line requires-python in the pyproject.toml file.

For instance: requires-python = “>=3.9”

Then you have to synchronize your environment using the command:

uv sync

The command first checks existing Python installations. If the requested version is not found, UV downloads and installs it. UV also creates a new virtual environment in the project directory, replacing the old one.

But the new environment does not have the required package. Thus, after a sync command, you have to type:

uv pip install -e .

Switch from virtualenv to uv

If you have a Python project initiated with pip and virtualenv and wish to use UV, nothing could be simpler. If there is no requirements file, you need to activate your virtual environment and then retrieve the package + installed version.

pip freeze > requirements.txt

Then, you have to init the project with UV and install the dependencies:

uv init .
uv pip install -r requirements.txt

Correspondence table between pip + virtualenv and UV, image by author.

Use the tools

UV offers the possibility of using tools via the uv tool command. Tools are Python packages that provide command interfaces for such as ruff, pytests, mypy, etc. To install a tool, type the command line:

uv tool install tool_name

But, a tool can be used without having been installed:

uv tool run tool_name

For convenience, an alias was created: uvx, which is equivalent to uv tool run. So, to run a tool, just type:

uvx tool_name

Conclusion

UV is a powerful and efficient Python package manager designed to provide fast dependency resolution and installation. It significantly outperforms traditional tools like pip or conda, making it an excellent choice to manage your Python projects.

Whether you’re working on small scripts or large projects, I recommend you get into the habit of using UV. And believe me, trying it out means adopting it.

References

1 — UV documentation: https://docs.astral.sh/uv/

2 — UV GitHub repository: https://github.com/astral-sh/uv

3 — A great datacamp article: https://www.datacamp.com/tutorial/python-uv

The post Data Scientist: From School to Work, Part I appeared first on Towards Data Science.

Manage Environment Variables with Pydantic

Marcello Politi — Wed, 12 Feb 2025 20:10:29 +0000

Introduction

Developers work on applications that are supposed to be deployed on some server in order to allow anyone to use those. Typically in the machine where these apps live, developers set up environment variables that allow the app to run. These variables can be API keys of external services, URL of your database and much more.

For local development though, it is really inconvenient to declare these variables on the machine because it is a slow and messy process. So I’d like to share in this short tutorial how to use Pydantic to handle environment variables in a secure way.

.env file

What you commonly do in a Python project is to store all your environment variables in a file named .env. This is a text file containing all the variables in a key : value format. You can use also the value of one of the variables to declare one of the other variables by leveraging the {} syntax.

The following is an example:

#.env file

OPENAI_API_KEY="sk-your private key"
OPENAI_MODEL_ID="gpt-4o-mini"

# Development settings
DOMAIN=example.org
ADMIN_EMAIL=admin@${DOMAIN}

WANDB_API_KEY="your-private-key"
WANDB_PROJECT="myproject"
WANDB_ENTITY="my-entity"

SERPAPI_KEY= "your-api-key"
PERPLEXITY_TOKEN = "your-api-token"

Be aware the .env file should remain private, so it is important that this file is mentioned in your .gitignore file, to be sure that you never push it on GitHub, otherwise, other developers could steal your keys and use the tools you’ve paid for.

env.example file

To ease the life of developers who will clone your repository, you could include an env.example file in your project. This is a file containing only the keys of what is supposed to go into the .env file. In this way, other people know what APIs, tokens, or secrets in general they need to set to make the scripts work.

#env.example

OPENAI_API_KEY=""
OPENAI_MODEL_ID=""

DOMAIN=""
ADMIN_EMAIL=""

WANDB_API_KEY=""
WANDB_PROJECT=""
WANDB_ENTITY=""

SERPAPI_KEY= ""
PERPLEXITY_TOKEN = ""

python-dotenv

python-dotenv is the library you use to load the variables declared into the .env file. To install this library:

pip install python-dotenv

Now you can use the load_dotenv to load the variables. Then get a reference to these variables with the os module.

import os
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
OPENAI_MODEL_ID = os.getenv('OPENAI_MODEL_ID')

This method will first look into your .env file to load the variables you’ve declared there. If this file doesn’t exist, the variable will be taken from the host machine. This means that you can use the .env file for your local development but then when the code is deployed to a host environment like a virtual machine or Docker container we are going to directly use the environment variables defined in the host environment.

Pydantic

Pydantic is one of the most used libraries in Python for data validation. It is also used for serializing and deserializing classes into JSON and back. It automatically generates JSON schema, reducing the need for manual schema management. It also provides built-in data validation, ensuring that the serialized data adheres to the expected format. Lastly, it easily integrates with popular web frameworks like FastAPI.

pydantic-settings is a Pydantic feature needed to load and validate settings or config classes from environment variables.

!pip install pydantic-settings

We are going to create a class named Settings. This class will inherit BaseSettings. This makes the default behaviours of determining the values of any fields to be read from the .env file. If no var is found in the .env file it will be used the default value if provided.

from pydantic_settings import BaseSettings, SettingsConfigDict

from pydantic import (
    AliasChoices,
    Field,
    RedisDsn,
)


class Settings(BaseSettings):
    auth_key: str = Field(validation_alias='my_auth_key')  
    api_key: str = Field(alias='my_api_key')  

    redis_dsn: RedisDsn = Field(
        'redis://user:pass@localhost:6379/1', #default value
        validation_alias=AliasChoices('service_redis_dsn', 'redis_url'),  
    )

    model_config = SettingsConfigDict(env_prefix='my_prefix_')

In the Settings class above we have defined several fields. The Field class is used to provide extra information about an attribute.

In our case, we setup a validation_alias. So the variable name to look for in the .env file is overridden. In the case reported above, the environment variable my_auth_key will be read instead of auth_key.

You can also have multiple aliases to look for in the .env file that you can specify by leveraging AliasChoises(choise1, choise2).

The last attribute model_config , contains all the variables regarding a particular topic (e.g connection to a db). And this variable will store all .env var that start with the prefix env_prefix.

Instantiate and use settings

The next step would be to actually instantiate and use these settings in your Python project.

from pydantic_settings import BaseSettings, SettingsConfigDict

from pydantic import (
    AliasChoices,
    Field,
    RedisDsn,
)


class Settings(BaseSettings):
    auth_key: str = Field(validation_alias='my_auth_key')  
    api_key: str = Field(alias='my_api_key')  

    redis_dsn: RedisDsn = Field(
        'redis://user:pass@localhost:6379/1', #default value
        validation_alias=AliasChoices('service_redis_dsn', 'redis_url'),  
    )

    model_config = SettingsConfigDict(env_prefix='my_prefix_')

# create immediately a settings object
settings = Settings()

Now what use the settings in other parts of our codebase.

from Settings import settings

print(settings.auth_key)

You finally have an easy access to your settings, and Pydantic helps you validate that the secrets have the correct format. For more advanced validation tips refer to the Pydantic documentation: https://docs.pydantic.dev/latest/

Final thoughts

Managing the configuration of a project is a boring but important part of software development. Secrets like API keys, db connections are what usually power your application. Naively you can hardcode these variables in your code and it will still work, but for obvious reasons, this could not be a good practice. In this article, I showed you an introduction on how to use pydantic settings to have a structured and safe way to handle your configurations.

Linkedin | X (Twitter) | Website

The post Manage Environment Variables with Pydantic appeared first on Towards Data Science.

Behind the Scenes of a Successful Data Analytics Project

Ilona Hetsevich — Thu, 23 Jan 2025 11:32:03 +0000

Learn the steps to approach any data analytics project like a pro.

Having worked as a data analyst for a while and tackled numerous projects, I can say that even though each project is unique, there is always a proven way to approach it.

Today, I’ll share with you the steps I usually take when working on a data project so you can follow them too.

Step 1: Define the Problem and Objectives

You cannot solve a problem or answer a business question if you do not understand what it is and how it fits into the bigger picture.

No matter how big or complex the task is, you must always understand what your business stakeholders are trying to achieve before diving into data. This is the part where you ask many questions, and before you get at least some answers, you are not diving into any data.

I learned this the hard way early in my career. Back then, when a vague request like "We saw visitors drop this month. Can you check why?" came, I would immediately jump into work. But every single time, I wasted hours trying to understand the real problem because I didn’t ask the right questions upfront.

I didn’t ask for context:

Why the team needed the traffic to be high?
What was the chosen strategy (brand awareness vs demand generation)?
What were the chosen tactics (paid search vs programmatic)?
What were the investments?

I didn’t ask stakeholders what they would do after receiving the data.

Did they want to increase signups and sales?
Were they aware that website visits may look impressive but not necessarily correlate with business outcomes and that focusing on metrics such as conversion rate would have a much better effect?

This initial first step is important because it affects everything else: from the data sources you will use to retrieve the data to the metrics you will analyze, the format you will use to present the insights, and the timeline you need to be ready for.

So don’t ever skip it or just partially understand hoping you will figure it out along the way.

Step 2: Set Expectations

Once you’ve defined the problem, it’s time to set expectations.

Stakeholders don’t always realize how much time and effort goes into collecting and analyzing data. You are among the few people in the organization who can find the answers, so you receive many requests. That is why you need to prioritize and set expectations.

Understanding the problem, its complexity, how it aligns with the organization’s goals from Step 1 helps prioritize and communicate to stakeholders when the task can be done or why you will not be prioritizing it right now. You want to focus on the most impactful work.

A colleague of mine took a smart approach. They required stakeholders to fill out a questionnaire when submitting a task. This questionnaire included various questions about the problem description, timeline, etc., and it also asked, "What will you do with the insights?". This approach not only gathered all the necessary information upfront, eliminating the need for back-and-forth communication, but it also made stakeholders think twice before submitting another "Can you quickly look at…?" request. Genius, right?

Step 3: Prepare the Data

Now that you’ve defined the problem and set expectations, it’s time to prepare the data.

This is the step where you ask youself:

Do I have all the data available, or do I need to collect it first?
Do I have all the domain knowledge needed, or do I need to do the research?
Do I have documentation available for the associated datasets? (If there is no documentation, you may need to contact the data owners for clarification.)

Another critical question to answer at this step is _"_What metrics should I measure?"

I always align my metrics with the business objectives. For instance, if the goal is to increase brand awareness, I prioritize metrics like impressions, branded search volume, direct traffic, and reach. If the objective is to drive sales, I focus on conversion rates, average order value, and customer acquisition cost. I also explore secondary metrics (demographics, device usage, customer behavior) to ensure my analysis is comprehensive and paints a complete picture.

Step 4: Explore the Data

Now comes the fun part – Exploratory Data Analysis (EDA). I love this part because it is where all the magic happens. Like a detective, you review the evidence, investigate the case, formulate hypotheses, and look for hidden patterns.

As you explore the data, you:

Ask better questions. As you become more familiar with data, you can approach data owners with concrete questions, making you look competent, knowledgeable, and confident in the eyes of your colleagues.
Innovate with feature engineering. You understand whether or not you need to create new features from existing ones. This helps to better capture the underlying patterns in the data that would otherwise go unnoticed.
Assess data quality. You check the number of rows of data and whether there are any anomalies, such as outliers, missing, or duplicate data.

If the exploration step shows data needs to be cleaned (and believe me, it is more than the case that it is not), you proceed with data cleanup.

Step 5: Clean the Data

No matter how polished a dataset looks at first glance, never assume it’s clean. Data quality issues are more common than not.

The most common data quality problems you need to fix are:

1. Missing values:

The way you will handle missing data differs from case to case.

If it is due to errors in data entry, collaborate with the relevant teams to correct it.
If the original data cannot be recovered, you need to either remove missing values or impute them using industry benchmarks, calculating the mean or median, or applying machine learning methods.
If missing values represent a small portion of the dataset and won’t significantly impact the analysis, it is usually OK to remove them.

2. Inconsistent data: Check data for inconsistent data formats and standardize them.

3. Duplicate records: Identify and remove duplicate records to avoid skewing results.

4. Outliers or errors in data: Check for outliers or errors in the data. Based on its context, decide whether to remove, fix, or keep it.

Once your data is cleaned, it is time to proceed to the analysis phase.

Step 6: Analyze the Data

This is where your detective work starts to pay off.

The key is to start with a very focused and specific question and not to be biased by having a hypothesis in mind. Using data to tell the story you or your colleagues want or expect to hear might be tempting, but you must let the dataset speak for itself.

I prefer to use the root-cause approach when analyzing data. For example, to answer the question, "Why do we see a drop in signups?" I would follow these 10 steps:

Trend analysis: When does the drop happen for the first time? Is it seasonal?
Traffic and conversion rates: Are fewer people visiting the site or fewer visitors signing up?
Offer performance: Is the decline widespread or isolated to a particular offer?
Website Performance: Are there any technical issues or broken links?
User insights: Is the pattern specific to a particular segment or all users?
User journey analysis: Are there any friction points where potential customers drop off?
Campaign performance: Have any recent marketing campaigns or changes in strategy, budget allocation, or execution impacted effectiveness?
Competitor activity: Have competitors launched a marketing campaign, new product, or feature? Have they changed their prices? Is there another reason that might be attracting customers away?
Market trends: Are there market trends and changes in consumer behavior affecting sales in the industry?
Customer feedback: Are customers dissatisfied with the offering? Did their needs change? Do we receive more support tickets?

Another important point is that the fastest and most accurate answers aren’t usually the same, and a lot depends on the context. That is why you need to collaborate with cross-functional teams and develop strong domain and industry knowledge.

Step 7: Build the Story

This step is my second favorite after data exploration because it is when all the data pieces fall into place, revealing a clear story and making perfect sense.

A common mistake here is including everything you found interesting instead of focusing on what the audience cares about. I get it. After working hard to get insights, it’s tempting to show off all the cool stuff you did. But if you overload your audience with data, you can further confuse them.

Don’t throw every data point at stakeholders; focus on what matters most to your audience instead. Think about their level of seniority, how familiar they are with the topic, their data literacy level, how much time they have, and whether you’re presenting in person or sending a report via email. This way you don’t waste anyone’s time – yours or theirs.

Lastly, always include actionable recommendations to stakeholders in your story. Your story should guide stakeholders on the next steps, ensuring that your insights drive meaningful decisions.

This brings us to the next point – sharing the insights and recommendations.

Step 8: Share the Insights

As a Data Analyst, you have the power to drive change. The secret lies in how you share data and tell the story.

First, consider the format your audience expects (see Step 1). Are you creating a dashboard, emailing a report, or presenting in person? Data storytelling becomes crucial for live presentations.

A great data story blends data, narrative, visuals, and practice:

Data: Focus only on insights with real business impact. If you can’t find a compelling reason why your insight will matter to the audience, if it’s unclear what they should do with the insights, or if the business impact is minimal, move it to the appendix.

Narrative: Ensure that your story has a clear structure.

Set the scene: What’s happening now?
Introduce the problem (to create some tension).
Reveal the key insights: What did you discover?
Finish with actionable steps: What should they do next?

This keeps your audience interested and makes your story memorable.

Visuals: The chart that helped you discover an insight isn’t always the best for presenting it. Highlight the key points and avoid clutter. For example, if you analyzed 10 categories but only 2 are critical, focus on those.

Practice: Practicing helps you feel more comfortable with the material. It also allows you to focus on important things like eye contact, hand gestures, and pacing. The more you practice, the more confident and credible you will appear.

You might think that once you’ve shared your insights, your job as a data analyst is done. In reality you want people not only hear what you’ve discovered but also act on your insights. This leads us to the final step – making people act on your data.

Step 9. Make People Act on Your Data

Seeing my work have an impact and a chance to drive real change brings me the most satisfaction. So don’t let your hard work go to waste either.

Work with the relevant teams to set clear action steps, timelines, and success metrics.
Monitor progress and ensure your recommendations are being implemented.
Communicate regularly with cross-functional teams to track the impact of your recommendations.

I understand that this might feel like a lot right now, but please don’t worry. With practice, it will become easier, and before you know it, these steps will become second nature.

Good luck on your data analyst journey! You’re on the right track!

All images were created by the author using Canva.com.

The post Behind the Scenes of a Successful Data Analytics Project appeared first on Towards Data Science.

Documenting Python Projects with MkDocs

Gustavo Santos — Fri, 22 Nov 2024 18:58:11 +0000

Introduction

Project Documentation is necessary. Very necessary, I would emphasize.

At the beginning of my career, I learned the hard way that a project must be documented.

Let’s go back in time – to the 2000s – when I was working as a Customer Representative for large US companies. I was part of a team and my colleagues and I had joined the company around the same month. So, for a while, there was no need to worry because nobody was going on vacation just a few weeks or months after starting a new job.

However, after some time, it inevitably would happen. And we were all assigned to back up each other. That is when documentation started to play a major part in my career.

The day the first person took a few days off, I panicked! I got to work and I didn’t know what to do or even where to start. The tasks kept coming and piling up while I was trying to figure out how to process them.

In the end, everything turned out well. I was able to figure it out and move on. But from that day on, I knew that documentation needed to be in place for any time off or team movement, like promotions or offboardings.

In this post, we will learn how to create a simple (and effective) project documentation using Mkdocs in Python. The final result will look similar to MkDocs documentation.

Building The Documentation

mkdocs is a module in Python that allows us to create simple web pages using Markdown language. The benefit is that it is highly customizable, and gives your documentation a professional look, besides easily integrating with GitHub.

Additionally, mkdocs leverages Markdown notation language, which is very simple to use, being just plain text with the addition of a couple of signs to point titles, subtitles, bullet points, italic, bold etc. To illustrate, Medium uses Markdown language for blogging.

Markdown is a lightweight markup language for creating web formatted text using a plain-text editor.

Preparation

I believe that the best time to create the documentation is once we finish the project. At that point, we already know which modules were used, how it was deployed, and how the project can be started and maintained. So, it is time to document those steps for the users.

When documenting something, my experience tells me to:

Describe it as if you were describing how to run the project to a complete layperson.
Try to avoid highly technical terms, and acronyms used in your company.
Describe each step using clear and simple language.
If the concept is too dense, or the task is too complex, try breaking it down into bullets.

Before starting with the documentation, let’s create a sample project real quick, using the module uv for virtual environment management. Here, I am using uv and VSCode.

Open a terminal and install uv with pip install uv
Create a new project name "p2": uv init p2
Change the directory to access the new folder: cd p2
Set the Python version for the project: pyenv local 3.12.1
Create a new virtual environment: uv venv --python 3.12.1
Activate the environment: venv/Scripts/activate
Add some packages: uv add pandas, numpy, scikit-learn, streamlit

Sample project created. Image by the author.

Getting Started

Having the project created, let’s add mkdocs.

# Install mkdocs
uv add mkdocs

Next, we will create a new documentation folder.

# create a new documentation folder in your project
mkdocs new .

That command will generate a docs folder and the files needed for the documentation.

File mkdocs.yml: It is used to configure your documentation webpage, like title, theme, and site structure, like adding new tabs.
Folder docs with the file index.md: This file is where you will write the documentation itself.

Documentation folder. Image by the author.

If we want to look at our documentation, we already can. Just use the serve command.

# Open a local server to display the docs
mkdocs serve

Local version running on the port 8000. Image by the author.

Now, we can just copy + paste that HTTP into a browser (or Ctrl + click) to see how the documentation currently is.

MkDocs documentation "out of the box". Image by the author.

Customization

It is time to customize our documentation.

Let’s start by changing the Title of the documentation page. Open the mkdocs.yml file. You will see only that site_name title in the default file.

mkdocs.yml default file. Image by the author.

Let’s change it.

site_name: P2 Project Documentation

We can add a new tab About with the information about the project. For that to actually work, we also need to add a markdown file about.md to the folder docs.

site_name: P2 Project Documentation
nav:
  - Home: index.md
  - About: about.md

And we can change the theme if we want to. Check for built-in available themes here. Or for installable themes gallery here.

site_name: P2 Project Documentation
nav:
  - Home: index.md
  - About: about.md
theme: mkdocs

Here is the result, so far.

Mkdocs is easily customizable. Image by the author.

Next, let us start writing the documentation. This should be done in a markdown file within the folder docs.

I will write the whole example documentation to the file index.md and the project meta information will go to the file about.md.

File index.md

We will erase the sample text that is in there and write our documentation instead.

# P2 Project

This project is an example of how we can write a professional documentation using `mkdocs` module in Python.

To learn MarkDown notation, use this [Cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Here-Cheatsheet).

---

## Python Version

This project was built using **Python 3.12.1**

---

## Modules

* mkdocs >= 1.6.1
* numpy >= 2.1.3
* pandas >= 2.2.3
* scikit-learn >= 1.5.2
* seaborn >= 0.13.2
* streamlit >= 1.40.1

---

## Quick Start

To create a documentation with MkDocs, these are the main bash commands:

* Install mkdocs: `pip install mkdocs`
* Create a new documentation folder: `mkdocs new .`
* `mkdocs.yml` is the file to customize the web page, such as creating tabs, changing titles and themes.
* The files in the folder **docs** are the ones to hold the documentation text, using MarkDown notation.

File about.md

# About This Project
 


* **Author:** Your Name
* **Purpose:** Exemplify how to create a professional looking documentation for your projects using MarkDown notation in Python.

---

### Contact

Find me on [Linkedin](https://www.linkedin.com/in/profile)

The final result is this beautiful documentation site.

Final documentation website. Image by the author.

Adding Functions and Docstrings

MkDocs also has ability to pull functions and respective docstrings from the code. To do that, first add the module MkDocstrings-Python using pip install mkdocstrings-python. In our case, I am using uv.

uv add mkdocstrings-python

Next, adjust the mkdocs.yml file to add the plugin. Add the following lines to the end of the file and save it.

plugins:
- mkdocstrings:
    default_handler: python

Now, let’s look at our code. In this example project, we have only the file hello.py with two functions.

Functions in the python file. Image by the author.

Adding them to the documentation is pretty simple. Use three ::: followed by the path to your file. Since this file is in the main folder of the project, we simply add the file_name.function. If it is within a folder, you can use something like folder.file.function.

### Function

These are the functions used in the code `hello.py`

::: hello.main_func
::: hello.calculations

After saving the file, we can look at the result mkdocs serve.

Functions pulled directly from the .py file. Image by the author.

Now let’s deploy the docs.

Deploying

Deploying our documentation page is simple.

First, we must create a GitHub page for the project, if we already don’t have one.

Next, go back to the IDE terminal and we will build our page with the next command. This command will create the folders and files necessary to deploy the documentation website.

mkdocs build

[OUT]:
INFO  -  Cleaning site directory
INFO  -  Building documentation to directory: C:MyDocumentstestesp2site
INFO  -  Documentation built in 0.06 seconds

Now, need to add the GitHub repository in the mkdocs.yml file, so the module knows where to deploy the documentation to.

Adding the repository name to the mkdocs.yml. Image by the author.

Then we open a Git Bash Terminal to initialize Git and commit.

# Initialize Git
git init

# Add the reference repository
git remote add origin https://github.com/gurezende/MkDocs-Example.git

# Add files from the project
git add .

# Commit the files
git commit -m "Project Code and documentation"

# Create Branch Main
git branch -M main

# Push files to GitHub
git push -u origin main

And then we can deploy the documentation with the following bash code in a Powershell terminal.

mkdocs gh-deploy

## Output ##
INFO    -  Cleaning site directory
INFO    -  Building documentation to directory: C:MyDocumentstestesp2site
INFO    -  Documentation built in 0.08 seconds
WARNING -  Version check skipped: No version specified in previous deployment.
INFO    -  Copying 'C:MyDocumentstestesp2site' to 'gh-pages' branch and pushing to GitHub.
Enumerating objects: 39, done.
Counting objects: 100% (39/39), done.
Delta compression using up to 12 threads
Compressing objects: 100% (37/37), done.
Writing objects: 100% (39/39), 829.12 KiB | 11.68 MiB/s, done.
Total 39 (delta 2), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (2/2), done.
remote: 
remote: Create a pull request for 'gh-pages' on GitHub by visiting:
remote:      https://github.com/gurezende/MkDocs-Example/pull/new/gh-pages
remote: 
To https://github.com/gurezende/MkDocs-Example.git
 * [new branch]      gh-pages -> gh-pages
INFO - Your documentation should shortly be available at: https://gurezende.github.io/MkDocs-Example/

Notice that on the last line, we have the URL where the documentation was deployed. This address can be added to your GitHub readme file.

P2 Project

After deployment, we just need to push the updates again to update GitHub, using the following commands on a Git Bash terminal.

git add .
git commit -m "Online documentation added"
git push origin main

That’s it! The documentation is live!

From now on, every time we update the markdown files from our project and command mkdocs gh-deploy, the web page is updated and our documentation stays up to date. Easy like that!

GitHub page of the project. Image by the author.

Before You Go

Documenting your projects is important.

After all, nobody knows what was in your head when you developed something. Therefore, documenting is like showing your line of thought, the steps used to reach an end.

Open a window in your mind to show other people how you created that product and how to use it.

MkDocs make it so easy and looks super professional. I am sure it will help a lot in documenting your projects at work, helping fellow colleagues to navigate your code, as well as positively impacting anyone who looks at your portfolio from now on.

If you liked this content, follow me for more.

Gustavo R Santos

GitHub Repository

Here is the GitHub Repository for this article.

GitHub – gurezende/MkDocs-Example: Creating documentation for Python Projects with mkdocs

Learn More

If you want to see this content in video, here’s a product from my Gumroad page.

Create Stunning Documentation with Python and MkDocs

References

MkDocs

Usage – mkdocstrings-python

PYTHON – Project Documentation with MkDocs and Python

Markdown Here Cheatsheet

Choosing Your Theme – MkDocs

Gallery

The post Documenting Python Projects with MkDocs appeared first on Towards Data Science.

How I Created a Data Science Project Following CRISP-DM Lifecycle

Gustavo Santos — Wed, 13 Nov 2024 16:02:02 +0000

Introduction

CRISP-DM stands for Cross-Industry Standard Process for Data Mining, a data mining framework open to anyone who wants to use it.

Its first version was created by SPSS, Daimler-Benz and NCR. Then, a group of companies developed and evolved it to CRISP-DM, which nowadays is one of the most known and adopted frameworks in Data Science.

The process consists of 6 phases, and it is flexible. It is more like a living organism where you can (and probably should) go back and forth between the phases, iterating and enhancing the results.

The phases are:

Business Understanding

Data Understanding

Data Preparation

Modeling

Evaluation

Deployment

The small arrows show a natural path from Business Understanding to Deployment—where the interactions occur directly—while the circle denotes a cyclic relationship between the phases. This means that the project does not end with Deployment but can be restarted due to new business questions triggered by the project or adjustments potentially needed.

CRISP-DM. Credits: Wikipedia

In this post, we will follow a project throughout its lifecycle using CRISP-DM steps. Our main objective is to show how using this framework is beneficial to the data scientist and to the company.

Let’s dive in.

Project

Let’s go over a project following the CRISP-DM framework.

In summary, our project is to create a classification model to estimate the probability of a customer submit a term direct deposit in our client’s institution, a bank.

Here is the GitHub Repository with the code, if you want to code along or follow it while reading the article.

GitHub – gurezende/CRISP-DM-Classification: End to End Classification project using the CRISP-DM…

Business Understanding

Understanding the business is crucial for any project, not just data science projects. We must know things like:

What is the business?
What is its product
What are we selling/ offering?
What is expected for that project?
What is the definition of success?
Metrics

In this project, we are working with a Bank, therefore we are talking about the Finance Industry. Our client sells financial solutions for people to easily receive, save, and invest their money in a secure environment.

The client reached us to discuss some direct marketing campaigns based on phone calls aiming to convert a financial product (term deposit). However, they feel like wasting time and effort from their managers to get the expected results, so the client wants to increase/ optimize the conversions by focusing effort on customers with a higher probability of conversion.

Certainly, business is a complex subject. Several factors can impact the result of the campaigns, but for the sake of simplicity, we will go straight to this solution:

Create a predictive model that would give the managers a probability that the customer will convert or not.

Having that in hand, managers would be equipped with a tool to make a better selection of calls with a higher probability of success versus those customers that would need more work along the way.

Ergo, the definition of success for this project is estimating the probability of conversion, and the metric for the model will be F1-score. For the business, the metric could be the conversion rate, which would be compared in a Before and After comparative study.

Next, we need to start touching the data.

_Da_ta Understanding

The data we will use is the dataset Bank Marketing, available in the UCI Data Science Repository. It is open source under the Creative Commons 4.0 license.

The modules installed and imported in this project can be found on the project’s GitHub page.

Python">!pip install ucimlrepo --quiet
from ucimlrepo import fetch_ucirepo

# fetch dataset
bank_marketing = fetch_ucirepo(id=222)

# data (as pandas dataframes)
df = pd.concat([bank_marketing.data.features, bank_marketing.data.targets], 
               axis=1)
df = df.rename(columns={'day_of_week':'day'})

# View
df.sample(3)

The first look at the imported dataset. Image by the author.

Before starting working on the data, we will go ahead and split it into train and test, so we keep it safe of [data leakage](https://en.wikipedia.org/wiki/Leakage(machinelearning)).

# Split in train and test sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('y', axis=1),
                                                    df['y'],
                                                    test_size=0.2,
                                                    stratify=df['y'],
                                                    random_state=42)

# train
df_train = pd.concat([X_train, y_train], axis=1)

# test
df_test = pd.concat([X_test, y_test], axis=1)

Great. Now we are ready to move on and understand the data. This is also known as Exploratory Data Analysis (EDA).

Exploratory Data Analysis

The first step in an EDA is to describe the data statistically. This will already bring insights up to start understanding the data, such as spotting variables with potential errors or outliers, having a sense of the distributions and averages, as well learning which categories are the most frequent for categorical variables.

# Statistical description
df_train.describe(include='all').T

Statistical description of the data. Image by the author.

This simple one line command allows us to get the following insights:

Age of the customers is 40 years old on average. Distribution skewed to the right.
More than 20% of the customers are blue-collar workers.
Most of the customers are married, with secondary level education, own a house loan.
Only ~2% had payment default.
Conversion Rate ~ 11.7%
The data is highly unbalanced towards the negative class.

Target variable. Negative class dominates it. Image by the author.

Once we know the distribution of the target variable, it is time to understand how the predictor variables interact with the target, trying to figure out which ones could be better for modeling the target variable’s behavior.

Age versus Conversions | Customers who converted to the campaigns are slightly younger than those who did not. However, both distributions are visually similar, even though the KS Test shows they are statistically different.

#Sample 1 - Age of the converted customers
converted = df_train.query('y == "yes"')['age']

#Sample 2 - Age of the not converted customers
not_converted = df_train.query('y == "no"')['age']

# Kolmogorov-Smirnov Test
# The null hypothesis is that the two distributions are identical
from scipy.stats import ks_2samp
statistic, p = ks_2samp(converted, not_converted)

if p > 0.05:
    print("The distributions are identical.")
else:
    print("The distributions are not identical: p-value ==", round(p,10))

----------
[OUT]:
The distributions are not identical: p-value == 0.0

# Age versus Conversion
plt.figure( figsize=(10,5))
ax = sns.boxenplot(data=df_train, x='age', y='y', hue='y', alpha=0.8)
plt.suptitle('Age versus Conversion')
plt.ylabel('Converted')
plt.title('Conversions are concentrated between 30 and 50 years old, which is not that different from the not converted', size=9)

# Annotation
# Medians and Averages
median_converted = df_train.query('y == "yes"')['age'].median()
median_not_converted = df_train.query('y == "no"')['age'].median()
avg_converted = df_train.query('y == "yes"')['age'].mean()
avg_not_converted = df_train.query('y == "no"')['age'].mean()
# Annotation - Insert text with Average and Median for each category
plt.text(95, 0, f"Avg: {round(avg_not_converted,1)} nMedian: {median_not_converted}",
         ha="center", va="center", rotation=0,
         size=9, bbox=dict(boxstyle="roundtooth, pad=0.5", fc="lightblue",
         ec="r", lw=0))
plt.text(95, 1, f"Avg: {round(avg_converted,1)} nMedian: {median_converted}",
         ha="center", va="center", rotation=0,
         size=9, bbox=dict(boxstyle="roundtooth, pad=0.5", fc="orange", 
         ec="r", lw=0));

The previous code yields this visualization.

Age versus Conversions. Image by the author.

Job vs. Converions | Customers who hold management roles in their jobs are converting more, followed by technicians, blue-collars, admin and retired.

# job versus Conversions == "YES"
converted = df_train.query('y == "yes"')
plt.figure( figsize=(10,5))
# order of the bars from highest to lowest
order = df_train.query('y == "yes"')['job'].value_counts().index
# Plot and title
ax = sns.countplot(data=converted,
                   x='job',
                   order=order,
                   palette= 5*["#4978d0"] + 6*["#7886a0"])
plt.suptitle('Job versus Converted Customers')
plt.title('Most of the customers who converted are in management jobs. n75% of the conversions are concentrated in 5 job-categories', size=9);
# X label rotation
plt.xticks(rotation=80);
#add % on top of each bar
for pct in ax.patches:
    ax.annotate(f'{round(pct.get_height()/converted.shape[0]*100,1)}%',
                (pct.get_x() + pct.get_width() / 2, pct.get_height()),
                ha='center', va='bottom')

Job versus Conversions. Image by the author.

Well, it does not make much sense to keep repeating code here for the visualizations, so I will go ahead and present only the graphics and the analysis from now on. Again, it is all available in this GitHub repository.

Marital status vs. Conversions | Married customers convert more to the term deposit.

Marital status vs Conversions. Image by the author.

Education vs. Conversion | More educated people convert more to a financial product. However, the converted distribution follows the dataset distribution, so this variable will probably not differentiate conversions from not conversions.

Education vs. Conversion. Image by the author.

Balance vs. Conversion | Customers with a higher balance on their account are converting more. We tested the statistical significance of the samples and there is a difference.

Balance vs. Conversion. Image by the author.

In the previous plot, we arbitrarily removed the data points over the 98th percentile, so the visualization was better. We can see that the converted customers have higher balances, in general, but we can’t tell if there is a statistical difference between both groups. Let’s test that. Given that the distributions are heavily skewed to the right, we will use a non-parametric test, the Kolmogorov-Smirnov Test.

#Sample 1 - Balance of the converted customers
converted = df_train.query('y == "yes"')['balance']

#Sample 2 - Balance of the not converted customers
not_converted = df_train.query('y == "no"')['balance']

# Kolmogorov-Smirnov Test
# The null hypothesis is that the two distributions are identical
from scipy.stats import ks_2samp
statistic, p = ks_2samp(converted, not_converted)

if p > 0.05:
    print("The distributions are identical.")
else:
    print("The distributions are not identical: p-value ==", round(p,4))

---------
[OUT]: 
The distributions are not identical: p-value == 0.0

Are there people with negative balance converting to a term deposit? Common sense says that, in order to be able to deposit something, you must have money available. Therefore, if the customer is negative, they should not be able to convert to a deposit. However, we will see that it happens.

neg_converted = df_train.query('y == "yes" & balance < 0').y.count()
pct = round(neg_converted/df_train.query('y == "yes"').y.count()*100,1)
print(f'There are {neg_converted} conversions from people with negative acct balance. nThis represents {pct}% of the total count of customers converted.')

---------
[OUT]:
There are 161 conversions from people with negative acct balance. 
This represents 3.8% of the total count of customers converted.

Duration vs. Conversions | In this plot, we can visually notice the impact of the duration of the phone calls on the conversions. Customers who converted stayed twice or more time in the call than the other customers.

Duration vs. Conversion. Image by the author.

Campaign contacts vs. Conversions | People who converted received between 2 to 4 contacts, in general. After the 5th contact, the points for converted start to become sparse. For Not converted, the points are more consistent through 13 contacts or so.

Campaign contacts vs. Conversion. Image by the author.

Previous Contacts vs. Converted | It appears that more previous contacts can influence the customer to convert. We notice in the graphic that the converted customers received a couple more calls than the not converted.

Previous contacts vs. Conversion. Image by the author.

Previous campaign outcome vs. Conversions | Customers who converted in the past are more inclined to convert again. Likewise, customers with past failures tend to repeat the failure.

Previous Outcome vs. Conversion. Image by the author.

Contact Method vs. Conversions | Despite there being more conversions from customers contacted via cell phone, it just shows that there are less landlines. The proportions of conversion are similar from both types of contact.

Contact Method vs. Conversion. Image by the author.

Month vs. Conversions | There are more conversions on the mid-year months, however, ~76% of the calls were made on those months. Possibly the campaign ran more heavily during those months.

Month vs. Conversion. Image by the author.

Day vs. Conversions | The conversions happen more around the most probable payment days 5, 15 and 30. We can notice higher peaks around these dates.

Day vs. Conversion. Image by the author.

Days since last contact vs. Conversions | Most of the conversions happened for customers contacted within the past 100 days from a previous campaign.

pDays vs. Conversion. Image by the author.

Most conversions (64%) are made in the first contact.

# The impact of the recency of the contact over conversions
total = df_train.query('y == "yes"').y.count()
print('First contact:', round( df_train.query('y == "yes" & pdays == -1').y.count()/total*100, 0 ), '%')
print('Up to 180 days:', round( df_train.query('y == "yes" & pdays > 0 & pdays <= 180').y.count()/total*100, 0 ), '%')
print('More than 180 days:', round( df_train.query('y == "yes" & pdays > 180').y.count()/total*100, 0 ), '%')

-------
[OUT]:
First contact: 64.0 %
Up to 180 days: 18.0 %
More than 180 days: 18.0 %

However, this is not different from the majority of the data. The non-converting customers with just the first contact are even higher in proportion (84%).

# The impact of the recency of the contact over Not converted
total = df_train.query('y == "no"').y.count()
print('First contact:', round( df_train.query('y == "no" & pdays == -1').y.count()/total*100, 0 ), '%')
print('Up to 180 days:', round( df_train.query('y == "no" & pdays > 0 & pdays <= 180').y.count()/total*100, 0 ), '%')
print('More than 180 days:', round( df_train.query('y == "no" & pdays > 180').y.count()/total*100, 0 ), '%')

-------
[OUT]:
First contact: 84.0 %
Up to 180 days: 6.0 %
More than 180 days: 10.0 %

Housing vs. Conversions | There are more conversions from people wihtout house loan – 1.7 times more conversions.

House Loan vs. Conversion. Image by the author.

Personal Loan vs. Conversions | There are more conversions from people without peprsonal loans. Although it follows the overall distribution, people without loan are proportionally higher conversions.

Personal Loan vs. Conversion. Image by the author.

Default vs. Conversions | Conversions are almost entirely from people without payment defaults, what makes sense, as those with default are probably without money.

Default vs. Conversion. Image by the author.

People without default is converting twice more (12%) as those with default (6%).

Next, we are ready to write the summary of findings.

Exploration Summary

After thorough exploration of the data, we can summarize it as follows:

The converter profile is a 38 to 41 year old person, working on a management role, married, with a refined education of at least secondary level, holding a positive balance on their account without housing or personal loan, thus with less debt.
Most conversions happened on the first contact (64%).
Customers not converted on the first contact received something between 2 to 4 contacts before conversion.
The more contacts from this current campaign, the lower the probability of a customer have converted.
Customers never contacted before need more contacts on average than existing customers
There are more chances of conversion for people contacted up to 10 times for previous campaign.
Contacts from previous campaigns can impact on the conversion of the current campaign, possibly indicating that relationship over time matters.
Customers that converted in previous campaigns are more likely to repeat conversion, while failure to convert also show a tendency to not convert again.
The longer the contacts, the higher the chance of converting. People who converted stayed connected to the call up to 4 times more seconds. However, we can’t use the duration of the call as a predictor.

Looking at the graphics after exploration, the variables duration, job, marital, balance, previous, campaign, default, housing and loan are interesting for modeling, as they impact more directly on the target variable. However, duration cannot be used, as it is not possible to know the duration of a phone call until it ends. The variable poutcome also looks promising, but it has too many NAs, so it needs further treatment to be considered.

Data Preparation

Understanding the data is very important for a better modeling. After the initial insights, we have an idea of what could drive more separation of the classes.

The next step is to prepare this dataset for modeling, transforming variables into categories or numbers, since many data science algorithms require only numbers as input.

Let’s get to work.

Dealing with Missing Data

Missing data can ruin our model, so we must treat them by removing or inputting data for those observations.

Here is what we have of missing data points.

# Checking for missing data
df_train.isna().sum()[df_train.isna().sum() > 0]

-------
[OUT]:

job 234
education 1482
contact 10386
poutcome 29589

Starting with job, out of those 234 NAs, we see that there are 28 converted customers that would be lost (0.6%) if we drop those NAs.

# NAs in job
 (df_train #data
 .query('job != job') # only NAs
 .groupby('y') #group by target var
 ['y']
 .count() #count values
 )

-------
[OUT]:

y 
no 206
yes 28

There would be three options in this case:

Drop the NAs: only 0.6% may not make a difference
Use Random Forest to predict what is the job.
Add the most frequent job, which is blue collar.

We will move on with drop, at this time, as we consider the number too small to be worth it to predict a job.

# Check the impact of NAs for the job variable in the conversions
df_train.query('job != job').groupby('y')['y'].value_counts()

# Drop NAs.
df_train_clean = df_train.dropna(subset='job')

Next, looking at education missing values. There are 1482 missing entries and 196 of those are Yes, which represents 4.6% of the converted customers. In this case, it is a considerable amount of converted observations to be dropped.

In this case, we are going to use the CategoricalImputer from feature_engineinput the most frequent category for the education of these NAs.

# Check the impact of NAs for the education variable in the conversions
df_train.query('education != education').groupby('y')['y'].value_counts()

# Simple Imputer
imputer = CategoricalImputer(
    variables=['education'],
    imputation_method="frequent"
)

# Fit and Transform
imputer.fit(df_train_clean)
df_train_clean = imputer.transform(df_train_clean)

For outcome, we must come up with a new category. So this variable shows the result of a previous marketing campaign. According to our insight in the exploration phase, customers who converted in the past are more likely to convert again. So this variable becomes interesting to the model. However, there are a lot of missing values that will need to go to a separate category, so we won’t bias our model with imputation of the vast majority of the data. We will input "unknown" for the NAs, as it is in the data documentation.

# Input "unknown" for NAs.
df_train_clean['poutcome'] = df_train_clean['poutcome'].fillna('unknown')

For contact we will add "unknown" to the NAs, just like the data documentation says.

# Fill NAs with "unknown"
df_train_clean['contact'] = df_train_clean['contact'].fillna('unknown')

Data clean of missing values. Image by the author.

Next, we need other transformations in this dataset.

Categorical Transformation

Many models don’t deal well with categorical data. Therefore, we need to transform the data into numbers using an encoding type. Here is the strategy to be used for this project:

education, contact, balance, marital, job, and poutcome: For these variables, One Hot Encoding can be ideal.
default, housing, loan, and y are binary variables that will be mapped to no: 0 and yes: 1.

# Binarizing default, housing, loan, and y
df_train_clean = df_train_clean.replace({'no': 0, 'yes': 1})

There is a previous binning to be done on balance prior to One Hot Encoding.

# Balance in 3 categories: <0 = 'negative, 0-median = 'avg', >median = 'over avg'
df_train_clean = (
    df_train_clean
    .assign(balance = lambda x: np.where(x.balance < 0,
                                          'negative',
                                          np.where(x.balance < x.balance.median(),
                                                   'avg',
                                                   'over avg')
                                          )
    )
)

# One Hot Encoding for 'marital', 'poutcome', 'education', 'contact', 'job', 'balance'
from feature_engine.encoding import OneHotEncoder

# Instance
ohe = OneHotEncoder(variables=['marital', 'poutcome', 'education', 'contact', 'job', 'balance'], drop_last=True)

# Fit
ohe.fit(df_train_clean)

# Transform
df_train_clean = ohe.transform(df_train_clean)

# Move y to the first column
df_train_clean.insert(0, 'y', df_train_clean.pop('y'))

Next, month to numerical variable.

# Month to numbers
df_train_clean['month'] = df_train_clean['month'].map({ 'jan':1, 'feb':2, 'mar':3, 'apr':4, 'may':5, 'jun':6, 'jul':7, 'aug':8, 'sep':9, 'oct':10, 'nov':11, 'dec':12})

And other numerical variables will be categorized (bins) to reduce the number of single values, which can help classification models to find patterns.

# Function to replace the variable data with the new categorized bins
def variable_to_category(data, variable, k):
  return pd.cut(data[variable], bins=k).astype(str)

# Transforming variable Age into bins
# Using Sturges rule, where number of bins k = 1 + 3.3*log10(n)
k = int( 1 + 3.3*np.log10(len(df_train_clean)) )

# Categorize age, balance, duration, previous, pdays
for var in str.split('age,pdays,previous', sep=','):
  df_train_clean[var] = variable_to_category(df_train_clean, var, k=k)

# CatBoost Encoding the dataset
df_train_clean = ce.CatBoostEncoder().fit_transform(df_train_clean, df_train_clean['y'])

# View of the final dataset for modeling
df_train_clean.sample(5)

Next, you can see a partial view of the final data to be used for modeling.

Data cleaned and transformed for modeling. Image by the author.

Modeling comes in sequence.

Modeling

Once the data is prepared and transformed, we can start modeling. For this modeling, we are going to start testing many algorithms to see which one performs best. Knowing that the data has a huge unbalance with 88% of the observations classified as no, we will use weights for the classes.

For this initial test, let’s get a sample of 10k observations randomly selected, so it runs faster.

# X and y sample for testing models
df_sample = df_train_clean.sample(10_000)
X = df_sample.drop(['y'], axis=1)
y = df_sample['y']

The code for the test is fairly extensive, but it can be seen in the GitHub repo.

# Example of using the function with your dataset
results = test_classifiers(X, y)
print(results)

-------
[OUT]:
               Classifier  F1 Score  Cross-Validated F1 Score
0                Catboost  0.863289                  0.863447
1             Extra Trees  0.870542                  0.862850
2       Gradient Boosting  0.868414                  0.861208
3                 XGBoost  0.858113                  0.858268
4           Random Forest  0.857215                  0.855420
5                AdaBoost  0.858410                  0.851967
6     K-Nearest Neighbors  0.852051                  0.849515
7           Decision Tree  0.831266                  0.833809
8  Support Vector Machine  0.753743                  0.768772
9     Logistic Regression  0.747108                  0.762013

The best-performing models for this problem were the Boosting ones. CatBoost was the top estimator, so we will work with it from now on.

Let’s move on with a new split and test, now for the whole cleaned training set.

# Split X and y
X = df_train_clean.drop(['y', 'duration'], axis=1)
y = df_train_clean['y']

# Split Train and Validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

Let us begin with a base model with all the columns to try to tune it from that starting point.

model = CatBoostClassifier(verbose=False)
# train the model
model.fit(X_train, y_train)

prediction = model.predict(X_val)

# confusion matrix
cm = pd.DataFrame(confusion_matrix(y_val, prediction)  )
print ("Confusion Matrix : n")
display(cm)

# Evaluate the weighted model
print('Base Model:')
print(classification_report(y_val, prediction))

Result of the base model. Image by the author.

As expected, the model does really well for the negative class, since there is a huge unbalance towards it. And even if the model just classified everything as "no", it would still be correct 88% of the time. That is why accuracy is not the best metric for classification. The precision of the positive class is not bad, but the recall is terrible. Let’s tune this model.

For that, I ran a GridSearchCV and tested a few values of learning_rate, depth, class_weights, border_count, and l2_leaf_reg. The hyperparameters:

border_count: Controls the number of binning thresholds for numeric features. Lower values (e.g., 32 or 64) can reduce overfitting, which may help the model generalize better on imbalanced data.
l2_leaf_reg: Adds L2 regularization to the model. Higher values (e.g., 5 or 7) can penalize the model, reducing its complexity and potentially preventing it from being overly biased toward the majority class.
depth: Controls how deep the decision tree should go for classification.
learning_rate: how large is the step of the learning for each iteration when adjusting the weights of the algorithm.
class_weights: good for unbalanced data, we can give a higher weight for the minority class.

The Gird Search returned this to me:

Best Parameters: {‘border_count’: 64, ‘class_weights’: [1, 3], ‘depth’: 4, ‘l2_leaf_reg’: 5, ‘learning_rate’: 0.1}

Here, I am considering that a False Positive (1 when the truth is 0) is worse than a False Negative (true 1 is classified as 0). That is because, thinking as a manager, if I see a customer with a higher probability of converting, I wouldn’t like to spend energy on that call if that is a false positive. On the other hand, if I call a person with a lower probability, but that person converts, I have made my sale.

So, a few other tweaks were manually made by me with that in mind, and I came up with this code snippet.

# Tuning the estimator
model2 = CatBoostClassifier(iterations=300,
                            depth=5,
                            learning_rate=0.1,
                            loss_function='Logloss',
                            eval_metric='F1',
                            class_weights={0: 1, 1: 3},
                            border_count= 64,
                            l2_leaf_reg= 13,
                            early_stopping_rounds=50,
                            verbose=1000)

# train the model
model2.fit(X_train, y_train)

prediction2 = model2.predict(X_val)

# confusion matrix
cm = pd.DataFrame(confusion_matrix(y_val, prediction2)  )
print ("Confusion Matrix : n")
display(cm)

# # Evaluate the weighted model
print('Tuned Catboost:')
print(classification_report(y_val, prediction2))
print('F1:', accuracy_score(y_val, prediction2))
print('Accuracy:', f1_score(y_val, prediction2))

Result of the tunned model. Image by the author.

Now, we can still run a Recursive Feature Elimination to select fewer variables and try to make this model simpler.

df_train_selected = df_train_clean[['age',  'job_admin.', 'job_services', 'job_management', 'job_blue-collar', 'job_unemployed', 'job_student', 'job_technician',
                                    'contact_cellular', 'contact_telephone', 'job_retired', 'poutcome_failure', 'poutcome_other', 'marital_single', 'marital_divorced',
                                    'previous', 'pdays', 'campaign', 'month', 'day', 'loan', 'housing', 'default', 'poutcome_unknown', 'y']]

The results are as follows.

Result of the selected variables model. Image by the author.

Despite being a good separator for the classes, the variable duration cannot be used, as it is not possible to know the duration of a phone call until it ends. But if we could, these are the results.

Result with the variable duration. Image by the author.

Look how we improve considerably the F1 score!

I have also tried some ensemble models, such as a VotingClassifier and a StackingClassifier. The results are presented next.

Voting and Stacking Classifiers. Image by the author.

Having trained enough models, it is time to evaluate the results and potentially iterate to adjust the best model.

Evaluation

I like to create a table to display the results of the models. It makes it easier to compare them all together.

pd.DataFrame({
    'Model':['Catboost Base', 'Catboost Tuned', 'Catboost Selected Variables', 'Voting Classifier', 'Voting Classifier + SMOTE', 'Catboost + duration', 'Stacking Classifier'],
    'F1 Score': [f1_score(y_val, prediction), f1_score(y_val, prediction2), f1_score(ys_val, prediction3), f1_score(y_val, y_pred), f1_score(y_val, y_pred2), f1_score(y_vald, prediction4), f1_score(y_val, y_pred3)],
    'Accuracy': [accuracy_score(y_val, prediction), accuracy_score(y_val, prediction2), accuracy_score(ys_val, prediction3), accuracy_score(y_val, y_pred), accuracy_score(y_val, y_pred2), accuracy_score(y_vald, prediction4), accuracy_score(y_val, y_pred3)]
}).sort_values('F1 Score', ascending=False)

Comparison of the models. Image by the author.

The Catboost model with the variable duration was by far the best one, however we cannot use that extra variable, since this data will not be available for the managers until the call ends, making no sense to have that for prediction.

So, the next best models were the Catboost Tuned and the model with the selected variables. Let’s take the tuned model and analyze the errors it is presenting. One way I like to do that is by creating some histograms or density plots, so we can see where the errors are concentrating for each variable.

Distribution of the errors by variable. Image by the author.

Concluding this study, it is clear that the variables presented cannot provide a solid separation of classes.

The imbalance is heavy, but the techniques to correct it – such as class weights and SMOTE – were not sufficient to improve class separation. This causes a problem for the model to find a pattern to properly classify the minority class 1 (converting customers) and perform better.

Given that there are too many observations where the customers did not convert, the variability of combinations that are labeled 0 is too large, overlaying and hiding the class 1 within it. Thus, the observations falling in this "commonplace" have similar probabilities for both sides, and that is where the model will fail. The observations are wrongly classified due to the imbalance since the negative class has more strength and creates more bias.

Predictions

To predict the results, the input data must be the same as the input provided during the training. So, I have created a function to take care of that. Once again, it’s available on GitHub.

# Preparing data for predictions
X_test, y_test = prepare_data(df_test)

# Predict
test_prediction = model3.predict(X_test)

# confusion matrix
cm = pd.DataFrame(confusion_matrix(y_test, test_prediction)  )
print ("Confusion Matrix : n")
display(cm)

# Evaluate the model
print('----------- Test Set Restuls -----------:')

print(classification_report(y_test, test_prediction))
print('-------------------------------')
print('F1:', f1_score(y_test, test_prediction))
print('-------------------------------')
print('Accuracy:', accuracy_score(y_test, test_prediction))

The results were within the expected, i.e. aligned with the results we have been seeing in training. The False Positives are slightly smaller than the False Negatives, which is better for our case. This prevents managers from erroneously going after customers who will not convert.

Result for the test set. Image by the author.

Finally, I also created a function to predict a single observation at a time, already thinking about the deployment application. The code that follows predicts one observation.

obs = {'age': 37,
       'job': 'management',
       'marital': 'single',
       'education': 'tertiary',
       'default': 'no', #
       'balance': 100,
       'housing': 'yes', #
       'loan': 'no', #
       'contact': 'cellular', #
       'day': 2, #
       'month': 'aug', #
       'duration': np.nan,
       'campaign': 2, #
       'pdays': 272, #
       'previous': 10,
       'poutcome': 'success',
       'y':99}

# Prediction
predict_single_entry(obs)

----------
[OUT]:
array([[0.59514531, 0.40485469]])

As a result, 59% probability that this customer will not convert. This exercise was interesting because as I manually changed each of the variables at a time, it was possible to see which ones had a larger influence on the model. It turns out that the variables default, housing, loan, day, contact_cellular, contact_telephone, month, campaign, pdays were changing more drastically the probabilities when modified.

So, I decided to create an even simpler model with those variables. And here is the true value of the CRISP-DM framework. I was almost done with the modeling when I noticed something new and went back to the beginning for another iteration.

This is the result.

Final model results for the test set. Image by the author.

This model is not only simpler, but it presents a better performance. The gain is very small, but when the results are similar, the simpler model is better, because it requires less data, computation power, and training time. It is a cheaper model overall.

Well, this is a wrap. Let’s go to the final considerations now.

Deployment

CRISP-DM has a Deployment step, but we won’t cover that in this post. It is way too long already.

The deployment will be presented in a future post, with a Streamlit application. Stay tuned to my blog.

Before You Go

In this post, the intention was to go over a whole data science project following the CRISP-DM lifecycle framework.

CRISP-DM is one of the most used lifecycle frameworks for data science, as it is intuitive and complete. The framework preaches that we should not only follow a sequence of steps. In fact, we can go back and forth whenever needed, as new concerns or discoveries are learned.

I loved creating this project and writing this article. I learned a lot, truly. There were many times when I was on the modeling and I learned something that could change the results. So I went back to exploration, understanding to incorporate the new knowledge into the model until I got to the final result, which is the best model I could create with the information and variables from this dataset.

This is a framework that I recommend. It can make you a better Data Scientist and your projects more complete.

Learn More

I have created a mini-course out of this content. So, if you liked this article, here is the link with a coupon code for you to redeem and enroll in the course. Everything you read here is taught in this quick course! Enjoy!

Follow me for more and mark this post for future reference.

Gustavo Santos – Medium

Find me on Linkedin.

Code Repository

GitHub – gurezende/CRISP-DM-Classification: End to End Classification project using the CRISP-DM…

References

Moro, S., Rita, P., & Cortez, P. (2014). Bank Marketing [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.

UCI Machine Learning Repository

CatBoostClassifier |

Cross-industry standard process for data mining – Wikipedia

CRISP-DM Help Overview

Leakage (machine learning) – Wikipedia

The post How I Created a Data Science Project Following CRISP-DM Lifecycle appeared first on Towards Data Science.

Opening Pandora’s Box: Conquer the 7 “Bringers of Evil” in Data Cloud Migration and Greenfield…

Marina Tosic — Tue, 08 Oct 2024 01:28:28 +0000

Opening Pandora’s Box: Conquer the 7 "Bringers of Evil" in Data Cloud Migration and Greenfield Projects

"Despite warnings, Pandora was curious, and she opened the jar, releasing the evils of the world – leaving only hope trapped inside." [Photo by Bailey Heedick on Unsplash]

Pandora, the first mortal woman, was created by the gods as part of Zeus‘s plan to punish humanity for Prometheus‘s theft of fire [1].

She was gifted with beauty and intelligence, and Zeus sent her to Epimetheus, Prometheus’s brother. For the wedding gift, Zeus gave Pandora a jar (often interpreted as a "box") and warned her never to open it [1].

Despite warnings, Pandora was curious. She opened the jar, releasing the world’s bringers of evils—leaving only hope trapped inside [2].

Since then, "to open a Pandora’s Box" has been synonymous with doing or starting something that will cause many unforeseen problems [3].

Comparing this to my professional life, the only occasion I felt like I had opened "Pandora’s Box" was when I began working on a data cloud migration/greenfield project several years ago.

And the funny thing is that this thought hasn’t changed years later, even after working on two additional and almost identical projects.

Not only did I experience new "bringers of evil" with every new data cloud migration project, but I also felt I had managed to release "hope" from the box.

Now, with (a bit) more wisdom, knowledge, and experience in a few data migration/greenfield projects, I will share the "7 bringers of evil" and how to overcome them.

Let’s dive in.

A Guide to Conquering Cloud Migration Challenges

I thought a lot about how to compare the mythical bringers of evil—envy, remorse, avarice, poverty, scorn, ignorance, and inconstancy [3]—to the real-world challenges I’ve seen or gone through in data cloud migration/greenfield projects.

And, as a result, I first created a one-sentence explanation of what specific bringer of evil can cause in the project.

Then, I provided the [hypothetical] project scenario for more context.

Lastly, I provided my input on best- and worst-case solutions, i.e., how to "conquer" a specific [hypothetical] scenario in case it happens in your project.

The flow of explaining and presenting solutions to project challenges [Diagram by Author]

1. Envy

Comparing your migration project unfavourably to other projects, leading to poor project planning and unrealistic expectations.

Scenario:

You did your research, you talked to people, you called in consultants, and they all confirmed it: "The company XY, which is similar to your company, managed to migrate their whole data platform to the cloud in only 10 months."

By all means, this implies only one – you should be able to migrate in the same period, if not faster.

So, you start by creating a project plan, keeping in mind external inputs on the migration deadline. This leads to budget approval, which leads to the execution phase.

Then, suddenly, reality kicks in during the project execution phase.

You start realizing that your on-prem infrastructure is more complex, and you have security restrictions that don’t allow you to connect to the source database(s). You understand then you can’t even use a 3rd party data integration tool to move your data to the cloud. You need to develop a new solution to overcome this problem.
Then you realize that your legacy development depends on the data sources, which you are not allowed to move to the cloud without approval, and you don’t have this approval.
Then you figure out that 15 to 20% of the data stored locally is not in use anymore from business, but you didn’t have time to do a proper analysis to design the clean cloud file storage or database landing layer, and you need to do it now.
…..

And the problems just started piling up because the planning was rushed and biased by external inputs.

On top of that, project "what-if" scenarios were not developed beforehand to have a "best-case" vs. "worst-case" implementation plan to justify what is happening.

What follows is that you need to communicate this information to project sponsors and inform them of the increased project scope and budget.

Not a fun thing to do.

Solution(s):

["Best-case"] OR [What you can do to prevent this]

Develop a customized migration plan: Create a migration plan that fits your organisation’s needs and infrastructure, rather than competing with external benchmarks.
Think about all pillars during planning: During the planning, consider technical, regulatory, and cost pillars, and plan contingency money (plus "what-if" scenarios) for project implementation.
Plan for Contingencies: From developed "what-if" scenarios, think and prepare how you will be able to resolve the "worst-case" ones if they happen.

["Worst-case"] OR [What you can do for damage control]

Take responsibility and be transparent: Ask for additional resources (both human and monetary ones), but this time be transparent in presenting the maximum project over-budget scenario. Outline the valid arguments/blockers you have faced, and take responsibility for what happened.
Highlight the long-term benefits: Emphasize to project sponsors that resolving these issues during the migration phase will lead to long-term cost savings on the data platform.
Focus on quick wins: If feasible, ensure that you speed up your developments in components where you don’t have blockers, so you can show the progress and maintain a "positive image" in front of the stakeholders.

2. Remorse

Regretting the selection of the new technologies and development principles, leading to delays in migration and impacting the solution design of the new data components.

Scenario:

You started your cloud migration, designed the architecture, selected the cloud services, and finally – the development began.

—What could go wrong, ha?

Then you realize that some of the newly selected cloud services may be missing part of the features that you were so used to/are critical in your legacy ones, and again – you need to find a solution or even additional service to cover the core purpose of these features.
Then you understand that your legacy development standards should be adapted because you changed your data management system and didn’t develop the new development standards. Pressured by deadlines, you "freelance" in the development, creating a new, even bigger mess.
With all the new pitfalls, you dwell on your destiny and regret not starting the migration project 5 years ago, when your dataset was smaller and business requirements manageable.
…

Finally, in Cher‘s words, you sing (quietly) the title of the song "If I Could Turn Back Time." On a loop.

Solution(s):

["Best-case"] OR [What you can do to prevent this]

Conduct feature analysis beforehand: Before selecting cloud services, do a feature comparison with the legacy services. Involve technical teams early in the evaluation and run proofs-of-concept (PoC) before the project starts.
Create new development standards early: Think in advance about how cloud development principles differ from on-prem systems. Develop new standards, by creating a mapping table with the legacy ones.
Again: contingency planning.

["Worst-case"] OR [What you can do for damage control]

Reevaluate technology choices quickly: When you realize some cloud services don’t fit your needs, temporarily pause the affected data migration component and re-plan the development while focusing on resolving this issue.
Temporarily assign extra or external resources to resolve blockers: To resolve missed feature gaps or lack of standards, assign additional resources (preferably consultants) to clean up these specific problems.
Again: take responsibility, communicate issues/needs transparently, and highlight the long-term benefits.

3. Avarice

Overemphasizing cost savings and future time-to-market value at the expense of critical components, like data quality, security, or performance, leading to higher costs and poor data platform in the end.

Scenario:

You landed yourself a new job.

You were hired to develop the new data platform where architecture is already defined, services selected, and the first data integration pipelines are functional.

Your main task is to **** create value from the data as fast as possible, design the semantic layer from scratch, and engineer new data products for key business colleagues.

—No big deal.

You start by focusing on transformations and data modelling, and voila – the first data products are here, and you are the company star. Then the business comes to you and starts questioning the accuracy of the delivered insights. You return to your data, compare it to the source systems, and then realize they don’t match. Something went wrong in the data integration part, but you didn’t see this before because you didn’t implement quality checks.
In addition, you were so eager to deliver the data products faster that you didn’t even think about infrastructure nor tried to put in place multiple environments. Instead, you did your development in production. Yes, production.
As a cherry on top, you started designing machine learning data products as fast as possible, realizing only later that their development and run costs are high with a low return on investment (ROI).
….

And here you are now.

You promised that your cloud costs would be lower than the on-prem ones, with the motto: "The storage is cheap in the cloud, and I don’t need SRE colleagues or a variety of data roles to develop a functional data platform."

Solution(s):

["Best-case"] OR [What you can do to prevent this]

Data quality as a mission: Plan time and staff human resources to set up quality checks at every stage of the data pipeline design—from the data integration to the final products that contain insights.
Design a stable and secure infrastructure: Plan time and staff human resources to set up multiple secure environments (dev, test, prod), even under tight deadlines. This allows you to develop quality and reliable data products.
Assess the costs of the data products: Before committing to deliver new data products, perform a cost-benefit analysis to assess their (potential) ROI. Avoid starting expensive developments without an understanding of the financial impact.
Again: contingency.

["Worst-case"] OR [What you can do for damage control]

Shut down the high-cost, low-value data products quickly, and manage expectations: Implement the cost monitoring framework and shut down the data products with low ROI. Show the numbers to the business transparently, and propose scaling back when business needs are more mature.
Again: take responsibility, communicate issues/needs transparently, and highlight the long-term benefits.

4. Poverty

Cutting down the budget and resources allocated to the migration project and not getting granted "feel good" benefits, leading to high stress and low morale.

Scenario:

You received your project budget, and implementation is going as planned.

However, the project sponsors expect you to keep the development costs under the assigned budget, while speeding up the delivery.

You got the assignment to build an additional, remote team in a more budget-friendly location.

Although this was not originally planned, you acknowledge their input and start the hiring process. However, the process of staffing takes longer than expected. You initially invest 3 months in staffing the new resources, then another 3 months in repeated staffing to replace candidates who declined offers. But finally, the staffing is done, and the new team members are slowly joining the project.
You start making organizational and backlog changes to consolidate the project development that is now split between two locations.
In the meantime, your small local cloud team is investing daily effort to share the knowledge and onboard the new (remote) colleagues with the current project development.
The collaboration is going well, except for minor delays in communication/providing feedback due to different time zones. Because of this, you would like to reward your local team members, who cover 2 to 3 roles simultaneously and work long hours to ensure the project gets successfully delivered.
However, this is not feasible. You can’t get this extra money for travel to a remote location, team events, etc.
The dissatisfaction only piles up when you realize that you and the team are not allowed to visit your remote team, with whom you work daily, but the business colleagues – are.
…

Solution(s):

["Best-case"] OR [What you can do to prevent this]:

Ensure a higher budget from the start: During the project planning phase, ask for a budget cut that covers project team-building events and travel. Try to argue that "feel-good" benefits are not only perks but are important for maintaining team morale.
Plan for unforeseen changes: Even if you don’t expect to get an additional team in your project, you can adapt remote-friendly processes. Create project documentation (space), and adopt Scrum/Kanban to track your development. In addition, create communication channels/groups and organize virtual coffee sessions/open discussion weekly meetings. These will help with every new onboarding, regardless of the new team member’s location.

["Worst-case"] OR [What you can do for damage control]:

Request project onboarding time: When building a new remote team, communicate to management the additional time and resources required for onboarding. Get the "buffer time" in the project to ensure proper knowledge transfer and collaboration from the start of the cooperation. Ensure that the local team doesn’t constantly work long hours. The team’s well-being should be a priority, even if this means prolonging project delivery.
Focus on non-monetary recognition after budget cuts: Recognize the team’s hard work in a non-financial way. This includes flexible working hours/locations, skipping admin meetings, giving public acknowledgement, or even small tokens of appreciation like chocolate.
Negotiate for minimal travel: If the travel between project locations isn’t feasible, negotiate for in-real-life meetings only to celebrate the project’s main milestones.

5. Scorn

Facing internal resistance from colleagues who feel their legacy systems are being dismissed, leading to passive and active obstruction of the migration project.

Scenario:

You know it, business knows it, everyone knows it – the cloud data platform is the fastest way to meet rising business requirements and, if developed properly, stop the rising on-prem maintenance costs.

Yet, despite the obvious need for the migration, you begin hearing the same concerns from various colleagues: "Our legacy system is completely functional; why do you want to replace it?" – OR – "This cloud migration is just a trend, and you will fail in it." – OR – "The cloud is not secure enough, and we need to keep data in our on-prem platform." You only hear reasons why this can’t and won’t work.
On top of this, the resistance comes in actions too. The colleagues are not sharing inputs in the project planning phase. They show a lack of interest in providing support in the development of proof of concepts. Consequently, project approvals get delayed.
Then you manage to resolve the blockers in the project planning/preparation phase, staff the cloud-skilled colleagues, and the project takes off. However, the resistance is higher than ever. The same colleagues who blocked the project now feel excluded and share constant and negative feedback on the new development and colleagues. All this blocks the normal development of the project, as your focus needs to be on conflict resolution instead of delivery.
…

You seek advice from your business coach, and he tells you the following sentence: "Welcome to big-corp business; now you need to learn how to deal with it." (Side input: it was only a starting sentence said as a joke.)

Solution(s):

["Best-case"] OR [What you can do to prevent this]:

Include middle/high management to share the vision: Before the cloud migration starts, seek help from management and project sponsors to share the positive vision of the change and the strategic direction of why platform modernization is important for the business.
Organize workshops: Seek consultancy support in organizing workshops for everyone to get insights into the new technology and present customer success stories. This, too, should result in creating awareness of the cloud platform’s advantages and benefits.
Create a transition plan: To show the benefits of the cloud platform, create a plan for comparing data products on both platforms. In other words, during the PoC phase, compare the metrics of the same data product developed on-prem vs. cloud. Focus on improvements in performance, costs, and development steps, showing that the legacy work will not be disregarded but evolved.

["Worst-case"] OR [What you can do for damage control]:

Address resistance head-on: If you encounter individuals who actively resist and block migration, try to address this behaviour in direct conversation. Address the possible issues of their concerns by providing positive aspects of the new development. Then observe if their behaviour will change after.
Escalate if necessary: If resistance persists and affects the project delivery, escalate the issue to middle/higher management. Show them the impact of this behaviour, and if necessary, ask them to help out in distancing some individuals from the project. You will need support from leadership to help push through roadblocks caused by internal resistance.

6. Ignorance

Lack of expertise in cloud technologies and migration best practices, increasing the risk of failure and delays.

Scenario:

You were assigned a new data-stack modernization project from management, and the expectation is to deliver the first PoCs in the next quarter.

You have never worked on anything similar and don’t know where to start.

Your colleagues and you start comparing the existing cloud platforms and services, and you select "the one" you will work with.

However, no one in the existing team is familiar with the new technologies or data migration best practices related to them.
So, you staff one or two new colleagues, or "fresh blood," who are enthusiastic techies but haven’t had experience in similar projects before.
They engage with new technologies, catch up very quickly, and manage to showcase new PoCs. And all of this is done before you get the official project approval and budget.
The enthusiasm on your and the team’s side gets higher, and you think that everything from now on will be a smooth sail.
Then the project officially starts, and you get a tangible budget. All happy, you again start staffing new colleagues. But this time, you pick the people who have worked on similar projects.
And they all bring new ideas and migration concepts from their previous roles. This results in discarding the initial concepts and PoCs you had. And again, you are starting the development from scratch.
As the initial effort of 3–4 months is gone, from the start of the project, you find yourself behind schedule.
This causes pressure on your team and you, and some colleagues even leave the project due to this.
…..

Solution(s):

["Best-case"] OR [What you can do to prevent this]

Conduct pre-project skills gap analysis: Before starting the cloud migration, identify the skills you need for your cloud project. Then try to get a hiring budget before the project budget. Following this, ensure that the same people who work on PoCs will work on the cloud development.
Commit to PoC development: Instead of rushing the PoC development with a quick-win solution, invest additional time in the design of several solutions. Approach this problem strategically, and for one data pipeline, try to develop and test two PoCs. On top of the "theoretical" one (read: best practices), a hands-on approach in selecting the optimal solution will bring your team confidence in the development phase.

["Worst-case"] OR [What you can do for damage control]

Reassess the staffing and get expert consultants: If the mixed ideas on the development solutions start causing project delays, bring in external consultants who can help with their expertise. These people can assist in improving the team’s learning curve and ensure the project stays on track.
Go with the suboptimal solution: Pick the suboptimal solution for part of your development if you have an existing skillset in the team. Existing hands-on experience can speed up the project’s delivery. For example, pick the programming language that everyone is already familiar with.
Again: take responsibility, communicate issues/needs transparently, and highlight the long-term benefits.

7. Inconstancy

Changing project scope and priorities, causing confusion and disrupting planned project delivery.

Scenario:

You: So, there are new requirements for this project component?
Your colleague: Yes, and some additional ones are still being discussed.
You: Really?
Your colleague: Yes, really.
You: But you know the project deadline is in 5 months? And what’s with that – is it still the same?
Your colleague: Mhm.
……………………………………………
Here we go again. For the fourth time since the project started, the business requirements have changed.
Twice in the current release.
Some work that you already delivered is now out of the project scope. However, the new one has been included in the scope. And you guessed it – with no extra time for this development.
Similar to the previous iterations, you return this information to your team, which only confuses them more.
And again – you know it, they know it, everyone knows it – this time, the changed project scope will break the deadline.
….

Solution(s):

["Best-case"] OR [What you can do to prevent this]

Develop a change management process: Establish rules and guidelines for handling unplanned requirements before the project starts. Ensure that this process includes an impact analysis on the timelines and obtaining approval for the requested changes from the project sponsors.
Introduce "hard deadlines": Lock down the changes in the project scope early enough and define the "hard deadlines" to get fixed requirements. In addition, track your project backlog delivery carefully and communicate the pending critical work to the requesters in case of new requirements.
Plan project meetings: Organize weekly meetings with project managers, business requirements engineers, technical analysts, and the delivery team. Ensure that everyone is on the same page when it comes to understanding the pending backlog and the timelines.

["Worst-case"] OR [What you can do for damage control]

Escalate fast: Escalate the changes in the project scope to your project sponsors. Create awareness of how these changes affect the deadline and the project outcome.
Re-prioritize even faster: If the new requirements have an impact on critical components, prioritize this development by re-planning lower-priority tasks for later project phases.
Again: take responsibility, communicate issues/needs transparently, and highlight the long-term benefits.

See no evil , hear no evil , speak no evil

Although cloud migration or greenfield projects can often feel like opening Pandora’s Box, it doesn’t mean they are not awarding to work on.

Despite all the challenges I experienced and saw in these projects, the learnings I got from them resulted in my personal and professional growth. Bigger and more positive than I was ever able to imagine.

I have learnt that good preparation, contingency planning, taking responsibility for mistakes, staffing skilled people, maintaining transparent communication towards ** sponsors, and consistently highlighting long-term benefit**s always lead to positive project outcomes.

In summary, it’s important to stay proactive in every challenging situation and find a solution for it.

Hence, in this blog post, I’ve shared my solutions in the hope you can reuse them to handle the challenges in your cloud migration/greenfield projects.

Until next time, happy data migration planning & development!

Thank you for reading my post. Stay connected for more stories on Medium, Substack and LinkedIn .

References

[1] Theoi Greek Mythology: Pandora, https://www.theoi.com/Heroine/Pandora.html, accessed on September 25, 2024

[2] Britannica: Pandora, __ https://www.britannica.com/topic/Pandora-Greek-mythology, accessed on September 25, 2024

[3] Wikipedia: Pandora’s box, https://en.wikipedia.org/wiki/Pandora%27s_box, accessed on September 25, 2024

The post Opening Pandora’s Box: Conquer the 7 “Bringers of Evil” in Data Cloud Migration and Greenfield… appeared first on Towards Data Science.

Tips on How to Manage Large Scale Data Science Projects

Ivo Bernardo — Sat, 14 Sep 2024 14:10:42 +0000

Managing large-scale Data Science and machine learning projects is challenging because they differ significantly from software engineering. Since we aim to discover patterns in data without explicitly coding them, there is more uncertainty involved, which can lead to various issues such as:

Stakeholders’ high expectations may go unmet
Projects can take longer than initially planned

The uncertainty arising from ML projects is major cause of setbacks. And when it comes to large-scale projects – that normally have higher expectations attached to them – these setbacks can be amplified and have catastrophic consequences for organizations and teams.

This blog post was born after my experience managing large-scale data science projects with DareData. I’ve had the opportunity to manage diverse projects across various industries, collaborating with talented teams who’ve contributed to my growth and success along the way – its thanks to them that I could gather these tips and lay them out in writing.

Below are some core principles that have guided me in making many of my projects successful. I hope you find them valuable for your own projects as well!

Analysis / Statistics and Prediction

It’s very important to segment the project you are about to start, as the most confusing topic for stakeholders is the difference between AI, ML and DS. These topics are also everywhere on the news and media, and people use the terms interchangeably (I don’t blame them).

The most important thing every stakeholder needs to understand is if the project is about Machine Learning or not. Some projects are "data science" projects, but do not contain any prediction feature. This significantly reduces the uncertainty of the project.

I normally group projects in the following:

analysis for insight projects involve examining data from current or historical sources to derive actionable insights. These projects typically focus on understanding trends or patterns based on data that has already been collected. Common examples include reporting and business intelligence (BI) initiatives.
statistics for causality may seem like machine learning projects but are a bit different. They intend to analyse data in the context of statistical hypothesis, without trying to predict the future. A great example are all type of A/B tests or other treatment/control analysis.
machine learning or clustering: these are the real machine learning projects, and they can be supervised, unsupervised or reinforcement learning (or have a mix between them).
I view GenAI projects as a subset of machine learning, as they involve prediction and error handling. Like traditional ML projects, they require similar management strategies.

Projects are rarely limited to just one type. For instance, ML projects often feature dashboards that display both historical data and predictions. While this combination is beneficial, it’s important for stakeholders to understand that predictions typically come with more uncertainty and take longer to develop than analyzing past data.

Managing Expectations

ML projects require that you manage uncertainty very well.

Users and stakeholders, particularly under the GenAI hype, expect very high performance from AI systems (sometimes, with unrealistic expectations).

Managing the expectation of speed and accuracy of algorithms is absolutely crucial. Don’t promise high accuracy / f-score / other metric without seeing the data.

Don’t promise high speed without assessing the company’s systems and ability to scale. In essence, don’t overpromise.

Also, understand how to add value with your ML project. Are you working for some organization trying to use ML to reduce costs? Or are you looking to increase sales and revenue? Try to translate the main goal of the project from technical to business performance – this will make the goal of the project much simpler.

And that leads us to…

Success Metric

A project’s success metric is often the most important part of the project – Image by tinkerman @ Unsplash.com

No ML project should start without a success metric. Is it speed of prediction? Is it technical performance? Or, is it saved dollars?

For stakeholders, faster and more accurate is always better. If you don’t address these expectations via the success metric definition right from the start, you may end up in a scenario where stakeholders expect 100% accuracy from an ML system.

The success metric definition is your friend. Right from the start, define the playing field with your stakeholders and make them verbally (and in write) agree with a certain level of performance (technical, business, or other) and/or business impact.

Ideally, you should focus on optimizing a single metric to guide the project, as trade-offs show up. If that’s not feasible, establish a hierarchy of metrics to prioritize, allowing you to make informed choices when necessary.

Organizational Stage

When working on a project, do you know the organization’s AI and Data Maturity stage? One common mistake data scientists make is to disregard the data maturity and context they are inserted in.

Some organizations are able to fastly deploy machine learning models, while others have move trouble doing so. Some organizations already work under MLOps best practices, while others are having trouble keeping track of the results of their ML models. These nuances are extremely important to ensure the success of your project.

Answering these questions will set the stage for several things:

How much data you will need to consume into your ML model. For example, if you need more features, what’s the likelihood that you will get them faster?
How will you deploy the model or project?
Will the model be continuously monitored? Or does it need retraining?

These questions can only be answered by exploring the organization’s data processes. Engage with people (peers and leaders) to understand whether a clear data vision exists within the organization and how this vision aligns with the project you are developing.

Agile vs. CRISP DM

Although Agile is applied by many organizations in Software Engineering projects, it should be used with caution in the context of ML.

For example, this paper compared the usage of different project methodologies in ML context. It came to the conclusion that Agile mixed with CRISP-DM (Cross Industry Standard Process for Data Mining) may be a good combination to achieve positive outcomes (without leading to team frustration).

In my experience, I’ve noticed that some flexibility on sprint planning and task assignment is often needed with ML Projects. It’s quite normal to spend 2 or 3 weeks without any major breakthroughs or tasks may need to be completely changed due to new discoveries. If this is not taken into consideration, the team may feel some disproportionate pressure to deliver results (that may be subpar or, even worse, solve a problem that doesn’t match stakeholders’ expectation).

This study details the inherent conflicts between traditional PM and AI workflow logics. The table below (taken from the study) highlights some of them:

Managing artificial intelligence projects: Key insights from an AI consulting firm

Team Management

Team management is more an art than science. And every team is different and contains its own nuances.

The most important tip I can give on team management and leadership is to try to adhere as much as possible to the Manager-Maker schedule. Do you know that it takes around 10–15 minutes to get back "in the zone" when you are interrupted? This estimate can be even higher if we are speaking of complex tasks that need high levels of focus.

Protect your team from unwanted distractions (random pings on Teams / Slack or meetings that operate on a managers schedule) and they will appreciate. This is the best tip that I can give for a happy and productive team.

Thank you for taking the time to read this post. I hope you’ve enjoyed these tips and you can use them in the day-to-day.

Let me know if there’s something that you would like to add! I’m always looking for fresh new perspectives on AI/ML/DS Project Management. Knowing how to handle these projects is definitely a rare skill and often, Project and Product managers define a large portion of the initiative’s success.

This post is likely more relevant to consultancy-based projects or those in non-tech businesses, where data science and machine learning are crucial to day-to-day operations. But, hopefully, some of the tips will also be applicable to those organizations.

The post Tips on How to Manage Large Scale Data Science Projects appeared first on Towards Data Science.

DSLP – The Data Science Project Management Framework that Transformed My Team

Benjamin Lee — Wed, 28 Aug 2024 17:33:10 +0000

Whilst software engineering practices dictate that issues are created to adapt to changing client requirements, we require practices that are able to adapt to changing requirements dictated by our own research.

You’ve probably tried Agile…
Why Agile doesn’t work for Data Science…
The Data Science Lifecycle Process (DSLP)
The Five Steps of DSLP
Example Project: Detecting Credit Card Fraud
A New Project – Create an Ask Issue
Exploring the data – Data Issue
This layout leads to the conventional Agile project
The Kanban board that makes sense for Data Science
Conclusion

You’ve probably tried Agile…

Lets face it, we’ve all tried using agile methodologies at some point to manage our data science projects.

And I’m sure that you and your team have seen it slowly fall apart – everyone says the same thing they said in their previous standup meetings, project Kanban boards never get maintained, sprints become meaningless.

You end up feeling like the whole thing is pointless, but you can’t quite place your finger on why it doesn’t work.

Because of this, you don’t know how to improve it or what to change.

Why Agile doesn’t work for Data Science…

Photo by Shekai on Unsplash

The Agile framework was built for software engineering, where there is a product that needs to be delivered at the end of it.

Based on this end-goal, the agile framework is trying to keep the project aligned with end-user requirements which can change over time. It aims to keep the feedback loop between the developers and end-users tight, so that the project can remain ‘agile’ to changes.

Meanwhile, developers maintain a high level of communication between each other to quickly identify the changing requirements, identify blockers, and transfer knowledge.

This is why you have sprints and stand-ups (or scrums) whilst utilising Kanban boards to keep track of your work.

However, this framework quickly falls apart for a data-science project.

Why?

Because, Data Science is fundamentally a R&D project, so there is no concept of an end-product that you are trying to build at the start. Research is required to determine what the end-product might look like.

Only after the R&D is finished, and you know what data you need, what preprocessing/feature engineering is required, and what model you are going to use, do you finally know what you are going to build.

This means that the agile framework only becomes applicable when you are trying to productionize your model, which for a Data Science project, is the very last step of the project.

Then what is the alternative?

Photo by Buddha Elemental 3D on Unsplash

The Data Science Lifecycle Process (DSLP)

After researching the field of Data Science Project Management, I came across the Data Science Lifecycle Process which seems to encompass all the key insights that other resources provided into one framework that can be incorporated directly into Github projects or any other Kanban-based project management tool.

GitHub – dslp/dslp: The Data Science Lifecycle Process is a process for taking data science teams…

I have used this for the project management of my own Data Science team, and it has proven to be the best improvement to our workflow to date.

What benefits did we see? Just to name a few,

Improved documentation of the project throughout its lifecycle, where every design choice and piece of research was documented in a single place,
Which facilitated seamless transfer of knowledge and completely removed friction during handovers,
And improved research collaboration between data scientists.
Better prioritisation of projects and reduced wasted hours on ill-defined projects,
Encouraged a task-based workflow that fits seamlessly to existing Kanban-board workflows with minimal required changes,
Which ended up facilitating the iterative approach that Agile strives for, but for data science, not software engineering.

The above is not an exhaustive list, only the immediate points I can think of in my head. It sounds too good to be true, but read on below to see how everything above (and more) is achieved.

The 5 Steps of DSLP

We will use my template Github Project as an example to illustrate how DSLP can be used. This can be found here:

*DSLP based example project template bl3e967**

The DSLP example project template which you can find through the link above.

Any related code or content that I refer to from this point will all be available in this project template.

All images used in this article have been generated by the author.

DSLP consists of five project lifecycle steps, consisting of Ask, Data, Explore, Experiment and Model. At each step, you would expect to raise a GitHub Issue for it in the Github Project for your Data Science team.

The structure of DSLP

Below are high level summaries, provided by the DSLP, of each step and what their corresponding Issues would involve. We will delve into the details of how they actually work, in addition to the Github Issue templates and Github based workflows I utilised in my team, in the next section with a realistic example.

Ask

Ask issues are for capturing, scoping, and refining the value-based problems your team is trying to solve. They serve as a live definition of work for your projects and will be the anchor for coordinating the rest of the work you do. It provides a summary of all the work that has been done, and the work in progress.

This issue becomes the first port of call for yourself or anyone who needs information about the project, and any issues created during other steps should be linked to this issue (we will see how easy this is later in the example).

Data

Data issues are for collaborating on gathering and creating the datasets needed to solve a problem.

This issue is relevant for sourcing your data and creating your input datasets.

Explore

Explore issues give us a way to provide quick summaries and TLDRs for the exploratory work we do. The goal of explore issues is to increase our understanding of the data and to share those insights with others.

This issue type is akin to the exploratory data analysis that you would do at the start of a Data Science project, and facilitates good documentation of what was explored and collaboration with other data scientists.

Experiment

Experiment issues are for tracking and collaborating on the various approaches taken to solve a problem and for capturing the results.

Once you have an understanding of your data and how it relates to your problem, Experiment issues are used for the modelling you would do. P

erhaps you frame your problem as an anomaly detection model? Perhaps you consider different types of models? Or maybe you try different hyperparameters?

Subject to how you work, each of these can be a separate Experiment issue or part of one big issue.

Model

Model issues are for working to productionise your successful experiments so that you can deploy them. This will often involve writing tests, creating pipelines, parameterising runs, and adding additional monitoring and logging.

As DSLP explains, once your experimentation phase is complete and you know what you want to productionise, any work done to productionise the model should be part of the Model issue.

More details about each step can be found in the DSLP repository.

But for now, lets now dive into how I actually got DSLP working for my Data Science team using GitHub Projects through a fictional example.

Example Project: Detecting Credit Card Fraud

Let us assume you are a Data scientist working for a bank, and you are approached by a SME (subject matter expert) named John Doe from the Credit Card Fraud operations team.

They let you know that credit card fraud has become a business priority for the bank and the bank is required to improve on their processes for detecting such cases. This was off the back of regulatory feedback that identified the bank as having been underperforming when it came to credit card fraud detection compared to other banks.

Photo by LinkedIn Sales Solutions on Unsplash

They approach you for your opinion on whether the Data Science team could improve the bank’s fraud detection rates using a model, in light of recent successes the bank has seen with previous models applied to other domains.

Obviously, you have no idea on the outset whether a model could be built – you don’t know what data is available and whether the quality of labels are good enough (or if they even exist).

But, you do know that your team has the capacity to initiate a new project and research this. So you arrange a follow up call with the SME and some others in the credit card fraud prevention team to learn about the problems they face and get an understanding of what the project would entail.

A New Project – Create an Ask Issue

You get back to your desk with the new project in your head. The first thing you should do is create a new Ask issue, using an Ask issue template (found here. Further information about each section in this template can be found as comments in the markdown template).

The issue has different sections serving different purposes. The description for what each section is for can be found as comments in the markdown code.

Ask issues are for capturing, scoping, and refining the value-based problems your team is trying to solve. They serve as a live definition of work for your projects and will be the anchor for coordinating the rest of the work you do.

By having the definition of work be in an issue, data science teams can collaborate with their business partners and domain experts to refine and rescope the issue as they learn more about the problem.

You should link to the other issues inside of the Ask issue. This will give people context around why a particular issue is being worked on.

As your understanding of the problem evolves you should update your ask issue accordingly. In order to create clarity, you should be as specific as possible and resolve ambiguities by updating the Ask.

So, we are at the very beginning of our project where we have little to no information yet. But, we can update the problem statement to the best of our knowledge.

The problem statement is a high level description of what you’re trying to solve and why. It should be clear what value solving this problem will create and should also set a clear scope for what is and isn’t included as part of the problem.

We update the problem statement as so:

"There has been a shift in business priorities and credit card fraud prevention has become a priority.

Upon speaking to John Doe about this issue, we are in need of better performing controls than what we have traditionally had in place."

At this stage, we have no concrete problem to work with off of the brief chat we had, so we will leave it as is for now. But in order to make this project workable, we need to follow up with John and other SMEs and get a better idea of what the ask is and what a successful outcome might look like in their eyes.

So, you write in the comments section of your Ask issue with the next steps you need to take to get this project moving.

Create a log of any action points that are needed for the project in the comments section.

The whole point of the comment section is to keep a log of all action points and the design choices that have been made for your project.

It serves to be an audit log that yourself and others can reference to get detailed information about the project at each step of the way, whilst the templated Ask Issue at the top serves as a high-level summary of the important points of your project that becomes the first port of call for any queries or questions related to your project.

So, once you have set up the meeting, talked to John and his colleagues to address this action point, then you can update the comment or add a new comment with the meeting notes.

The comment now populated with the outcome of the meeting

This meeting has given us enough information to fill out the remaining sub-sections (Desired Outcome, Current State, Success Criteria).

An example of how one might populate the Desired Outcome and Current State fields. See the issue template for more details on what purpose this section serves.

An example of how one might populate the Success Criteria field. See the issue template for more details on what purpose this section serves.

Keep in mind that as the project progresses and the clearer it becomes in terms of what you can/can’t do, the above sections and any other sections in the template should be changed to reflect the most up-to-date understanding of the project.

Exploring the data – Data Issue

So, after our meeting with John and co, we have an action point:

Data science team will make a start on the available data to verify if a model is even possible.

The first step would be to search for any ground-truth data that will serve as labels for our task of credit card fraud detection.

From the log in the comments section, we create an issue off the back of this action point:

Click the (…) button on the top right, click ‘Reference in a new issue’

and then fill in the details for a new issue, like below:

Fill in the issue title (in red), and reference the issue in your project (in yellow).

and now this issue will appear on your Project board. Drag and drop it into the Data column and your board should look like below:

So, this Data issue you have created will be where you log everything related to obtaining the ground truth data, in addition to any design choices you made and any limitations with the data.

Any code you write to obtain the data should be created via this issue, by creating a branch from the issue directly:

Create a branch linked to this issue (in yellow), and select the repo the branch should belong to (in red)

The greatest thing about this is that the branch you create can be associated with any repository that has Issues enabled. This is useful if you have separate repositories for differents parts of your data science stack, as you can keep everything related to your project all in one board.

Any PRs related to this Data issue is linked.

Finally, once you complete your coding and raise a PR for it, it is automatically linked to your Data issue.

What we have so far…

An outline of how all development related to the Ask is linked together, in the form of issues and PRs.

By working in this manner, we have achieved two things:

All design choices and implementation details are linked together into the relevant Ask issue so everything related to the project can be found through one place.
There is no scope for loss of information, whilst information is summarised in a hierarchical format – high level details in the Ask Issue, low level details in the Data issue, implementation details in the Pull Requests, which makes it easier to collaborate or handover.

In this manner, we can continue to build out our project:

The same ask issue, linked to subsequent DSLP issues like Explore, Experiment, Model…

We’ve covered Data and Explore issues, which are equivalent to the data acquisition and EDA parts of Data Science.

Experiment is the actual modelling, and would involve the Jupyter Notebooks you create to try out different approaches and different models.

Model is the final step for everything to do with productionisation – unit testing, code refactoring, model monitoring etc.

and they work in the same way as we have covered above. For more details on these steps, check out the DSLP page or the example project board. But for now, we will skip to the next section.

This layout leads to the conventional Agile project

So, let’s jump ahead, perhaps two weeks into the future of our example project.

So far, you have carried out the necessary work to acquire the data you need and have performed some exploration into the data and potential features, in addition to attempting at building a first iteration model.

Example of many issues being open at once for a single Ask issue.

The project will most likely have multiple issues open at the same time.

Explore issues or Experiment issues may be blocked by a Data issue as you realise you need some different data to be made available before you proceed.
Some other issues may require code review, and you need to chase reviewers to do this for you.
Some issues perhaps you have forgotten about, and hasn’t even been started yet.

We need a way to keep track of all these issues, something familiar….

Photo by Tingey Injury Law Firm on Unsplash

The Kanban board that makes sense for Data Science

What we now do is go into the project settings and create a Progress field that we can assign to our issues.

The settings can be found by clicking on the (…) button on the top right (highlighted in red).

Like below, I have created four labels using the Field Type of Single select: To Do, In Progress, Blocked and Waiting for Review. Of course, this is personalised to me, and you can decide what you want to have.

The four progress fields created using Field type of Single select.

Now, going back to the project board, I can create a new view called Kanban Board and using the dropdown button, configure the columns to be based on the Progress field that we just created (highlighted red).

Assign the Progress field to each of your issues in your project,

Create a new view and configure the Column by field to Progress, highlighted in red.

and voila!

You have yourself a Kanban board, which can be used for meetings to organise and track progress of your Data Science projects!

You now have a Kanban board that is familiar to use.

Why does this Kanban Board work, whilst the traditional one fails?

So, we’ve arrived full circle back to using a Kanban board. You may be thinking:

Wait, what was the whole point of this entire article if we are using Kanban boards anyways?

Well, the lineage from which our Kanban board stems from is different.

Data science projects are essentially R&D work. It requires a different flavour of rigour to software engineering projects.

Data science issues are determined by the research, not by the client/end-user.

Whilst software engineering practices dictate that issues are created to adapt to changing client requirements, we require practices that are able to adapt to changing requirements dictated by our own research!

That is why we have two different boards:

One board (the Project board or the Ask board) that highlights the different research and development work being done for a single project (or Ask),
and another board (the Kanban board) for keeping track of the progress of the R&D tasks associated with an Ask.

That is what makes this framework different, and what makes this framework actually work for Data science.

By using these two different boards, you are able to achieve the following:

Have a full summary of all R&D steps taken for a project in one Ask issue, which links all related issues and PRs into one. This makes documentation, audit logging, and knowledge sharing a piece of cake. This is especially important for high-stakes industries such as healthcare or finance.
Maintain the progress of each Ask using the Kanban board, which is a way of working that most people are already accustomed to. This allows you to incorporate other Agile methodologies such as stand-ups and sprints around a Data Science oriented Kanban board.

Conclusion

I hope that you found this article useful.

This framework can be used by data scientists of any level:

be it a manager trying to find a way of working that works for their team,
or a junior data scientist, trying to find a way to organise their work.

It also doesn’t have to be in Github projects. Any project that supports a Kanban board based workflow is compatible, and most tools allows for integration with Github these days, so everything can be integrated together.

I haven’t covered every single detail of the DSLP framework, only the important things you need to get started with using it.

I encourage you to read the framework yourself, as there are some other bits and bobs they recommend that I decided not to use, but perhaps they may be useful for you – let me know in the comments.

And a final comment about why such a framework is becoming increasingly important.

I’ve mentioned Data Science is an R&D based profession, but the truth is that the industry as a whole is fast reaching a point where pure research is just not enough, and we need to be able to deliver concrete value.

Meanwhile, regulatory scrutiny will only increase with time as ML models proliferate throughout industries, starting with high-stakes industries such as Finance and Healthcare. This trend indicates a need for better project management, documentation and audit logging of the models we implement.

Let me know your thoughts in the comments, and if you found this article useful, please give me a clap (you can do up to 50 per person!), and feel free to share it with your colleagues.

Follow me for more practical Data Science content.

How to Stand Out in Your Data Scientist Interview

Enhance Your Network Analysis with the Power of a Graph DB

An Interactive Visualisation for your Graph Neural Network Explanations

The New Best Python Package for Visualising Network Graphs

The post DSLP – The Data Science Project Management Framework that Transformed My Team appeared first on Towards Data Science.

Solving a Constrained Project Scheduling Problem with Quantum Annealing

Luis Fernando PÉREZ ARMAS, Ph.D. — Tue, 20 Aug 2024 17:18:15 +0000

Why did the dog fail quantum mechanics class? He couldn’t grasp the concept of super-paws-ition. Quantum superposition is the principle where a quantum system can exist in multiple states simultaneously until it is measured, at which point it collapses into one of the possible states. (Image generated by DALLE-3)

I’m really excited to share this article with you because it’s closely tied to my current research: optimizing project schedules with quantum computing. My passion for this topic comes from my background as a project manager and my ongoing research into how quantum computing can help solving complex optimization problems. During my PhD, I specifically looked into how today’s quantum technology could be used to tackle complex scheduling challenges, that are specific for the field of Project Management.

We’re living in an incredible era where quantum computers are no longer just a concept – they’re real and being used. Imagine what Richard Feynman would say about them! However, these machines aren’t fully ready yet. They’re often referred to as NISQ (Noisy Intermediate-Scale Quantum) devices, a term coined by quantum computing pioneer John Preskill. These machines are still considered early versions of the more advanced quantum computers we hope to see in the future.

Currently, they’re small, with only a few qubits (quantum bits of information) available. For comparison, it’s not uncommon today to have a laptop with 16GB of RAM, which equates to around 128 billion bits of information, that we can work with. In contrast, the largest quantum computers have fewer than 6,000 qubits. Moreover, qubits are highly sensitive to noise, and the limited number of qubits makes it challenging to correct errors and perform complex calculations.

Despite these challenges, the quantum computers we have today are powerful enough to test out algorithms and see their potential for solving tough problems. In broad terms there are two main types of quantum computers: universal quantum machines, like the ones from IBM and Google, which work on a circuit-gate model of computation, and quantum annealers, which use adiabatic evolution principles (Technically speaking there is a third type of quantum computer, which are measurement based quantum machines, but these ones are still in early development). In Quantum Computing, "universal" means that these machines can, in theory, run any quantum algorithm by implementing any quantum mechanical gate (including classical gates). While quantum annealers might not be universal, they’re still super promising for optimization tasks.

In this article, I’ll walk you through how you can use a quantum annealer provided by a company called D-Wave to optimize project schedules, this article is based on one of my recently published research papers [1]. If you’re interested in a deeper dive, I invite you to read my full research article, which you can access using the link down below.

Solving the resource constrained project scheduling problem with quantum annealing – Scientific…

Before we get into the coding, I’ll give you a quick intro to quantum annealing and how it’s used in optimization. Let’s get started!

Article index:

What is quantum annealing?
The RCPSP and it’s binary MILP formulation
Installing the libraries and setting up the Colab environment
D-wave CQM for the RCPSP
Solving the instance and extracting the output
Conclusions
References

What is quantum annealing?

Quantum Annealing is a quantum metaheuristic designed to solve combinatorial optimization problems by leveraging the principles of quantum mechanics [2]. This approach is inspired by simulated annealing [4], a classical metaheuristic where a system is gradually cooled to reach a state of minimum energy. In quantum annealing, quantum phenomena such as superposition (where a quantum system can exist in multiple states simultaneously until measured, at which point it collapses into one of the possible states) and quantum tunneling are utilized to efficiently navigate through local minima, with the goal of finding the global minimum of a cost function.

Quantum annealing devices are primarily employed for optimization purposes, such as finding the minimum or maximum value of a problem. Imagine trying to find the lowest point in a rugged, hilly landscape aka the "best" solution to a problem. Traditional methods might get stuck in small dips, mistaking them for the lowest point. In contrast, a quantum annealer leverages the properties of quantum mechanics, attempting to tunnel through these local minima and guide the system to the true lowest point, or optimal solution. The effectiveness of quantum annealing, compared to its classical counterpart, can vary depending on the specific problem and the hardware used. The literature presents both theoretical proofs [3] that highlight the advantages of quantum annealing, as well as contrasting perspectives, particularly when compared to simulated annealing

At the core of quantum annealing is the profound concept that the universe naturally tends to seek states of minimum energy, known as ground states. Quantum annealing also relies on the "adiabatic theorem" of quantum mechanics and the evolution of quantum systems as described by the time-dependent Schrödinger equation.

Optimization problems can be represented as physical systems and formulated as energy Hamiltonian problems, which are mathematical representations of a system’s total energy. A common approach is to use the ISING model to encode the optimization problem into a physical system. In this model, the optimization problem is mapped onto a representation of the magnetic spin of each qubit, interacting in a two-dimensional lattice grid, as shown in Figure 1 below.

Figure 1. ISING Two Dimensional Lattice (Image created by the author)

The mathematical representation of the energy Hamiltonian H₁ in the ISING model is given by the formula below:

Here, σᵢ is the Z-Pauli gate applied to qubit i, which takes the value +1 if the qubit is pointing upwards and -1 if it is oriented downwards. Jᵢ ⱼ represents the interaction coefficient between qubits i and j, while Hᵢ represents the coefficient of an external magnetic field interacting with qubit i. Fortunately, most common optimization problems can be encoded using this model, as the ISING problem is NP-Complete, as demonstrated by Istrail in 2000 [5]. In this model, each variable is binary, acting like a magnet that can only point up or down. These magnets will naturally try to orient themselves to minimize the energy of the entire grid, resulting in the system’s minimum energy. The solution to the problem is simply the configuration of the magnets after they have evolved and properly aligned themselves.

You’ve likely already encountered some of the principles of this model without even realizing it. If you’ve ever played with magnets, you know that when you have two or more, they naturally try to align their poles in opposite directions to minimize the total energy of the system, as illustrated in Figure 2.

Figure 2. Magnets minimizing energy (Image created by the Author)

Mother Nature is a giant optimizer, that tries to solve this problem on her own. As illustrated in Figure 3 if we start with a random assortment of spin particles, each oriented in a different direction, they will naturally attempt to realign themselves, seeking the minimum energy state. Quantum annealing leverages this principle. However, this process doesn’t always guarantee finding the absolute global minimum, as nature can still get stuck in local minima. That’s why we also rely on the power of the adiabatic theorem.

Figure 3. Quench of an ISING system on a two-dimensional square lattice (500 × 500) starting from a random configuration (Image downloaded from Wilkipedia https://en.wikipedia.org/wiki/Ising_model)

The adiabatic theorem, first formally stated by Max Born in 1928 (although Einstein had also explored the concept earlier), tells us that:

If a system starts in its lowest energy state and changes "slowly enough", it will remain in that minimum energy state as it evolves.

This theorem and its impact on quantum annealing can be challenging to grasp, so I’m going to use a metaphor that has worked well for me in the past. Keep in mind that, like all metaphors, it may not be entirely accurate from a formal physics perspective (it is not), but it serves as a helpful way to illustrate the concept.

Imagine we have a toddler who, in this example, represents a quantum system. Let’s get creative, like Elon Musk, and call this child Psi Ψ. Now, Psi Ψ can interact with various environments within the house where they live. These environments represent what we’ll call energy Hamiltonians. For instance, one Hamiltonian could be the living room of the house, with a comfy couch in front of the TV, as shown in Figure 4.

Figure 4. Energy Hamiltonian metaphor example (Image created by the author using Midjourney)

Another Hamiltonian could be for example the bedroom of Psi Ψ, as depicted in Figure 5.

Figure 5. Energy Hamiltonian metaphor example (Image created by the author using Midjourney)

Now, Psi Ψ can exist in various energetic states. For example, Psi Ψ could be in an excited state, let’s say they’re in the middle of a sugar rush, as shown in Figure 6.

Figure 6. Very Exited State Ψ (Image created by DALLE-3)

Psi Ψ can also be in a ground state, a state of minimum energy. For example Psi Ψ might be in deep sleep as shown in Figure 7.

Figure 7. Ground State Ψ (Image created by DALLE-3)

Now, there are certain Hamiltonians, which we’ll call H₀, where Psi Ψ will naturally evolve toward a minimum energy ground state. For instance, if we let Psi Ψ play in the living room, even if they start in an excited state, they’ll eventually tire out and fall asleep on that comfy sofa, as shown in Figure 8. (I remember a childhood friend’s house with a terrace overlooking a beautiful valley, where you could feel the fresh mountain breeze. The terrace had a sofa and was filled with chimes that sang with the wind. It was one of the most relaxing environments I’ve ever experienced, and I’m sure anyone would naturally settle into a ground state in that kind of "Hamiltonian".)

Figure 8. Easy Hamiltonians H₀ (Images created by DALLE-3)

In contrast, there are other Hamiltonians – typically the ones we’re most interested in – called H₁, where it’s much more difficult for Psi Ψ to reach the ground state. For example, Psi Ψ really struggles to fall asleep when they are in their bedroom, as shown in Figure 9.

Figure 9. Hamiltonians of interest H₁ and Ψ in exited state (Image created by DALLE-3)

Unfortunately, what we really want is for Psi Ψ to be in the ground state of H₁ – in other words, we want Psi Ψ to be fully asleep in their bedroom . But that’s tough to achieve, so what do we do?

Quantum annealing advises us not to fight against the current. Instead of directly seeking the ground state of H₁, it suggests leveraging the power of the adiabatic theorem. Quantum annealing says, "Look, the starting point of this process should be an easy Hamiltonian, H₀". We first allow Psi Ψ to naturally evolve into a ground state because it’s easy to achieve this in H₀. So, we let Psi Ψ play in the living room, and once they are fully asleep on that comfy sofa, then – and only then – do we very gently, very softly, without disturbing their sweet dreams, begin to slowly change the Hamiltonian from H₀ to H₁. We gently pick them up from the sofa, slowly climb the stairs, and softly place them in their bed. If we do this process correctly and "slowly enough" (adiabatically), by the end of it, we should find Psi Ψ in the ground state – not of H₀, but of H₁, as shown in Figure 10. And voila , this means that we could use this technique to find the ground states of Hamiltonians of interest H₁.

Figure 10. Adiabatic evolution of Ψ (Image created by DALLE-3)

Great! So, this means that we can indeed use this technique to solve complex optimization problems. All we need to do is find a way to encode the problem into an H₁ (using the ISING model) and identify a suitable H₀. Then, we gradually transition from H₀ to H₁ by following the annealing schedule outlined below.

Figure 11. Annealing schedule (Image created by the author)

At this point, you might be wondering, "What about H₀? How do we find it?" Fortunately, the initial Hamiltonian H₀ is easy to identify. We typically use a Hamiltonian known as the "tunneling Hamiltonian," which is described by Equation 4 below. The ground state for this Hamiltonian is simply each qubit in a "uniform" superposition (equal probability of being 0 or 1).

Now, when it comes to optimization, we often prefer not to use the +1,−1 basis of the ISING model. Instead, we like to encode optimization problems into Hamiltonians H₁ using an equivalent isomorphic formulation of the ISING model called QUBO (Quadratic Unconstrained Binary Optimization). QUBO expresses the problem as minimizing a quadratic polynomial over binary variables that can be either 0 or 1. When we formulate a problem as a QUBO, we must represent all variables of our optimization problem as binary and include each constraint as a penalty term added to the objective function. An excellent guide on how to do this was published by Fred Glover [6].

Once the optimization problem is in QUBO or ISING form, it must be mapped onto the quantum chip. The chip has a fixed architecture for the qubits – a topology, if you will (the current generation of D-wave quantum annealers uses a topology known as Pegasus). Since the qubits are not fully connected, we need to be creative when mapping the problem into the quantum chip. If for example a qubit i needs to interact with qubit j, but they aren’t directly connected in the machine’s topology, we need to find other qubits (or a group of them) that form a connecting path between i and j. This path is known as a "chain", and the process of mapping a problem onto the quantum chip is called minor embedding, this process consumes extra qubits that are now going to be used as logical qubits. Minor embedding is not trivial, as it is also an NP problem, but we can rely on heuristics to help find these embeddings. An example of the mapping produced from my research can be seen in Figure 12 below.

Figure 12. Minor embeddings for an RCPSP instance used in the research article of the author (Image created by the author)

There’s one small detail about the current generation of quantum annealers: it’s been nearly impossible to recreate the ideal conditions required by the adiabatic theorem. Even minor disturbances, like those caused by cosmic rays [9], can disrupt the ground state and interfere with the annealing process. That’s why, in practice, we run the annealing process multiple times and consider all the output solutions as samples. We then collect these samples and select the one with the lowest energy hoping that it is the optimal solution.

Now that we have a solid understanding of how quantum annealing works, let’s move on to the next section, where we’ll apply it to solve one of the most challenging scheduling problems – the Resource-Constrained Project Scheduling Problem (RCPSP).

The RCPSP and it’s binary MILP formulation

As a reminder, we’re tackling a project scheduling problem where we need to manage limited resources, such as workers or equipment. This problem is known as the Resource-Constrained Project Scheduling Problem (RCPSP). It’s a challenging combinatorial optimization problem, classified as NP-hard by Blazewicz et al. (1983) [7]. The project consists of many tasks, each with its own duration and resource requirements, and there’s a maximum capacity for each resource. Additionally, tasks have "precedence constraints," meaning some tasks must be completed before others can begin. (Note that there are different types of precedence constraints; the "Finish-to-Start" (FS) type is the most common.)

The main goal is to create a schedule with the minimum makespan – a plan that determines the timing of each task to complete the entire project as quickly as possible. Our schedule must respect all constraints, which means we must adhere to precedence requirements (not breaking the order of tasks) and ensure that resource usage doesn’t exceed the available capacities.

There are many different Mixed Integer Linear Programming (MILP) formulations for the RCPSP, but we need to choose one that best adapts to quantum annealing. In my research, we explored 12 of the most commonly used formulations for the RCPSP. We selected the one that consumes the fewest qubits, as qubits are a precious resource in the current stage of quantum computing development, and their availability is a limiting factor. After an extensive analysis, we determined that the most suitable formulation is the time-indexed binary formulation of Pritzker et al. (1969) [8]. For more details on how we reach this conclusion, please refer to the research article.

This is a binary formulation where we have one binary variable x_{i,t} for each activity i and starting date t. This variable equals 1 if activity i starts on day t, and 0 otherwise. The total number of variables is equal to the number of project activities (plus two extra dummy activities – one for the project ‘Start’ and one for the project ‘End’) multiplied by the number of possible start dates t. The number of possible start dates is determined by the project time horizon T, which represents the maximum possible project duration. Each task has a duration p_i and a resource consumption u_{i,k} for each resource k. Precedence constraints are provided as a list of tuples E in the form (i,j), indicating that task j cannot start until task i is finished.

Figure 13. Pritzker et al. 1969 mathematical programming formulation for the RCPSP (Image created by the author)

The formulation above might look daunting at first sight, but it is actually quite simple.

The objective function (1) represents the sum of all possible start dates for the dummy variable x_{n+1,t} (where these dummy variables represent the project’s end date). We know that only one of these variables will equal 1 for a specifict. Therefore, by minimizing the sum of the product t*x_{n+1,t}, we are effectively minimizing the project duration.
Constraint (2) ensures that each activity i has exactly one start date.
Constraint (3) ensures that if an activity j follows another activity i, then activity j must start after the finish date of activity i, which is equal to the start date of i plus its duration p_i.
Finally, Constraint (4) guarantees that for any time period t, the schedule does not exceed the capacity R_k for any resource k.

In this article, we will solve a project with 21 activities (n=21) (including the two dummy ones) and two resources (k=2). The project details are provided in Table 1 below.

You can also download the instance as a text file in the .rcp format using the GitHub link provided below.

medium_rcpsp_dwave_cqm/RCPSP_19_instance.rcp at main · ceche1212/medium_rcpsp_dwave_cqm

Our project instance in the .rcp format is shown in the text file below.

  21    2 
  10   10

   0   0   0  11 2 3 4 5 6 10 16 17 18 19 20 
   3   1   1   3 15 14 7 
   3   2   8   3 13 12 11 
   3   5   5   2 12 11 
   2   7   3   1 9 
   2   5   6   1 8 
   2   0   0   1 12 
   3   4   4   1 11 
   9   4   6   1 11 
   4   5   4   1 11 
   3   5   6   1 21
   3   7   5   1 21
   6   9   6   1 21
   4   7   0   1 21
   7   6   3   1 21
   2   5   3   1 21
   1   3   7   1 21
   1   0   6   1 21
   2   6   4   1 21
   5   4   8   1 21
   0   0   0   0

The first line of the .rcp instance is always empty. The next line contains the number of project tasks/activities (including the dummy start and dummy end activities), followed by the number of resources k. In our project, we have two resources. Starting with the fourth line of the text file, you’ll find data specific to each project task/activity. The first column contains the duration p_i of the activities. The next k columns provide information on the amount of resources required u_{i,k} for activity i and resource k. The subsequent columns contain the precedence constraints, listing the activities that have activity i as a predecessor. For example, in the instance below, activities 11, 2, 3, 4, 5, 6, 10, 16, 17, 18, 19, and 20 require activity 1 to be completed before they can start.

Installing the libraries and setting up the Colab environment

In this section, we’ll walk through the steps to install and import the necessary libraries for solving our project scheduling problem, including setting up your D-Wave environment in Google Colab. We’ll also install [cheche_pm](https://pypi.org/project/cheche-pm/), a Python library I developed to simplify importing project data and constructing the required data structures for our optimization model. If you’re unfamiliar with cheche_pm, I recommend checking out another article where I’ve detailed its functionalities for a deeper understanding of its capabilities.

Efficient Project Scheduling with Python. The Critical Path Method

Let’s start first with getting your D-wave leap API token.

1. Go to https://cloud.dwavesys.com/leap/login/?next=/leap/ and click on sign-up to create a new account. Please note that to create an account, you are going to be asked to provide the address to one of your GitHub repo’s.

Figure 14. Dwave-Leap site (Snapshot taken by the author)

1. Once you have created your account, please proceed to login and on the left-down side of your dashboard you will find your personnel API token.

Figure 15. Dwave-Leap dashboard (Snapshot taken by the author)

1. Copy your API token and paste it in the secrets tab of google colab using the key name dwave_leap. We will later import this secret key into the python session without having to reveal it, similar as the process of importing keys from a .env file.

Figure 16. Google Colab Secrets (Snapshot taken by the author)

Now that we have our D-wave key, let’s proceed to install and import the required libraries and setup the development environment. We can achieve this by following the snippet of code below.

!pip install cheche_pm
!pip install dwave-ocean-sdk

from itertools import product
from dwave.system.samplers import DWaveSampler
from dwave.system.composites import EmbeddingComposite
from dimod import ConstrainedQuadraticModel, CQM, SampleSet,
from dwave.system import LeapHybridCQMSampler
from dimod.vartypes import Vartype
from dimod import Binary, quicksum
from cheche_pm import Project 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
from google.colab import userdata

We will now proceed to setup the D-Wave key using the userdata.get method of google colab.

endpoint = 'https://cloud.dwavesys.com/sapi'
token = userdata.get('dwave_leap')

Now that the environment is set up, let’s move on to importing the project data and generating the required data structures. We’ll use some of the functionalities of cheche_pm for this. I’ve coded a method that allows us to read RCPSP instances directly from a .rcp file. We’ll use this method to download the instance data from my GitHub repo, import it into our Colab session, and create a Project instance by reading the .rcp file..

url = 'https://raw.githubusercontent.com/ceche1212/medium_rcpsp_dwave_cqm/main/RCPSP_19_instance.rcp'

response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Write the content to a file
    with open("instance.rcp", "w") as file:
        file.write(response.text)
    print("File downloaded and saved as 'instance.rcp'")
else:
    print("Failed to download the file. Status code:", response.status_code)

project = Project.from_rangen_1_rcp_file('instance.rcp')
project.create_project_dict()

Now that we have the project data, let’s visualize it using a project network diagram. This diagram represents a graph G(V,S) where each node in V is a project activity and each edge in S represents a precedence constraint. We can create this diagram using the .plot_network_diagram function from cheche_pm, which leverages pydot to generate and plot the graph.

project.plot_network_diagram()

Figure 17. Project Network Diagram (Image created by the author)

We can also generate the data structures typically needed to formulate an RCPSP instance as a MILP. The cheche_pm library includes a method called .produce_outputs_for_MILP, which outputs a dictionary containing all the necessary data structures.

n,p,S,U,C,naive_ph = project.produce_outputs_for_MILP().values()

Here, n represents the number of project activities, p is a list of activity durations, S is a list of all precedence constraints, U is a list of resource consumption for each activity and each resource, and C is a list of resource capacities. The naive project duration, or time horizon naive_ph, is the sum of all activity durations, which is equivalent to executing the activities sequentially.

As I mentioned in the introduction of this article, qubits are a precious resource. Our formulation requires a number of variables/qubits equal to the combination of n+2 activities and t potential start times, with t ranging from 0 up to the time horizon naive_ph. Currently, the naive_ph for this project instance is 65 days, meaning we’re using a very conservative upper bound. If we can find a better upper bound, we can drastically reduce the number of qubits required and make the quantum annealer’s job easier. Fortunately, there are many constructive heuristics for project scheduling that can help us establish a better upper bound for the project time horizon. The cheche_pm library includes a collection of these heuristics, which we can easily call by following the code snippet below.

heuristic = project.run_all_pl_heuristics(max_resources=C)
new_ph = heuristic['makespan']

The method generates a bar graph showing the different makespans for each heuristic tested by cheche_pm.

Figure 18. Bar graph of the makespan of different heuristics tested (Image created by the author)

Using this method, we were able to reduce the project time horizon from 65 to 40 days. This reduction allows us to save 525 qubits ((65−40)×21), representing a significant decrease in qubit usage, thanks to simple heuristics that run in fractions of a second.

Now that we’ve optimized our project time horizon and gathered all the necessary data, let’s proceed to optimize the project schedule using the quantum annealer.

D-wave CQM for the RCPSP

D-Wave offers various methods for solving optimization problems. If your problem is already in QUBO form, you can create a Binary Quadratic Model (BQM) and solve it. That’s exactly what I did in my research article. However, in this article, I wanted to demonstrate a simplified process that allows you to use the quantum annealer with a framework and language similar to those used by other commercial tools like Pyomo, CPLEX or Gurobi.

Here, we’ll use D-Wave’s Constrained Quadratic Model (CQM), which employs a hybrid quantum-classical approach. This approach simplifies the formulation process and enables us to solve larger problem instances. With the current Advantage 6.4 quantum annealer and the CQM solver, we can tackle problems with up to 1 million binary variables. In contrast, D-Wave’s purely quantum BQM method is limited to 5,760 available qubits, making it unsuitable for the project size we’re addressing in this article. The advantage of using a BQM, is that you have total control of the annealing parameters and the schedule, moreover you can use more advanced techniques such as Reverse Quantum Annealing, which is similar as starting the annealing evolution from a warm start.

If you’re interested in learning how to transform the RCPSP into a QUBO and solve smaller instances of the problem using the purely quantum BQM method, I cover that in detail in my research article. A link to all the BQM code used during the research is provided there, and the article is open access, so feel free to read it and download it.

To create a CQM model of our project, we simply follow the code below:

(R, J, T) = (range(len(C)), range(len(p)), range(new_ph))

# Create empty model
cqm=ConstrainedQuadraticModel()

# Create binary variables
x = {(i, t): Binary(f'x{i}_{t}') for i in J for t in T}

# Create objective function (1)
objective = quicksum(t*x[(n+1,t)] for t in T)
cqm.set_objective(objective)

# Add constraint (2)
for a in J:
  cqm.add_constraint( quicksum(x[(a,t)] for t in T) == 1 )

# Add constraint (3) Precedence constraints
for (j,s) in S:
  cqm.add_constraint( quicksum( t*x[(s,t)] - t*x[(j,t)] for t in T ) >= p[j])

# Add constraint (4) Resource capacity constraints
for (r, t) in product(R, T):
  r_c = quicksum( U[j][r]*x[(j,t2)] for j in J for t2 in range(max(0, t - p[j] + 1), t + 1))
  cqm.add_constraint( r_c <= C[r] )

print("CQM model created with number of variables = ",len(cqm.variables))

The CQM model has been created with a total of 840 variables. Now, we just need to invoke the hybrid QA sampler to solve it. First, we’ll create an empty sampler and pass in our API token and endpoint connection (defined at the start of the previous section).

cqm_sampler = LeapHybridCQMSampler(endpoint=endpoint, token=token,)

With the sampler created, the final step is to perform the QA evolution and retrieve the solution samples.

problem_name = 'RCPSP_Medium_Article'

sampleset = cqm_sampler.sample_cqm(cqm,label=problem_name)

annealing_solutions = len(sampleset) # number of solutions inside sampleset

running_time = sampleset.info['run_time'] # running time reported in microseconds

print(f"Annealing stoped after {running_time/1e6:.2f} seconds. {annealing_solutions} obtained")

Solving the instance and extracting the output

After 5.002 seconds, we obtained 58 samples from the hybrid quantum annealing process. Now, let’s extract the best solution and generate the output visualizations for our project.

We can create a dataframe from the results currently stored in the variable sampleset. This dataframe will have three columns: the solution, the energy (which represents the objective function value), and a boolean variable indicating whether the solution obtained from the quantum annealer is valid (feasible) or not.

df_sampleset = sampleset.to_pandas_dataframe(sample_column=True)
df_sampleset = df_sampleset[['sample','energy','is_feasible']]
df_sampleset = df_sampleset.sort_values(by = ['is_feasible','energy'], ascending = [False,True])

Let’s also proceed to extract the best feasible solution of our problem and see if the quantum annealer was able to obtain the optimal project schedule.

try:
  feasible_sampleset = sampleset.filter(lambda row: row.is_feasible)
  best = feasible_sampleset.first
except:
  print('No Feasible solution found')

Z = best.energy #objective value of the best solution
X_out = best.sample #best solution

print(f"Solution obtained with a final project duration of {Z} days")

The quantum annealer successfully produced a feasible project schedule with a makespan of 39 days, which corresponds to the optimal solution for this project instance. Now that we have the solution, let’s proceed to extract it into a schedule dictionary.

SCHEDULE = dict()
for act in project.PROJECT:

  i = project.PROJECT[act]['idx']
  for t in range(new_ph):
    key = f'x{i}_{t}'
    if X_out[key] == 1:
      row = {'ES':t,'EF':t+p[i]}
      SCHEDULE[act] = row
      print(act,row)

With the schedule now stored in a dictionary, we can use another functionality of cheche_pm to produce a Gantt chart by calling the .plot_date_gantt method within the library.

schedule_df = project.generate_datetime_schedule(SCHEDULE)

project.plot_date_gantt(schedule_df,plot_type = 'matplotlib')

Figure 19. Gantt Chart of the solution (Image created by the author)

Conclusions

In conclusion, this article has guided you through a detailed process of using D-Wave’s Constrained Quadratic Model (CQM) to schedule a resource-constrained project. I’ve explained what quantum annealing is and how it can be applied to optimization problems, all based on my own research. It’s important to remember that, unlike classical optimization methods such as branch-and-bound, quantum annealing doesn’t guarantee optimality since it’s a heuristic method. However, it’s an extremely powerful heuristic that can be applied to a wide range of optimization problems.

I also compared Gurobi and quantum annealing on the same instance we have worked with in this article, but this time using for both methods the naive time horizon. As shown in Figure 20, Gurobi took 446.9 seconds to produce the optimal solution of 39 days using the naive time horizon, while quantum annealing took only 5.1 seconds to generate a near-optimal solution of 40 days.

Figure 20. Benchmark of gurobi and D-waves CQM (Image created by the author)

Even though quantum annealing didn’t achieve the optimal solution in this benchmark, the beauty of heuristics lies in their speed and flexibility. Heuristics can be combined with exact methods to reach the optimal solution more quickly. In fact, most commercial solvers run a series of heuristics before employing more rigorous methods. When the solution from quantum annealing was used as a warm start for Gurobi, the entire process only took 83.5 seconds (including the time for QA) to produce the optimal solution .

In my opinion, the true beauty of quantum computing isn’t just about proving supremacy or proving clear quantum advantages; it’s about how we can combine these quantum capabilities with classical methods to create even more powerful solutions to challenging. In summary I think hybrid is the future, and that is where I am currently steering my research.

You can find the entire notebook with the code developed for this article in the link to my GitHub repository just below.

medium_rcpsp_dwave_cqm/RCPSP_DWAVE_CQM_Medium.ipynb at main · ceche1212/medium_rcpsp_dwave_cqm

I sincerely hope that this article had been enjoyable and had served as a helpful resource for anyone, trying to learn more about quantum computing and more specifically about quantum annealing. If so, I’d love to hear your thoughts! Please feel free to leave a comment or show your appreciation with a clap . And if you’re interested in staying updated on my latest articles, consider following me on Medium. Your support and feedback are what drive me to keep exploring and sharing. Thank you for taking the time to read, and stay tuned for more insights in my next article!

References

[1] Pérez Armas, L. F., Creemers, S., & Deleplanque, S. (2024). Solving the resource constrained project scheduling problem with quantum annealing. Scientific Reports, 14(1), 16784. https://doi.org/10.1038/s41598-024-67168-6
[2] Albash, T., & Lidar, D. A. (2018). Adiabatic quantum computation. Reviews of Modern Physics, 90(1), 015002.
[3] Kadowaki, T., & Nishimori, H. (1998). Quantum annealing in the transverse Ising model. Physical Review E, 58(5), 5355.
[4] Kirkpatrick, S., Gelatt Jr, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. science, 220(4598), 671–680.
[5] Istrail, S. (2000, May). Statistical mechanics, three-dimensionality and NP-completeness: I. Universality of intracatability for the partition function of the Ising model across non-planar surfaces. In Proceedings of the thirty-second annual ACM symposium on Theory of computing (pp. 87–96).
[6] Glover, F., Kochenberger, G., & Du, Y. (2019). Quantum Bridge Analytics I: a tutorial on formulating and using QUBO models. 4or, 17(4), 335–371.
[7] Blazewicz, J., Lenstra, J. K., & Kan, A. R. (1983). Scheduling subject to resource constraints: classification and complexity. Discrete applied mathematics, 5(1), 11–24.
[8] Pritsker A, Watters L, Wolfe P (1969) Multi-project scheduling with limited resources: a zero-one programming approach. Manage Sci 16:93–108
[9] Martinis, J.M. Saving superconducting quantum processors from decay and correlated errors generated by gamma and cosmic rays. npj Quantum Inf 7, 90 (2021). https://doi.org/10.1038/s41534-021-00431-0

The post Solving a Constrained Project Scheduling Problem with Quantum Annealing appeared first on Towards Data Science.