DSLP - The Data Science Project Management Framework that Transformed My Team

Whilst software engineering practices dictate that issues are created to adapt to changing client requirements, we require practices that are able to adapt to changing requirements dictated by our own research.

You’ve probably tried Agile…
Why Agile doesn’t work for Data Science…
The Data Science Lifecycle Process (DSLP)
The Five Steps of DSLP
Example Project: Detecting Credit Card Fraud
A New Project – Create an Ask Issue
Exploring the data – Data Issue
This layout leads to the conventional Agile project
The Kanban board that makes sense for Data Science
Conclusion

You’ve probably tried Agile…

Lets face it, we’ve all tried using agile methodologies at some point to manage our data science projects.

And I’m sure that you and your team have seen it slowly fall apart – everyone says the same thing they said in their previous standup meetings, project Kanban boards never get maintained, sprints become meaningless.

You end up feeling like the whole thing is pointless, but you can’t quite place your finger on why it doesn’t work.

Because of this, you don’t know how to improve it or what to change.

Why Agile doesn’t work for Data Science…

The Agile framework was built for software engineering, where there is a product that needs to be delivered at the end of it.

Based on this end-goal, the agile framework is trying to keep the project aligned with end-user requirements which can change over time. It aims to keep the feedback loop between the developers and end-users tight, so that the project can remain ‘agile’ to changes.

Meanwhile, developers maintain a high level of communication between each other to quickly identify the changing requirements, identify blockers, and transfer knowledge.

This is why you have sprints and stand-ups (or scrums) whilst utilising Kanban boards to keep track of your work.

However, this framework quickly falls apart for a data-science project.

Why?

Because, Data Science is fundamentally a R&D project, so there is no concept of an end-product that you are trying to build at the start. Research is required to determine what the end-product might look like.

Only after the R&D is finished, and you know what data you need, what preprocessing/feature engineering is required, and what model you are going to use, do you finally know what you are going to build.

This means that the agile framework only becomes applicable when you are trying to productionize your model, which for a Data Science project, is the very last step of the project.

Then what is the alternative?

Photo by Buddha Elemental 3D on Unsplash

The Data Science Lifecycle Process (DSLP)

After researching the field of Data Science Project Management, I came across the Data Science Lifecycle Process which seems to encompass all the key insights that other resources provided into one framework that can be incorporated directly into Github projects or any other Kanban-based project management tool.

GitHub – dslp/dslp: The Data Science Lifecycle Process is a process for taking data science teams…

I have used this for the project management of my own Data Science team, and it has proven to be the best improvement to our workflow to date.

What benefits did we see? Just to name a few,

Improved documentation of the project throughout its lifecycle, where every design choice and piece of research was documented in a single place,
Which facilitated seamless transfer of knowledge and completely removed friction during handovers,
And improved research collaboration between data scientists.
Better prioritisation of projects and reduced wasted hours on ill-defined projects,
Encouraged a task-based workflow that fits seamlessly to existing Kanban-board workflows with minimal required changes,
Which ended up facilitating the iterative approach that Agile strives for, but for data science, not software engineering.

The above is not an exhaustive list, only the immediate points I can think of in my head. It sounds too good to be true, but read on below to see how everything above (and more) is achieved.

The 5 Steps of DSLP

We will use my template Github Project as an example to illustrate how DSLP can be used. This can be found here:

*DSLP based example project template bl3e967**

The DSLP example project template which you can find through the link above.

Any related code or content that I refer to from this point will all be available in this project template.

All images used in this article have been generated by the author.

DSLP consists of five project lifecycle steps, consisting of Ask, Data, Explore, Experiment and Model. At each step, you would expect to raise a GitHub Issue for it in the Github Project for your Data Science team.

Below are high level summaries, provided by the DSLP, of each step and what their corresponding Issues would involve. We will delve into the details of how they actually work, in addition to the Github Issue templates and Github based workflows I utilised in my team, in the next section with a realistic example.

Ask

Ask issues are for capturing, scoping, and refining the value-based problems your team is trying to solve. They serve as a live definition of work for your projects and will be the anchor for coordinating the rest of the work you do. It provides a summary of all the work that has been done, and the work in progress.

This issue becomes the first port of call for yourself or anyone who needs information about the project, and any issues created during other steps should be linked to this issue (we will see how easy this is later in the example).

Data

Data issues are for collaborating on gathering and creating the datasets needed to solve a problem.

This issue is relevant for sourcing your data and creating your input datasets.

Explore

Explore issues give us a way to provide quick summaries and TLDRs for the exploratory work we do. The goal of explore issues is to increase our understanding of the data and to share those insights with others.

This issue type is akin to the exploratory data analysis that you would do at the start of a Data Science project, and facilitates good documentation of what was explored and collaboration with other data scientists.

Experiment

Experiment issues are for tracking and collaborating on the various approaches taken to solve a problem and for capturing the results.

Once you have an understanding of your data and how it relates to your problem, Experiment issues are used for the modelling you would do. P

erhaps you frame your problem as an anomaly detection model? Perhaps you consider different types of models? Or maybe you try different hyperparameters?

Subject to how you work, each of these can be a separate Experiment issue or part of one big issue.

Model

Model issues are for working to productionise your successful experiments so that you can deploy them. This will often involve writing tests, creating pipelines, parameterising runs, and adding additional monitoring and logging.

As DSLP explains, once your experimentation phase is complete and you know what you want to productionise, any work done to productionise the model should be part of the Model issue.

More details about each step can be found in the DSLP repository.

But for now, lets now dive into how I actually got DSLP working for my Data Science team using GitHub Projects through a fictional example.

Example Project: Detecting Credit Card Fraud

Let us assume you are a Data scientist working for a bank, and you are approached by a SME (subject matter expert) named John Doe from the Credit Card Fraud operations team.

They let you know that credit card fraud has become a business priority for the bank and the bank is required to improve on their processes for detecting such cases. This was off the back of regulatory feedback that identified the bank as having been underperforming when it came to credit card fraud detection compared to other banks.

Photo by LinkedIn Sales Solutions on Unsplash

They approach you for your opinion on whether the Data Science team could improve the bank’s fraud detection rates using a model, in light of recent successes the bank has seen with previous models applied to other domains.

Obviously, you have no idea on the outset whether a model could be built – you don’t know what data is available and whether the quality of labels are good enough (or if they even exist).

But, you do know that your team has the capacity to initiate a new project and research this. So you arrange a follow up call with the SME and some others in the credit card fraud prevention team to learn about the problems they face and get an understanding of what the project would entail.

A New Project – Create an Ask Issue

You get back to your desk with the new project in your head. The first thing you should do is create a new Ask issue, using an Ask issue template (found here. Further information about each section in this template can be found as comments in the markdown template).

The issue has different sections serving different purposes. The description for what each section is for can be found as comments in the markdown code. — The issue has different sections serving different purposes. **The description for what each section is for can be found as comments in the markdown code.**

Ask issues are for capturing, scoping, and refining the value-based problems your team is trying to solve. They serve as a live definition of work for your projects and will be the anchor for coordinating the rest of the work you do.

By having the definition of work be in an issue, data science teams can collaborate with their business partners and domain experts to refine and rescope the issue as they learn more about the problem.

You should link to the other issues inside of the Ask issue. This will give people context around why a particular issue is being worked on.

As your understanding of the problem evolves you should update your ask issue accordingly. In order to create clarity, you should be as specific as possible and resolve ambiguities by updating the Ask.

So, we are at the very beginning of our project where we have little to no information yet. But, we can update the problem statement to the best of our knowledge.

The problem statement is a high level description of what you’re trying to solve and why. It should be clear what value solving this problem will create and should also set a clear scope for what is and isn’t included as part of the problem.

We update the problem statement as so:

"There has been a shift in business priorities and credit card fraud prevention has become a priority.

Upon speaking to John Doe about this issue, we are in need of better performing controls than what we have traditionally had in place."

At this stage, we have no concrete problem to work with off of the brief chat we had, so we will leave it as is for now. But in order to make this project workable, we need to follow up with John and other SMEs and get a better idea of what the ask is and what a successful outcome might look like in their eyes.

So, you write in the comments section of your Ask issue with the next steps you need to take to get this project moving.

Create a log of any action points that are needed for the project in the comments section.

The whole point of the comment section is to keep a log of all action points and the design choices that have been made for your project.

It serves to be an audit log that yourself and others can reference to get detailed information about the project at each step of the way, whilst the templated Ask Issue at the top serves as a high-level summary of the important points of your project that becomes the first port of call for any queries or questions related to your project.

So, once you have set up the meeting, talked to John and his colleagues to address this action point, then you can update the comment or add a new comment with the meeting notes.

The comment now populated with the outcome of the meeting

This meeting has given us enough information to fill out the remaining sub-sections (Desired Outcome, Current State, Success Criteria).

An example of how one might populate the Desired Outcome and Current State fields. See the issue template for more details on what purpose this section serves.

An example of how one might populate the Success Criteria field. See the issue template for more details on what purpose this section serves.

Keep in mind that as the project progresses and the clearer it becomes in terms of what you can/can’t do, the above sections and any other sections in the template should be changed to reflect the most up-to-date understanding of the project.

Exploring the data – Data Issue

So, after our meeting with John and co, we have an action point:

Data science team will make a start on the available data to verify if a model is even possible.

The first step would be to search for any ground-truth data that will serve as labels for our task of credit card fraud detection.

From the log in the comments section, we create an issue off the back of this action point:

Click the (...) button on the top right, click 'Reference in a new issue' — Click the (…) button on the top right, click ‘Reference in a new issue’

and then fill in the details for a new issue, like below:

Fill in the issue title (in red), and reference the issue in your project (in yellow).

and now this issue will appear on your Project board. Drag and drop it into the Data column and your board should look like below:

So, this Data issue you have created will be where you log everything related to obtaining the ground truth data, in addition to any design choices you made and any limitations with the data.

Any code you write to obtain the data should be created via this issue, by creating a branch from the issue directly:

Create a branch linked to this issue (in yellow), and select the repo the branch should belong to (in red)

The greatest thing about this is that the branch you create can be associated with any repository that has Issues enabled. This is useful if you have separate repositories for differents parts of your data science stack, as you can keep everything related to your project all in one board.

Any PRs related to this Data issue is linked.

Finally, once you complete your coding and raise a PR for it, it is automatically linked to your Data issue.

What we have so far…

An outline of how all development related to the Ask is linked together, in the form of issues and PRs.

By working in this manner, we have achieved two things:

All design choices and implementation details are linked together into the relevant Ask issue so everything related to the project can be found through one place.
There is no scope for loss of information, whilst information is summarised in a hierarchical format – high level details in the Ask Issue, low level details in the Data issue, implementation details in the Pull Requests, which makes it easier to collaborate or handover.

In this manner, we can continue to build out our project:

The same ask issue, linked to subsequent DSLP issues like Explore, Experiment, Model... — The same ask issue, linked to subsequent DSLP issues like Explore, Experiment, Model…

We’ve covered Data and Explore issues, which are equivalent to the data acquisition and EDA parts of Data Science.

Experiment is the actual modelling, and would involve the Jupyter Notebooks you create to try out different approaches and different models.

Model is the final step for everything to do with productionisation – unit testing, code refactoring, model monitoring etc.

and they work in the same way as we have covered above. For more details on these steps, check out the DSLP page or the example project board. But for now, we will skip to the next section.

This layout leads to the conventional Agile project

So, let’s jump ahead, perhaps two weeks into the future of our example project.

So far, you have carried out the necessary work to acquire the data you need and have performed some exploration into the data and potential features, in addition to attempting at building a first iteration model.

Example of many issues being open at once for a single Ask issue.

The project will most likely have multiple issues open at the same time.

Explore issues or Experiment issues may be blocked by a Data issue as you realise you need some different data to be made available before you proceed.
Some other issues may require code review, and you need to chase reviewers to do this for you.
Some issues perhaps you have forgotten about, and hasn’t even been started yet.

We need a way to keep track of all these issues, something familiar….

Photo by Tingey Injury Law Firm on Unsplash

The Kanban board that makes sense for Data Science

What we now do is go into the project settings and create a Progress field that we can assign to our issues.

The settings can be found by clicking on the (...) button on the top right (highlighted in red). — The settings can be found by clicking on the (…) button on the top right (highlighted in red).

Like below, I have created four labels using the Field Type of Single select: To Do, In Progress, Blocked and Waiting for Review. Of course, this is personalised to me, and you can decide what you want to have.

The four progress fields created using Field type of Single select.

Now, going back to the project board, I can create a new view called Kanban Board and using the dropdown button, configure the columns to be based on the Progress field that we just created (highlighted red).

Assign the Progress field to each of your issues in your project,

Create a new view and configure the Column by field to Progress, highlighted in red.

and voila!

You have yourself a Kanban board, which can be used for meetings to organise and track progress of your Data Science projects!

You now have a Kanban board that is familiar to use.

Why does this Kanban Board work, whilst the traditional one fails?

So, we’ve arrived full circle back to using a Kanban board. You may be thinking:

Wait, what was the whole point of this entire article if we are using Kanban boards anyways?

Well, the lineage from which our Kanban board stems from is different.

Data science projects are essentially R&D work. It requires a different flavour of rigour to software engineering projects.

Data science issues are determined by the research, not by the client/end-user.

Whilst software engineering practices dictate that issues are created to adapt to changing client requirements, we require practices that are able to adapt to changing requirements dictated by our own research!

That is why we have two different boards:

One board (the Project board or the Ask board) that highlights the different research and development work being done for a single project (or Ask),
and another board (the Kanban board) for keeping track of the progress of the R&D tasks associated with an Ask.

That is what makes this framework different, and what makes this framework actually work for Data science.

By using these two different boards, you are able to achieve the following:

Have a full summary of all R&D steps taken for a project in one Ask issue, which links all related issues and PRs into one. This makes documentation, audit logging, and knowledge sharing a piece of cake. This is especially important for high-stakes industries such as healthcare or finance.
Maintain the progress of each Ask using the Kanban board, which is a way of working that most people are already accustomed to. This allows you to incorporate other Agile methodologies such as stand-ups and sprints around a Data Science oriented Kanban board.

Conclusion

I hope that you found this article useful.

This framework can be used by data scientists of any level:

be it a manager trying to find a way of working that works for their team,
or a junior data scientist, trying to find a way to organise their work.

It also doesn’t have to be in Github projects. Any project that supports a Kanban board based workflow is compatible, and most tools allows for integration with Github these days, so everything can be integrated together.

I haven’t covered every single detail of the DSLP framework, only the important things you need to get started with using it.

I encourage you to read the framework yourself, as there are some other bits and bobs they recommend that I decided not to use, but perhaps they may be useful for you – let me know in the comments.

And a final comment about why such a framework is becoming increasingly important.

I’ve mentioned Data Science is an R&D based profession, but the truth is that the industry as a whole is fast reaching a point where pure research is just not enough, and we need to be able to deliver concrete value.

Meanwhile, regulatory scrutiny will only increase with time as ML models proliferate throughout industries, starting with high-stakes industries such as Finance and Healthcare. This trend indicates a need for better project management, documentation and audit logging of the models we implement.

Let me know your thoughts in the comments, and if you found this article useful, please give me a clap (you can do up to 50 per person!), and feel free to share it with your colleagues.

Follow me for more practical Data Science content.

How to Stand Out in Your Data Scientist Interview

Enhance Your Network Analysis with the Power of a Graph DB

An Interactive Visualisation for your Graph Neural Network Explanations

The New Best Python Package for Visualising Network Graphs

DSLP – The Data Science Project Management Framework that Transformed My Team

Contents

You’ve probably tried Agile…

Why Agile doesn’t work for Data Science…

The Data Science Lifecycle Process (DSLP)

The 5 Steps of DSLP

Ask

Data

Explore

Experiment

Model

Example Project: Detecting Credit Card Fraud

A New Project – Create an Ask Issue

Exploring the data – Data Issue

What we have so far…

This layout leads to the conventional Agile project

The Kanban board that makes sense for Data Science

Why does this Kanban Board work, whilst the traditional one fails?

Conclusion

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

Solving a Constrained Project Scheduling Problem with Quantum Annealing

Back To Basics, Part Uno: Linear Regression and Cost Function

Must-Know in Statistics: The Bivariate Normal Projection Explained

Our Columns

Optimizing Marketing Campaigns with Budgeted Multi-Armed Bandits