Daniel Bakkelund, Author at Towards Data Science

A Little More Conversation, A Little Less Action — A Case Against Premature Data Integration

Daniel Bakkelund — Sat, 29 Mar 2025 03:00:55 +0000

When I talk to [large] organisations that have not yet properly started with Data Science (DS) and Machine Learning (ML), they often tell me that they have to run a data integration project first, because “…all the data is scattered across the organisation, hidden in silos and packed away at odd formats on obscure servers run by different departments.”

While it may be true that the data is hard to get at, running a large data integration project before embarking on the ML part is easily a bad idea. This, because you integrate data without knowing its use — the chances that the data is going to be fit for purpose in some future ML use case is slim, at best.

In this article, I discuss some of the most important drivers and pitfalls for this kind of integration projects, and rather suggest an approach that focuses on optimising value for money in the integration efforts. The short answer to the challenge is [spoiler alert…] to integrate data on a use-case-per-use-case basis, working backwards from the use case to identify exactly the data you need.

A desire for clean and tidy data

It is easy to understand the urge for doing data integration prior to starting on the data science and machine learning challenges. Below, I list four drivers that I often meet. The list is not exhaustive, but covers the most important motivations, as I see it. We will then go through each driver, discussing their merits, pitfalls and alternatives.

Cracking out AI/ML use cases is difficult, and even more so if you don’t know what data is available, and of which quality.
Snooping out hidden-away data and integrating the data into a platform seems like a more concrete and manageable problem to solve.
Many organisations have a culture for not sharing data, and focusing on data sharing and integration first, helps to change this.
From history, we know that many ML projects grind to a halt due to data access issues, and tackling the organisational, political and technical challenges prior to the ML project may help remove these barriers.

There are of course other drivers for data integration projects, such as “single source of truth”, “Customer 360”, FOMO, and the basic urge to “do something now!”. While important drivers for data integration initiatives, I don’t see them as key for ML-projects, and therefore will not discuss these any further in this post.

1. Cracking out AI/ML use cases is difficult,

… and even more so if you don’t know what data is available, and of which quality. This is, in fact, a real Catch-22 problem: you can’t do machine learning without the right data in place, but if you don’t know what data you have, identifying the potentials of machine learning is essentially impossible too. Indeed, it is one of the main challenges in getting started with machine learning in the first place [See “Nobody puts AI in a corner!” for more on that]. But the problem is not solved most effectively by running an initial data discovery and integration project. It is better solved by an awesome methodology, that is well proven in use, and applies to so many different problem areas. It is called talking together. Since this, to a large extent, is the answer to several of the driving urges, we shall spend a few lines on this topic now.

The value of having people talking to each other cannot be overestimated. This is the only way to make a team work, and to make teams across an organisation work together. It is also a very efficient carrier of information about intricate details regarding data, products, services or other contraptions that are made by one team, but to be used by someone else. Compare “Talking Together” to its antithesis in this context: Produce Comprehensive Documentation. Producing self-contained documentation is difficult and expensive. For a dataset to be usable by a third party solely by consulting the documentation, it has to be complete. It must document the full context in which the data must be seen; How was the data captured? What is the generating process? What transformation has been applied to the data in its current form? What is the interpretation of the different fields/columns, and how do they relate? What are the data types and value ranges, and how should one deal with null values? Are there access restrictions or usage restrictions on the data? Privacy concerns? The list goes on and on. And as the dataset changes, the documentation must change too.

Now, if the data is an independent, commercial data product that you provide to customers, comprehensive documentation may be the way to go. If you are OpenWeatherMap, you want your weather data APIs to be well documented — these are true data products, and OpenWeatherMap has built a business out of serving real-time and historical weather data through those APIs. Also, if you are a large organisation and a team finds that it spends so much time talking to people that it would indeed pay off making comprehensive documentation — then you do that. But most internal data products have one or two internal consumers to begin with, and then, comprehensive documentation doesn’t pay off.

On a general note, Talking Together is actually a key factor for succeeding with a transition to AI and Machine Learning altogether, as I write about in “Nobody puts AI in a corner!”. And, it is a cornerstone of agile software development. Remember the Agile Manifesto? We value individuals and interaction over comprehensive documentation, it states. So there you have it. Talk Together.

Also, not only does documentation incur a cost, but you are running the risk of increasing the barrier for people talking together (“read the $#@!!?% documentation”).

Now, just to be clear on one thing: I am not against documentation. Documentation is super important. But, as we discuss in the next section, don’t waste time on writing documentation that is not needed.

2. Snooping out hidden away data and integrating the data into a platform seems as a much more concrete and manageable problem to solve.

Yes, it is. However, the downside of doing this before identifying the ML use case, is that you only solve the “integrating data in a platform” problem. You don’t solve the “gather useful data for the machine learning use case” problem, which is what you want to do. This is another flip side of the Catch-22 from the previous section: if you don’t know the ML use case, then you don’t know what data you need to integrate. Also, integrating data for its own sake, without the data-users being part of the team, requires very good documentation, which we have already covered.

To look deeper into why data integration without the ML-use case in view is premature, we can look at how [successful] machine learning projects are run. At a high level, the output of a machine learning project is a kind of oracle (the algorithm) that answers questions for you. “What product should we recommend for this user?”, or “When is this motor due for maintenance?”. If we stick with the latter, the algorithm would be a function mapping the motor in question to a date, namely the due date for maintenance. If this service is provided through an API, the input can be {“motor-id” : 42} and the output can be {“latest maintenance” : “March 9th 2026”}. Now, this prediction is done by some “system”, so a richer picture of the solution could be something along the lines of

Image by the author.

The key here is that the motor-id is used to obtain further information about that motor from the data mesh in order to do a robust prediction. The required data set is illustrated by the feature vector in the illustration. And exactly which data you need in order to do that prediction is difficult to know before the ML project is started. Indeed, the very precipice on which every ML project balances, is whether the project succeeds in figuring out exactly what information is required to answer the question well. And this is done by trial and error in the course of the ML project (we call it hypothesis testing and feature extraction and experiments and other fancy things, but it’s just structured trial and error).

If you integrate your motor data into the platform without these experiments, how are you going to know what data you need to integrate? Surely, you could integrate everything, and keep updating the platform with all the data (and documentation) to the end of time. But most likely, only a small amount of that data is required to solve the prediction problem. Unused data is waste. Both the effort invested in integrating and documenting the data, as well as the storage and maintenance cost for all time to come. According to the Pareto rule, you can expect roughly 20% of the data to provide 80% of the data value. But it is hard to know which 20% this is prior to knowing the ML use case, and prior to running the experiments.

This is also a caution against just “storing data for the sake of it”. I’ve seen many data hoarding initiatives, where decrees have been passed from top management about saving away all the data possible, because data is the new oil/gold/cash/currency/etc. For a concrete example; a few years back I met with an old colleague, a product owner in the mechanical industry, and they had started collecting all sorts of time series data about their machinery some time ago. One day, they came up with a killer ML use case where they wanted to take advantage of how distributed events across the industrial plant were related. But, alas, when they looked at their time series data, they realised that the distributed machine instances did not have sufficiently synchronised clocks, leading to non-correlatable time stamps, so the planned cross correlation between time series was not feasible after all. Bummer, that one, but a classical example of what happens when you don’t know the use case you are gathering data for.

3. Many organisations have a culture for not sharing data, and focusing on data sharing and integration first, helps to change this culture.

The first part of this sentence is true; there is no doubt that many good initiatives are blocked due to cultural issues in the organisation. Power struggles, data ownership, reluctance to share, siloing etc. The question is whether an organisation wide data integration effort is going to change this. If someone is reluctant to share their data, having a creed from above stating that if you share your data, the world is going to be a better place is probably too abstract to change that attitude.

However, if you interact with this group, include them in the work and show them how their data can help the organisation improve, you are much more likely to win their hearts. Because attitudes are about feelings, and the best way to deal with differences of this kind is (believe it or not) to talk together. The team providing the data has a need to shine, too. And if they are not being invited into the project, they will feel forgotten and ignored when honour and glory rains on the ML/product team that delivered some new and fancy solution to a long standing problem.

Remember that the data feeding into the ML algorithms is a part of the product stack — if you don’t include the data-owning team in the development, you are not running full stack. (An important reason why full stack teams are better than many alternatives, is that inside teams, people are talking together. And bringing all the players in the value chain into the [full stack] team gets them talking together.)

I have been in a number of organisations, and many times have I run into collaboration problems due to cultural differences of this kind. Never have I seen such barriers drop due to a decree from the C-suit level. Middle management may buy into it, but the rank-and-file employees mostly just give it a scornful look and carry on as before. However, I have been in many teams where we solved this problem by inviting the other party into the fold, and talking about it, together.

4. From history, we know that many DS/ML projects grind to a halt due to data access issues, and tackling the organisational, political and technical challenges prior to the ML project may help remove these barriers.

While the paragraph on cultural change is about human behaviour, I place this one in the category of technical states of affairs. When data is integrated into the platform, it should be safely stored and easy to obtain and use in the right way. For a large organisation, having a strategy and policies for data integration is key. But there is a difference between rigging an infrastructure for data integration together with a minimum of processes around this infrastructure, to that of scavenging through the enterprise and integrating a shit load of data. Yes, you need the platform and the policies, but you don’t integrate data before you know that you need it. And, when you do this step by step, you can benefit from iterative development of the data platform too.

A basic platform infrastructure should also come with the necessary policies to ensure compliance to regulations, privacy and other concerns. Concerns that come with being an organisation that uses machine learning and artificial intelligence to make decisions, that trains on data that may or may not be generated by individuals that may or may not have given their consent to different uses of that data.

But to circle back to the first driver, about not knowing what data the ML projects may get their hands on — you still need something to help people navigate the data residing in various parts of the organisation. And if we are not to run an integration project first, what do we do? Establish a catalogue where departments and teams are rewarded for adding a block of text about what kinds of data they are sitting on. Just a brief description of the data; what kind of data, what it is about, who are stewards of the data, and perhaps with a guess to what it can be used for. Put this into a text database or similar structure, and make it searchable . Or, even better, let the database back an AI-assistant that allows you to do proper semantic searches through the descriptions of the datasets. As time (and projects) passes by, the catalogue can be extended with further information and documentation as data is integrated into the platform and documentation is created. And if someone queries a department regarding their dataset, you may just as well shove both the question and the answer into the catalogue database too.

Such a database, containing mostly free text, is a much cheaper alternative to a readily integrated data platform with comprehensive documentation. You just need the different data-owning teams and departments to dump some of their documentation into the database. They may even use generative AI to produce the documentation (allowing them to check off that OKR too ).

5. Summing up

To sum up, in the context of ML-projects, the data integration efforts should be attacked by:

Establish a data platform/data mesh strategy, together with the minimally required infrastructure and policies.
Create a catalogue of dataset descriptions that can be queried by using free text search, as a low-cost data discovery tool. Incentivise the different groups to populate the database through use of KPIs or other mechanisms.
Integrate data into the platform or mesh on a use case per use case basis, working backwards from the use case and ML experiments, making sure the integrated data is both necessary and sufficient for its intended use.
Solve cultural, cross departmental (or silo) barriers by including the relevant resources into the ML project’s full stack team, and…
Talk Together

Good luck!

Regards
-daniel-

The post A Little More Conversation, A Little Less Action — A Case Against Premature Data Integration appeared first on Towards Data Science.

Nobody Puts AI in a Corner!

Daniel Bakkelund — Wed, 13 Nov 2024 12:02:24 +0000

Generated by ChatGTP

Many product companies I talk to struggle to understand what "transformation to AI" means to them. In this post, I share some insights into what it means to be an AI-enabled business, and what you can do to get there. Not by enumerating things you have to do, but through two anecdotes. The first is about digitalisation – what it means for a non-digital company to transform into a digital company. This is because the transition to AI follows the same kind of path; it is a "same same but different" transformation. The second story is about why so many product companies failed in their investments in AI and Data Science over the last years, because they put AI in a corner.

But before we go there, keep in mind that becoming AI-enabled is a transformation, or a journey. And to embark upon a journey and successfully riding along to its destination, you are better off knowing where you are going. So: what what does it mean to be "AI-enabled"?

To be AI-enabled is to be able to use AI technology to seize an opportunity, or to obtain a competitive advantage, that you could otherwise not.

So, after finishing the transformation, how can you know whether you have succeeded? You ask yourself the question:

What can we do now that we could not do before? Can we take advantage of an opportunity now, that we could not before?

Or more to the point: Will we take advantage of an opportunity now, that we could not before?

There is nothing AI-specific about this question. It is valid for any transformation an organisation takes upon itself in order to acquire new capabilities. And, for this very reason, there is a lot to learn from other transformations, if you wish to transition to AI.

Anecdote 1: A tale of digitalisation

Generated by ChatGPT

Over the last decades, there has been a tremendous shift in some large businesses referred to as digitalisation. This is the process where a company transforms from using IT as a tool in their everyday work, to using IT as a strategic asset to achieve competitive advantage. A few years back, I spent some time in the Oil & Gas sector, participating in large digitalisation efforts. And if you have not worked in O&G, you may be surprised to learn that this huge economy still is not digital, to a large extent. Of course, the sector has used computers since they came about, but as tools: CAD-tools for design, logistics systems for project and production planning, CRM systems for managing employees and customers, and so on. But the competitive power of one company over another has been in their employees’ knowledge about steel and pipes and machinery, about how fluids flows through pipes, about installation of heavy equipment under rough conditions, and many other things of this trade. Computers have been perceived as tools to get the job done, and IT has been considered an expense to be minimised. Digitalisation is the transformation that aims to change that mindset.

To enable IT as leverage in competition, the business must move from thinking about IT as an expense, to thinking of IT as an investment opportunity. By investing in your own IT, you can create tools and products that competitors do not have, and that give you a competitive advantage.

But investing in in-house software development is expensive, so to pin down the right investments to shift competition in your favour, you need all the engineers, the steel and machinery specialists, to start thinking about which problems and challenges you can solve with computers in a manner that serves this cause. This is because, the knowledge about how to improve your products and services, is located in the heads of the employees: the sales people talking to the customers, the marketing people feeling the market trends on their fingertips, the product people designing and manufacturing the assets, and the engineers designing, making and testing the final product artefacts. These humans must internalise the idea of using computer technology to improve the business as a whole, and do it. That is the goal of digitalisation.

But you already knew this, right? So why bother reiterating?

Because a transformation to AI is the exact same story over again; you just have to replace "digital transformation" by "transformation to AI". Hence, there is much to learn from digitalisation programs. And if you are lucky, you already understand what it means to be a digital company, so you actually know what a transformation to digital entails.

Anecdote 2: The three eras of Data Science

Generated by ChatGPT

The history of industrial AI and Data Science is short, starting back in 2010–2012. While there is some learning to be had from this history, I’ll say right away: there is still no silver bullet for going AI with a bang. But, as an industry, we are getting better at it. I think of this history as playing out over three distinct eras, demarcated by how many companies approached AI when they launched their first AI initiatives.

In the first era, companies that wanted to use AI and ML invested heavily in large data infrastructures and hired a bunch of data scientists, placed them in a room, and waited for magic to emanate. But nothing happened, and the infrastructure and the people were really expensive, so the method was soon abandoned. The angle of attack was inspired by large successes such as Twitter, Facebook, Netflix, and Google, but the scale of these operations don’t apply to most companies. Lesson learned.

In the second era, having learned from the first era, the AI advisors said that you should start by identifying the killer AI-app in your domain, hire a small team of Data Scientists, make an MVP, and iterate from there. This would give you a high-value project and star example with which you could demonstrate the magnificence of AI to the entire company. Everybody would be flabbergasted, see the light, and the Ai Transformation would be complete. So companies hired a small team of data scientists, placed them in a corner, and waited for magic to emanate. But nothing happened.

And the reason why magic does not happen in this setting is that the data scientists and AI/ML experts hired to help in the transformation don’t know the business. They know neither your nor your customer’s pain points. They don’t know the hopes, dreams, and ambitions of the business segment. And, moreover, the people who know this, the product people, managers, and engineers in your organisation, they don’t know the data scientists, or AI, or what AI can be used for. And they don’t understand what the Data Scientists are saying. And before these groups learn to talk with each other, there will be no magic. Because, before that, no AI transformation has taken place.

This is why it is important to ask, not what you can do, but what you will do, when you check whether you have transformed or not. The AI team can help in applying AI to seize an opportunity, but it will not happen unless they know what to do.

This is a matter of communication. Of getting the right people to talk to each other. But communication across these kinds of boundaries is challenging, leading us to where we are now:

The third era – While still short of a silver bullet, the current advice goes as follows:

Get hold of someone experienced with AI and machine learning. It is a specialist discipline, and you need the competency. Unless you are sitting on exceptional talent, don’t try to turn your other-area experts into Data Scientists over night. Building a team from scratch takes time, and they will have no experience at the onset. Don’t hesitate to go externally to find someone to help you get started.
Put the Data Scientists in touch with your domain experts and product development teams, and let them, together, come up with the first AI application in your business. It does not have to be the killer app – if you can find anything that may be of use, it will do.
Go ahead and develop the solution and showcase it to the rest of the organisation.

The point of the exercise is not to strike bullseye, but to set forth a working AI example that the rest of the organisation can recognise, understand, and critique. If the domain experts and the product people come forth saying "But you solved the wrong problem! What you should have done is…" you can consider it a victory. By then, you have the key resources talking to each other, collaborating to find new and better solutions to the problems you already have set out to solve.

During my time as a Data Scientist, the "Data Scientist in the corner" pitfall is one of the main reasons groups or organisations fail in their initial AI-initiatives. Not having the AI-resources interacting closely with the product teams should be considered rigging for failure. You need the AI-initiatives to be driven by the product teams – that is how you ensure that the AI solutions contribute to solving the right problems.

Summing up

The transformation to being an AI-enabled product organisation builds on top of being digitally enabled, and follows the same kind of path: The key to success is in engaging with the domain experts and the product teams, getting them up and running on the extended problem solving capabilities provided by AI.
AI and Machine Learning is a complicated specialist discipline, and you need someone proficient in the craft. Thereafter, the key is to deeply connect that resource with the domain experts and product teams, so that they can start solving problems together.

And: don’t put AI in a corner!

The process of transformation. Illustration by the author in collaboration with ChatGPT and GIMP.

The post Nobody Puts AI in a Corner! appeared first on Towards Data Science.

Can AI Solve Your Problem?

Daniel Bakkelund — Mon, 27 Nov 2023 08:18:21 +0000

In a product organisation aiming to build AI capabilities into their products and services, there is always the challenge of bringing the non-AI-literates onboard the AI train. While not everybody needs to be an AI expert, it is necessary to have as many as possible contributing with ideas and possibilities of exploiting the power of AI to propel the company to the next level. This applies in particular to domain experts and product people, who are on top of the problems their products and services are trying to solve, and knowing where the shoe pinches.

One challenge I have learned is prevalent, is the basic question of "Which problems can we solve with AI?". A question that is surprisingly hard to answer when posed by a non-expert. So I have devised three heuristic questions that you can use whenever you are looking at a problem, and you are wondering "Can this be solved with AI?". If you can answer yes to all three of them, you may find yourself in position to start an AI project.

Question 1: Can you say it?

You can think of an AI as an oracle that answers questions. What you have to ask yourself about, is:

Can you express, in writing, the question you wish to have answered?

This is, of course, a test that applies to anything you wish to do. If you want to do something, but you can’t formulate what it is you want, you probably don’t really know what you want. Launching an AI project is no exception to this rule.

Example questions to ask an AI could be

Is there a dog in this picture?
What will the weather be tomorrow?
What are next week’s lottery numbers?

All of these are well posed questions that can be asked. But not all of them can be answered, so we need another test.

Question 2: Does it exist?

We can think of the oracle as a function mapping questions to answers:

The circle on the left contains all the questions, and the circle on the right contains all the answers. The oracle is the function sending questions to answers. The next thing to ask oneself is:

Does the function exist?

This may seem odd, and it gets queerer still: you should ask this question on a metaphysical level – is there any theoretical possibility for this function to exist? Let us have some examples:

We have all seen AIs answering the "dog in the picture" question, so we know that this function exists. We have also seen the weather forecast, so we know it is possible, to some extent, to predict tomorrow’s weather. But there is no way to predict next week’s lottery numbers. And the reason for this is that the lottery is rigged exactly with the goal of this function not to exist. It is impossible. And this is what I mean by "on a metaphysical level".

Why is this important? Because Machine Learning (which is how we make AIs) is about trying to approximate functions by learning from examples.

If we have a lot of examples of how the function (i.e. oracle) should behave, we can try to learn this behaviour, and mimic it as closely as possible. But you can only approximate a function that exists.

Admittedly, all of this is a bit abstract, so I recommend replacing this heuristic with the following meta-heuristic:

Can a well-informed human do the job?

Still metaphysically, given all the information in the world and unlimited time, can a human answer the question? Clearly, humans are pretty good at recognising dogs in pictures. And humans did develop weather forecasts, and do them too. But, we are not able to predict next week’s lottery numbers.

If you have come this far, answering yes twice, you have 1) a well posed question, and 2) you know that, at least in theory, the question can be answered. But there is one more box to check off:

Question 3: Is the context available?

This one is a wee bit more technical. The key to the question is that the oracle function often needs more information than just the question to find the answer. The informed human being, doing the job as oracle, may need additional information to make a decision or produce an answer. This is what I refer to as the context.

For example, the weather forecast oracle needs to know the current meteorological conditions as well as conditions from some days back to do forecasting. This information is not contained in the phrase "What is the weather going to be tomorrow?" On the other hand, in the case of pictures of dogs and cats, the context is in the picture, and no additional context is required.

The reason why this is important, is that when we train an AI, the AI is presented with questions of the type

The AI then makes a guess before receiving the true answer, and over time it is hoped that the AI will learn the difference between cats and dogs. But for this to happen, the difference must be available, so that the AI can learn to identify the difference. In the case of pictures, this is straightforward – you just have to make sure the pictures are of sufficient quality to make the distinction possible. In the case of weather forecasting, it becomes more complicated – you actually have to make an informed decision to what information is required to make a weather prediction. This is a question best answered by domain experts, so you may have to reach out to get a good answer to this one.

But the bottom line is: if there is not enough information available for the informed human to answer the question, then there is little hope for the AI to learn how to answer the question also. You need that context.

Conclusion

So to sum up, if you wish to test your AI project idea, to see if this is something that can be solved with the use of AI, you can try answering the following three questions:

1. Can you express your question in writing?

2. Can an informed human do the job?

3. Is the context available?

If you can answer yes to all three, then you are ready to move on. There may still be hurdles to overcome, and perhaps it turns out to be too difficult in the end. But that is the topic of another post.

Good luck!

With sincere regards Daniel Bakkelund

The post Can AI Solve Your Problem? appeared first on Towards Data Science.