Podcast: No Hallucinations, AI That Ships by Astral Forest

Episode 02 transcript

AI-Augmented Data Engineering, with Maksym Karashchuk

Michał Dębski0:10

Hi, hello, Max. You are a data architect, and I believe that, you know, right now half of a day that you are spending at work as data architect, data engineer, most probably will evaporate in the next 18 months due to AI and LLMs. A honest answer from you that you can give me in 30 seconds. Are you scared? Are you excited? How do you feel about it?

Maksym Karashchuk0:39

Yeah, yeah, good question. I think I would be lying if I said that I'm not scared at all. Like, of course, you know, this is a disruptive change because it is such a big change and we are stepping in the new era, right? And yeah, it is incredible. But the amount of information also we have and everything what we have have to learn, actually. It's so overwhelming. Overwhelming. So yeah, yeah, I'm a bit scared. I am a bit scared. Okay. But you know, you have to adapt. You have to adapt, basically.

Michał Dębski1:18

You have to adapt, and we will be talking about this adaptation during our episode today. So I'm Michał Dębski. This is No Hallucinations and AI That Ships. Today with Max, we'll be exploring the question, how the data engineering job is evolving in the era of AI. My guest is Max, Maksym Karashchuk, Data Architect at Astral Forest. And Max spends lots of his time, I believe the past few months at least, half a year, maybe 8 months, maybe even a year, just doing AI things, LLM things. And based on my observation, like in the past month, 12 months, 1 year maybe, I imagine that they are like the 5 different stages of AI. In data teams. And the first stage is what most people are calling the adoption right now. So just using an LLM model together with you in Cursor, in Visual Studio, whatever, to help you to code better, to deliver better data engineering in faster way. But this is only the first wave. The second one is actually in which you are starting building the agents that are building the code for you and you are sequencing them, maintaining them, making them work smarter, architecturing them, and so on. The thesis is that right now we have both, like, the technology that supports those kind of things and also the business justifications, because I believe that we'll also explore this topic very shortly. It's much faster and much more efficient just to rely on the agentic development in, in data engineering, but it seems that We don't have many organizations yet that are already there to embrace this moment. And we'll test why.

Maksym Karashchuk3:05

Yeah, in my opinion, a lot of organizations just would like to have a stable environment to do their job. They would like to know for sure, to have a plan, for example, how to do some things. But to be honest, with AI currently, it's such a rapidly expanding and evolving subject that, you know, every week something new appears and some new way of doing the same thing, actually. So maybe that's why some of the organizations slowed down when it comes to the AI adoption in their...

Michał Dębski3:51

So, Max, maybe just a quick, like, the question here, like, practical example if you can give how today your typical day of work differs from the day that used to be 12 months ago, a year ago. What do you do differently today?

Maksym Karashchuk4:08

Yeah, so since I'm an architect, right, I stopped writing code like manually some time ago already. I still like to be hands-on, right, in the technology and touch code, but in most cases I was drawing diagrams anyway for most of the clients. But here, I think that now I have one more additional task in my day-to-day work is basically helping businesses and helping organizations to understand how to adopt AI and how to do it quickly with the help of agents. So I need to understand how to adopt those agents. What are the processes of the specific client.

Michał Dębski4:59

Okay, so maybe before we go into this, can you just give an example of a small agent and the job this agent is doing right now?

Maksym Karashchuk5:07

Yeah, yeah. For example, so one of the agents we have created is basically, for example, analyzing different recordings for internal purposes, analyzes recordings automatically of our organization and extracts all the decisions, requirements, open questions for the client. And puts it all into one huge vault and allows you to basically then query that. Or another agent. So that agent was rather for business purposes, right? Where you would like your documentation to be written automatically without any human touch. But another thing is like development. Process being sped up with the help of agents, where you would have an agent which would go to the Azure DevOps, fetch all the tasks which you have assigned to your sprint, would analyze the content, the comments, all pull requests associated, go to the SharePoint of those, to the SharePoint and analyze all the attached files to to this specific work item, for example, propose the solution to developer. Developer says, yeah, I'd like to implement that. That makes sense. And another agent like goes, kicks in and starts implementing all of that. And at the end, it creates a pull request in Azure DevOps. And basically, the whole workflow is, is done and all your work is done in a couple of hours instead of 5 days, for example.

Michał Dębski7:04

Okay, so let me just wrap it up. So you are mentioning that you can have an agent that is going to your backlog, picking up the ticket, analyzing the context of this ticket, like developing the plan of how it should be implemented, and another agent is given this plan and it's just executing to create like the pull request. And once the pull request is done once again, I believe it's your job to verify this pull request.

Maksym Karashchuk7:31

Exactly, exactly. I think that we are still in this stage where you would probably like to have human in the loop in that process at some point just to have, you know, that peer review to be done by a specific person. So yeah, I think it allows developers to also believe in AI and trust it a little bit more. Bit more.

Michał Dębski7:58

So, Max, how did you feel the first time actually the machine built for you like the pull request?

Maksym Karashchuk8:04

Oh yeah, very powerful. I mean, I was surprised. It was like magic, right? So basically it goes and uses all of those new tools which we had to, I mean, humanity had to create like MCP tools, right? And RAGs and all of that. And it just goes, analyzes everything and does it all for you without your touch. And the only The only thing which appears on your screen is basically the login page to the Azure DevOps. You click on it and basically everything else is done automatically. Isn't that magic?

Michał Dębski8:40

Like, it is magic. I will have a few questions on this, but before we go into the technical details, another topic, adoption. And adoption, I'm talking about the adoption also among our peers, the other data engineers, the other data architects, because from my observation, not everyone is on the same page, like embracing this LLM technology. What do you think about this? Why are some people, you know, like really into this coding and some people are just still relying on their own craft, let's say? What are the blockers of this adoption?

Maksym Karashchuk9:16

Yeah, I think this is again the same situation which we have with organizations, right? Which are afraid of adopting something which is unstable. I hear from a lot of developers that, you have to wait, you have to wait for a little bit and just wait until it's stable and maybe then I will start learning this and so on. And to be honest, I can partially understand that because the amount of information is overwhelming, really. If you go to YouTube and you start exploring this field, I mean, you will just drown in the information. However, it may seem like if you are already late with the adoption as an engineer or an analyst, but I don't think it's true actually. I think that if you start right now to adopt that and I mean discover, discover that for you and discover those topics you are most interested in as an analyst or engineer because those topics can be a bit different, you will catch up with the amount of work done currently pretty quickly, just in a couple of weeks or something like that. Okay. I'm pretty sure that once you start, you will be so excited about what you are learning and what you are practicing that you will be just...

Michał Dębski10:50

Okay. So from your perspective, it's mainly like the being, uh, drowned in the content that is being generated so quick, so fast, and you don't even know when to start, where to start with, with what kind of all the information, because everything is changing so quickly and so fast and disappearing as well. Like the OpenCloth, for example, I believe. I don't know if it still does exist or not, but you can do more or less the same thing with cloth right now. So this is one of the aspects from my perspective, also from what I heard and what I've seen and in the different organizations. Is this kind of the dichotomy in which some of the people are pressured to deliver the AI results. But on the same time, you have very heavy governance policies, especially here in Europe. We do have those governance policies. Yeah. That, for example, you cannot use, uh, Claude code because it's, uh, inference is made in the United States, not here, not in Europe. You are, or you were limited only, for example, to Copilot that wasn't that great a few months ago and still is lagging behind. And the decision-making process is very long. So maybe some people are just frustrated. This is, you know, by the company's policies sometimes, which are, which are as they are because this is a new topic. And here also I'm thinking about one thing that Maybe I would like to see what do you think about it, that all of the things, we have only one category, let's say, of the data, of the information, which is just the information. Like, for example, about the customer orders, your financial statement. So this is the business information. But also what we very often just need to send to Claude or to another model. It's the metadata, the construction of the table, the logs, technical logs of some processes that are just being generated.

Maksym Karashchuk12:56

That's right. I very often, I find it useless to send any data or whatever to LLM because the truth is LLM works in a bit unpredictable way. Everyone does know that basically it tends to hallucinate and and so on. And I, so what we call hallucination, it is caused by the fact that LLM basically guesses its next word all the time. So basically it is just a guess machine. So yeah, of course, I could be frustrated by that.

Michał Dębski13:38

Yeah, of course. I see. I see. So this is one of the blockers that I spot. But however, we are also to talk about, you know, moving from this state in which you have the access to a load pool of the LLMs that you can just work with and moving from the stage that you can accelerate your job on your own, let's say PC, doing your stuff, going to the next wave, to the wave in which you have the agents. Yeah. How would you see, how would you describe this? Transition? What needs to happen to make it happen, basically?

Maksym Karashchuk14:15

Yeah, I, so I think that first of all, organization should be ready for this. Organization also should understand what processes to be automated somehow, because to be honest, here we are talking about automating some things in the existing processes, which organization organization might have and replacing them with agents.

Michał Dębski14:47

Yeah.

Maksym Karashchuk14:47

So basically, it would be great if organization has the whole process written down, like in the BPMN, for example, format, right? And then you start identifying that, okay, maybe this piece can... the agent can do, and this piece can be done by agent. So, yeah. That helps you. And of course, we differentiate two ways of working with agents, right? First one is agents, humans with agents, and another one, agents with humans. So humans with agents, it means that you create the agent which knows the context of your project, for example, which knows the documentation, which knows the SharePoint files and the content and how to parse, for example, Excel file with some maps mappings, weird mappings from business or something like that. And but human initiates the dialogue with that agent, start asking questions and expect some output and result to be done autonomously.

Michał Dębski15:56

So Max, here a few examples maybe on this because when we are moving this way, you need to have some kind of the homework done firstly. Like you mentioned, the agents need to know the context. And if we are operating within the data platform, within the data warehouse, what is like the standard of the data warehouse that the data warehouse should have so the agent can operate? Can it operate on any kind of data warehouse or it can operate only on some data warehouses?

Maksym Karashchuk16:32

Yeah, yeah. So of course some homework has to be done. And in case we... so first of all, I think that in case we want to automate the development process with the help of AI, everything is as much information as possible has to be in the code, written as a code, basically infrastructure as a code, your backend as a code, your frontend as a code. So basically, AI can just go and start analyzing that because it needs that metadata. Right.

Michał Dębski17:10

Okay. So the first step is to have basically like the good quality repository of all the things that you are doing with the infrastructure as a code, with the backend as a code, and even as the frontend, whatever it is as a code as well. Yeah, yeah, yeah, yeah.

Maksym Karashchuk17:24

Exactly. And then since you have your project, right, and you have created some documentation, or you have some part of the documentation, okay? Then maybe you can feed that documentation and ask LLM, your agent, to generate you rules for the project, which it should follow when writing code as junior, as like any developer basically follows when they write code. Because, you know, garbage in, garbage out. Right. And you need to follow some standards. And this is what you have to prepare next, basically rules, skills, which will help your agent, which will help it drive the development very easily.

Michał Dębski18:28

Yeah. I see. So you need to have the repo, you need to have the set of rules. Yep. Where should you, what kind of environment, what kind of the data format should you store your rules?

Maksym Karashchuk18:41

So that would be probably markdown files. For now, this is kind of a standard because markdown files or this is just an open standard of writing your documentation. If you have a wiki in your Azure DevOps, for example, you have it stored in the markdown files already. This is very lightweight format which does not consume a lot of tokens when your agent reads it or writes to it. So this is the most important. And yeah, because if you operate on documents like Word documents or presentations, then you just ingest a lot of unnecessary context, which is not always required for the agent to work.

Michał Dębski19:38

So you do recommend to have like the knowledge base, solid knowledge base on the markdown files. And what about RAG, the vector databases?

Maksym Karashchuk19:47

Yeah, good question. Because RAG is another way of storing your knowledge. So if you have a set of rules and they are very specific, there are not a lot of them. What I mean is like up to 20 50 rules, let's say, right? You can still keep them in the Markdown format. However, if we are talking about like a huge database with a lot of metadata and like the whole your repo is so big that it is really hard to understand what are the relationship between objects and where something is stored, then you might need the vector database or RAG to help you with this. So what is that? This is just a vector database. So basically it takes all the information, all the documents which you might have and just construct a mathematical model, right? Because do I describe that correctly? Probably yes. I'm not a mathematician, so don't quote me on that. But in general, the idea is that once you put all of those documents into that vector vector database, something called embeddings are being created and put into vectors. And when LLM needs some context, it just creates, it just tries to find similarities in that vector database. Yeah, so this is very mathematical model and it tends to hallucinate more than those rules written in Markdown file.

Michał Dębski21:36

Okay, so Markdown finds are much better, let's say. And so personally, I prefer just to... for one simple reason, I can control them visually very well, and I can do the same things with the black box, which is like a RAG, a vector database, because I don't know how everything gets tokenized, let's say, inside. It's a, it's a, it's a, it's a different pattern, let's say. So we need to have the knowledge base We can have it in Markdown. How about version control of this database? Any tips on this?

Maksym Karashchuk22:13

Yeah, so in my opinion, the easiest way to store and version that, as with any code, since this is just the set of Markdown files, for example, we are talking about, we can still like have a repository with all that stuff added there. And in case we you want to change some skill or some rule, right, you would approach that the same way you would change the architecture decision record. You cannot just set something in stone and then suddenly go and change that decision in the middle of the project. So you should approach this the same way, I think. Skills, rules for the project and for your agent should be written once, tested, of course, approved, and then versioned. So basically, you would have a repository which tracks all the changes, and you can always revert back to the previous version, which was working much better.

Michał Dębski23:23

Okay, so you have your KnowledgeDB. Within your KnowledgeDB, you have Markdown files. Like the KnowledgeDB, you can have different procedures. You can have the documentation. Parts of the reference data and the skills, basically the skills and the chain of the skills become, becomes an agent. And you need also to control those agents right now. And I believe that we are moving to the different angle right now, the future, because we can see already how, how the design is going, like, and your job is evolving as well because you used to be maintaining the repo with the code. Builder. And right now you start to maintain and to develop the repo of the knowledge base and the agents. Yeah, yeah. And control version... and version control them and, uh, make them auditable, I believe, as well. Governed, traceable, cost-efficient, all of these things. So my question is, how do you The next 18 months from now.

Maksym Karashchuk24:28

Yeah, it is an extremely difficult question to answer. I mean, next 18 months, I have no idea what will happen like in 2 weeks when it comes to AI. So, but I will try to answer looking back at last 1 year, for example, what happened in that field. And in my opinion, instead of just, we will go to the next level and we will take a look at, we will be looking at our processes from the higher level than we do right now. Sometimes nowadays we need to still go and check our code, some details, what was the specific filter and so on. So we we really go into much details when analyzing something. So maybe in one or one and a half year, we will just stop doing this and models will be so much better that we will be relying on it when it comes to writing code entirely, you know. But nobody knows that. So this is just my So maybe a tricky question right now.

Michał Dębski25:51

At this stage we are right now, do we need the better models or the better processes?

Maksym Karashchuk25:59

Yeah, yeah, very good question. I think that currently the difference in model performance is very small. I mean, we every month, some new model is being released by the the biggest giants of model providers right now. But honestly, I don't feel the biggest difference between that, between those. I honestly, I still sit on Sonnet 4.6 and I'm fully satisfied with that if you have granular tasks to execute. So that is totally fine with me. Yeah, so I don't think that we need much better models, we indeed need better processes and the way we use those models. And this is something what we still need to learn, all of us.

Michał Dębski27:04

So maybe Max, also like from your perspective, because we started this question also, sorry, this meeting with the hard question. And this is also for all the listeners. If you are a data engineer right now or working in the data, what should you do? So to make sure that you can still evolve in your career path. Yeah.

Maksym Karashchuk27:26

So in my opinion, you should go and touch the technology. I mean, start practicing, start using that. Of course, do not forget about security and safety, right? You cannot just go open that on the on VM and then start feeding that ton of data from your organization just to learn a little bit of agents. No, but I think that you should go and try to learn what are the agents, for example, how to work with agents. I use agents not only for my professional work, but for my day-to-day life in general. So there are numerous different cases which you can also find the inspiration for in the internet. People are sharing that. You can just go be inspired and try the technology. And I'm sure you will find a lot of topics to discover.

Michał Dębski28:29

So the hard statement right now, will the job of data engineer as we know it right now exist in 10 years?

Maksym Karashchuk28:37

Like we know right now? No, it will be, I think, changed enormously. But I hope that data engineers will still be required, but maybe their role will be just changed a little bit. And yeah, maybe they will more focus on those processes and business processes more than than just writing SQLs and notebooks in Databricks.

Michał Dębski29:07

Yeah. I see. I see. I do agree with your statement as well that the world is evolving very quickly and all we need to evolve and follow because the train is already departing. Going fast. Yeah. And it will go fast, maybe even faster. We just need to jump on it and deliver. Okay. Thank you, Max, for the discussion that we had in this wonderful afternoon. Soon.

Maksym Karashchuk29:31

Uh, thank you for having me, Michał.

Michał Dębski29:34

It was a real pleasure. So thank you to all of our listeners and have a great day.

Episode 01 transcript

The Role of MDM in AI Transformation

Michał Dębski0:27

And we are live. So, okay, so I believe that we can start right now. So Malcolm, before I even say hello, I want to start with the question I get from every customer every single time, every single week that I'm talking with them. The question that maybe makes every single MDM vendor a little bit nervous. So the question is, why should I pay for the MDM license if I can just drop all of my data into Claude and get the answer in 30 seconds? And I would like you to answer this question also in the 30 seconds, and later we'll do this properly.

Malcolm Hawker1:09

30-second lightning round on can I use an LLM to do master data management? Well, the short answer is no. You can't. And the number one reason is because you need explainability. What I learned the first time that I ever implemented MDM is that the number one thing that you need to be able to provide are clear answers to your business as to why they're seeing what they are seeing. If you implement MDM and you merge two records together, Acme Incorporated, Acme SA, and create something new, you have to be able to fully explain. MDM is inherently a deterministic enterprise, meaning it is rules-driven. You need clear rules. You need your organization to align on those rules.

Michał Dębski1:52

Okay, welcome.

Malcolm Hawker1:53

This was just teaser. 30 seconds.

Michał Dębski1:54

Yeah, okay. That's all. We're coming back a little bit later, maybe in 15 minutes from now, and I'm going to push you even harder. So, but let me right now to back up a little bit and prepare, set up our table correctly right now. So welcome everyone. Hello to the first episode of No Hallucinations, and we're in AI meets reality. Like right now, I believe the decisions, the data leaders, the decisions of the data leaders that are being taken right now will shape every single AI transformation that is happening, that is going to happen in the next 5, 10 years. Nobody is talking about the correct foundation, the right setup of all of these transformations that are going to happen. Everyone is chasing the context for AI. But context built on duplicated and untrustworthy data is just a more confident way to be wrong. And my guest today is Malcolm, Malcolm Hawker. Malcolm is the CDO of Profisee, 20 years of experience in master data management, I believe, also. And I believe also that Malcolm, you're coming back from Gartner conference, which happened recently. So maybe let's start with, let's talk about the elephant in the room right now and share with us some insights. So what did happen exactly in this conference? So in Gartner in London, I believe.

Malcolm Hawker3:26

So London's coming up. London will be in 2 weeks. The Gartner conference that I was at was in Orlando. That was a few weeks ago now. All the days are melting together. But, you know, Gartner is, I would argue, is the preeminent conference for data and AI leaders. It is mostly attended by senior directors, VPs, C-level executives who are responsible for AI and for data and for analytics. And the number one thing that I heard over and over again, you said it, Michał, is context, context, context. It seemed to be the word that I heard the most at every session, in the hallways, in the exhibit hall. Everybody's trying to figure out context. And if you ask me, the reason why we're talking about context is because everybody's trying to figure out how to operationalize all of their legacy structured data. We've been doing data a long time, but most of the data that we manage is highly structured, sitting in tables, sitting in relational stores. And that data in and of itself is not very actionable by LLMs. LLMs prefer text. LLMs prefer unstructured data. The more text you can put into a prompt window, the more context, the more detail, the more accurate and consistent and predictable the answers are going to be. So if we're going to use all of this relational data that we have, increasingly what companies are realizing is that we need to be able to add context to it. Probably the number one way most are doing context these days are talking about things like knowledge graphs and how to implement knowledge graphs.

Michał Dębski5:02

Knowledge graphs, RAGs, pipelines, stuff like this. It's almost everywhere right now.

Malcolm Hawker5:07

Agreed. The thing that we're not talking enough about is the context that is inherent to MDM. We have hierarchies. We've always had hierarchies within MDM platforms. We've always been a source of truth for context. So the combination of MDM plus data catalogs, often data catalogs will be where these things are defined, where you will have your business glossary. But MDM is where you're operationalizing a lot of this context. So the good news for MDM practitioners, the good news for data and AI practitioners is that context is hot. We've got to figure this out. But it is a little bit challenging because the skill set that is needed to create and manage a lot of context typically within most organizations is happening in what is known as more of a knowledge management function where people are managing complex ontologies and taxonomies. Maybe they're managing your search infrastructure, maybe they're managing your product catalog. There are people in your organization that know this stuff pretty well. We've got to find a way to reach out to those people and pull them into the data and analytics function within our organizations to have a more holistic approach.

Michał Dębski6:10

This is maybe a very good question right now that I want to ask you. So we have the context, we have agent workflows that can just feed data directly to MDM tables right now. But exactly and very precisely, what does that unlock from the human perspective as well? How does it impact the work that I'm doing, all my organization is doing right now?

Malcolm Hawker6:32

Well, the biggest impact to humans is we're going to have to figure out how to master unstructured data. And right now we're not really doing that very much. We may do a little bit of capture of some data for maybe we're capturing data off of forms for healthcare-related use cases or insurance-related use cases. But Gartner says that 80 to 90 percent of all data in organizations is unstructured. And if we are going to use that data, operationalize that data, that 80 to 90 percent of data for AI, we're going to have to govern it. We're going to have to apply data quality to it. We're going to have to master it, even traditional data for traditional MDM. So that's a big change coming for a lot of people: how do we do that? How do we adapt our governance policies to managing, mastering quality for unstructured data?

Michał Dębski7:28

How do we do it? From my perspective, every single time we have the unstructured data in order to govern it, to model it correctly, we need to squeeze some kind of the structure out of this unstructured data. Let it be the tags, enrichment of the data, putting on additional columns, whatever. LLMs can help us doing so, but right now we are entering these questions about also AI governance, LLM governance, the unstructured data governance. Now, do we even know what does it really mean, like the AI governance? This is a serious question.

Malcolm Hawker8:05

It is a serious question, and I'm giving you a serious answer. The answer is no, we don't really know because we've got legacy frameworks, legacy approaches, legacy rules that don't really work that well when it comes to unstructured data, particularly text. If I read a paragraph and come to a certain conclusion about what it meant or its truth, and you read that paragraph and come to a different conclusion, who's right? Who's wrong? These are the questions we need to start to answer. What does it mean to have ethical data? I think we can start to model what ethics mean from the model behavior perspectives, but what does that actually mean? What is ethical data? When we actually look at the data that will be used to train and guide and ground these models, how do we reduce bias in the data? I don't think we really know. We know it when we see it in the models when the models behave a certain way, but in the core data itself, we really don't know. We don't really know what it means to fully govern all of this unstructured data. There's some basic things we can do. Starts with tagging for sure. That's why so many people are focused on data catalogs and increasingly on MDM as well, because we need to tag all of this data. We need to know what's there. We need to apply metadata, create metadata for these video files, for these text files. That's a starting point.

Michał Dębski9:25

Everything reminds me about libraries, basically. Like the old physical libraries in which you had like the huge collections of the unstructured data. They are called books. And for every single book, you are creating the tags, the categories, the indexes, the indices that are helping to manage all of this. Is it like this metaphor or parallel also similar to you? Like something is happening right now.

Malcolm Hawker9:54

It's not a metaphor, it's literal. It is literal. Library scientists who work in our companies, many of them who work in our companies, they typically don't sit in a data and analytics function, which is a bit of a problem. But these library scientists are out there. They're building corporate ontologies. They are building corporate taxonomies. We need to figure out how to pull these people in, or at the very least integrate them to our data governance processes so they're sitting at the table when we're having conversations around things like metadata standards. There's a huge opportunity just to have common definitions for things. This is why MDM will continue to play a critical role in organizations because it is where definitions are operationalized. MDM is how today you make sure that you define something in CRM one way and it's defined the exact same way in an ERP system. And we can extend that. It won't just be CRM and ERP. It will be MP4 files sitting in your marketing SharePoint service. It'll be Word docs sitting somewhere else. So MDM needs to start moving into unstructured data and we need to start pulling unstructured data in.

Michał Dębski11:03

You did ask a really good question. So we'll be building huge libraries, basically. Huge libraries for all the organizations. And I believe that this is something that I would like also our listeners to stick their head into it because it's an easy concept. Everyone has been to the library. So the people know how does it operate. MDM, ontology, semantic. For the people outside from our data bubble, those concepts may be difficult to grasp. Library is very simple one. So right now I'm a little bit rephrasing our discussion and putting it maybe to a real business case. Something that I've seen in the past years, and as a Profisee also implementator, we've seen lots of M&A cases, like the companies being acquired being merged together. Extremely important topic for both, like, the investors who are purchasing those companies. They want to have synergy effects, let it be, for example, the simplification of the vendor structures or better addressing the customers' needs. And also, this is an important topic, and I believe for some of our listeners, maybe even nightmare, a few nightmares, like the post-merger integration. How right now we can better embrace all of these legacy systems, put them together, master them. And every single M&A that I've seen in my life, there are a few problems, exactly the same ones. So 3 different customer definitions, 4 definitions of vendor, 5 different general ledgers. How can you manage this using the MDM approach, library approach, AI approach? Do you have any stories?

Malcolm Hawker12:51

I have many personal stories. Of course, at Profisee we have many clients that are using us to support merger and acquisition use cases. You mentioned post. So that's a big thing, is actually integrating two separate companies. And that's exactly what MDM does all day, every day. We break silos. MDM is about breaking silos, whether that silo exists at the level of a single database or a single application or whether the silo exists at the level of an entire company. Just another way to look at a silo. MDM is very, very good at establishing common definitions for things and then enforcing those common definitions across all of these siloed data assets or siloed business applications. So that's post-acquisition. But something as equally as important is pre-acquisition. Right now, companies spend a ton of money to hire very expensive consultants to do due diligence for a merger acquisition activity to understand: okay, how much do our customer bases overlap? It's the first question that a company will ask because they don't want to buy a duplicate customer, or maybe they do want to buy a duplicate customer, but they want to understand what revenue is at risk. If we merge these two companies together, what revenue will be at risk? To do that, you need to understand what customer base number 1, customer base number 2, and see what the overlap is. It's like a Venn diagram, very basic stuff.

Michał Dębski14:16

So it's very simple as you're explaining right now. It's really that simple?

Malcolm Hawker14:20

It's not that simple. One of the reasons why I'm involved in MDM is because on the surface, it seems like such a simple problem to solve. And when you try to solve it, it is a very complex problem. So in terms of a pre-merger, historically, companies will pay consulting companies to go and basically custom build MDM solutions to answer these questions, when you could just go buy an off-the-shelf MDM platform to understand what are our customer overlaps. Yes, you have to define what a customer is, but that's what MDM does. It's very, very good at helping you scale and automate those.

Michał Dębski14:54

Hold on, hold on, because I believe that you touch extremely important topic right now. I don't know if I understood it correctly, so let me rephrase it. So the companies, they decided to invest a lot to build like a one-shot solution, just pre-merger, to understand what is the potential overlap between the customer bases. So it's one-shot solution. And later we have also this post-merger activities that are like really painful and they can take years to get solved. So why is this? Do you think that there is lack of consciousness in the market how this problem should be approached? Because in my opinion, why shouldn't we just build one solution to tackle them all? Cheaper, isn't it?

Malcolm Hawker15:38

In the case of pre-acquisition or due diligence, that is a world that is dominated by very expensive consultants, and they get billed by the hour. They don't make a ton of money by helping you build a solution that is going to last for the next 30 years. They're going to help you build something that will get you through the due diligence and maybe require you to pay ongoing maintenance for it, if you continue to use it, which most companies don't. So, the question of why did companies do this and throw money and waste money when it comes to M&A activity? Well, partially because large consultancies are involved. And secondly, because I don't think many understand a different way. There's a lot of us that, if you talk to a CIO about what the playbook is for a merger, the first thing they'll talk about is how do we physically integrate systems. How do we take two ERP systems and make them one ERP system? When in reality what you might be able to do is keep them individual but then integrate at a data level. Integrate the data, keep the systems the same, but integrate the data. That's what MDM does, complex integrations across your most important data, the shared data across those two systems. MDM is a great and viable tool in the short run to virtually tie these systems together instead of physically tying these systems together.

Michał Dębski16:59

I see. Okay. So we discussed a little bit this MDM, the consultancy, which is an extremely important thing. Right now, let's get back to our initial question. Maybe we can find some ways how we can accelerate also.

Malcolm Hawker17:15

And by the way, I was talking like Big Five consultancies, not Astral Forest. I know Astral Forest would be looking out for customers' best interest and help them build scalable systems that can go into the future. I'm talking about the big ones that focus more on tax-centric and M&A-centric, the giant ones, the Deloittes and the Accentures and the McKinseys. Not Astral Forest.

Michał Dębski17:37

Yeah, but anyway, let's get back to the AI right now because every single day I'm using AI and this is like the spiciest question right now that I can ask you. So let's continue the one from the beginning and the reason why I'm pushing you really into it because this is the same question that I am being asked almost every single day. My customers are coming to me and asking, you know, Michał, I just dropped my entire customer database to Claude. It deduces, it answers the questions, it's brilliant. It does everything by itself like magic. So why should I pay for your services and also for Profisee license if I can do everything within Claude? This is the serious question. And I would like really to push you forward with this one.

Malcolm Hawker18:24

Well, short answer, you can't do everything in Claude. But let's back up. Can some MDM use cases be fully automated by AI? Maybe. I'm thinking more like maybe CDP, customer data platform use cases, marketing use cases where the cost of being wrong or where the expectation of consistency and predictability is low. If the cost of being wrong is low and the expectation of having the same answer consistently provided over time, and the unit costs are reasonably low, okay, maybe you could orchestrate a reasonably complex agentic workflow that ends up looking a lot like MDM and maybe supporting some very basic use cases. Maybe, that's a big maybe, but it wouldn't be enterprise class and it wouldn't be used outside of a marketing function for sure. But if you need consistency, if you need predictability, if you need accuracy over time, most importantly, I started the first conversation we had today was explainability. MDM at its core runs on deterministic rules. Those deterministic rules are defined by a data governance council. People will come together and align on how do I define things? How are things related to other things? What are our minimum data quality standards? By use case. What are our match rules? How will we define what a unique corporate or party entity looks like? Those are deterministic rules that are determined by a governance council. Where you put those into an MDM, you're going to get the same answer consistently over time. The only thing that's inherently probabilistic is the match process. But even then, you apply data steward resources to it to make sure that what you've got, what you're looking at, is accurate and is consistent in the cases where the probabilities are reasonably low when you run these matches. Put it all together, and what you have is a system that is inherently deterministic, that is running on rules, that is predictable and consistent, and can stand the scrutiny of audit, can stand the scrutiny of compliance, can stand the scrutiny of use cases where the cost of being wrong is high. If you are wrong about your customer name, or if you are wrong about a product name, the cost can often be extremely high. And as a data leader, the last thing that you want as a data leader, to look in the eye of your CEO when the CEO asks: what happened? Why did we fail the audit? Why are we in trouble from a regulator? The last thing that you want to say is: because Claude. Because Claude. Claude did it.

Michał Dębski21:09

Because Claude.

Malcolm Hawker21:10

Because Claude did it.

Michał Dębski21:11

Okay, so let's flip this coin around, maybe. So how, in your opinion, can Claude actually enhance building the MDM platforms? What kind of processes can be automated? Just pick up one, maybe the best one. From your experience?

Malcolm Hawker21:25

I would argue that all MDM critical capabilities, and Gartner says there's 13 of them, all like data modeling, data quality, data governance, I would argue all of them can be augmented by Claude. Claude or any LLM can help you define, and we do this in Profisee. We have an AI orchestrator, we call her Aisi, and she runs on your OpenAI tenant and she can help you define data quality rules. She can help you recommend data models. She can help you recommend match processes. So that's augmentation, but that's very different than full automation. You can drastically scale and accelerate a lot of it.

Michał Dębski22:11

I believe that right now we are very close to this concept of human in the loop, like the human stays always. Now, in my opinion, human doesn't need to be in the loop. It's maybe a little bit contrarian to the market consensus right now, but I believe that many of the processes can be liberated from the human decision-making process. Humans can orchestrate them, humans can monitor them, humans can make them better. But what's your take on this?

Malcolm Hawker22:45

When it comes to building an MDM platform, I think for the short term that we will continue to have humans in the loop because humans are ultimately accountable. If we build the RACI matrix, especially from a governance perspective around those use cases that I was talking about, audit, compliance, financial data accuracy, on and on, as long as humans are accountable, I think that they will remain in the loop to some degree.

Michał Dębski23:18

So you think it's like accountability question mostly that right now we just cannot allow a Wild West, let's say, and taking the decisions by AI by itself or themselves. There must be human because we need to attribute the decision to someone.

Malcolm Hawker23:34

It gets tough. In defining rules that you would configure into an MDM platform or data quality platform or data integration platform, I think it's reasonable to assume that we can continue to scale human beings to meet the demand. However, transactionally, that's an area where I don't think it's feasible to always have a human in the loop. If you have an AI-based process, an AI-based chatbot that is doing customer service help, will you always be able to have a human in the loop on every interaction? No, you're not going to. So we need to pivot. This requires a pivot in how we approach governance from a bottoms-up rules-driven process to more of an exception. There will still be rules. But we need to start working from more of an exceptions-driven process. And this is one of the many ways that we need to adapt our governance processes here because it's not feasible transactionally. If somebody's creating, or if there are agents creating new customer records or new products or new something and they're doing it in milliseconds and they're doing millions a day, we're using AI because it can go that fast because it enables that scale. And if you start throwing humans into every single transactional loop, it's going to break. And there'll be no value.

Michał Dębski24:53

So basically we need to transform the way that we are doing the business right now to embrace the new capabilities of AI. And in order to do so, we will need more projects basically to do so. Talking about projects, I have one additional question for you. This is something that you mentioned, I believe, last week that we were talking. So you said something about the people in the data landscape, in the data environment, and our own difficulty to measure the impact of our job, to measure the impact of our projects that we are implementing. Everyone needs to measure, everyone needs to forecast the real impact on the organization. However, we are refusing it very often. Can you just elaborate on this topic and say it once again? Share it with our public.

Malcolm Hawker25:42

The connection between what we were just talking about, humans in the loop scale, is in essence what you're talking about was finding the right AI use case. I can kind of paraphrase. What is the right use case? What is the use case that is going to be ROI positive? What is the use case that is going to align well to automation? On and on. That requires us to take a more rigorous perspective when it comes to measuring the ROI of our investments in everything. When I was a Gartner analyst, I would say to CDOs and to CIOs, VPs of data and analytics, they'd say things like: I'm not getting any business engagement. Nobody's contributing data stewardship resources. I can't get any more people. On and on. Most of the problems most pressing to CDOs are based around the fact that we don't measure the economic impact of the things that we do. We don't measure, and I'm talking about dollars. Pounds, actual money in the bank that can be attributed to data quality, MDM, better data integration, AI, whatever data and analytics use case that you want, we need to be able to start measuring those. And the thing that drives me a little bit nuts is that you have CDOs out there who are saying: this is impossible. This is impossible. You cannot do this because the benefits are indirect. If I can't attribute a dollar in the bank or a euro in the bank because of a data quality rule, Malcolm, this is impossible. I hear what you're saying and it sounds like a nice academic exercise, but I can't do it. Can you imagine if you're that chief data officer and you're sitting in a C-level meeting and your CEO asks: what is the value of your function? Everybody else at the table has built attribution models. HR has built some sort of idea of understanding: okay, if we invest in HR and employee retention programs or maybe an employee wellness program, I can reasonably assume that our employee retention will improve 10 percent. HR is doing that. Finance is doing measurements around what happens when we invest in audit, what risk, how many dollars do we mitigate by investing in better compliance? Everybody else at that table has developed measurements for the effectiveness of their organization. We're in the business of measuring things. We're literally in the business of measuring things. It's what we do. It's what we're hired and paid to do. Yet we're saying impossible. Can't do it. We cannot measure this. And it's ridiculous. Of course we can measure this.

Michał Dębski28:27

So one message maybe to everyone, a little bit more of courage and start measuring in the best way possible the impact, the real impact that we are doing because it's for our own sake. And we can do this. Even if some kind of approximations is always possible. So the last one, Malcolm, for you. 5 years from now, how do you see MDM landscape?

Malcolm Hawker28:47

It's going to look drastically different, but MDM will still play a critical foundation in the management and governance of data because it will have to. All the things that I talked about, unstructured data. How do we apply data quality? Structured data, how do we apply consistent definitions? How do we manage complex hierarchies and complex relationships? And how do we enforce those into the data that matters the most, which is shared across the organization? Our organizations will remain highly federated. Marketing will continue to have its own language. Finance will continue to have its own language. And you need a layer in between those that is consistent, accurate, and predictable. That's MDM. It's not going away. We're still going to be here.

Michał Dębski29:26

Okay. So the market is going to be booming and exploding basically about MDM. Because this will be the image generated by AI, I believe, as well.

Malcolm Hawker29:37

We're going to be using AI. It'll be AI for MDM. But the bigger use case is MDM for AI, figuring out how to master all that unstructured data.

Michał Dębski29:48

Thank you all. Thank you for listening to us. Malcolm Hawker from Profisee. Enjoy the day. Enjoy the week.

No HallucinationsAI That Ships

AI-Augmented Data Engineering, with Maksym Karashchuk

Every episode in one place.

Every episode, fully transcribed and timestamped.

Search-friendly

AI-ready

Quote-ready

AI-Augmented Data Engineering, with Maksym Karashchuk

The Role of MDM in AI Transformation

Michał Dębski

No HallucinationsAI That Ships

AI-Augmented Data Engineering, with Maksym Karashchuk

Every episode in one place.

Every episode, fully transcribed and timestamped.

Search-friendly

AI-ready

Quote-ready

AI-Augmented Data Engineering, with Maksym Karashchuk

The Role of MDM in AI Transformation

Michał Dębski

Get one email per episode.

Or follow on your player of choice