The post Injecting domain expertise into your AI system appeared first on Towards Data Science.
]]>When starting their AI initiatives, many companies are trapped in silos and treat AI as a purely technical enterprise, sidelining domain experts or involving them too late. They end up with generic AI applications that miss industry nuances, produce poor recommendations, and quickly become unpopular with users. By contrast, AI systems that deeply understand industry-specific processes, constraints, and decision logic have the following benefits:
Domain experts can help you connect the dots between the technicalities of an AI system and its real-life usage and value. Thus, they should be key stakeholders and co-creators of your AI applications. This guide is the first part of my series on expertise-driven AI. Following my mental model of AI systems, it provides a structured approach to embedding deep domain expertise into your AI.
Throughout the article, we will use the use case of supply chain optimisation (SCO) to illustrate these different methods. Modern supply chains are under constant strain from geopolitical tensions, climate disruptions, and volatile demand shifts, and AI can provide the kind of dynamic, high-coverage intelligence needed to anticipate delays, manage risks, and optimise logistics. However, without domain expertise, these systems are often disconnected from the realities of life. Let’s see how we can solve this by integrating domain expertise across the different components of the AI application.
AI is only as domain-aware as the data it learns from. Raw data isn’t enough – it must be curated, refined, and contextualised by experts who understand its meaning in the real world.
While data scientists can build sophisticated models to analyse patterns and distributions, these analyses often stay at a theoretical, abstract level. Only domain experts can validate whether the data is complete, accurate, and representative of real-world conditions.
In supply chain optimisation, for example, shipment records may contain missing delivery timestamps, inconsistent route details, or unexplained fluctuations in transit times. A data scientist might discard these as noise, but a logistics expert could have real-world explanations of these inconsistencies. For instance, they might be caused by weather-related delays, seasonal port congestion, or supplier reliability issues. If these nuances aren’t accounted for, the AI might learn an overly simplified view of supply chain dynamics, resulting in misleading risk assessments and poor recommendations.
Experts also play a critical role in assessing the completeness of data. AI models work with what they have, assuming that all key factors are already present. It takes human expertise and judgment to identify blind spots. For example, if your supply chain AI isn’t trained on customs clearance times or factory shutdown histories, it won’t be able to predict disruptions caused by regulatory issues or production bottlenecks.
Implementation tip: Run joint Exploratory Data Analysis (EDA) sessions with data scientists and domain experts to identify missing business-critical information, ensuring AI models work with a complete and meaningful dataset, not just statistically clean data.
One common pitfall when starting with AI is integrating too much data too soon, leading to complexity, congestion of your data pipelines, and blurred or noisy insights. Instead, start with a couple of high-impact data sources and expand incrementally based on AI performance and user needs. For instance, an SCO system may initially use historical shipment data and supplier reliability scores. Over time, domain experts may identify missing information – such as port congestion data or real-time weather forecasts – and point engineers to those data sources where it can be found.
Implementation tip: Start with a minimal, high-value dataset (normally 3–5 data sources), then expand incrementally based on expert feedback and real-world AI performance.
AI models learn by detecting patterns in data, but sometimes, the right learning signals aren’t yet present in raw data. This is where data annotation comes in – by labelling key attributes, domain experts help the AI understand what matters and make better predictions. Consider an AI model built to predict supplier reliability. The model is trained on shipment records, which contain delivery times, delays, and transit routes. However, raw delivery data alone doesn’t capture the full picture of supplier risk – there are no direct labels indicating whether a supplier is "high risk" or "low risk."
Without more explicit learning signals, the AI might make the wrong conclusions. It could conclude that all delays are equally bad, even when some are caused by predictable seasonal fluctuations. Or it might overlook early warning signs of supplier instability, such as frequent last-minute order changes or inconsistent inventory levels.
Domain experts can enrich the data with more nuanced labels, such as supplier risk categories, disruption causes, and exception-handling rules. By introducing these curated learning signals, you can ensure that AI doesn’t just memorise past trends but learns meaningful, decision-ready insights.
You shouldn’t rush your annotation efforts – instead, think about a structured annotation process that includes the following components:
Implementation tip: Define an annotation playbook with clear labelling criteria, involve at least two domain experts per critical label for objectivity, and run regular annotation review cycles to ensure AI is learning from accurate, business-relevant insights.
So far, our AI models learn from real-life historical data. However, rare, high-impact events – like factory shutdowns, port closures, or regulatory shifts in our supply chain scenario – may be underrepresented. Without exposure to these scenarios, AI can fail to anticipate major risks, leading to overconfidence in supplier stability and poor contingency planning. Synthetic data solves this by creating more datapoints for rare events, but expert oversight is crucial to ensure that it reflects plausible risks rather than unrealistic patterns.
Let’s say we want to predict supplier reliability in our supply chain system. The historical data may have few recorded supplier failures – but that’s not because failures don’t happen. Rather, many companies proactively mitigate risks before they escalate. Without synthetic examples, AI might deduce that supplier defaults are extremely rare, leading to misguided risk assessments.
Experts can help generate synthetic failure scenarios based on:
Actionable step: Work with domain experts to define high-impact but low-frequency events and scenarios, which can be in focus when you generate synthetic data.
Data makes domain expertise shine. An AI initiative that relies on clean, relevant, and enriched domain data will have an obvious competitive advantage over one that takes the "quick-and-dirty" shortcut to data. However, keep in mind that working with data can be tedious, and experts need to see the outcome of their efforts – whether it’s improving AI-driven risk assessments, optimising supply chain resilience, or enabling smarter decision-making. The key is to make data collaboration intuitive, purpose-driven, and directly tied to business outcomes, so experts remain engaged and motivated.
Once AI has access to high-quality data, the next challenge is ensuring it generates useful and accurate outputs. Domain expertise is needed to:
Let’s look at some common AI approaches and see how they can benefit from an extra shot of domain knowledge.
For structured problems like supply chain forecasting, predictive models such as classification and regression can help anticipate delays and suggest optimisations. However, to make sure these models are aligned with business goals, data scientists and knowledge engineers need to work together. For example, an AI model might try to minimise shipment delays at all costs, but a supply chain expert knows that fast-tracking every shipment through air freight is financially unsustainable. They can formulate additional constraints on the model, making it prioritise critical shipments while balancing cost, risk, and lead times.
Implementation tip: Define clear objectives and constraints with domain experts before training AI models, ensuring alignment with real business priorities.
For a detailed overview of predictive AI techniques, please refer to Chapter 4 of my book The Art of AI Product Management.
While predictive models trained from scratch can excel at very specific tasks, they are also rigid and will "refuse" to perform any other task. GenAI models are more open-minded and can be used for highly diverse requests. For example, an LLM-based conversational widget in an SCO system can allow users to interact with real-time insights using natural language. Instead of sifting through inflexible dashboards, users can ask, "Which suppliers are at risk of delays?" or "What alternative routes are available?" The AI pulls from historical data, live logistics feeds, and external risk factors to provide actionable answers, suggest mitigations, and even automate workflows like rerouting shipments.
But how can you ensure that a huge, out-of-the-box model like ChatGPT or Llama understands the nuances of your domain? Let’s walk through the LLM triad – a progression of techniques to incorporate domain knowledge into your LLM system.
As you progress from left to right, you can ingrain more domain knowledge into the LLM – however, each stage also adds new technical challenges (if you are interested in a systematic deep-dive into the LLM triad, please check out chapters 5–8 of my book The Art of AI Product Management). Here, let’s focus on how domain experts can jump in at each of the stages:
2. RAG (Retrieval-Augmented Generation): While prompting helps guide AI, it still relies on pre-trained knowledge, which may be outdated or incomplete. RAG allows AI to retrieve real-time, company-specific data, ensuring that its responses are grounded in current logistics reports, supplier performance records, and risk assessments. For example, instead of generating generic supplier risk analyses, a RAG-powered AI system would pull real-time shipment data, supplier credit ratings, and port congestion reports before making recommendations. Domain experts can help select and structure these data sources and are also needed when it comes to testing and evaluating RAG systems.
Implementation tip: Work with domain experts to curate and structure knowledge sources – ensuring AI retrieves and applies only the most relevant and high-quality business information.
3. Fine-tuning: While prompting and RAG inject domain knowledge on-the-fly, they do not inherently embed supply domain-specific workflows, terminology, or decision logic into your LLM. Fine-tuning adapts the LLM to think like a logistics expert. Domain experts can guide this process by creating high-quality training data, ensuring AI learns from real supplier assessments, risk evaluations, and procurement decisions. They can refine industry terminology to prevent misinterpretations (e.g., AI distinguishing between "buffer stock" and "safety stock"). They also align AI’s reasoning with business logic, ensuring it considers cost, risk, and compliance – not just efficiency. Finally, they evaluate fine-tuned models, testing AI against real-world decisions to catch biases or blind spots.
Implementation tip: In LLM fine-tuning, data is the crucial success factor. Quality goes over quantity, and fine-tuning on a small, high-quality dataset can give you excellent results. Thus, give your experts enough time to figure out the right structure and content of the fine-tuning data and plan for plenty of end-to-end iterations of your fine-tuning process.
Every machine learning algorithm gets it wrong from time to time. To mitigate errors, it helps to set the "hard facts" of your domain in stone, making your AI system more reliable and controllable. This combination of machine learning and deterministic rules is called neuro-symbolic AI.
For example, an explicit knowledge graph can encode supplier relationships, regulatory constraints, transportation networks, and risk dependencies in a structured, interconnected format.
Instead of relying purely on statistical correlations, an AI system enriched with knowledge graphs can:
How can you decide which knowledge should be encoded with rules (symbolic AI), and which should be learned dynamically from the data (neural AI)? Domain experts can help youpick those bits of knowledge where hard-coding makes the most sense:
In most cases, this knowledge will be stored in separate components of your AI system, like decision trees, knowledge graphs, and ontologies. There are also some methods to integrate it directly into LLMs and other statistical models, such as Lamini’s memory fine-tuning.
Generating insights and turning them into actions is a multi-step process. Experts can help you model workflows and decision-making pipelines, ensuring that the process followed by your AI system aligns with their tasks. For example, the following pipeline shows how the AI components we considered so far can be combined into a modular workflow for the mitigation of shipment risks:
Experts are also needed to calibrate the "labor distribution" between humans in AI. For example, when modelling decision logic, they can set thresholds for automation, deciding when AI can trigger workflows versus when human approval is needed.
Implementation tip: Involve your domain experts in mapping your processes to AI models and assets, identifying gaps vs. steps that can already be automated.
Especially in B2B environments, where workers are deeply embedded in their daily workflows, the user experience must be seamlessly integrated with existing processes and task structures to ensure efficiency and adoption. For example, an AI-powered supply chain tool must align with how logistics professionals think, work, and make decisions. In the development phase, domain experts are the closest "peers" to your real users, and picking their brains is one of the fastest ways to bridge the gap between AI capabilities and real-world usability.
Implementation tip: Involve domain experts early in UX design to ensure AI interfaces are intuitive, relevant, and tailored to real decision-making workflows.
AI thinks differently from humans, which makes us humans skeptical. Often, that’s a good thing since it helps us stay alert to potential mistakes. But distrust is also one of the biggest barriers to AI adoption. When users don’t understand why a system makes a particular recommendation, they are less likely to work with it. Domain experts can define how AI should explain itself – ensuring users have visibility into confidence scores, decision logic, and key influencing factors.
For example, if an SCO system recommends rerouting a shipment, it would be irresponsible on the part of a logistics planner to just accept it. She needs to see the "why" behind the recommendation – is it due to supplier risk, port congestion, or fuel cost spikes? The UX should show a breakdown of the decision, backed by additional information like historical data, risk factors, and a cost-benefit analysis.
Mitigate overreliance on AI: Excessive dependence of your users on AI can introduce bias, errors, and unforeseen failures. Experts should find ways to calibrate AI-driven insights vs. human expertise, ethical oversight, and strategic safeguards to ensure resilience, adaptability, and trust in decision-making.
Implementation tip: Work with domain experts to define key explainability features – such as confidence scores, data sources, and impact summaries – so users can quickly assess AI-driven recommendations.
AI tools should make complex decisions easier, not harder. If users need deep technical knowledge to extract insights from AI, the system has failed from a UX perspective. Domain experts can help strike a balance between simplicity and depth, ensuring the interface provides actionable, context-aware recommendations while allowing deeper analysis when needed.
For instance, instead of forcing users to manually sift through data tables, AI could provide pre-configured reports based on common logistics challenges. However, expert users should also have on-demand access to raw data and advanced settings when necessary. The key is to design AI interactions that are efficient for everyday use but flexible for deep analysis when required.
Implementation tip: Use domain expert feedback to define default views, priority alerts, and user-configurable settings, ensuring AI interfaces provide both efficiency for routine tasks and depth for deeper research and strategic decisions.
AI UX isn’t a one-and-done process – it needs to evolve with real-world user feedback. Domain experts play a key role in UX testing, refinement, and iteration, ensuring that AI-driven workflows stay aligned with business needs and user expectations.
For example, your initial interface may surface too many low-priority alerts, leading to alert fatigue where users start ignoring AI recommendations. Supply chain experts can identify which alerts are most valuable, allowing UX designers to prioritize high-impact insights while reducing noise.
Implementation tip: Conduct think-aloud sessions and have domain experts verbalize their thought process when interacting with your AI interface. This helps AI teams uncover hidden assumptions and refine AI based on how experts actually think and make decisions.
Vertical AI systems must integrate domain knowledge at every stage, and experts should become key stakeholders in your AI development:
An AI system that "gets" the domain of your users will not only be useful and adopted in the short- and middle-term, but also contribute to the competitive advantage of your business.
Now that you have learned a bunch of methods to incorporate domain-specific knowledge, you might be wondering how to approach this in your organizational context. Stay tuned for my next article, where we will consider the practical challenges and strategies for implementing an expertise-driven Ai Strategy!
Note: Unless noted otherwise, all images are the author’s.
The post Injecting domain expertise into your AI system appeared first on Towards Data Science.
]]>The post Carving out your competitive advantage with AI appeared first on Towards Data Science.
]]>When I talk to corporate customers, there is often this idea that AI, while powerful, won’t give any company a lasting competitive edge. After all, over the past two years, large-scale LLMs have become a commodity for everyone. I’ve been thinking a lot about how companies can shape a competitive advantage using AI, and a recent article in the Harvard Business Review (AI Won’t Give You a New Sustainable Advantage) inspired me to organize my thoughts around the topic.
Indeed, maybe one day, when businesses and markets are driven by the invisible hand of AI, the equal-opportunity hypothesis might ring true. But until then, there are so many ways – big and small – for companies to differentiate themselves using AI. I like to think of it as a complex ingredient in your business recipe – the success of the final dish depends on the cook who is making it. The magic lies in how you combine AI craft with strategy, design, and execution.
In this article, I’ll focus on real-life business applications of AI and explore their key sources of competitive advantage. As we’ll see, successful AI integration goes far beyond technology, and certainly beyond having the trendiest LLM at work. It’s about finding AI’s unique sweet spot in your organization, making critical design decisions, and aligning a variety of stakeholders around the optimal design, deployment, and usage of your AI systems. In the following, I will illustrate this using the mental model we developed to structure our thinking about AI projects (cf. this article for an in-depth introduction).
Note: If you want to learn more about pragmatic AI applications in real-life business scenarios, subscribe to my newsletter AI for Business.
AI is often used to automate existing tasks, but the more space you allow for creativity and innovation when selecting your AI use cases, the more likely they will result in a competitive advantage. You should also prioritize the unique needs and strengths of your company when evaluating opportunities.
When we brainstorm AI use cases with customers, 90% of them typically fall into one of four buckets – productivity, improvement, personalization, and innovation. Let’s take the example of an airline business to illustrate some opportunities across these categories:
Of course, the first branch – productivity and Automation – looks like the low-hanging fruit. It is the easiest one to implement, and automating boring routine tasks has an undeniable efficiency benefit. However, if you’re limiting your use of AI to basic automation, don’t be surprised when your competitors do the same. In our experience, strategic advantage is built up in the other branches. Companies that take the time to figure out how AI can help them offer something different, not just faster or cheaper, are the ones that see long-term results.
As an example, let’s look at a project we recently implemented with the Lufthansa Group. The company wanted to systematize and speed up its innovation processes. We developed an AI tool that acts as a giant sensor into the airline market, monitoring competitors, trends, and the overall market context. Based on this broad information, the tool now provides tailored innovation recommendations for Lufthansa. There are several aspects that cannot be easily imitated by potential competitors, and certainly not by just using a bigger AI model:
All of this is novel know-how that was developed in tight cooperation between industry experts, practitioners, and a specialized AI team, involving lots of discovery, design decisions, and stakeholder alignment. If you get all of these aspects right, I believe you are on a good path toward creating a sustainable and defensible advantage with AI.
Value creation with AI is a highly individual affair. I recently experienced this firsthand when I challenged myself to build and launch an end-to-end AI app on my own. I’m comfortable with Python and don’t massively benefit from AI help there, but other stuff like frontend? Not really my home turf. In this situation, AI-powered code generation worked like a charm. It felt like flowing through an effortless no-code tool, while having all the versatility of the underlying – and unfamiliar – programming languages under my fingertips. This was my very own, personal sweet spot – using AI where it unlocks value I wouldn’t otherwise tap into, and sparing a frontend developer on the way. Most other people would not get so much value out of this case:
While this is a personal example, the same principle applies at the corporate level. For good or for bad, most companies have some notion of strategy and core competence driving their business. The secret is about finding the right place for AI in that equation – a place where it will complement and amplify the existing skills.
Data is the fuel for any AI system. Here, success comes from curating high-quality, focused datasets and continuously adapting them to evolving needs. By blending AI with your unique expertise and treating data as a dynamic resource, you can transform information into long-term strategic value.
To illustrate the importance of proper knowledge management, let’s do a thought experiment and travel to the 16th century. Antonio and Bartolomeo are the best shoemakers in Florence (which means they’re probably the best in the world). Antonio’s family has meticulously recorded their craft for generations, with shelves of notes on leather treatments, perfect fits, and small adjustments learned from years of experience. On the other hand, Bartolomeo’s family has kept their secrets more closely guarded. They don’t write anything down; their shoemaking expertise has been passed down verbally, from father to son.
Now, a visionary named Leonardo comes along, offering both families a groundbreaking technology that can automate their whole shoemaking business – if it can learn from their data. Antonio comes with his wagon of detailed documentation, and the technology can directly learn from those centuries of know-how. Bartolomeo is in trouble – without written records, there’s nothing explicit for the AI to chew on. His family’s expertise is trapped in oral tradition, intuition, and muscle memory. Should he try to write all of it down now – is it even possible, given that most of his work is governed intuitively? Or should he just let it be and go on with his manual business-as-usual? Succumbing to inertia and uncertainty, he goes for the latter option, while Antonio’s business strives and grows with the help of the new technology. Freed from daily routine tasks, he can get creative and invent new ways to make and improve shoes.
Beyond explicit documentation, valuable domain expertise is also hidden across other data assets such as transactional data, customer interactions, and market insights. AI thrives on this kind of information, extracting meaning and patterns that would otherwise go unnoticed by humans.
Data doesn’t need to be big – on the contrary, today, big often means noisy. What’s critical is the quality of the data you’re feeding into your AI system. As models become more sample-efficient – i.e., able to learn from smaller, more focused datasets – the kind of data you use is far more important than how much of it you have.
In my experience, the companies that succeed with AI treat their data – be it for training, fine-tuning, or evaluation – like a craft. They don’t just gather information passively; they curate and edit it, refining and selecting data that reflects a deep understanding of their specific industry. This careful approach gives their AI sharper insights and a more nuanced understanding than any competitor using a generic dataset. I’ve seen firsthand how even small improvements in data quality can lead to significant leaps in AI performance.
Data needs to evolve along with the real world. That’s where DataOps comes in, ensuring data is continuously adapted and doesn’t drift apart from reality. The most successful companies understand this and regularly update their datasets to reflect changing environments and market dynamics. A power mechanism to achieve this is the data flywheel. The more your AI generates insights, the better your data becomes, creating a self-reinforcing feedback loop because users will come back to your system more often. With every cycle, your data sharpens and your AI improves, building an advantage that competitors will struggle to match. To kick off the data flywheel, your system needs to demonstrate some initial value to start with – and then, you can bake in some additional incentives to nudge your users into using your system on a regular basis.
Now, let’s dive into the "intelligence" component. This component isn’t just about AI models in isolation – it’s about how you integrate them into larger intelligent systems. Big Tech is working hard to make us believe that AI success hinges on the use of massive LLMs such as the GPT models. Good for them – bad for those of us who want to use AI in real-life applications. Overrelying on these heavyweights can bloat your system and quickly become a costly liability, while smart system design and tailored models are important sources for differentiation and competitive advantage.
Mainstream LLMs are generalists. Like high-school graduates, they have a mediocre-to-decent performance across a wide range of tasks. However, in business, decent isn’t enough. You need to send your AI model to university so it can specialize, respond to your specific business needs, and excel in your domain. This is where fine-tuning comes into play. However, it’s important to recognize that mainstream LLMs, while powerful, can quickly become slow and expensive if not managed efficiently. As Big Tech boasts about larger model sizes and longer context windows – i.e., how much information you can feed into one prompt – smart tech is quietly moving towards efficiency. Techniques like prompt compression reduce prompt size, making interactions faster and more cost-effective. Small language models (SLMs) are another trend (Figure 4). With up to a couple of billions of parameters, they allow companies to safely deploy task- and domain-specific intelligence on their internal infrastructure (Anacode).
But before fine-tuning an LLM, ask yourself whether generative AI is even the right solution for your specific challenge. In many cases, predictive AI models – those that focus on forecasting outcomes rather than generating content – are more effective, cheaper, and easier to defend from a competitive standpoint. And while this might sound like old news, most of AI value creation in businesses actually happens with predictive AI.
AI models don’t operate in isolation. Just as the human brain consists of multiple regions, each responsible for specific capabilities like reasoning, vision, and language, a truly intelligent AI system often involves multiple components. This is also called a "compound AI system" (BAIR). Compound systems can accommodate different models, databases, and software tools and allow you to optimize for cost and transparency. They also enable faster iteration and extension – modular components are easier to test and rearrange than a huge monolithic LLM.
Take, for example, a customer service automation system for an SME. In its basic form – calling a commercial LLM – such a setup might cost you a significant amount – let’s say $21k/month for a "vanilla" system. This cost can easily scare away an SME, and they will not touch the opportunity at all. However, with careful engineering, optimization, and the integration of multiple models, the costs can be reduced by as much as 98% (FrugalGPT). Yes, you read it right, that’s 2% of the original cost – a staggering difference, putting a company with stronger AI and engineering skills at a clear advantage. At the moment, most businesses are not leveraging these advanced techniques, and we can only imagine how much there is yet to optimize in their AI usage.
While generative AI has captured everyone’s imagination with its ability to produce content, the real future of AI lies in reasoning and problem-solving. Unlike content generation, reasoning is nonlinear – it involves skills like abstraction and generalization which generative AI models aren’t trained for.
AI systems of the future will need to handle complex, multi-step activities that go far beyond what current generative models can do. We’re already seeing early demonstrations of AI’s reasoning capabilities, whether through language-based emulations or engineered add-ons. However, the limitations are apparent – past a certain threshold of complexity, these models start to hallucinate. Companies that invest in crafting AI systems designed to handle these complex, iterative processes will have a major head start. These companies will thrive as AI moves beyond its current generative phase and into a new era of smart, modular, and reasoning-driven systems.
User experience is the channel through which you can deliver the value of AI to users. It should smoothly transport the benefits users need to speed up and perfect their workflows, while inherent AI risks and issues such as erroneous outputs need to be filtered or mitigated.
In most real-world scenarios, AI alone can’t achieve full automation. For example, at my company Equintel, we use AI to assist in the ESG reporting process, which involves multiple layers of analysis and decision-making. While AI excels at large-scale data processing, there are many subtasks that demand human judgment, creativity, and expertise. An ergonomic system design reflects this labor distribution, relieving humans from tedious data routines and giving them the space to focus on their strengths.
This strength-based approach also alleviates common fears of job replacement. When employees are empowered to focus on tasks where their skills shine, they’re more likely to view AI as a supporting tool, not a competitor. This fosters a win-win situation where both humans and AI thrive by working together.
Every AI model has an inherent failure rate. Whether generative AI hallucinations or incorrect outputs from predictive models, mistakes happen and accumulate into the dreaded "last-mile problem." Even if your AI system performs well 90% of the time, a small error rate can quickly become a showstopper if users overtrust the system and don’t address its errors.
Consider a bank using AI for fraud detection. If the AI fails to flag a fraudulent transaction and the user doesn’t catch it, the resulting loss could be significant – let’s say $500,000 siphoned from a compromised account. Without proper trust calibration, users might lack the tools or alerts to question the AI’s decision, allowing fraud to go unnoticed.
Now, imagine another bank using the same system but with proper trust calibration in place. When the AI is uncertain about a transaction, it flags it for review, even if it doesn’t outright classify it as fraud. This additional layer of trust calibration encourages the user to investigate further, potentially catching fraud that would have slipped through. In this scenario, the bank could avoid the $500,000 loss. Multiply that across multiple transactions, and the savings – along with improved security and customer trust – are substantial.
Success with AI requires more than just adopting the latest technologies – it’s about identifying and nurturing the individual sweet spots where AI can drive the most value for your business. This involves:
Finally, I believe we are moving into a time when the notion of competitive advantage itself is shaken up. While in the past, competing was all about maximizing profitability, today, businesses are expected to balance financial gains with sustainability, which adds a new layer of complexity. AI has the potential to help companies not only optimize their operations but also move toward more sustainable practices. Imagine AI helping to reduce plastic waste, streamline shared economy models, or support other initiatives that make the world a better place. The real power of AI lies not just in efficiency but in the potential it offers us to reshape whole industries and drive both profit and positive social impact.
For deep-dives into many of the topics that were touched in this article, check out my upcoming book The Art of AI Product Development.
Note: Unless noted otherwise, all images are the author’s.
The post Carving out your competitive advantage with AI appeared first on Towards Data Science.
]]>The post Designing the relationship between LLMs and user experience appeared first on Towards Data Science.
]]>A while ago, I wrote the article Choosing the right language model for your NLP use case on Medium. It focussed on the nuts and bolts of LLMs – and while rather popular, by now, I realize it doesn’t actually say much about selecting LLMs. I wrote it at the beginning of my Llm journey and somehow figured that the technical details about LLMs – their inner workings and training history – would speak for themselves, allowing AI product builders to confidently select LLMs for specific scenarios.
Since then, I have integrated LLMs into multiple AI products. This allowed me to discover how exactly the technical makeup of an LLM determines the final experience of a product. It also strengthened the belief that product managers and designers need to have a solid understanding of how an LLM works "under the hood." LLM interfaces are different from traditional graphical interfaces. The latter provide users with a (hopefully clear) mental model by displaying the functionality of a product in a rather implicit way. On the other hand, LLM interfaces use free text as the main interaction format, offering much more flexibility. At the same time, they also "hide" the capabilities and the limitations of the underlying model, leaving it to the user to explore and discover them. Thus, a simple text field or chat window invites an infinite number of intents and inputs and can display as many different outputs.
The responsibility for the success of these interactions is not (only) on the engineering side – rather, a big part of it should be assumed by whoever manages and designs the product. In this article, we will flesh out the relationship between LLMs and User Experience, working with two universal ingredients that you can use to improve the experience of your product:
(Note: These two ingredients are part of any LLM application. Beyond these, most applications will also have a range of more individual criteria to be fulfilled, such as latency, privacy, and safety, which will not be addressed here.)
Thus, in Peter Drucker’s words, it’s about "doing the right things" (functionality) and "doing them right" (quality). Now, as we know, LLMs will never be 100% right. As a builder, you can approximate the ideal experience from two directions:
In this article, we will focus on the engineering part. The design of the ideal partnership with human users will be covered in a future article. First, I will briefly introduce the steps in the engineering process – LLM selection, adaptation, and evaluation – which directly determine the final experience. Then, we will look at the two ingredients – functionality and quality – and provide some guidelines to steer your work with LLMs to optimize the product’s performance along these dimensions.
A note on scope: In this article, we will consider the use of stand-alone LLMs. Many of the principles and guidelines also apply to LLMs used in RAG (Retrieval-Augmented Generation) and agent systems. For a more detailed consideration of the user experience in these extended LLM scenarios, please refer to my book The Art of AI Product Development.
In the following, we will focus on the three steps of LLM selection, adaptation, and evaluation. Let’s consider each of these steps:
The following figure summarizes the process:
In real life, the three stages will overlap, and there can be back-and-forth between the stages. In general, model selection is more the "one big decision." Of course, you can shift from one model to another further down the road and even should do this when new, more suitable models appear on the market. However, these changes are expensive since they affect everything downstream. Past the discovery phase, you will not want to make them on a regular basis. On the other hand, LLM adaptation and evaluation are highly iterative. They should be accompanied by continuous discovery activities where you learn more about the behavior of your model and your users. Finally, all three activities should be embedded into a solid LLMOps pipeline, which will allow you to integrate new insights and data with minimal engineering friction.
Now, let’s move to the second column of the chart, scoping the functionality of an LLM and learning how it can be shaped during the three stages of this process.
You might be wondering why we talk about the "functionality" of LLMs. After all, aren’t LLMs those versatile all-rounders that can magically perform any linguistic task we can think of? In fact, they are, as famously described in the paper Language Models Are Few-Shot Learners. LLMs can learn new capabilities from just a couple of examples. Sometimes, their capabilities will even "emerge" out of the blue during normal training and – hopefully – be discovered by chance. This is because the task of language modeling is just as versatile as it is challenging – as a side effect, it equips an LLM with the ability to perform many other related tasks.
Still, the pre-training objective of LLMs is to generate the next word given the context of past words (OK, that’s a simplification – in auto-encoding, the LLM can work in both directions [3]). This is what a pre-trained LLM, motivated by an imaginary "reward," will insist on doing once it is prompted. In most cases, there is quite a gap between this objective and a user who comes to your product to chat, get answers to questions, or translate a text from German to Italian. The landmark paper Climbing Towards NLU: On Meaning, Form, and Understanding in the Age of Data by Emily Bender and Alexander Koller even argues that language models are generally unable to recover communicative intents and thus are doomed to work with incomplete meaning representations.
Thus, it is one thing to brag about amazing LLM capabilities in scientific research and demonstrate them on highly controlled benchmarks and test scenarios. Rolling out an LLM to an anonymous crowd of users with different AI skills and intents—some harmful—is a different kind of game. This is especially true once you understand that your product inherits not only the capabilities of the LLM but also its weaknesses and risks, and you (not a third-party provider) hold the responsibility for its behavior.
In practice, we have learned that it is best to identify and isolate discrete islands of functionality when integrating LLMs into a product. These functions can largely correspond to the different intents with which your users come to your product. For example, it could be:
Oftentimes, these can be further decomposed into more granular, potentially even reusable, capabilities. "Engaging in conversation" could be decomposed into:
Taking this more discrete approach to LLM capabilities provides you with the following advantages:
Let’s summarize some practical guidelines to make sure that the LLM does the right thing in your product:
These are just short snapshots of the lessons we learned when integrating LLMs. My upcoming book The Art of AI Product Development contains deep dives into each of the guidelines along with numerous examples. For the technical details behind pre-training objectives and procedures, you can refer to this article.
Ok, so you have gained an understanding of the intents with which your users come to your product and "motivated" your model to respond to these intents. You might even have put out the LLM into the world in the hope that it will kick off the data flywheel. Now, if you want to keep your good-willed users and acquire new users, you need to quickly ramp up on our second ingredient, namely quality.
In the context of LLMs, quality can be decomposed into an objective and a subjective component. The objective component tells you when and why things go wrong (i.e., the LLM makes explicit mistakes). The subjective component is more subtle and emotional, reflecting the alignment with your specific user crowd.
Using language to communicate comes naturally to humans. Language is ingrained in our minds from the beginning of our lives, and we have a hard time imagining how much effort it takes to learn it from scratch. Even the challenges we experience when learning a foreign language can’t compare to the training of an LLM. The LLM starts from a blank slate, while our learning process builds on an incredibly rich basis of existing knowledge about the world and about how language works in general.
When working with an LLM, we should constantly remain aware of the many ways in which things can go wrong:
These shortcomings can quickly turn into showstoppers for your product – output quality is a central determinant of the user experience of an LLM product. For example, one of the major determinants of the "public" success of ChatGPT was that it was indeed able to generate correct, fluent, and relatively coherent text across a large variety of domains. Earlier generations of LLMs were not able to achieve this objective quality. Most pre-trained LLMs that are used in production today do have the capability to generate language. However, their performance on criteria like coherence, consistency, and world knowledge can be very variable and inconsistent. To achieve the experience you are aiming for, it is important to have these requirements clearly prioritized and select and adapt LLMs accordingly.
Venturing into the more nuanced subjective domain, you want to understand and monitor how users feel around your product. Do they feel good and trustful and get into a state of flow when they use it? Or do they go away with feelings of frustration, inefficiency, and misalignment? A lot of this hinges on individual nuances of culture, values, and style. If you are building a copilot for junior developers, you hardly want it to speak the language of senior executives and vice versa.
For the sake of example, imagine you are a product marketer. You have spent a lot of your time with a fellow engineer to iterate on an LLM that helps you with content generation. At some point, you find yourself chatting with the UX designer on your team and bragging about your new AI assistant. Your colleague doesn’t get the need for so much effort. He is regularly using ChatGPT to assist with the creation and evaluation of UX surveys and is very satisfied with the results. You counter – ChatGPT’s outputs are too generic and monotonous for your storytelling and writing tasks. In fact, you have been using it at the beginning and got quite embarrassed because, at some point, your readers started to recognize the characteristic ChatGPT flavor. That was a slippery episode in your career, after which you decided you needed something more sophisticated.
There is no right or wrong in this discussion. ChatGPT is good for straightforward factual tasks where style doesn’t matter that much. By contrast, you as a marketer need an assistant that can assist in crafting high-quality, persuasive communications that speak the language of your customers and reflect the unique DNA of your company.
These subjective nuances can ultimately define the difference between an LLM that is useless because its outputs need to be rewritten anyway and one that is "good enough" so users start using it and feed it with suitable fine-tuning data. The holy grail of LLM mastery is personalization – i.e., using efficient fine-tuning or prompt tuning to adapt the LLM to the individual preferences of any user who has spent a certain amount of time with the model. If you are just starting out on your LLM journey, these details might seem far off – but in the end, they can help you reach a level where your LLM delights users by responding in the exact manner and style that is desired, spurring user satisfaction and large-scale adoption and leaving your competition behind.
Here are our tips for managing the quality of your LLM:
Pre-trained LLMs are highly convenient – they make AI accessible to everyone, offloading the huge engineering, computation, and infrastructure spending needed to train a huge initial model. Once published, they are ready to use, and we can plug their amazing capabilities into our product. However, when using a third-party model in your product, you inherit not only its power but also the many ways in which it can and will fail. When things go wrong, the last thing you want to do to maintain integrity is to blame an external model provider, your engineers, or – worse – your users.
Thus, when building with LLMs, you should not only look for transparency into the model’s origins (training data and process) but also build a causal understanding of how its technical makeup shapes the experience offered by your product. This will allow you to find the sensitive balance between kicking off a robust data flywheel at the beginning of your journey and continuously optimizing and differentiating the LLM as your product matures toward excellence.
[1] Janna Lipenkova (2022). Choosing the right language model for your NLP use case, Medium.
[2] Tom B. Brown et al. (2020). Language Models are Few-Shot Learners.
[3] Jacob Devlin et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
[4] Emily M. Bender and Alexander Koller (2020). Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data.
[5] Janna Lipenkova (upcoming). The Art of AI Product Development, Manning Publications.
Note: All images are by the author, except when noted otherwise.
The post Designing the relationship between LLMs and user experience appeared first on Towards Data Science.
]]>The post Redefining Conversational AI with Large Language Models appeared first on Towards Data Science.
]]>Conversational AI is an application of LLMs that has triggered a lot of buzz and attention due to its scalability across many industries and use cases. While conversational systems have existed for decades, LLMs have brought the quality push that was needed for their large-scale adoption. In this article, we will use the mental model shown in Figure 1 to dissect conversational AI applications (cf. Building AI products with a holistic mental model for an introduction to the mental model). After considering the market opportunities and the business value of conversational AI systems, we will explain the additional "machinery" in terms of data, LLM fine-tuning, and conversational design that needs to be set up to make conversations not only possible but also useful and enjoyable.
Traditional UX design is built around a multitude of artificial UX elements, swipes, taps, and clicks, requiring a learning curve for each new app. Using conversational AI, we can do away with this busyness, substituting it with the elegant experience of a naturally flowing conversation in which we can forget about the transitions between different apps, windows, and devices. We use language, our universal and familiar protocol for communication, to interact with different virtual assistants (VAs) and accomplish our tasks.
Conversational UIs are not exactly the new hot stuff. Interactive voice response systems (IVRs) and chatbots have been around since the 1990s, and major advances in NLP have been closely followed by waves of hope and development for voice and chat interfaces. However, before the time of LLMs, most of the systems were implemented in the symbolic paradigm, relying on rules, keywords, and conversational patterns. They were also limited to a specific, pre-defined domain of "competence", and users venturing outside of these would soon hit a dead end. All in all, these systems were mined with potential points of failure, and after a couple of frustrating attempts, many users never came back to them. The following figure illustrates an example dialogue. A user who wants to order tickets for a specific concert patiently goes through a detailed interrogation flow, only to find out at the end that the concert is sold out.
As an enabling technology, LLMs can take conversational interfaces to new levels of quality and user satisfaction. Conversational systems can now display much broader world knowledge, linguistic competence, and conversational ability. Leveraging pre-trained models, they can also be developed in much shorter timespans since the tedious work of compiling rules, keywords, and dialogue flows is now replaced by the statistical knowledge of the LLM. Let’s look at two prominent applications where conversational AI can provide value at scale:
Beyond these major application areas, there are numerous other applications, such as telehealth, mental health assistants, and educational chatbots, that can streamline UX and bring value to their users in a faster and more efficient way.
LLMs are originally not trained to engage in fluent small talk or more substantial conversations. Rather, they learn to generate the following token at each inference step, eventually resulting in a coherent text. This low-level objective is different from the challenge of human conversation. Conversation is incredibly intuitive for humans, but it gets incredibly complex and nuanced when you want to teach a machine to do it. For example, let’s look at the fundamental notion of intents. When we use language, we do so for a specific purpose, which is our communicative intent – it could be to convey information, socialize, or ask someone to do something. While the first two are rather straightforward for an LLM (as long as it has seen the required information in the data), the latter is already more challenging. Not only does the LLM need to combine and structure the related information in a coherent way, but it also needs to set the right emotional tone in terms of soft criteria such as formality, creativity, humor, etc. This is a challenge for conversational design (cf. section 5), which is closely intertwined with the task of creating fine-tuning data.
Making the transition from classical language generation to recognizing and responding to specific communicative intents is an important step toward better usability and acceptance of conversational systems. As for all fine-tuning endeavors, this starts with the compilation of an appropriate dataset.
The fine-tuning data should come as close as possible to the (future) real-world data distribution. First, it should be conversational (dialogue) data. Second, if your virtual assistant will be specialized in a specific domain, you should try to assemble fine-tuning data that reflects the necessary domain knowledge. Third, if there are typical flows and requests that will be recurring frequently in your application, as in the case of customer support, try to incorporate varied examples of these in your training data. The following table shows a sample of conversational fine-tuning data from the 3K Conversations Dataset for ChatBot, which is freely available on Kaggle:
Manually creating conversational data can become an expensive undertaking – crowdsourcing and using LLMs to help you generate data are two ways to scale up. Once the dialogue data is collected, the conversations need to be assessed and annotated. This allows you to show both positive and negative examples to your model and nudge it towards picking up the characteristics of the "right" conversations. The assessment can happen either with absolute scores or a ranking of different options between each other. The latter approach leads to more accurate fine-tuning data because humans are normally better at ranking multiple options than evaluating them in isolation.
With your data in place, you are ready to fine-tune your model and enrich it with additional capabilities. In the next section, we will look at fine-tuning, integrating additional information from memory and semantic search, and connecting agents to your conversational system to empower it to execute specific tasks.
A typical conversational system is built with a conversational agent that orchestrates and coordinates the components and capabilities of the system, such as the LLM, the memory, and external data sources. The development of conversational AI systems is a highly experimental and empirical task, and your developers will be in a constant back-and-forth between optimizing your data, improving the fine-tuning strategy, playing with additional components and enhancements, and testing the results. Non-technical team members, including product managers and UX designers, will also be continuously testing the product. Based on their customer discovery activities, they are in a great position to anticipate future users’ conversation style and content and should be actively contributing this knowledge.
For fine-tuning, you need your fine-tuning data (cf. section 2) and a pre-trained LLM. LLMs already know a lot about language and the world, and our challenge is to teach them the principles of conversation. In fine-tuning, the target outputs are texts, and the model will be optimized to generate texts that are as similar as possible to the targets. For supervised fine-tuning, you first need to clearly define the conversational AI task you want the model to perform, gather the data, and run and iterate over the fine-tuning process.
With the hype around LLMs, a variety of fine-tuning methods have emerged. For a rather traditional example of fine-tuning for conversation, you can refer to the description of the LaMDA model.[1] LaMDA was fine-tuned in two steps. First, dialogue data is used to teach the model conversational skills ("generative" fine-tuning). Then, the labels produced by annotators during the assessment of the data are used to train classifiers that can assess the model’s outputs along desired attributes, which include sensibleness, specificity, interestingness, and safety ("discriminative" fine-tuning). These classifiers are then used to steer the behavior of the model towards these attributes.
Additionally, factual groundedness – the ability to ground their outputs in credible external information – is an important attribute of LLMs. To ensure factual groundedness and minimize hallucination, LaMDA was fine-tuned with a dataset that involves calls to an external information retrieval system whenever external knowledge is required. Thus, the model learned to first retrieve factual information whenever the user made a query that required new knowledge.
Another popular fine-tuning technique is Reinforcement Learning from Human Feedback (RLHF)[2]. RLHF "redirects" the learning process of the LLM from the straightforward but artificial next-token prediction task towards learning human preferences in a given communicative situation. These human preferences are directly encoded in the training data. During the annotation process, humans are presented with prompts and either write the desired response or rank a series of existing responses. The behavior of the LLM is then optimized to reflect the human preference.
Beyond compiling conversations for fine-tuning the model, you might want to enhance your system with specialized data that can be leveraged during the conversation. For example, your system might need access to external data, such as patents or scientific papers, or internal data, such as customer profiles or your technical documentation. This is normally done via semantic search (also known as retrieval-augmented generation, or RAG)[3]. The additional data is saved in a database in the form of semantic embeddings (cf. this article for an explanation of embeddings and further references). When the user request comes in, it is preprocessed and transformed into a semantic embedding. The semantic search then identifies the documents that are most relevant to the request and uses them as context for the prompt. By integrating additional data with semantic search, you can reduce hallucination and provide more useful, factually grounded responses. By continuously updating the embedding database, you can also keep the knowledge and responses of your system up-to-date without constantly rerunning your fine-tuning process.
Imagine going to a party and meeting Peter, a lawyer. You get excited and start pitching the legal chatbot you are currently planning to build. Peter looks interested, leans towards you, uhms and nods. At some point, you want his opinion on whether he would like to use your app. Instead of an informative statement that would compensate for your eloquence, you hear: "Uhm… what was this app doing again?"
The unwritten contract of communication among humans presupposes that we are listening to our conversation partners and building our own speech acts on the context we are co-creating during the interaction. In social settings, the emergence of this joint understanding characterizes a fruitful, enriching conversation. In more mundane settings like reserving a restaurant table or buying a train ticket, it is an absolute necessity in order to accomplish the task and provide the expected value to the user. This requires your assistant to know the history of the current conversation, but also of past conversations – for example, it should not be asking for the name and other personal details of a user over and over whenever they initiate a conversation.
One of the challenges of maintaining context awareness is coreference resolution, i.e. understanding which objects are referred to by pronouns. Humans intuitively use a lot of contextual cues when they interpret language – for example, you can ask a young child, "Please get the green ball out of the red box and bring it to me," and the child will know you mean the ball, not the box. For virtual assistants, this task can be rather challenging, as illustrated by the following dialogue:
Assistant: Thanks, I will now book your flight. Would you also like to order a meal for your flight?
User: Uhm… can I decide later whether I want it?
Assistant: Sorry, this flight cannot be changed or canceled later.
Here, the assistant fails to recognize that the pronoun it from the user refers not to the flight, but to the meal, thus requiring another iteration to fix this misunderstanding.
Every now and then, even the best LLM will misbehave and hallucinate. In many cases, hallucinations are plain accuracy issues – and, well, you need to accept that no AI is 100% accurate. Compared to other AI systems, the "distance" between the user and the AI is rather small between the user and the AI. A plain accuracy issue can quickly turn into something that is perceived as toxic, discriminative, or generally harmful. Additionally, since LLMs don’t have an inherent understanding of privacy, they can also reveal sensitive data such as personally identifiable information (PII). You can work against these behaviors by using additional guardrails. Tools such as Guardrails AI, Rebuff, NeMo Guardrails, and Microsoft Guidance allow you to de-risk your system by formulating additional requirements on LLM outputs and blocking undesired outputs.
Multiple architectures are possible in conversational AI. The following schema shows a simple example of how the fine-tuned LLM, external data, and memory can be integrated by a conversational agent, which is also responsible for the prompt construction and the guardrails.
The charm of conversational interfaces lies in their simplicity and uniformity across different applications. If the future of user interfaces is that all apps look more or less the same, is the job of the UX designer doomed? Definitely not – conversation is an art to be taught to your LLM so it can conduct conversations that are helpful, natural, and comfortable for your users. Good conversational design emerges when we combine our knowledge of human psychology, linguistics, and UX design. In the following, we will first consider two basic choices when building a conversational system, namely whether you will use voice and/or chat, as well as the larger context of your system. Then, we will look at the conversations themselves, and see how you can design the personality of your assistant while teaching it to engage in helpful and cooperative conversations.
Conversational interfaces can be implemented using chat or voice. In a nutshell, voice is faster while chat allows users to stay private and to benefit from enriched UI functionality. Let’s dive a bit deeper into the two options since this is one of the first and most important decisions you will face when building a conversational app.
To pick between the two alternatives, start by considering the physical setting in which your app will be used. For example, why are almost all conversational systems in cars, such as those offered by Nuance Communications, based on voice? Because the hands of the driver are already busy and they cannot constantly switch between the steering wheel and a keyboard. This also applies to other activities like cooking, where users want to stay in the flow of their activity while using your app. Cars and kitchens are mostly private settings, so users can experience the joy of voice interaction without worrying about privacy or about bothering others. By contrast, if your app is to be used in a public setting like the office, a library, or a train station, voice might not be your first choice.
After understanding the physical setting, consider the emotional side. Voice can be used intentionally to transmit tone, mood, and personality – does this add value in your context? If you are building your app for leisure, voice might increase the fun factor, while an assistant for mental health could accommodate more empathy and allow a potentially troubled user a larger diapason of expression. By contrast, if your app will assist users in a professional setting like trading or customer service, a more anonymous, text-based interaction might contribute to more objective decisions and spare you the hassle of designing an overly emotional experience.
As a next step, think about the functionality. The text-based interface allows you to enrich the conversations with other media like images and graphical UI elements such as buttons. For example, in an e-commerce assistant, an app that suggests products by posting their pictures and structured descriptions will be way more user-friendly than one that describes products via voice and potentially provides their identifiers.
Finally, let’s talk about the additional design and development challenges of building a voice UI:
If you go for the voice solution, make sure that you not only clearly understand the advantages as compared to chat, but also have the skills and resources to address these additional challenges.
Now, let’s consider the larger context in which you can integrate conversational AI. All of us are familiar with chatbots on company websites – those widgets on the right of your screen that pop up when we open the website of a business. Personally, more often than not, my intuitive reaction is to look for the Close button. Why is that? Through initial attempts to "converse" with these bots, I have learned that they cannot satisfy more specific information requirements, and in the end, I still need to comb through the website. The moral of the story? Don’t build a chatbot because it’s cool and trendy – rather, build it because you are sure it can create additional value for your users.
Beyond the controversial widget on a company website, there are several exciting contexts to integrate those more general chatbots that have become possible with LLMs:
As humans, we are wired to anthropomorphize, i.e. to inflict additional human traits when we see something that vaguely resembles a human. Language is one of humankind’s most unique and fascinating abilities, and conversational products will automatically be associated with humans. People will imagine a person behind their screen or device – and it is good practice to not leave this specific person to the chance of your users’ imaginations, but rather lend it a consistent personality that is aligned with your product and brand. This process is called "persona design".
The first step of persona design is understanding the character traits you would like your persona to display. Ideally, this is already done at the level of the training data – for example, when using RLHF, you can ask your annotators to rank the data according to traits like helpfulness, politeness, fun, etc., in order to bias the model towards the desired characteristics. These characteristics can be matched with your brand attributes to create a consistent image that continuously promotes your branding via the product experience.
Beyond general characteristics, you should also think about how your virtual assistant will deal with specific situations beyond the "happy path". For example, how will it respond to user requests that are beyond its scope, reply to questions about itself, and deal with abusive or vulgar language?
It is important to develop explicit internal guidelines on your persona that can be used by data annotators and conversation designers. This will allow you to design your persona in a purposeful way and keep it consistent across your team and over time, as your application undergoes multiple iterations and refinements.
Have you ever had the impression of talking to a brick wall when you were actually speaking with a human? Sometimes, we find our conversation partners are just not interested in leading the conversation to success. Fortunately, in most cases, things are smoother, and humans will intuitively follow the "principle of cooperation" that was introduced by the language philosopher Paul Grice. According to this principle, humans who successfully communicate with each other follow four maxims, namely quantity, quality, relevance, and manner.
Maxim of quantity
The maxim of quantity asks the speaker to be informative and make their contribution as informative as required. On the side of the virtual assistant, this also means actively moving the conversation forward. For example, consider this snippet from an e-commerce fashion app:
Assistant: What kind of clothing items are you looking for?
User: I am looking for a dress in orange.
Assistant: Don’t: Sorry, we don’t have orange dresses at the moment.
Do: Sorry, we don’t have dresses in orange, but we have this great and very comfortable dress in yellow: …
The user hopes to leave your app with a suitable item. Stopping the conversation because you don’t have items that would fit the exact description kills off the possibility of success. However, if your app makes suggestions about alternative items, it will appear more helpful and leave the option of a successful interaction open.
Especially in voice interactions, it is important to find the right balance between providing all the information the user might need for success while not overwhelming them with unnecessary information which might cloud the interaction.
Maxim of quality
The maxim of quality asks speakers to be truthful and not say things they believe are false or for which they lack adequate evidence. There is a lot of subjectivity and personal belief involved here – thus, this maxim cannot be directly projected onto LLMs. As we have seen, LLMs tend to hallucinate, and this risk is especially high when the conversation drifts off into less familiar territories. Additional guardrails can be used in order to avoid misleading statements and bring the conversation back on track.
Maxim of relevance
The maxim of relevance asks us to be – well, relevant. As a precondition, the virtual assistant has to understand the true intent of the user. Here is an example of an airline assistant that violates this rule:
User: I need a flight from Berlin to Zurich tomorrow morning.
Assistant: OK, I found the following flights from Berlin to Zurich for tomorrow morning. Flight BZ36294 leaves Berlin at 6.30am and arrives in Zurich at 8am. Flight BZ96294 leaves Berlin at 8am and arrives at 9.30am. Flight BZ96294 leaves Berlin at 8am and arrives in Zurich at 9.30am. Flight BZ95614 leaves Berlin at 11am and arrives in Zurich at 12.30pm. There is also the possibility of a connecting flight, in which case you would leave Berlin at 11.45am and arrive in Zurich at 3pm.
User: Sorry… could you repeat this again?
The assistant is overwhelming the user with detailed information that might not be relevant. By breaking this up and guiding the user to limit their search space, the dialogue becomes much easier to digest:
User: I need a flight from Berlin to Zurich tomorrow morning.
Assistant: Ok, I found flights leaving at 6.30, 8, and 11am. Which time would you like to leave?
User: I need to be in Zurich before 9am.
Assistant: OK, so you can take the flight BZ36294. It leaves at 6.30 and arrives at 8am. Should I buy the ticket for you?
User: Yes, thanks.
Maxim of manner
Finally, the maxim of manner states that our speech acts should be clear, concise and orderly, avoiding ambiguity and obscurity of expression. Your virtual assistant should avoid technical or internal jargon, and favour simple, universally understandable formulations.
While Grice’s principles are valid for all conversations independently of a specific domain, LLMs not trained specifically for conversation often fail to fulfill them. Thus, when compiling your training data, it is important to have enough dialogue samples that allow your model to learn these principles.
The domain of conversational design is developing rather quickly. Whether you are already building AI products or thinking about your career path in AI, I encourage you to dig deeper into this topic (cf. the excellent introductions in [5] and [6]). As AI is turning into a commodity, good design together with a defensible data strategy will become two important differentiators for AI products.
Let’s summarize the key takeaways from the article. Additionally, figure 5 offers a "cheat sheet" with the main points that you can download as a reference.
[1] Heng-Tze Chen et al. 2022. LaMDA: Towards Safe, Grounded, and High-Quality Dialog Models for Everything.
[2] OpenAI. 2022. ChatGPT: Optimizing Language Models for Dialogue. Retrieved on January 13, 2022.
[3] Patrick Lewis et al. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
[4] Paul Grice. 1989. Studies in the Way of Words.
[5] Cathy Pearl. 2016. Designing Voice User Interfaces.
[6] Michael Cohen et al. 2004. Voice User Interface Design.
Note: All images are by the author, except noted otherwise.
The post Redefining Conversational AI with Large Language Models appeared first on Towards Data Science.
]]>The post Building AI products with a holistic mental model appeared first on Towards Data Science.
]]>Last update: October 19, 2024
Recently, I coached a client – an SME in the fintech sector – who had hit a dead end with their AI effort for marketing content generation. With their MS Copilot and access to the latest OpenAI models in place, they were all set for their AI adventure. They had even hired a prompt engineer to help them create prompts they could regularly use for new marketing content. The whole project was fun and engaging, but things didn’t make it into production. Dissatisfied with the many issues in the AI outputs – hallucination, the failure to integrate relevant sources, and a heavy AI flavour in the writing style – marketeers would switch back to "manual mode" as soon as things got serious. They simply couldn’t rely on the AI to produce content that would publicly represent the company.
Misunderstanding AI as prompting the latest GenAI models, this company failed to embrace all the elements of a successful AI initiative, so we had to put them back in place. In this article, I will introduce a mental model of for AI systems that we often use to help customers build a holistic understanding of their target application. This model can be used as a tool to ease collaboration, align the different perspectives inside and outside the AI team, and create successful applications based on a shared vision.
Inspired by product management, this model has two big sections – an "opportunity space" where you can define your use cases and value creation potentials, and a "solution space" where all the hard work of implementing, fine-tuning, and promoting your AI happens. In the following, I will explain the components that you need to define in each of these spaces.
Note: If you want to learn more about pragmatic AI applications in real-life business scenarios, subscribe to my newsletter AI for Business.
With all the cool stuff you can now do with AI, you might be impatient to get your hands dirty and start building. However, to build something your users need and love, you should back your development with a use case that is in demand by your users. In the ideal world, users tell us what they need or want. For example, in the case of my client, the request for automating content generation came from marketeers who were overwhelmed in their jobs, but also saw that the company needs to produce more content to stay visible and relevant. If you are building for external users, you can look for hints about their needs and painpoints in existing customer feedback, such as in product reviews and notes from your sales and success teams.
However, since AI is an emerging technology, you shouldn’t overrely on what your users ask for – chances are, they simply don’t know what is possible yet. True innovators embrace and hone their information advantage over customers and users – for example, Henry Ford famously said: "If I had asked people what they wanted, they would have said faster horses." Luckily for us, he was proactive and didn’t wait for people to articulate they need cars. If stretch out your antennae, AI opportunities will come to you from many directions, such as:
I also advise to proactively brainstorm opportunities and ideas to make your business more efficient, improve, and innovate. You can use the following four "buckets" of AI benefits to guide your brainstorming:
Some of these benefits – for example productivity – can be directly quantified for ROI. For less tangible gains like personalization, you will need to think of proxy metrics like user satisfaction. As you think about your AI strategy, you might want to start with productivity and automation, which are the low-hanging fruits, and move on to the more challenging and transformative buckets later on.
In our content generation project, the company jumped right into using LLMs without having a high-quality, task-specific dataset at hand. This was one of the reasons for its failure – your AI is only as good as the data you feed it. For any kind of serious AI and machine learning, you need to collect and prepare your data so it reflects the real-life inputs and provides sufficient learning signals for your AI models. When you start out, there are different ways to get your hands on a decent dataset:
When creating your data, you face a trade-off between quality and quantity. You can manually annotate less data with a high quality, or spend your budget on developing hacks and tricks for automated data augmentation that will introduce additional noise. A rough rule of thumb is as follows:
Ultimately, you will find your ideal data composition through a constant back-and-forth between training, evaluation, and enhancing your data. Thus, in the content generation project, the team is now continuously collecting and curating data based on new published content and updating the datasets used for LLM fine-tuning. They are able to quickly optimize and sharpen the system because the most valuable data comes directly from production. When your application goes live, you should have data collection mechanisms in place to collect user inputs, AI outputs, and, if possible, additional learning signals such as user evaluations. Using this data for fine-tuning will make your model come as close as possible to the "ground truth" of user expectations. This results in higher user satisfaction, more usage and engagement, and, in turn, more high-quality data – a virtuous cycle that is also called the data flywheel.
Data is the raw material from which your model will learn, and hopefully, you can compile a representative, high-quality dataset for your challenges. Now, the actual intelligence of your AI system – its ability to generalize to new data – resides in the machine learning algorithms and models, and any additional tools and plugins that can be called by these.
In terms of the core AI models, there are three main approaches you can adopt:
Prompt an existing model. Mainstream LLMs (Large Language Models) of the GPT family, such as GPT-4o and GPT-4, as well as from other providers such as Anthropic and AI21 Labs are available for inference via API. With prompting, you can directly talk to these models, including in your prompt all the domain- and task-specific information required for a task. This can include specific content to be used, examples of analogous tasks (few-shot prompting) as well as instructions for the model to follow. For example, if your user wants to generate a blog post about a new product they are releasing, you might ask them to provide some core information about the product, such as its benefits and use cases, how to use it, the launch date, etc. Your product then fills this information into a carefully crafted prompt template and asks the LLM to generate the text. Prompting is great to get a head-start into pre-trained models. However, just as in the case of my client, it often leads to the "last-mile problem" – the AI gives reasonable outputs but their are just not good enough for real-life use. You can do whatever you want – provide more data, optimize your formulation, threaten the model, but this point, you’ve used up the optimization potential of prompting.
Beyond the training, evaluation is of primary importance for the successful use of machine learning. Suitable evaluation metrics and methods are not only important for a confident launch of your AI features but will also serve as a clear target for further optimization and as a common ground for internal discussions and decisions. While technical metrics such as precision, recall, and accuracy can provide a good starting point, ultimately, you will want to look for metrics that reflect the real-life value that your AI is delivering to users.
Finally, today, the trend is moving from using a single AI model to compound AI systems which accommodate different models, databases, and software tools and allow you to optimize for cost and transparency. Thus, in the content generation project, we used a RAG (Retrieval-Augmented Generation) architecture and combined the model with a database of domain-specific sources that it could use to produce specialized fintech content.
After the user inputs a query, the system doesn’t pass it directly to the LLM, but rather retrieves the most relevant sources for the query in the database. Then, it uses these sources to augment the prompt passed to the LLM. Thus, the LLM can use up-to-date, specialized sources to generate its final answer. Compared to an isolated fine-tuned model, this reduced hallucinations and allowed users to always have access to the latest sources. Other types of compound systems are agent systems, LLM routers and cascades. A detailed description is out of the scope of this article – if you want to learn more about these patterns, you can refer to my book The Art of AI Product Management.
The user experience of AI products is a captivating theme – after all, users have high hopes but also fears about "partnering" with an AI that can supercharge and potentially outsmart them. The design of this human-AI partnership requires a thoughtful and sensible discovery and design process. One of the key considerations is the degree of automation you want to grant with your product – and, mind you, total automation is by far not always the ideal solution. The following figure illustrates the automation continuum:
Let’s look at each of these levels:
AI systems need special treatment when it comes to design. Standard graphical interfaces are deterministic and allow you to foresee all possible paths the user might take. By contrast, large AI models are probabilistic and uncertain – they expose a range of amazing capabilities but also risks such as toxic, wrong, and harmful outputs. From the outside, your AI interface might look simple because a broad range of the capabilities of your product reside in the model and are not directly visible to users. For example, an LLM can interpret prompts, produce text, search for information, summarize it, adopt a certain style and terminology, execute instructions, etc. Even if your UI is a simple chat or prompting interface, don’t leave this potential unseen – in order to lead users to success, you need to be explicit and realistic. Make users aware of the capabilities and limitations of your AI models, allow them to easily discover and fix errors made by the AI, and teach them ways to iterate themselves to optimal outputs. By emphasizing trust, transparency, and user education, you can make your users collaborate with the AI. While a deep dive into AI UX Design is out of the scope of this article, I strongly encourage you to look for inspiration not only from other AI companies but also from other areas of design, such as human-machine interaction. You will soon identify a range of recurring design patterns, such as autocompletes, prompt suggestions, and templates, that you can integrate into your own interface to make the most out of your data and models.
When you start out with AI, it is easy to forget about governance because you are busy solving technological challenges and creating value. However, without a governance framework, your tool can be vulnerable to security risks, legal violations, and ethical concerns that can erode customer trust and harm your business. In the fintech example mentioned earlier, this led to issues such as hallucinations and irrelevant sources leaking into the public content of the company. A strong governance structure creates guardrails to prevent these issues. It protects sensitive data, ensures compliance with privacy regulations, maintains transparency, and mitigates biases in AI-generated content.
There are different definitions of AI governance. In my practice, companies were especially concerned with the following four types of risk:
Exposure to these risks depends on the application you are building, so it is worth spending some time to analyze your specific situation. For example, demographic bias (based on gender, race, location, etc.) is an important topic if your model generates user-facing content or makes decisions about people, but it turns into a non-issue if you use your model to generate code in the B2B context. In my experience, B2B applications have higher requirements in terms of security and transparency, while B2C applications need mode guardrails to safeguard the privacy of user data and mitigate bias.
To set up your AI governance framework, begin by reviewing the relevant regulations and defining your objectives. At a minimum, ensure you meet the regulatory requirements for your industry and geography, such as the EU AI Act in Europe or the California Consumer Privacy Act in the U.S. Beyond compliance, you can also plan for additional guardrails to address key risks specific to your AI application. Next, assemble a cross-functional team of legal, compliance, security, and AI experts to define, implement, and assign governance measures. This team should regularly review and update the framework to adapt to system improvements, new risks, and evolving regulations. For example, the recent FTC actions against companies that exaggerated their AI performance signal the importance of focusing on quality and maintaining realistic communication about AI capabilities.
Let’s summarize how we addressed the different components in the content generation project:
This representation can be used at different stages in the AI journey – to prioritize use cases, to guide your team planning and discussions, and to align different stakeholders. It is an evolving construct that can be updated as you move along with new learnings as you move forward with your project.
Let’s summarize the key take-aways from this article:
Where you can go from here:
Note: All images are by the author.
The post Building AI products with a holistic mental model appeared first on Towards Data Science.
]]>The post Creating an Information Edge with Conversational Access to Data appeared first on Towards Data Science.
]]>As our world is getting more global and dynamic, businesses are more and more dependent on data for making informed, objective and timely decisions. In this article, we will see how AI can be used for intuitive conversational data access. We will use the mental model shown in Figure 2 to illustrate the implementation of Text2SQL systems (cf. Building AI products with a holistic mental model for an introduction to the mental model). After considering the market opportunities and the business value, we will explain the additional "machinery" in terms of data, LLM fine-tuning, and UX design that needs to be set up to make data widely accessible throughout the organization.
As of now, unleashing the full potential of organisational data is often a privilege of a handful of data scientists and analysts. Most employees don’t master the conventional Data Science toolkit (SQL, Python, R etc.). To access the desired data, they go via an additional layer where analysts or BI teams "translate" the prose of business questions into the language of data. The potential for friction and inefficiency on this journey is high – for example, the data might be delivered with delays or even when the question has already become obsolete. Information might get lost along the way when the requirements are not accurately translated into analytical queries. Besides, generating high-quality insights requires an iterative approach which is discouraged with every additional step in the loop. On the other side, these ad-hoc interactions create disruption for expensive data talent and distract them from more strategic data work, as described in these "confessions" of a data scientist:
When I was at Square and the team was smaller we had a dreaded "analytics on-call" rotation. It was strictly rotated on a weekly basis, and if it was your turn up you knew you would get very little "real" work done that week and spend most of your time fielding ad-hoc questions from the various product and operations teams at the company (SQL monkeying, we called it). There was cutthroat competition for manager roles on the analytics team and I think this was entirely the result of managers being exempted from this rotation – no status prize could rival the carrot of not doing on-call work.[1]
Wouldn’t it be cool to talk directly to your data instead of having to go through multiple rounds of interaction with your data staff? This vision is embraced by conversational interfaces which allow humans to interact with data using language, our most intuitive and universal channel of communication. After parsing a question, an algorithm encodes it into a structured logical form in the query language of choice, such as SQL. Thus, non-technical users can chat with their data and quickly get their hands on specific, relevant and timely information, without making the detour via a BI team. The three main benefits are:
Now, what are the product scenarios in which you might consider Text2SQL? The three main settings are:
As we will see in the following sections, Text2SQL requires a non-trivial upfront setup. To estimate the ROI, consider the nature of the decisions that are to be supported as well as the available data. Text2SQL can be an absolute win in dynamic environments where data is changing quickly and is actively and frequently used in decision making, such as investing, marketing, manufacturing and the energy industry. In these environments, traditional tools for knowledge management are too static, and more fluent ways to access data and information help companies generate a competitive advantage. In terms of the data, Text2SQL provides the biggest value with a database that is:
Any machine learning endeavour starts with data, so we will start by clarifying the structure of the input and target data that are used during training and prediction. Throughout the article, we will use the Text2SQL flow from Figure 1 as our running representation, and highlight the currently considered components and relationships in yellow.
1. 1 Format and structure of the data
Typically, a raw Text2SQL input-output pair consists of a natural-language question and the corresponding SQL query, for example:
Question: "List the name and number of followers for each user."
SQL query:
select name, followers from user_profiles
In the training data space, the mapping between questions and SQL queries is many-to-many:
The manual collection of training data for Text2SQL is particularly tedious. It not only requires SQL mastery on the part of the annotator, but also more time per example than more general linguistic tasks such as sentiment analysis and text classification. To ensure a sufficient quantity of training examples, data augmentation can be used – for example, LLMs can be used to generate paraphrases for the same question. [3] provides a more complete survey of Text2SQL data augmentation techniques.
1.2 Enriching the prompt with database information
Text2SQL is an algorithm at the interface between unstructured and structured data. For optimal performance, both types of data need to be present during training and prediction. Specifically, the algorithm has to know about the queried database and be able to formulate the query in such a way that it can be executed against the database. This knowledge can encompass:
There are two options for incorporating database knowledge: on the one hand, the training data can be restricted to examples written for the specific database, in which case the schema is learned directly from the SQL query and its mapping to the question. This single-database setting allows to optimise the algorithm for an individual database and/or company. However, it kills off any ambitions for scalability, since the model needs to be fine-tuned for every single customer or database. Alternatively, in a multi-database setting, the database schema can be provided as part of the input, allowing the algorithm to "generalise" to new, unseen database schemas. While you will absolutely need to go for this approach if you want to use Text2SQL on many different databases, keep in mind that it requires considerable prompt engineering effort. For any reasonable business database, including the full information in the prompt will be extremely inefficient and most probably impossible due to prompt length limitations. Thus, the function responsible for prompt formulation should be smart enough to select a subset of database information which is most "useful" for a given question, and to do this for potentially unseen databases.
Finally, database structure plays a crucial role. In those scenarios where you have enough control over the database, you can make your model’s life easier by letting it learn from an intuitive structure. As a rule of thumb, the more your database reflects how business users talk about the business, the better and faster your model can learn from it. Thus, consider applying additional transformations to the data, such as assembling normalised or otherwise dispersed data into wide tables or a data vault, naming tables and columns in an explicit and unambiguous way etc. All business knowledge that you can encode up-front will reduce the burden of probabilistic learning on your model and help you achieve better results.
Text2SQL is a type of semantic parsing – the mapping of texts to logical representations. Thus, the system has not only to "learn" natural language, but also the target representation – in our case, SQL. Specifically, it needs to acquire the following bits of knowledge:
2.1 Solving linguistic variability in the input
At the input, the main challenge of Text2SQL lies in the flexibility of language: as described in the section Format and structure of the data, the same question can be paraphrased in many different ways. Additionally, in the real-life conversational context, we have to deal with a number of issues such as spelling and grammar mistakes, incomplete and ambiguous inputs, multilingual inputs etc.
LLMs such as the GPT models, T5, and CodeX are coming closer and closer to solving this challenge. Learning from huge quantities of diverse text, they learn to deal with a large number of linguistic patterns and irregularities. In the end, they become able to generalise over questions which are semantically similar despite having different surface forms. LLMs can be applied out-of-the-box (zero-shot) or after fine-tuning. The former, while convenient, leads to lower accuracy. The latter requires more skill and work, but can significantly increase accuracy.
In terms of accuracy, as expected, the best-performing models are the latest models of the GPT family including the CodeX models. In April 2023, GPT-4 led to a dramatic accuracy increase of more than 5% over the previous state-of-the-art and achieved an accuracy of 85.3% (оn the metric "execution with values").[4] In the open-source camp, initial attempts at solving the Text2SQL puzzle were focussed on auto-encoding models such as BERT, which excel at NLU tasks.[5, 6, 7] However, amidst the hype around generative AI, recent approaches focus on autoregressive models such as the T5 model. T5 is pre-trained using multi-task learning and thus easily adapts to new linguistic tasks, incl. different variants of semantic parsing. However, autoregressive models have an intrinsic flaw when it comes to semantic parsing tasks: they have an unconstrained output space and no semantic guardrails that would constrain their output, which means they can get stunningly creative in their behaviour. While this is amazing stuff for generating free-form content, it is a nuisance for tasks like Text2SQL where we expect a constrained, well-structured target output.
2.2 Query validation and improvement
To constrain the LLM output, we can introduce additional mechanisms for validating and improving the query. This can be implemented as an extra validation step, as proposed in the PICARD system.[8] PICARD uses a SQL parser that can verify whether a partial SQL query can lead to a valid SQL query after completion. At each generation step by the LLM, tokens that would invalidate the query are rejected, and the highest-probability valid tokens are kept. Being deterministic, this approach ensures 100% SQL validity as long as the parser observes correct SQL rules. It also decouples the query validation from the generation, thus allowing to maintain both components independently of one another and to upgrade and modify the LLM.
Another approach is to incorporate structural and SQL knowledge directly into the LLM. For example, Graphix [9] uses graph-aware layers to inject structured SQL knowledge into the T5 model. Due to the probabilistic nature of this approach, it biases the system towards correct queries, but doesn’t provide a guarantee for success.
Finally, the LLM can be used as a multi-step agent that can autonomously check and improve the query.[10] Using multiple steps in a chain-of-thought prompt, the agent can be tasked to reflect on the correctness of its own queries and improve any flaws. If the validated query can still not be executed, the SQL exception traceback can be passed to the agent as an additional feedback for improvement.
Beyond these automated methods which happen in the backend, it is also possible to involve the user during the query checking process. We will describe this in more detail in the section on User experience.
2.3 Evaluation
To evaluate our Text2SQL algorithm, we need to generate a test (validation) dataset, run our algorithm on it and apply relevant evaluation metrics on the result. A naive dataset split into training, development and validation data would be based on question-query pairs and lead to suboptimal results. Validation queries might be revealed to the model during training and lead to an overly optimistic view on its generalisation skills. A query-based split, where the dataset is split in such a way that no query appears both during training and during validation, provides more truthful results.
In terms of evaluation metrics, what we care about in Text2SQL is not to generate queries that are completely identical to the gold standard. This "exact string match" method is too strict and will generate many false negatives, since different SQL queries can lead to the same returned dataset. Instead, we want to achieve high semantic accuracy and evaluate whether the predicted and the "gold standard" queries would always return the same datasets. There are three evaluation metrics that approximate this goal:
Execution accuracy: the datasets resulting from the generated and target SQL queries are compared for identity. With good luck, queries with different semantics can still pass this test on a specific database instance. For example, assuming a database where all users are aged over 30, the following two queries would return identical results despite having different semantics: select from user select from user where age > 30
3. User experience
The current state-of-the-art of Text2SQL doesn’t allow a completely seamless integration into production systems – instead, it is necessary to actively manage the expectations and the behaviour of the user, who should always be aware that she is interacting with an AI system.
3.1 Failure management
Text2SQL can fail in two modes, which need to be caught in different ways:
The second mode is particularly tricky since the risk of "silent failures" – errors that go undetected by the user – is high. The prototypical user will have neither the time nor the technical skill to verify the correctness of the query and/or the resulting data. When data is used for decision making in the real world, this kind of failure can have devastating consequences. To avoid this, it is critical to educate users and establish guardrails on a business level that limit the potential impact, such as additional data checks for decisions with a higher impact. On the other hand, we can also use the user interface to manage the human-machine interaction and help the user detect and improve problematic requests.
3.2 Human-machine interaction
Users can get involved with your AI system with different degrees of intensity. More interaction per request can lead to better results, but it also slows down the fluidity of the user experience. Besides the potential negative impact of erroneous queries and results, also consider how motivated your users will be to provide back-and-forth feedback in order to get more accurate results and also help improve the product in the long term.
The easiest and least engaging way is to work with confidence scores. While the naive calculation of confidence as an average of the probabilities of the generated tokens is overly simplistic, more advanced methods like verbalised feedback can be used. [13] The confidence can be displayed in the interface and highlighted with an explicit alert in case it is dangerously low. This way, the responsibility of an appropriate follow-up in the "real world" – be it a rejection, acceptance or an additional check of the data – lands on the shoulders of your user. While this is a safe bet for you as a vendor, transferring this work to the user can also reduce the value of your product.
A second possibility is to engage the user in a clarification dialogue in the case of low-confidence, ambiguous or otherwise suspicious queries. For example, your system might suggest orthographic or grammar corrections to the input and ask to disambiguate specific words or grammatical structures. It might also allow the user to proactively ask for corrections in the query:[14]
USER: Show me John’s tasks in this sprint.
ASSISTANT: Would you like to see tasks John created, or those he is working on?
USER: tasks John created
ASSISTANT: Ok, here are the task IDs:
USER: Thanks, I would also like to see more information about the tasks. Please also sort by urgency.
ASSISTANT: Sure, here are the tasks along with short descriptions, assignees and deadlines, sorted by deadline.
Finally, to ease the understanding of queries by the user, your system can also provide an explicit textual reformulation of the query and ask the user to either confirm or correct it.[15]
In this section, we discuss the specific non-functional requirements for Text2SQL as well as the trade-offs between them. We will focus on the six requirements that seem most important for the task: accuracy, scalability, speed, explainability, privacy and adaptability over time.
4.1 Accuracy
For Text2SQL, the requirements on accuracy are high. First, Text2SQL is typically applied in a conversation setting where predictions are made one-by-one. Thus, the "Law of large numbers" which typically helps balance off the error in batched predictions, does not help. Second, syntactic and lexical validity is a "hard" condition: the model has to generate a well-formed SQL query, potentially with complex syntax and semantics, otherwise the request cannot be executed against the database. And if this goes well and the query can be executed, it can still contain semantic errors and lead to a wrong returned dataset (cf. section 3.1 Failure management).
4.2 Scalability
The main scalability considerations are whether you want to apply Text2SQL on one or multiple databases – and in the latter case, whether the set of databases is known and closed. If yes, you will have an easier time since you can include the information about these databases during training. However, in a scenario of a scalable product – be it a standalone Text2SQL application or an integration into an existing data product – your algorithm has to cope with any new database schema on the fly. This scenario also doesn’t give you the opportunity to transform the database structure to make it more intuitive for learning (link!). All of this leads to a heavy trade-off with accuracy, which might also explain why current Text2SQL providers that offer ad-hoc querying of new databases have not yet achieve a significant market penetration.
4.3 Speed
Since Text2SQL requests will typically be processed online in a conversation, the speed aspect is important for user satisfaction. On the positive side, users are often aware of the fact that data requests can take a certain time and show the required patience. However, this goodwill can be undermined by the chat setting, where users subconsciously expect human-like conversation speed. Brute-force optimisation methods like reducing the size of the model might have an unacceptable impact on accuracy, so consider inference optimisation to satisfy this expectation.
4.4 Explainability and transparency
In the ideal case, the user can follow how the query was generated from the text, see the mapping between specific words or expressions in the question and the SQL query etc. This allows to verify the query and make any adjustments when interacting with the system. Besides, the system could also provide an explicit textual reformulation of the query and ask the user to either confirm or correct it.
4.5 Privacy
The Text2SQL function can be isolated from query execution, so the returned database information can be kept invisible. However, the critical question is how much information about the database is included in the prompt. The three options (by decreasing privacy level) are:
Privacy trades off with accuracy – the less constrained you are in including useful information in the prompt, the better the results.
4.6 Adaptability over time
To use Text2SQL in a durable way, you need to adapt to data drift, i. e. the changing distribution of the data to which the model is applied. For example, let’s assume that the data used for initial fine-tuning reflects the simple querying behaviour of users when they start using the BI system. As time passes, information needs of users become more sophisticated and require more complex queries, which overwhelm your naive model. Besides, the goals or the strategy of a company change might also drift and direct the information needs towards other areas of the database. Finally, a Text2SQL-specific challenge is database drift. As the company database is extended, new, unseen columns and tables make their way into the prompt. While Text2SQL algorithms that are designed for multi-database application can handle this issue well, it can significantly impact the accuracy of a single-database model. All of these issues are best solved with a fine-tuning dataset that reflects the current, real-world behaviour of users. Thus, it is crucial to log user questions and results, as well as any associated feedback that can be collected from usage. Additionally, semantic clustering algorithms, for example using embeddings or topic modelling, can be applied to detect underlying long-term changes in user behaviour and use these as an additional source of information for perfecting your fine-tuning dataset
Let’s summarise the key points of the article:
[1] Ken Van Haren. 2023. Replacing a SQL analyst with 26 recursive GPT prompts
[2] Nitarshan Rajkumar et al. 2022. Evaluating the Text-to-SQL Capabilities of Large Language Models
[3] Naihao Deng et al. 2023. Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect
[4] Mohammadreza Pourreza et al. 2023. DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction
[5] Victor Zhong et al. 2021. Grounded Adaptation for Zero-shot Executable Semantic Parsing
[6] Xi Victoria Lin et al. 2020. Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing
[7] Tong Guo et al. 2019. Content Enhanced BERT-based Text-to-SQL Generation
[8] Torsten Scholak et al. 2021. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models
[9] Jinyang Li et al. 2023. Graphix-T5: Mixing Pre-Trained Transformers with Graph-Aware Layers for Text-to-SQL Parsing
[10] LangChain. 2023. LLMs and SQL
[11] Tao Yu et al. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task
[12] Ruiqi Zhong et al. 2020. Semantic Evaluation for Text-to-SQL with Distilled Test Suites
[13] Katherine Tian et al. 2023. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback
[14] Braden Hancock et al. 2019. Learning from Dialogue after Deployment: Feed Yourself, Chatbot!
[15] Ahmed Elgohary et al. 2020. Speak to your Parser: Interactive Text-to-SQL with Natural Language Feedback
[16] Janna Lipenkova. 2022. Talk to me! Text2SQL conversations with your company’s data, talk at New York Natural Language Processing meetup.
All images are by the author.
The post Creating an Information Edge with Conversational Access to Data appeared first on Towards Data Science.
]]>The post Four LLM trends since ChatGPT and their implications for AI builders appeared first on Towards Data Science.
]]>For many AI companies, it seems like ChatGPT has turned into the ultimate competitor. When pitching my analytics startups in earlier days, I would frequently be challenged: "what will you do if Google (Facebook, Alibaba, Yandex…) comes around the corner and does the same?" Now, the question du jour is: "why can’t you use ChatGPT to do this?"
The short answer is: ChatGPT is great for many things, but it does by far not cover the full spectrum of AI. The current hype happens explicitly around generative AI – not analytical AI, or its rather fresh branch of synthetic AI [1]. What does this mean for LLMs? As described in my previous article, LLMs can be pre-trained with three objectives – autoregression, autoencoding and sequence-to-sequence (cf. also Table 1, column "Pre-training objective"). Typically, a model is pre-trained with one of these objectives, but there are exceptions – for example, UniLM [2] was pre-trained on all three objectives. The fun generative tasks that have popularised AI in the past months are conversation, question answering and content generation – those tasks where the model indeed learns to "generate" the next token, sentence etc. These are best carried out by autoregressive models, which include the GPT family as well as most of the recent open-source models, like MPT-7B, OPT and Pythia. Autoencoding models, which are better suited for information extraction, distillation and other analytical tasks, are resting in the background – but let’s not forget that the initial LLM breakthrough in 2018 happened with BERT, an autoencoding model. While this might feel like stone age for modern AI, autoencoding models are especially relevant for many B2B use cases where the focus is on distilling concise insights that address specific business tasks. We might indeed witness another wave around autoencoding and a new generation of LLMs that excel at extracting and synthesizing information for analytical purposes.
For builders, this means that popular autoregressive models can be used for everything that is content generation – and the longer the content, the better. However, for analytical tasks, you should carefully evaluate whether the autoregressive LLM you use will output a satisfying result, and consider autoencoding models or even more traditional NLP methods otherwise.
In the past months, there has been a lot of debate about the uneasy relationship between open-source and commercial AI. In the short term, the open-source community cannot keep up in a race where winning entails a huge spend on data and/or compute. But with a long-term perspective in mind, even the big companies like Google and OpenAI feel threatened by open-source.[3] Spurred by this tension, both camps have continued building, and the resulting advances are eventually converging into fruitful synergies. The open-source community has a strong focus on frugality, i. e. increasing the efficiency of LLMs by doing more with less. This not only makes LLMs affordable to a broader user base – think AI democratisation – but also more sustainable from an environmental perspective. There are three principal dimensions along which LLMs can become more efficient:
On the other extreme, for now, "generative AI control is in the hands of the few that can afford the dollars to train and deploy models at scale".[5] The commercial offerings are exploding in size – be it model size, data size or the time spent on training – and clearly outcompete open-source models in terms of output quality. There is not much to report here technically – rather, the concerns are more on the side of governance and regulation. Thus, "one key risk is that powerful LLMs like GPT develop only in a direction that suits the commercial objectives of these companies."[5]
How will these two ends meet – and will they meet at all? On the one hand, any tricks that allow to reduce resource consumption can eventually be scaled up again by throwing more resources at them. On the other hand, LLM training follows the power law, which means that the learning curve flattens out as model size, dataset size and training time increase.[6] You can think of this in terms of the human education analogy – over the lifetime of humanity, schooling times have increased, but did the intelligence and erudition of the average person follow suit?
The positive thing about a flattening learning curve is the relief it brings amidst fears about AI growing "stronger and smarter" than humans. But brace yourself – the LLM world is full of surprises, and one of the most unpredictable ones is emergence.[7] Emergence is when quantitative changes in a system result in qualitative changes in behaviour – summarised with "quantity leads to quality", or simply "more is different".[8] At some point in their training, LLMs seem to acquire new, unexpected capabilities that were not in the original training scope. At present, these capabilities come in the form of new linguistic skills – for instance, instead of just generating text, models suddenly learn to summarise or translate. It is impossible to predict when this might happen and what the nature and scope of the new capabilities will be. Hence, the phenomenon of emergence, while fascinating for researchers and futurists, is still far away from providing robust value in a commercial context.
As more and more methods are developed that increase the efficiency of LLM finetuning and inference, the resource bottleneck around the physical operation of open-source LLMs seems to be loosening. Concerned with the high usage cost and restricted quota of commercial LLMs, more and more companies consider deploying their own LLMs. However, development and maintenance costs remain, and most of the described optimisations also require extended technical skills for manipulating both the models and the hardware on which they are deployed. The choice between open-source and commercial LLMs is a strategic one and should be done after a careful exploration of a range of trade-offs that include costs (incl. development, operating and usage costs), availability, flexibility and performance. A common line of advice is to get a head start with the big commercial LLMs to quickly validate the business value of your end product, and "switch" to open-source later down the road. But this transition can be tough and even unrealistic, since LLMs widely differ in the tasks they are good at. There is a risk that open-source models cannot satisfy the requirements of your already developed application, or that you need to do considerable modifications to mitigate the associated trade-offs. Finally, the most advanced setup for companies that build a variety of features on LLMs is a multi-LLM architecture that allows to leverage the advantages of different LLMs.
The big challenges of LLM training being roughly solved, another branch of work has focussed on the integration of LLMs into real-world products. Beyond providing ready-made components that enhance convenience for developers, these innovations also help overcome the existing limitations of LLMs and enrich them with additional capabilities such as reasoning and the use of non-linguistic data.[9] The basic idea is that, while LLMs are already great at mimicking human linguistic capacity, they still have to be placed into the context of a broader computational "cognition" to conduct more complex reasoning and execution. This cognition encompasses a number of different capacities such as reasoning, action and observation of the environment. Basis: At the moment, it is approximated using plugins and agents, which can be combined using modular LLM frameworks such as LangChain, LlamaIndex and AutoGPT.
Pre-trained LLMs have significant practical limitations when it comes to the data they leverage: on the one hand, the data quickly gets outdated – for instance, while GPT-4 was published in 2023, its data was cut off in 2021. On the other hand, most real-world applications require some customisation of the knowledge in the LLM. Consider building an app that allows you to create personalised marketing content – the more information you can feed into the LLM about your product and specific users, the better the result. Plugins make this possible – your program can fetch data from an external source, like customer e-mails and call records, and insert these into the prompt for a personalised, controlled output.
Language is closely tied with actionability. Our communicative intents often circle around action, for example when we ask someone to do something or when we refuse to act in a certain way. The same goes for computer programs, which can be seen as collections of functions that execute specific actions, block them when certain conditions are not met etc. LLM-based agents bring these two worlds together. The instructions for these agents are not hard-coded in a programming language, but are freely generated by LLMs in the form of reasoning chains that lead to achieving a given goal. Each agent has a set of plugins at hand and can juggle them around as required by the reasoning chain – for example, he can combine a search engine for retrieving specific information and a calculator to subsequently execute computations on this information. The idea of agents has existed for a long time in reinforcement learning – however, as of today, reinforcement learning still happens in relatively closed and safe environments. Backed by the vast common knowledge of LLMs, agents can now not only venture into the "big world", but also tap into an endless combinatorial potential: each agent can execute a multitude of tasks to reach their goals, and multiple agents can interact and collaborate with each other.[10] Moreover, agents learn from their interactions with the world and build up a memory that comes much closer to the multi-modal memory of humans than does the purely linguistic memory of LLMs.
In the last months, we have seen a range of new LLM-based frameworks such as LangChain, AutoGPT and LlamaIndex. These frameworks allow to integrate plugins and agents into complex chains of generations and actions to implement complex processes that include multi-step reasoning and execution. Developers can now focus on efficient prompt engineering and quick app prototyping.[11] At the moment, a lot of hard-coding is still going on when you use these frameworks – but gradually, they might be evolving towards a more comprehensive and flexible system for modelling cognition and action, such as the JEPA architecture proposed by Yann LeCun.[12]
What are the implications of these new components and frameworks for builders? On the one hand, they boost the potential of LLMs by enhancing them with external data and agency. Frameworks, in combination with convenient commercial LLMs, have turned app prototyping into a matter of days. But the rise of LLM frameworks also has implications for the LLM layer. It is now hidden behind an additional abstraction, and as any abstraction it requires higher awareness and discipline to be leveraged in a sustainable way. First, when developing for production, a structured process is still required to evaluate and select specific LLMs for the tasks at hand. At the moment, many companies skip this process under the assumption that the latest models provided by OpenAI are the most appropriate. Second, LLM selection should be coordinated with the desired agent behavior: the more complex and flexible the desired behavior, the better the LLM should perform to ensure that it picks the right actions in a wide space of options.[13] Finally, in operation, an MLOps pipeline should ensure that the model doesn’t drift away from changing data distributions and user preferences.
With the advance of prompting, using AI to do cool and creative things is becoming accessible for non-technical people. No need to be a programmer anymore – just use language, our natural communication medium, to tell the machine what to do. However, amidst all the buzz and excitement around quick prototyping and experimentation with LLMs, at some point, we still come to realize that "it’s easy to make something cool with LLMs, but very hard to make something production-ready with them."[14] In production, LLMs hallucinate, are sensitive to imperfect prompt designs, and raise a number of issues for governance, safety, and alignment with desired outcomes. And the thing we love most about LLMs – its open-ended space of in- and outputs – also makes it all the harder to test for potential failures before deploying them to production.
If you have ever built an AI product, you will know that end users are often highly sensitive to AI failures. Users are prone to a "negativity bias": even if your system achieves high overall accuracy, those occasional but unavoidable error cases will be scrutinized with a magnifying glass. With LLMs, the situation is different. Just as with any other complex AI system, LLMs do fail – but they do so in a silent way. Even if they don’t have a good response at hand, they will still generate something and present it in a highly confident way, tricking us into believing and accepting them and putting us in embarrassing situations further down the stream. Imagine a multi-step agent whose instructions are generated by an LLM – an error in the first generation will cascade to all subsequent tasks and corrupt the whole action sequence of the agent.
One of the biggest quality issues of LLMs is hallucination, which refers to the generation of texts that are semantically or syntactically plausible but are factually incorrect. Already Noam Chomsky, with his famous sentence "Colorless green ideas sleep furiously", made the point that a sentence can be perfectly well-formed from the linguistic point of view but completely nonsensical for humans. Not so for LLMs, which lack the non-linguistic knowledge that humans possess and thus cannot ground language in the reality of the underlying world. And while we can immediately spot the issue in Chomsky’s sentence, fact-checking LLM outputs becomes quite cumbersome once we get into more specialized domains that are outside of our field of expertise. The risk of undetected hallucinations is especially high for long-form content as well as for interactions for which no ground truth exists, such as forecasts and open-ended scientific or philosophical questions.[15]
There are multiple approaches to hallucination. From a statistical viewpoint, we can expect that hallucination decreases as language models learn more. But in a business context, the incrementality and uncertain timeline of this "solution" makes it rather unreliable. Another approach is rooted in neuro-symbolic AI. By combining the powers of statistical language generation and deterministic world knowledge, we may be able to reduce hallucinations and silent failures and finally make LLMs robust for large-scale production. For instance, ChatGPT makes this promise with the integration of Wolfram Alpha, a vast structured database of curated world knowledge.
On the surface, the natural language interface offered by prompting seems to close the gap between AI experts and laypeople – after all, all of us know at least one language and use it for communication, so why not do the same with an LLM? But prompting is a fine craft. Successful prompting that goes beyond trivia requires not only strong linguistic intuitions but also knowledge about how LLMs learn and work. And then, the process of designing successful prompts is highly iterative and requires systematic experimentation. As shown in the paper Why Johnny can’t prompt, humans struggle to maintain this rigor. On the one hand, we often are primed by expectations that are rooted in our experience of human interaction. Talking to humans is different from talking to LLMs – when we interact with each other, our inputs are transmitted in a rich situational context, which allows us to neutralize the imprecisions and ambiguities of human language. An LLM only gets the linguistic information and thus is much less forgiving. On the other hand, it is difficult to adopt a systematic approach to prompt engineering, so we quickly end up with opportunistic trial-and-error, making it hard to construct a scalable and consistent system of prompts.
To resolve these challenges, it is necessary to educate both prompt engineers and users about the learning process and the failure modes of LLMs, and to maintain an awareness of possible mistakes in the interface. It should be clear that an LLM output is always an uncertain thing. For instance, this can be achieved using confidence scores in the user interface which can be derived via model calibration.[15] For prompt engineering, we currently see the rise of LLMOps, a subcategory of MLOps that allows to manage the prompt lifecycle with prompt templating, versioning, optimisation etc. Finally, finetuning trumps few-shot learning in terms of consistency since it removes the variable "human factor" of ad-hoc prompting and enriches the inherent knowledge of the LLM. Whenever possible given your setup, you should consider switching from prompting to finetuning once you have accumulated enough training data.
With new models, performance hacks and integrations coming up every day, the LLM rabbit hole is deepening day by day. For companies, it is important to stay differentiated, keep an eye on the recent developments and new risks and favour hands-on experimentation over the buzz – many trade-offs and issues related to LLMs only become visible during real-world use. In this article, we took a look at the recent developments and how they affect building with LLMs:
[1] Andreessen Horowitz. 2023. For B2B Generative AI Apps, Is Less More?
[2] Li Dong et al. 2019. Unified language model pre-training for natural language understanding and generation. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 13063–13075.
[3] The Information. 2023. Google Researcher: Company Has ‘No Moat’ in AI.
[4] Tri Dao et al. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
[5] EE Times. 2023. Can Open-Source LLMs Solve AI’s Democratization Problem?
[6] Jared Kaplan et al. 2023. Scaling Laws for Neural Language Models.
[7] Jason Wei et al. 2023. Emergent Abilities of Large Language Models.
[8] Philip Anderson. 1972. More is Different. In Science, Vol 177, Issue 4047, pp. 393–396.
[9] Janna Lipenkova. 2023. Overcoming the Limitations of Large Language Models.
[10] Joon Sung Park et al. 2023. Generative Agents: Interactive Simulacra of Human Behavior.
[11] Harvard University. 2023. GPT-4 – How does it work, and how do I build apps with it? – CS50 Tech Talk.
[12] Yann LeCun. 2022. A Path Towards Autonomous Machine Intelligence.
[13] Jerry Liu. 2023. Dumber LLM Agents Need More Constraints and Better Tools.
[14] Chip Huyen. 2023. Building LLM applications for production.
[15] Stephanie Lin et al. 2022. Teaching models to express their uncertainty in words.
The post Four LLM trends since ChatGPT and their implications for AI builders appeared first on Towards Data Science.
]]>The post Overcoming the Limitations of Large Language Models appeared first on Towards Data Science.
]]>Disclaimer: This article was written without the support of ChatGPT.
In the last couple of years, Large Language Models (LLMs) such as ChatGPT, T5 and LaMDA have developed amazing skills to produce human language. We are quick to attribute intelligence to models and algorithms, but how much of this is emulation, and how much is really reminiscent of the rich language capability of humans? When confronted with the natural-sounding, confident outputs of these models, it is sometimes easy to forget that language per se is only the tip of the communication iceberg. Its full power unfolds in combination with a wide range of complex cognitive skills relating to perception, reasoning and communication. While humans acquire these skills naturally from the surrounding world as they grow, the learning inputs and signals for LLMs are rather meagre. They are forced to learn only from the surface form of language, and their success criterion is not communicative efficiency but the reproduction of high-probability linguistic patterns.
In the business context, this can lead to bad surprises when too much power is given to an LLM. Facing its own limitations, it will not admit them and rather gravitate to the other extreme – produce non-sense, toxic content or even dangerous advice with a high level of confidence. For example, a medical virtual assistant driven by GPT-3 can advise its user to kill themselves at a certain point in the conversation.[4]
Considering these risks, how can we safely benefit from the power of LLMs when integrating them in our product development? On the one hand, it is important to be aware of inherent weak points and use rigorous evaluation and probing methods to target them in specific use cases, instead of relying on happy-path interactions. On the other hand, the race is on – all major AI labs are planting their seeds to enhance LLMs with additional capabilities, and there is plenty of space for a cheerful glance into the future. In this article, we will look into the limitations of LLMs and discuss ongoing efforts to control and enhance LLM behaviour. A basic knowledge of the workings of language models is assumed – if you are a newbie, please refer to this article.
Before diving into the technology, let’s set the scene with a thought experiment – the "Octopus test" as proposed by Emily Bender – to understand how differently humans and LLMs see the world.[1]
Imagine that Anna and Maria are stranded on two uninhabited islands. Luckily, they have discovered two telegraphs and an underwater cable left behind by previous visitors and start communicating with each other. Their conversations are "overheard" by a quick-witted octopus who has never seen the world above water but is exceptionally good at statistical learning. He picks up the words, syntactic patterns and communication flows between the two ladies and thus masters the external form of their language without understanding how it is actually grounded in the real world. As Ludwig Wittgenstein once put it, "the limits of language are the limits of my world" – while we know today that the world models of humans are composed of much more than language, the octopus would sympathise with this statement, at least regarding his knowledge of the world above water.
At some point, listening is not enough. Our octopus decides to take control, cuts the cable on Maria’s side and starts chatting with Anna. The interesting question is, when will Anna detect the change? As long as the two parties exchange social pleasantries, there is a reasonable chance that Anna will not suspect anything. Their small talk might go on as follows:
A: Hi Maria!
O: Hi Anna, how are you?
A: Thanks, I’m good, just enjoyed a coconut breakfast!
O: You are lucky, there are no coconuts on my island. What are your plans?
A: I wanted to go swimming but I am afraid there will be a storm. And you?
O: I am having my breakfast now and will do some woodwork afterwards.
A: Have a nice day, talk later!
O: Bye!
However, as their relationship deepens, their communication also grows in intensity and sophistication. Over the next sections, we will take the octopus through a couple of scenes from island life that require the mastery of common-sense knowledge, communicative context and reasoning. As we go, we will also survey approaches to incorporate additional intelligence into agents – be they fictive octopusses or LLMs – that are originally only trained from the surface form of language.
One morning, Anna is planning a hunting trip and tries to forecast the weather for the day. Since the wind is coming from Maria’s direction, she asks "Maria" for a report on current weather conditions as an important piece of information. Being caught in deep waters, our octopus grows embarrassed about describing the weather conditions. Even if he had a chance to glance into the skies, he would not know what specific weather terms like "rain", "wind", "cloudy" etc. refer to in the real world. He desperately makes up some weather facts. Later in the day, while hunting in the woods, Anna is surprised by a dangerous thunderstorm. She attributes her failure to predict the storm to a lack of meteorological knowledge rather than a deliberate hallucination by her conversation partner.
At the surface, LLMs are able to truthfully reflect a lot of true facts about the world. However, their knowledge is limited to concepts and facts that they explicitly encountered in the training data. Even with huge training data, this knowledge cannot be complete. For example, it might miss domain-specific knowledge that is required for commercial use cases. Another important limitation, as of now, is the recency of the information. Since language models lack a notion of temporal context, they can’t work with dynamic information such as the current weather, stock prices or even today’s date.
This problem can be solved by systematically "injecting" additional knowledge into the LLM. This new input can come from various sources, such as structured external databases (e.g. FreeBase or WikiData), company-specific data sources and APIs. One possibility to inject it is via adapter networks that are "plugged in" between the LLM layers to learn the new knowledge:[2]
The training of this architecture happens in two steps, namely memorisation and utilisation:
During inference, the hidden state that the LLM provides to the adapter is fused with the adapter’s output using a fusion function to produce the final answer.
While architecture-level knowledge injection allows for efficient modular retraining of smaller adapter networks, the modification of the architecture also requires considerable engineering skill and effort. The easier alternative is input-level injection, where the model is directly fine-tuned on the new facts (cf. [3] for an example). The downside is the expensive fine-tuning required after each change – thus, it is not suitable for dynamic knowledge sources. A complete overview over existing knowledge injection approaches can be found in this article.
Knowledge injection helps you build domain intelligence, which is becoming a key differentiator for vertical AI products. In addition, you can use it to establish traceability so the model can point a user to the original sources of information. Beyond structured knowledge injection, efforts are underway to integrate multimodal information and knowledge into LLMs. For instance, in April 2022, DeepMind introduced Flamingo, a visual language model that can seamlessly ingest text, images and video.[5] At the same time, Google is working on Socratic Models, a modular framework in which multiple pre-trained models may be composed zero-shot i.e., via multi-modal prompting, to exchange information with each other.[6]
As Anna wants to share not only her thoughts about life, but also the delicious coconuts from her island with Maria, she invents a coconut catapult. She sends Maria a detailed instruction on how she did it and asks her for instructions to optimise it. At the receiving end, the octopus falls short of a meaningful reply. Even if he had a way of constructing the catapult underwater, he does not know what words such as rope and coconut refer to, and thus can’t physically reproduce and improve the experiment. So he simply says "Cool idea, great job! I need to go hunting now, bye!". Anna is bothered by the uncooperative response, but she also needs to go on with her daily business and forgets about the incident.
When we use language, we do so for a specific purpose, which is our communicative intent. For example, the communicative intent can be to convey information, socialise or ask someone to do something. While the first two are rather straightforward for an LLM (as long as it has seen the required information in the data), the latter is already more challenging. Let’s forget about the fact that the LLM does not have an ability to act in the real world and limit ourselves to tasks in its realm of language – writing a speech, an application letter etc. Not only does the LLM need to combine and structure the related information in a coherent way, but it also needs to set the right emotional tone in terms of soft criteria such as formality, creativity, humour etc.
Making the transition from classical language generation to recognising and responding to specific communicative intents is an important step to achieve better acceptance of user-facing NLP systems, especially in Conversational AI. One method for this is Reinforcement Learning from Human Feedback (RLHF), which has been recently implemented in ChatGPT ([7]) but has a longer history in human preference learning.[8] In a nutshell, RLHF "redirects" the learning process of the LLM from the straightforward but artificial next-token prediction task towards learning human preferences in a given communicative situation. These human preferences are directly encoded in the training data: during the annotation process, humans are presented with prompts and either write the desired response or rank a series of existing responses. The behaviour of the LLM is then optimised to reflect the human preference. Technically, RLHF is performed in three steps:
The LLM is thus fine-tuned to produce useful outputs that maximise human preferences in a given communicative situation, for example using Proximal Policy Optimisation (PPO).
For a more in-depth introduction into RLHF, please check out the excellent materials by Huggingface (article and video).
The RLHF methodology had a mind-blowing success with ChatGPT, especially in the areas of conversational AI and creative content creation. In fact, it not only leads to more authentic and purposeful conversations, but can also positively "bias" the model towards ethical values while mitigating unethical, discriminatory or even dangerous outputs. However, what is often left unsaid amidst the excitement about RLHF is that, while not introducing significant technological breakthroughs, its mega-power comes from linear human annotation effort. RLHF is prohibitively expensive in terms of labelled data, the known bottleneck for all supervised and reinforcement learning endeavours. Beyond human rankings for LLM outputs, OpenAI’s data for ChatGPT also include human-written responses to prompts that are used to fine-tune the initial LLM. It is obvious that only big companies committed to AI innovation can afford the necessary budget for data labelling at this scale.
With the help of a brainy community, most bottlenecks eventually get solved. In the past, the Deep Learning community solved the data shortage with self-supervision – pre-training LLMs using next-token prediction, a learning signal that is available "for free" since it is inherent to any text. The Reinforcement Learning community is using algorithms such as Variational Autoencoders or Generative Adversarial Networks to generate synthetic data – with varying degrees of success. To make RLHF broadly accessible, we will also need to figure out a way to crowdsource communicative reward data and/or to build it in a self-supervised or automated way. One possibility is to use ranking datasets that are available "in the wild", for example Reddit or Stackoverflow conversations where answers to questions are rated by users. Beyond simple ratings and thumbs up/down labels, some conversational AI systems also allow the user to directly edit the response to demonstrate the desired behaviour, which creates a more differentiated learning signal.
Finally, Anna faces an emergency. She is pursued by an angry bear. In a panic, she grabs a couple of metal sticks and asks Maria to tell her how defend herself. Of course, the octopus has no clue what Anna means. Not only has he never faced a bear – he also doesn’t know how to behave in a bear attack and how the sticks can help Anna. Solving a task like this not only requires the ability to map accurately between words and objects in the real world, but also to reason about how these objects can be leveraged. The octopus miserably fails and Anna discovers the delusion in the lethal encounter.
Now, what if Maria was still there? Most humans can reason logically, even if there are huge individual differences in the mastery of this skill. Using reasoning, Maria could solve the task as follows:
Premise 1 (based on situation): Anna has a couple of metal sticks.
Premise 2 (based on common-sense knowledge): Bears are intimidated by noise.
Conclusion: Anna can try and use her sticks to make noise and scare the bear away.
LLMs often produce outputs with a valid reasoning chain. Yet, on closer inspection, most of this coherence is the result of pattern learning rather than a deliberate and novel combination of facts. DeepMind has been on a quest to solve causality for years, and a recent attempt is the faithful reasoning framework for question answering.[9] The architecture consists of two LLMs – one for the selection of relevant premises and another for inferring the final, conclusive answer to the question. When prompted with a question and its context, the selection LLM first picks the related statements from its data corpus and passes them to the inference LLM. The inference LLM deduces new statements and adds them to the context. This iterative reasoning process comes to an end when all statements line up into a coherent reasoning chain that provides a complete answer to the question:
The following shows the reasoning chain for our island incident:
Another method to perform reasoning with LLMs is chain-of-thought prompting. Here, the user first provides one or more examples of the reasoning process as part of the prompt, and the LLM "imitates" this reasoning process with new inputs.[13]
Beyond this general ability to reason logically, humans also access a whole toolbox of more specific reasoning skills. A classical example is mathematical calculation. LLMs can produce these calculations up to a certain level – for example, modern LLMs can confidently perform 2- or 3-digit addition. However, they start to fail systematically when complexity increases, for example when more digits are added or multiple operations need to be performed to solve a mathematical task. And "verbal" tasks formulated in natural language (for example, "I had 10 mangoes and lost 3. How many mangoes do I have left?") are much more challenging than explicit computations ("ten minus three equals…"). While LLM performance can be improved by increasing training time, training data, and parameter sizes, using a simple calculator will still remain the more reliable alternative.
Just as children who explicitly learn the laws of mathematics and other exact sciences, LLMs can also benefit from hard-coded rules. This sounds like a case for neuro-symbolic AI – and indeed, modular systems like MRKL (pronounced "miracle") by AI21 Labs split the workload of understanding the task, executing the computation and formulating the output result between different models.[12] MRKL stands for Modular Reasoning, Knowledge and Language and combines AI modules in a pragmatic plug-and-play fashion, switching back and forth between structured knowledge, symbolic methods and neural models. Coming back to our example, to perform mathematical calculations, an LLM is first fine-tuned to extract the formal arguments from a verbal arithmetic task (numbers, operands, parentheses). The calculation itself is then "routed" to a deterministic mathematical module, and the final the result is formatted in natural language using the output LLM.
As opposed to black-box, monolithic LLMs, reasoning add-ons create transparency and trust since they decompose the "thinking" process into individual steps. They are particularly useful for supporting complex, multi-step decision and action paths. For example, they can be used by virtual assistants that make data-driven recommendations and need to perform multiple steps of analytics and aggregation to get to a conclusion.
In this article, we have provided an overview of approaches to complement the intelligence of LLMs. Let’s summarise our guidelines for maximising the benefits of LLMs and potential enhancements:
And even with the described enhancements, LLMs stay far behind human understanding and language use – they simply lack the unique, powerful and mysterious synergy of cultural knowledge, intuition and experience that humans build up as they go through their lifes. According to Yann LeCun, "it is clear that these models are doomed to a shallow understanding that will never approximate the full-bodied thinking we see in humans."[11] When using AI, it is important to appreciate the wonders and complexity we find in language and cognition. Looking at smart machines from the right distance, we can differentiate between tasks that can be delegated to them and those that will remain the privilege of humans in the foreseeable future.
[1] Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–5198, Online. Association for Computational Linguistics.
[2] Emelin, Denis & Bonadiman, Daniele & Alqahtani, Sawsan & Zhang, Yi & Mansour, Saab. (2022). Injecting Domain Knowledge in Language Models for Task-Oriented Dialogue Systems. 10.48550/arXiv.2212.08120.
[3] Fedor Moiseev et al. 2022. SKILL: Structured Knowledge Infusion for Large Language Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1581–1588, Seattle, United States. Association for Computational Linguistics.
[4] Ryan Daws. 2020. Medical chatbot using OpenAI’s GPT-3 told a fake patient to kill themselves. Retrieved on January 13, 2022.
[5] DeepMind. 2022. Tackling multiple tasks with a single visual language model. Retrieved on January 13, 2022.
[6] Zeng et al. 2022. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. Preprint.
[7] OpenAI. 2022. ChatGPT: Optimizing Language Models for Dialogue. Retrieved on January 13, 2022.
[8] Christiano et al. 2017. Deep reinforcement learning from human preferences.
[9] Creswell & Shanahan. 2022. Faithful Reasoning Using Large Language Models. DeepMind.
[10] Karpas et al. 2022. MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. AI21 Labs.
[11] Jacob Browning & Yann LeCun. 2022. AI And The Limits Of Language. Retrieved on January 13, 2022.
[12] Karpas et al. 2022. MRKL Systems – A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning.
[13] Wei et al. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of NeurIPS 2022.
All images unless otherwise noted are by the author.
The post Overcoming the Limitations of Large Language Models appeared first on Towards Data Science.
]]>The post Choosing the right language model for your NLP use case appeared first on Towards Data Science.
]]>Large Language Models (LLMs) are Deep Learning models trained to produce text. With this impressive ability, LLMs have become the backbone of modern Natural Language Processing (NLP). Traditionally, they are pre-trained by academic institutions and big tech companies such as OpenAI, Microsoft and NVIDIA. Most of them are then made available for public use. This plug-and-play approach is an important step towards large-scale AI adoption – instead of spending huge resources on the training of models with general linguistic knowledge, businesses can now focus on fine-tuning existing LLMs for specific use cases.
However, picking the right model for your application can be tricky. Users and other stakeholders have to make their way through a vibrant landscape of language models and related innovations. These improvements address different components of the language model including its training data, pre-training objective, architecture and fine-tuning approach – you could write a book on each of these aspects. On top of all this research, the marketing buzz and the intriguing aura of Artificial General Intelligence around huge language models obfuscate things even more.
In this article, I explain the main concepts and principles behind LLMs. The goal is to provide non-technical stakeholders with an intuitive understanding as well as a language for efficient interaction with developers and AI experts. For broader coverage, the article includes analyses that are rooted in a large number of NLP-related publications. While we will not dive into mathematical details of language models, these can be easily retrieved from the references.
The article is structured as follows: first, I situate language models in the context of the evolving NLP landscape. The second section explains how LLMs are built and pre-trained. Finally, I describe the fine-tuning process and provide some guidance on model selection.
Language is a fascinating skill of the human mind – it is a universal protocol for communicating our rich knowledge of the world, and also more subjective aspects such as intents, opinions and emotions. In the history of AI, there have been multiple waves of research to approximate ("model") human language with mathematical means. Before the era of Deep Learning, representations were based on simple algebraic and probabilistic concepts such as one-hot representations of words, sequential probability models and recursive structures. With the evolution of Deep Learning in the past years, linguistic representations have increased in precision, complexity and expressiveness.
In 2018, BERT was introduced as the first LLM on the basis of the new Transformer architecture. Since then, Transformer-based LLMs have gained strong momentum. Language modelling is especially attractive due to its universal usefulness. While many real-world NLP tasks such as sentiment analysis, information retrieval and information extraction do not need to generate language, the assumption is that a model that produces language also has the skills to solve a variety of more specialised linguistic challenges.
Learning happens based on parameters – variables that are optimized during the training process to achieve the best prediction quality. As the number of parameters increases, the model is able to acquire more granular knowledge and improve its predictions. Since the introduction of the first LLMs in 2017–2018, we saw an exponential explosion in parameter sizes – while breakthrough BERT was trained with 340M parameters, Megatron-Turing NLG, a model released in 2022, is trained with 530B parameters – a more than thousand-fold increase.
Thus, the mainstream keeps wowing the public with ever bigger amounts of parameters. However, there have been critical voices pointing out that model performance is not increasing at the same rate as model size. On the other side, model pre-training can leave a considerable carbon footprint. Downsizing efforts have countered the brute-force approach to make progress in language modelling more sustainable.
The LLM landscape is competitive and innovations are short-lived. The following chart shows the top-15 most popular LLMs in the timespan 2018–2022, along with their share-of-voice over time:
We can see that most models fade in popularity after a relatively short time. To stay cutting-edge, users should monitor the current innovations and evaluate whether an upgrade would be worthwhile.
Most LLMs follow a similar lifecycle: first, at the "upstream", the model is pre-trained. Due to the heavy requirements on data size and compute, it is mostly a privilege of large tech companies and universities. Recently, there have also been some collaborative efforts (e.g. the BigScience workshop) for the joint advancement of the LLM field. A handful of well-funded startups such as Cohere and AI21 Labs also provide pre-trained LLMs.
After the release, the model is adopted and deployed at the "downstream" by application-focussed developers and businesses. At this stage, most models require an extra fine-tuning step to specific domains and tasks. Others, like GPT-3, are more convenient in that they can learn a variety of linguistic tasks directly during prediction (zero- or few-shot prediction).
Finally, time knocks at the door and a better model comes around the corner – either with an even larger number of parameters, more efficient use of hardware or a more fundamental improvement to the modelling of human language. Models that brought about substantial innovations can give birth to whole model families. For example, BERT lives on in BERT-QA, DistilBERT and RoBERTa, which are all based on the original architecture.
In the next sections, we will look at the first two phases in this lifecycle – the pre-training and the fine-tuning for deployment.
Most teams and NLP practitioners will not be involved in the pre-training of LLMs, but rather in their fine-tuning and deployment. However, to successfully pick and use a model, it is important to understand what is going on "under the hood". In this section, we will look at the basic ingredients of an LLM:
Each of these will affect not only the choice, but also the fine-tuning and deployment of your LLM.
The data used for LLM training is mostly text data covering different styles, such as literature, user-generated content and news data. After seeing a variety of different text types, the resulting models become aware of the fine details of language. Other than text data, code is regularly used as input, teaching the model to generate valid programs and code snippets.
Unsurprisingly, the quality of the training data has a direct impact on model performance – and also on the required size of the model. If you are smart in preparing the training data, you can improve model quality while reducing its size. One example is the T0 model, which is 16 times smaller than GPT-3 but outperforms it on a range of benchmark tasks. Here is the trick: instead of just using any text as training data, it works directly with task formulations, thus making its learning signal much more focussed. Figure 3 illustrates some training examples.
A final note on training data: we often hear that language models are trained in an unsupervised manner. While this makes them appealing, it is technically wrong. Instead, well-formed text already provides the necessary learning signals, sparing us the tedious process of manual data annotation. The labels to be predicted correspond to past and/or future words in a sentence. Thus, annotation happens automatically and at scale, making possible the relatively quick progress in the field.
Once the training data is assembled, we need to pack it into form that can be digested by the model. Neural networks are fed with algebraic structures (vectors and matrices), and the optimal algebraic representation of language is an ongoing quest – reaching from simple sets of words to representations containing highly differentiated context information. Each new step confronts researchers with the endless complexity of natural language, exposing the limitations of the current representation.
The basic unit of language is the word. In the beginnings of NLP, this gave rise to the naive bag-of-words representation that throws all words from a text together, irrespectively of their ordering. Consider these two examples:
In the bag-of-words world, these sentences would get exactly the same representation since they consist of the same words. Clearly, it embraces only a small part of their meaning.
Sequential representations accommodate information about word order. In Deep Learning, the processing of sequences was originally implemented in order-aware Recurrent Neural Networks (RNN).[2] However, going one step further, the underlying structure of language is not purely sequential but hierarchical. In other words, we are not talking about lists, but about trees. Words that are farther apart can actually have stronger syntactic and semantic ties than neighbouring words. Consider the following example:
Here, her refers to the girl. When an RNN reaches the end of the sentence and finally sees her, its memory of the beginning of the sentence might already be fading, thus not allowing it to recover this relationship.
To solve these long-distance dependencies, more complex neural structures were proposed to build up a more differentiated memory of the context. The idea is to keep words that are relevant for future predictions in memory while forgetting the other words. This was the contribution of Long-Short Term Memory (LSTM)[3] cells and Gated Recurrent Units (GRUs)[4]. However, these models don’t optimise for specific positions to be predicted, but rather for a generic future context. Moreover, due to their complex structure, they are even slower to train than traditional RNNs.
Finally, people have done away with recurrence and proposed the attention mechanism, as incorporated in the Transformer architecture.[5] Attention allows the model to focus back and forth between different words during prediction. Each word is weighted according to its relevance for the specific position to be predicted. For the above sentence, once the model reaches the position of her, girl will have a higher weight than at, despite the fact that it is much farther away in the linear order.
To date, the attention mechanism comes closest to the biological workings of the human brain during information processing. Studies have shown that attention learns hierarchical syntactic structures, incl. a range of complex syntactic phenomena (cf. the Primer on BERTology and the papers referenced therein). It also allows for parallel computation and, thus, faster and more efficient training.
With the appropriate training data representation in place, our model can start learning. There are three generic objectives used for pre-training language models: sequence-to-sequence transduction, autoregression and auto-encoding. All of them require the model to master broad linguistic knowledge.
The original task addressed by the encoder-decoder architecture as well as the Transformer model is sequence-to-sequence transduction: a sequence is transduced into a sequence in a different representation framework. The classical sequence-to-sequence task is machine translation, but other tasks such as summarisation are frequently formulated in this manner. Note that the target sequence is not necessarily text – it can also be other unstructured data such as images as well as structured data such as programming languages. An example of sequence-to-sequence LLMs is the BART family.
The second task is autoregression, which is also the original language modelling objective. In autoregression, the model learns to predict the next output (token) based on previous tokens. The learning signal is restricted by the unidirectionality of the enterprise – the model can only use information from the right or from the left of the predicted token. This is a major limitation since words can depend both on past as well as on future positions. As an example, consider how the verb written impacts the following sentence in both directions:
Here, the position of paper is restricted to something that is writable, while the position of student is restricted to a human or, anyway, another intelligent entity capable of writing.
Many of the LLMs making today’s headlines are autoregressive, incl. the GPT family, PaLM and BLOOM.
The third task – auto-encoding – solves the issue of unidirectionality. Auto-encoding is very similar to the learning of classical word embeddings.[6] First, we corrupt the training data by hiding a certain portion of tokens – typically 10–20% – in the input. The model then learns to reconstruct the correct inputs based on the surrounding context, taking into account both the preceding and the following tokens. The typical example of auto-encoders is the BERT family, where BERT stands for Bidirectional Encoder Representations from Transformers.
The basic building blocks of a language model are the encoder and the decoder. The encoder transforms the original input into a high-dimensional algebraic representation, also called a "hidden" vector. Wait a minute – hidden? Well, in reality there are no big secrets at this point. Of course you can look at this representation, but a lengthy vector of numbers will not convey anything meaningful to a human. It takes the mathematical intelligence of our model to deal with it. The decoder reproduces the hidden representation in an intelligible form such as another language, programming code, an image etc.
The encoder-decoder architecture was originally introduced for Recurrent Neural Networks. Since the introduction of the attention-based Transformer model, traditional recurrence has lost its popularity while the encoder-decoder idea lives on. Most Natural Language Understanding (NLU) tasks rely on the encoder, while Natural Language Generation (NLG) tasks need the decoder and sequence-to-sequence transduction requires both components.
We will not go into the details of the Transformer architecture and the attention mechanism here. For those who want to master the details, be prepared to spend a good amount of time to wrap your head around it. Beyond the original paper, [7] and [8] provide excellent explanations. For a lightweight introduction, I recommend the corresponding sections in Andrew Ng’s Sequence models course.
Language modelling is a powerful upstream task – if you have a model that successfully generates language, congratulations – it is an intelligent model. However, the business value of having a model bubbling with random text is limited. Instead, NLP is mostly used for more targeted downstream tasks such as sentiment analysis, question answering and information extraction. This is the time to apply transfer learning and reuse the existing linguistic knowledge for more specific challenges. During fine-tuning, a portion of the model is "freezed" and the rest is further trained with domain- or task-specific data.
Explicit fine-tuning adds complexity on the path towards LLM deployment. It can also lead to model explosion, where each business task requires its own fine-tuned model, escalating to an unmaintainable variety of models. So, folks have made an effort to get rid of the fine-tuning step using few- or zero-shot learning (e.g. in GPT-3 [9]). This learning happens on-the-fly during prediction: the model is fed with a "prompt" – a task description and potentially a few training examples – to guide its predictions for future examples.
While much quicker to implement, the convenience factor of zero- or few-shot learning is counterbalanced by its lower prediction quality. Besides, many of these models need to be accessed via cloud APIs. This might be a welcome opportunity at the beginning of your development – however, at more advanced stages, it can turn into another unwanted external dependency.
Looking at the continuous supply of new language models on the AI market, selecting the right model for a specific downstream task and staying in synch with the state-of-the-art can be tricky.
Research papers normally benchmark each model against specific downstream tasks and datasets. Standardised task suites such as SuperGLUE and BIG-bench allow for unified benchmarking against a multitude of NLP tasks and provide a basis for comparison. Still, we should keep in mind that these tests are prepared in a highly controlled setting. As of today, the generalisation capacity of language models is rather limited – thus, the transfer to real-life datasets might significantly affect model performance. The evaluation and selection of an appropriate model should involve experimentation on data that is as close as possible to the production data.
As a rule of thumb, the pre-training objective provides an important hint: autoregressive models perform well on text generation tasks such as Conversational AI, question answering and text summarisation, while auto-encoders excel at "understanding" and structuring language, for example for sentiment analysis and various information extraction tasks. Models intended for zero-shot learning can theoretically perform all kinds of tasks as long as they receive appropriate prompts – however, their accuracy is generally lower than that of fine-tuned models.
To make things more concrete, the following chart shows how popular NLP tasks are associated with prominent language models in the NLP literature. The associations are computed based on multiple similarity and aggregation metrics, incl. embedding similarity and distance-weighted co-occurrence. Model-task pairs with higher scores, such as BART / Text Summarization and LaMDA / Conversational AI, indicate a good fit based on historical data.
In this article, we have covered the basic notions of LLMs and the main dimensions where innovation is happening. The following table provides a summary of the key features for the most popular LLMs:
Let’s summarise some general guidelines for the selection and deployment of LLMs:
Finally, be aware of the limitations of LLMs. While they have the amazing, human-like capacity to produce language, their overall cognitive power is galaxies away from us humans. The world knowledge and reasoning capacity of these models are strictly limited to the information they find at the surface of language. They also can’t situate facts in time and might provide you with outdated information without blinking an eye. If you are building an application that relies on generating up-to-date or even original knowledge, consider combining your LLM with additional multimodal, structured or dynamic knowledge sources.
[1] Victor Sanh et al. 2021. Multitask prompted training enables zero-shot task generalization. CoRR, abs/2110.08207. [2] Yoshua Bengio et al. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166. [3] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8): 1735–1780. [4] Kyunghyun Cho et al. 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar. [5] Ashish Vaswani et al. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. [6] Tomas Mikolov et al. 2013. Distributed representations of words and phrases and their compositionality. CoRR, abs/1310.4546. [7] Jay Jalammar. 2018. The illustrated transformer. [8] Alexander Rush et al. 2018. The annotated transformer. [9] Tom B. Brown et al. 2020. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc. [10] Jacob Devlin et al. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. [11] Julien Simon 2021. Large Language Models: A New Moore’s Law? [12] Underlying dataset: more than 320k articles on AI and NLP published 2018–2022 in specialised AI resources, technology blogs and publications by the leading AI think tanks.
All images unless otherwise noted are by the author.
The post Choosing the right language model for your NLP use case appeared first on Towards Data Science.
]]>The post Major trends in NLP: a review of 20 years of ACL research appeared first on Towards Data Science.
]]>The 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019) is starting this week in Florence, Italy. We took the opportunity to review major research trends in the animated NLP space and formulate some implications from the business perspective. The article is backed by a statistical and – guess what – NLP-based analysis of ACL papers from the last 20 years.
When compared to other species, natural language is one of the primary USPs of the human mind. NLP, a major buzzword in today’s tech discussion, deals with how computers can understand and generate language. The rise of NLP in the past decades is backed by a couple of global developments – the universal hype around AI, exponential advances in the field of Deep Learning and an ever-increasing quantity of available text data. But what is the substance behind the buzz? In fact, NLP is a highly complex, interdisciplinary field that is constantly supplied by high-quality fundamental research in linguistics, math and computer science. The ACL conference brings these different angles together. As the following chart shows, research activity has been flourishing in the past years:
Figure 1: Paper quantity published at the ACL conference by years
In the following, we summarize some core trends in terms of data strategies, algorithms, tasks as well as multilingual NLP. The analysis is based on ACL papers published since 1998 which were processed using a domain-specific ontology for the fields of NLP and Machine Learning.
The quantity of freely available text data is increasing exponentially, mainly due to the massive production of Web content. However, this large body of data comes with some key challenges. First, large data is inherently noisy. Think of natural resources such as oil and metal – they need a process of refining and purification before they can be used in the final product. The same goes for data. In general, the more "democratic" the production channel, the dirtier the data – which means that more effort has to be spent on its cleaning. For example, data from social media will require a longer cleaning pipeline. Among others, you will need to deal with extravagancies of self-expression like smileys and irregular punctuation, which are normally absent in more formal settings such as scientific papers or legal contracts.
The other major challenge is the labeled data bottleneck: strictly speaking, most state-of-the-art algorithms are supervised. They not only need labeled data – they need Big Labeled Data. This is especially relevant for the advanced, complex algorithms of the Deep Learning family. Just as a child’s brain first needs a max of input before it can learn its native language, to go "deep", an algorithm first needs a large quantity of data to embrace language in its whole complexity.
Traditionally, training data at smaller scale has been annotated manually. However, dedicated manual annotation of large datasets comes with efficiency trade-offs which are rarely acceptable, especially in the business context.
What are the possible solutions? On the one hand, there are some enhancements on the management side, incl. crowd-sourcing and Training Data as a Service (TDaaS). On the other hand, a range of automatic workarounds for the creation of annotated datasets have also been suggested in the machine learning community. The following chart shows some trends:
Figure 2: Discussion of approaches for creation and reuse of training data (amounts of mentions normalised by paper quantity in the respective year)
Clearly, pretraining has seen the biggest rise in the past five years. In pre-training, a model is first trained on a large, general dataset and subsequently tweaked with task-specific data and objectives. Its popularity is largely due to the fact that companies such as Google and Facebook are making huge models available out-of-the-box to the open-source community. Especially pretrained word embeddings such as Word2Vec, FastText and BERT allow NLP developers to jump to the next level. Transfer learning is another approach to reusing models across different tasks. If the reuse of existing models is not an option, one can leverage a small quantity of labeled data to automatically label a larger quantity of data, as is done in distant and weak supervision – note, however, that these approaches usually lead to a decrease in the labeling precision.
In terms of algorithms, research in recent years has been strongly focussed on the Deep Learning family:
Figure 3: Discussion of Deep Learning algorithms (amounts of mentions normalised by paper quantity in the respective year)
Word embeddings are clearly taking up. In their basic form, word embeddings were introduced by Mikolov et al. (2013). The universal linguistic principle behind word embeddings is distributional similarity: a word can be characterized by the contexts in which it occurs. Thus, as humans, we normally have no difficulty completing the sentence "The customer signed the ___ today" with suitable words such as "deal" or "contract". Word embeddings allow to do this automatically and are thus extremely powerful for addressing the very core of the context awareness issue.
While word2vec, the original embedding algorithm, is statistical and does not account for complexities of life such as ambiguity, context sensitivity and linguistic structure, subsequent approaches have enriched word embeddings with all kinds of linguistic information. And, by the way, you can embed not only words, but also other things such as senses, sentences and whole documents.
Neural Networks are the workhorse of Deep Learning (cf. Goldberg and Hirst (2017) for an introduction of the basic architectures in the NLP context). Convolutional Neural Networks have seen an increase in the past years, whereas the popularity of the traditional Recurrent Neural Network (RNN) is dropping. This is due, on the one hand, to the availability of more efficient RNN-based architectures such as LSTM and GRU. On the other hand, a new and pretty disruptive mechanism for sequential processing – attention – has been introduced in the sequence-to-sequence (seq2seq) model by Sutskever et al. (2014). If you use Google Translate, you might have noticed the leapfrog in the translation quality a couple of years ago – seq2seq was the culprit. And while seq2seq still relies on RNNs in the pipeline, the transformer architecture, another major advance from 2017, finally gets rid of recurrence and completely relies on the attention mechanism (Vaswani et al. 2017).
Deep Learning is a vibrant and fascinating domain, but it can also be quite intimidating from the application point of view. When it does, keep in mind that most developments are motivated by increased efficiency at Big Data scale, context awareness and scalability to different tasks and languages. For a mathematical introduction, Young et al. (2018) present an excellent overview of the state-of-the-art algorithms.
When we look at specific NLP tasks such as sentiment analysis and named entity recognition, the inventories are much steadier than for the underlying algorithms. Over the years, there has been an gradient evolution from preprocessing tasks such as stemming over syntactic parsing and information extraction to semantically oriented tasks such as sentiment/emotion analysis and semantic parsing. This corresponds to the three "global" NLP development curves – syntax, semantics and context awareness – as described by Cambria et al. (2014). As we have seen in the previous section, the third curve – the awareness of a larger context – has already become one of the main drivers behind new Deep Learning algorithms.
From an even more general perspective, there is an interesting trend towards task-agnostic research. In Section 2, we saw how the generalization power of modern mathematical approaches has been leveraged in scenarios such as transfer learning and pretraining. Indeed, modern algorithms are developing amazing multi-tasking powers – thus, the relevance of the specific task at hand decreases. The following chart shows an overall decline in the discussion of specific NLP tasks since 2006:
Figure 4: Amount of discussion of specific NLP tasks
With globalization, going international becomes an imperative for business growth. English is traditionally the starting point for most NLP research, but the demand for scalable multilingual NLP systems increases in recent years. How is this need reflected in the research community? Think of different languages as different lenses through which we view the same world – they share many properties, a fact that is fully accommodated by modern learning algorithms with their increasing power for abstraction and generalization. Still, language-specific features have to be thoroughly addressed especially in the preprocessing phase. As the following chart shows, the diversity of languages addressed in ACL research keeps increasing:
Figure 5: Frequent languages per year (> 10 mentions per language)
However, just as seen for NLP tasks in the previous section, we can expect a consolidation once language-specific differences have been neutralized for the next wave of algorithms. The most popular languages are summarised in Figure 6.
Figure 6: Languages addressed by ACL research
For some of these languages, research interest meets commercial attractiveness: languages such as English, Chinese and Spanish bring together large quantities of available data, huge native speaker populations and a large economic potential in the corresponding geographical regions. However, the abundance of "smaller" languages also shows that the NLP field is generally evolving towards a theoretically sound treatment of multilinguality and cross-linguistic generalisation.
Spurred by the global AI hype, the NLP field is exploding with new approaches and disruptive improvements. There is a shift towards modeling meaning and context dependence, probably the most universal and challenging fact of human language. The generalisation power of modern algorithms allows for efficient scaling across different tasks, languages and datasets, thus significantly speeding up the ROI cycle of NLP developments and allowing for a flexible and efficient integration of NLP into individual business scenarios.
Stay tuned for a review of ACL 2019 and more updates on NLP trends!
The post Major trends in NLP: a review of 20 years of ACL research appeared first on Towards Data Science.
]]>