Dr. Janna Lipenkova, Author at Towards Data Science

Injecting domain expertise into your AI system

Dr. Janna Lipenkova — Sat, 01 Feb 2025 13:38:48 +0000

(Source: Getty Images)

When starting their AI initiatives, many companies are trapped in silos and treat AI as a purely technical enterprise, sidelining domain experts or involving them too late. They end up with generic AI applications that miss industry nuances, produce poor recommendations, and quickly become unpopular with users. By contrast, AI systems that deeply understand industry-specific processes, constraints, and decision logic have the following benefits:

Increased efficiency – The more domain knowledge AI incorporates, the less manual effort is required from human experts.
Improved adoption – Experts disengage from AI systems that feel too generic. AI must speak their language and align with real workflows to gain trust.
A sustainable competitive moat – As AI becomes a commodity, embedding proprietary expertise is the most effective way to build defensible AI systems (cf. this article to learn about the building blocks of AI’s competitive advantage).

Domain experts can help you connect the dots between the technicalities of an AI system and its real-life usage and value. Thus, they should be key stakeholders and co-creators of your AI applications. This guide is the first part of my series on expertise-driven AI. Following my mental model of AI systems, it provides a structured approach to embedding deep domain expertise into your AI.

Overview of the methods for domain knowledge integration

Throughout the article, we will use the use case of supply chain optimisation (SCO) to illustrate these different methods. Modern supply chains are under constant strain from geopolitical tensions, climate disruptions, and volatile demand shifts, and AI can provide the kind of dynamic, high-coverage intelligence needed to anticipate delays, manage risks, and optimise logistics. However, without domain expertise, these systems are often disconnected from the realities of life. Let’s see how we can solve this by integrating domain expertise across the different components of the AI application.

1. Data: The bedrock of expertise-driven AI

AI is only as domain-aware as the data it learns from. Raw data isn’t enough – it must be curated, refined, and contextualised by experts who understand its meaning in the real world.

Data understanding: Teaching AI what matters

While data scientists can build sophisticated models to analyse patterns and distributions, these analyses often stay at a theoretical, abstract level. Only domain experts can validate whether the data is complete, accurate, and representative of real-world conditions.

In supply chain optimisation, for example, shipment records may contain missing delivery timestamps, inconsistent route details, or unexplained fluctuations in transit times. A data scientist might discard these as noise, but a logistics expert could have real-world explanations of these inconsistencies. For instance, they might be caused by weather-related delays, seasonal port congestion, or supplier reliability issues. If these nuances aren’t accounted for, the AI might learn an overly simplified view of supply chain dynamics, resulting in misleading risk assessments and poor recommendations.

Experts also play a critical role in assessing the completeness of data. AI models work with what they have, assuming that all key factors are already present. It takes human expertise and judgment to identify blind spots. For example, if your supply chain AI isn’t trained on customs clearance times or factory shutdown histories, it won’t be able to predict disruptions caused by regulatory issues or production bottlenecks.

Implementation tip: Run joint Exploratory Data Analysis (EDA) sessions with data scientists and domain experts to identify missing business-critical information, ensuring AI models work with a complete and meaningful dataset, not just statistically clean data.

Data source selection: Start small, expand strategically

One common pitfall when starting with AI is integrating too much data too soon, leading to complexity, congestion of your data pipelines, and blurred or noisy insights. Instead, start with a couple of high-impact data sources and expand incrementally based on AI performance and user needs. For instance, an SCO system may initially use historical shipment data and supplier reliability scores. Over time, domain experts may identify missing information – such as port congestion data or real-time weather forecasts – and point engineers to those data sources where it can be found.

Implementation tip: Start with a minimal, high-value dataset (normally 3–5 data sources), then expand incrementally based on expert feedback and real-world AI performance.

Data annotation

AI models learn by detecting patterns in data, but sometimes, the right learning signals aren’t yet present in raw data. This is where data annotation comes in – by labelling key attributes, domain experts help the AI understand what matters and make better predictions. Consider an AI model built to predict supplier reliability. The model is trained on shipment records, which contain delivery times, delays, and transit routes. However, raw delivery data alone doesn’t capture the full picture of supplier risk – there are no direct labels indicating whether a supplier is "high risk" or "low risk."

Without more explicit learning signals, the AI might make the wrong conclusions. It could conclude that all delays are equally bad, even when some are caused by predictable seasonal fluctuations. Or it might overlook early warning signs of supplier instability, such as frequent last-minute order changes or inconsistent inventory levels.

Domain experts can enrich the data with more nuanced labels, such as supplier risk categories, disruption causes, and exception-handling rules. By introducing these curated learning signals, you can ensure that AI doesn’t just memorise past trends but learns meaningful, decision-ready insights.

You shouldn’t rush your annotation efforts – instead, think about a structured annotation process that includes the following components:

Annotation guidelines: Establish clear, standardized rules for labeling data to ensure consistency. For example, supplier risk categories should be based on defined thresholds (e.g., delivery delays over 5 days + financial instability = high risk).
Multiple expert review: Involve several domain experts to reduce bias and ensure objectivity, particularly for subjective classifications like risk levels or disruption impact.
Granular labelling: Capture both direct and contextual factors, such as annotating not just shipment delays but also the cause (customs, weather, supplier fault).
Continuous refinement: Regularly audit and refine annotations based on AI performance – if predictions consistently miss key risks, experts should adjust labelling strategies accordingly.

Implementation tip: Define an annotation playbook with clear labelling criteria, involve at least two domain experts per critical label for objectivity, and run regular annotation review cycles to ensure AI is learning from accurate, business-relevant insights.

Synthetic data: Preparing AI for rare but critical events

So far, our AI models learn from real-life historical data. However, rare, high-impact events – like factory shutdowns, port closures, or regulatory shifts in our supply chain scenario – may be underrepresented. Without exposure to these scenarios, AI can fail to anticipate major risks, leading to overconfidence in supplier stability and poor contingency planning. Synthetic data solves this by creating more datapoints for rare events, but expert oversight is crucial to ensure that it reflects plausible risks rather than unrealistic patterns.

Let’s say we want to predict supplier reliability in our supply chain system. The historical data may have few recorded supplier failures – but that’s not because failures don’t happen. Rather, many companies proactively mitigate risks before they escalate. Without synthetic examples, AI might deduce that supplier defaults are extremely rare, leading to misguided risk assessments.

Experts can help generate synthetic failure scenarios based on:

Historical patterns – Simulating supplier collapses triggered by economic downturns, regulatory shifts, or geopolitical tensions.
Hidden risk indicators – Training AI on unrecorded early warning signs, like financial instability or leadership changes.
Counterfactuals – Creating "what-if" events, such as a semiconductor supplier suddenly halting production or a prolonged port strike.

Actionable step: Work with domain experts to define high-impact but low-frequency events and scenarios, which can be in focus when you generate synthetic data.

Data makes domain expertise shine. An AI initiative that relies on clean, relevant, and enriched domain data will have an obvious competitive advantage over one that takes the "quick-and-dirty" shortcut to data. However, keep in mind that working with data can be tedious, and experts need to see the outcome of their efforts – whether it’s improving AI-driven risk assessments, optimising supply chain resilience, or enabling smarter decision-making. The key is to make data collaboration intuitive, purpose-driven, and directly tied to business outcomes, so experts remain engaged and motivated.

Intelligence: Making AI systems smarter

Once AI has access to high-quality data, the next challenge is ensuring it generates useful and accurate outputs. Domain expertise is needed to:

Define clear AI objectives aligned with business priorities
Ensure AI correctly interprets industry-specific data
Continuously validate AI’s outputs and recommendations

Let’s look at some common AI approaches and see how they can benefit from an extra shot of domain knowledge.

Training predictive models from scratch

For structured problems like supply chain forecasting, predictive models such as classification and regression can help anticipate delays and suggest optimisations. However, to make sure these models are aligned with business goals, data scientists and knowledge engineers need to work together. For example, an AI model might try to minimise shipment delays at all costs, but a supply chain expert knows that fast-tracking every shipment through air freight is financially unsustainable. They can formulate additional constraints on the model, making it prioritise critical shipments while balancing cost, risk, and lead times.

Implementation tip: Define clear objectives and constraints with domain experts before training AI models, ensuring alignment with real business priorities.

For a detailed overview of predictive AI techniques, please refer to Chapter 4 of my book The Art of AI Product Management.

Navigating the LLM triad

While predictive models trained from scratch can excel at very specific tasks, they are also rigid and will "refuse" to perform any other task. GenAI models are more open-minded and can be used for highly diverse requests. For example, an LLM-based conversational widget in an SCO system can allow users to interact with real-time insights using natural language. Instead of sifting through inflexible dashboards, users can ask, "Which suppliers are at risk of delays?" or "What alternative routes are available?" The AI pulls from historical data, live logistics feeds, and external risk factors to provide actionable answers, suggest mitigations, and even automate workflows like rerouting shipments.

But how can you ensure that a huge, out-of-the-box model like ChatGPT or Llama understands the nuances of your domain? Let’s walk through the LLM triad – a progression of techniques to incorporate domain knowledge into your LLM system.

Figure 2: The LLM triad is a progression of techniques for incorporating domain- and company-specific knowledge into your LLM system

As you progress from left to right, you can ingrain more domain knowledge into the LLM – however, each stage also adds new technical challenges (if you are interested in a systematic deep-dive into the LLM triad, please check out chapters 5–8 of my book The Art of AI Product Management). Here, let’s focus on how domain experts can jump in at each of the stages:

Prompting out-of-the-box LLMs might seem like a generic approach, but with the right intuition and skill, domain experts can fine-tune prompts to extract the extra bit of domain knowledge out of the LLM. Personally, I think this is a big part of the fascination around prompting – it puts the most powerful AI models directly into the hands of domain experts without any technical expertise. Some key prompting techniques include:

Few-shot prompting: Incorporate examples to guide the model’s responses. Instead of just asking "What are alternative shipping routes?", a well-crafted prompt includes sample scenarios, such as "Example of past scenario: A previous delay at the Port of Shenzhen was mitigated by rerouting through Ho Chi Minh City, reducing transit time by 3 days."
Chain-of-thought prompting: Encourage step-by-step reasoning for complex logistics queries. Instead of "Why is my shipment delayed?", a structured prompt might be "Analyse historical delivery data, weather reports, and customs processing times to determine why shipment #12345 is delayed."
Providing further background information: Attach external documents to improve domain-specific responses. For example, prompts could reference real-time port congestion reports, supplier contracts, or risk assessments to generate data-backed recommendations. Most LLM interfaces already allow you to conveniently attach additional files to your prompt.

2. RAG (Retrieval-Augmented Generation): While prompting helps guide AI, it still relies on pre-trained knowledge, which may be outdated or incomplete. RAG allows AI to retrieve real-time, company-specific data, ensuring that its responses are grounded in current logistics reports, supplier performance records, and risk assessments. For example, instead of generating generic supplier risk analyses, a RAG-powered AI system would pull real-time shipment data, supplier credit ratings, and port congestion reports before making recommendations. Domain experts can help select and structure these data sources and are also needed when it comes to testing and evaluating RAG systems.

Implementation tip: Work with domain experts to curate and structure knowledge sources – ensuring AI retrieves and applies only the most relevant and high-quality business information.

3. Fine-tuning: While prompting and RAG inject domain knowledge on-the-fly, they do not inherently embed supply domain-specific workflows, terminology, or decision logic into your LLM. Fine-tuning adapts the LLM to think like a logistics expert. Domain experts can guide this process by creating high-quality training data, ensuring AI learns from real supplier assessments, risk evaluations, and procurement decisions. They can refine industry terminology to prevent misinterpretations (e.g., AI distinguishing between "buffer stock" and "safety stock"). They also align AI’s reasoning with business logic, ensuring it considers cost, risk, and compliance – not just efficiency. Finally, they evaluate fine-tuned models, testing AI against real-world decisions to catch biases or blind spots.

Implementation tip: In LLM fine-tuning, data is the crucial success factor. Quality goes over quantity, and fine-tuning on a small, high-quality dataset can give you excellent results. Thus, give your experts enough time to figure out the right structure and content of the fine-tuning data and plan for plenty of end-to-end iterations of your fine-tuning process.

Encoding expert knowledge with neuro-symbolic AI

Every machine learning algorithm gets it wrong from time to time. To mitigate errors, it helps to set the "hard facts" of your domain in stone, making your AI system more reliable and controllable. This combination of machine learning and deterministic rules is called neuro-symbolic AI.

For example, an explicit knowledge graph can encode supplier relationships, regulatory constraints, transportation networks, and risk dependencies in a structured, interconnected format.

Figure 3: Knowledge graphs explicitly encode relationships between entities, reducing the guesswork in your AI system

Instead of relying purely on statistical correlations, an AI system enriched with knowledge graphs can:

Validate predictions against domain-specific rules (e.g., ensuring that AI-generated supplier recommendations comply with regulatory requirements).
Infer missing information (e.g., if a supplier has no historical delays but shares dependencies with high-risk suppliers, AI can assess its potential risk).
Improve explainability by allowing AI decisions to be traced back to logical, rule-based reasoning rather than black-box statistical outputs.

How can you decide which knowledge should be encoded with rules (symbolic AI), and which should be learned dynamically from the data (neural AI)? Domain experts can help youpick those bits of knowledge where hard-coding makes the most sense:

Knowledge that is relatively stable over time
Knowledge that is hard to infer from the data, for example because it is not well-represented
Knowledge that is critical for high-impact decisions in your domain, so you can’t afford to get it wrong

In most cases, this knowledge will be stored in separate components of your AI system, like decision trees, knowledge graphs, and ontologies. There are also some methods to integrate it directly into LLMs and other statistical models, such as Lamini’s memory fine-tuning.

Compound AI and modular workflows

Generating insights and turning them into actions is a multi-step process. Experts can help you model workflows and decision-making pipelines, ensuring that the process followed by your AI system aligns with their tasks. For example, the following pipeline shows how the AI components we considered so far can be combined into a modular workflow for the mitigation of shipment risks:

Figure 4: A combined workflow for the assessment and mitigation of shipment risks

Experts are also needed to calibrate the "labor distribution" between humans in AI. For example, when modelling decision logic, they can set thresholds for automation, deciding when AI can trigger workflows versus when human approval is needed.

Implementation tip: Involve your domain experts in mapping your processes to AI models and assets, identifying gaps vs. steps that can already be automated.

Designing ergonomic user experiences

Especially in B2B environments, where workers are deeply embedded in their daily workflows, the user experience must be seamlessly integrated with existing processes and task structures to ensure efficiency and adoption. For example, an AI-powered supply chain tool must align with how logistics professionals think, work, and make decisions. In the development phase, domain experts are the closest "peers" to your real users, and picking their brains is one of the fastest ways to bridge the gap between AI capabilities and real-world usability.

Implementation tip: Involve domain experts early in UX design to ensure AI interfaces are intuitive, relevant, and tailored to real decision-making workflows.

Ensuring transparency and trust in AI decisions

AI thinks differently from humans, which makes us humans skeptical. Often, that’s a good thing since it helps us stay alert to potential mistakes. But distrust is also one of the biggest barriers to AI adoption. When users don’t understand why a system makes a particular recommendation, they are less likely to work with it. Domain experts can define how AI should explain itself – ensuring users have visibility into confidence scores, decision logic, and key influencing factors.

For example, if an SCO system recommends rerouting a shipment, it would be irresponsible on the part of a logistics planner to just accept it. She needs to see the "why" behind the recommendation – is it due to supplier risk, port congestion, or fuel cost spikes? The UX should show a breakdown of the decision, backed by additional information like historical data, risk factors, and a cost-benefit analysis.

Mitigate overreliance on AI: Excessive dependence of your users on AI can introduce bias, errors, and unforeseen failures. Experts should find ways to calibrate AI-driven insights vs. human expertise, ethical oversight, and strategic safeguards to ensure resilience, adaptability, and trust in decision-making.

Implementation tip: Work with domain experts to define key explainability features – such as confidence scores, data sources, and impact summaries – so users can quickly assess AI-driven recommendations.

Simplifying AI interactions without losing depth

AI tools should make complex decisions easier, not harder. If users need deep technical knowledge to extract insights from AI, the system has failed from a UX perspective. Domain experts can help strike a balance between simplicity and depth, ensuring the interface provides actionable, context-aware recommendations while allowing deeper analysis when needed.

For instance, instead of forcing users to manually sift through data tables, AI could provide pre-configured reports based on common logistics challenges. However, expert users should also have on-demand access to raw data and advanced settings when necessary. The key is to design AI interactions that are efficient for everyday use but flexible for deep analysis when required.

Implementation tip: Use domain expert feedback to define default views, priority alerts, and user-configurable settings, ensuring AI interfaces provide both efficiency for routine tasks and depth for deeper research and strategic decisions.

Continuous UX testing and iteration with experts

AI UX isn’t a one-and-done process – it needs to evolve with real-world user feedback. Domain experts play a key role in UX testing, refinement, and iteration, ensuring that AI-driven workflows stay aligned with business needs and user expectations.

For example, your initial interface may surface too many low-priority alerts, leading to alert fatigue where users start ignoring AI recommendations. Supply chain experts can identify which alerts are most valuable, allowing UX designers to prioritize high-impact insights while reducing noise.

Implementation tip: Conduct think-aloud sessions and have domain experts verbalize their thought process when interacting with your AI interface. This helps AI teams uncover hidden assumptions and refine AI based on how experts actually think and make decisions.

Conclusion

Vertical AI systems must integrate domain knowledge at every stage, and experts should become key stakeholders in your AI development:

They refine data selection, annotation, and synthetic data.
They guide AI learning through prompting, RAG, and fine-tuning.
They support the design of seamless user experiences that integrate with daily workflows in a transparent and trustworthy way.

An AI system that "gets" the domain of your users will not only be useful and adopted in the short- and middle-term, but also contribute to the competitive advantage of your business.

Now that you have learned a bunch of methods to incorporate domain-specific knowledge, you might be wondering how to approach this in your organizational context. Stay tuned for my next article, where we will consider the practical challenges and strategies for implementing an expertise-driven Ai Strategy!

Note: Unless noted otherwise, all images are the author’s.

The post Injecting domain expertise into your AI system appeared first on Towards Data Science.

Carving out your competitive advantage with AI

Dr. Janna Lipenkova — Thu, 17 Oct 2024 09:42:35 +0000

Carving Out Your Competitive Advantage with AI

Why the future of AI isn’t just automation – It’s craftsmanship, strategy, and innovation

Credit: Valentin Müller

When I talk to corporate customers, there is often this idea that AI, while powerful, won’t give any company a lasting competitive edge. After all, over the past two years, large-scale LLMs have become a commodity for everyone. I’ve been thinking a lot about how companies can shape a competitive advantage using AI, and a recent article in the Harvard Business Review (AI Won’t Give You a New Sustainable Advantage) inspired me to organize my thoughts around the topic.

Indeed, maybe one day, when businesses and markets are driven by the invisible hand of AI, the equal-opportunity hypothesis might ring true. But until then, there are so many ways – big and small – for companies to differentiate themselves using AI. I like to think of it as a complex ingredient in your business recipe – the success of the final dish depends on the cook who is making it. The magic lies in how you combine AI craft with strategy, design, and execution.

In this article, I’ll focus on real-life business applications of AI and explore their key sources of competitive advantage. As we’ll see, successful AI integration goes far beyond technology, and certainly beyond having the trendiest LLM at work. It’s about finding AI’s unique sweet spot in your organization, making critical design decisions, and aligning a variety of stakeholders around the optimal design, deployment, and usage of your AI systems. In the following, I will illustrate this using the mental model we developed to structure our thinking about AI projects (cf. this article for an in-depth introduction).

Figure 1: Sources of competitive advantage in an AI system (cf. this article for an explanation of the mental model for AI systems)

Note: If you want to learn more about pragmatic AI applications in real-life business scenarios, subscribe to my newsletter AI for Business.

AI opportunities aren’t created equal

AI is often used to automate existing tasks, but the more space you allow for creativity and innovation when selecting your AI use cases, the more likely they will result in a competitive advantage. You should also prioritize the unique needs and strengths of your company when evaluating opportunities.

Identifying use cases with differentiation potential

When we brainstorm AI use cases with customers, 90% of them typically fall into one of four buckets – productivity, improvement, personalization, and innovation. Let’s take the example of an airline business to illustrate some opportunities across these categories:

Figure 2: Mapping AI opportunities for an airline

Of course, the first branch – productivity and Automation – looks like the low-hanging fruit. It is the easiest one to implement, and automating boring routine tasks has an undeniable efficiency benefit. However, if you’re limiting your use of AI to basic automation, don’t be surprised when your competitors do the same. In our experience, strategic advantage is built up in the other branches. Companies that take the time to figure out how AI can help them offer something different, not just faster or cheaper, are the ones that see long-term results.

As an example, let’s look at a project we recently implemented with the Lufthansa Group. The company wanted to systematize and speed up its innovation processes. We developed an AI tool that acts as a giant sensor into the airline market, monitoring competitors, trends, and the overall market context. Based on this broad information, the tool now provides tailored innovation recommendations for Lufthansa. There are several aspects that cannot be easily imitated by potential competitors, and certainly not by just using a bigger AI model:

Understanding which information exactly is needed to make decisions about new innovation initiatives
Blending public data with unique company-specific knowledge
Educating users at company scale on the right usage of the data in their assessment of new innovation initiatives

All of this is novel know-how that was developed in tight cooperation between industry experts, practitioners, and a specialized AI team, involving lots of discovery, design decisions, and stakeholder alignment. If you get all of these aspects right, I believe you are on a good path toward creating a sustainable and defensible advantage with AI.

Finding your unique sweet spot for value creation

Value creation with AI is a highly individual affair. I recently experienced this firsthand when I challenged myself to build and launch an end-to-end AI app on my own. I’m comfortable with Python and don’t massively benefit from AI help there, but other stuff like frontend? Not really my home turf. In this situation, AI-powered code generation worked like a charm. It felt like flowing through an effortless no-code tool, while having all the versatility of the underlying – and unfamiliar – programming languages under my fingertips. This was my very own, personal sweet spot – using AI where it unlocks value I wouldn’t otherwise tap into, and sparing a frontend developer on the way. Most other people would not get so much value out of this case:

A professional front-end developer would not see such a drastic increase in speed .
A person without programming experience would hardly ever get to the finish line. You must understand how programming works to correctly prompt an AI model and integrate its outputs.

While this is a personal example, the same principle applies at the corporate level. For good or for bad, most companies have some notion of strategy and core competence driving their business. The secret is about finding the right place for AI in that equation – a place where it will complement and amplify the existing skills.

Data – a game of effort

Data is the fuel for any AI system. Here, success comes from curating high-quality, focused datasets and continuously adapting them to evolving needs. By blending AI with your unique expertise and treating data as a dynamic resource, you can transform information into long-term strategic value.

Managing knowledge and domain expertise

To illustrate the importance of proper knowledge management, let’s do a thought experiment and travel to the 16th century. Antonio and Bartolomeo are the best shoemakers in Florence (which means they’re probably the best in the world). Antonio’s family has meticulously recorded their craft for generations, with shelves of notes on leather treatments, perfect fits, and small adjustments learned from years of experience. On the other hand, Bartolomeo’s family has kept their secrets more closely guarded. They don’t write anything down; their shoemaking expertise has been passed down verbally, from father to son.

Now, a visionary named Leonardo comes along, offering both families a groundbreaking technology that can automate their whole shoemaking business – if it can learn from their data. Antonio comes with his wagon of detailed documentation, and the technology can directly learn from those centuries of know-how. Bartolomeo is in trouble – without written records, there’s nothing explicit for the AI to chew on. His family’s expertise is trapped in oral tradition, intuition, and muscle memory. Should he try to write all of it down now – is it even possible, given that most of his work is governed intuitively? Or should he just let it be and go on with his manual business-as-usual? Succumbing to inertia and uncertainty, he goes for the latter option, while Antonio’s business strives and grows with the help of the new technology. Freed from daily routine tasks, he can get creative and invent new ways to make and improve shoes.

Beyond explicit documentation, valuable domain expertise is also hidden across other data assets such as transactional data, customer interactions, and market insights. AI thrives on this kind of information, extracting meaning and patterns that would otherwise go unnoticed by humans.

Quality over quantity

Data doesn’t need to be big – on the contrary, today, big often means noisy. What’s critical is the quality of the data you’re feeding into your AI system. As models become more sample-efficient – i.e., able to learn from smaller, more focused datasets – the kind of data you use is far more important than how much of it you have.

In my experience, the companies that succeed with AI treat their data – be it for training, fine-tuning, or evaluation – like a craft. They don’t just gather information passively; they curate and edit it, refining and selecting data that reflects a deep understanding of their specific industry. This careful approach gives their AI sharper insights and a more nuanced understanding than any competitor using a generic dataset. I’ve seen firsthand how even small improvements in data quality can lead to significant leaps in AI performance.

Capturing the dynamics with the data flywheel

Data needs to evolve along with the real world. That’s where DataOps comes in, ensuring data is continuously adapted and doesn’t drift apart from reality. The most successful companies understand this and regularly update their datasets to reflect changing environments and market dynamics. A power mechanism to achieve this is the data flywheel. The more your AI generates insights, the better your data becomes, creating a self-reinforcing feedback loop because users will come back to your system more often. With every cycle, your data sharpens and your AI improves, building an advantage that competitors will struggle to match. To kick off the data flywheel, your system needs to demonstrate some initial value to start with – and then, you can bake in some additional incentives to nudge your users into using your system on a regular basis.

Figure 3: The data flywheel is a self-reinforcing feedback loop between users and the AI system

Intelligence: Sharpening your AI tools

Now, let’s dive into the "intelligence" component. This component isn’t just about AI models in isolation – it’s about how you integrate them into larger intelligent systems. Big Tech is working hard to make us believe that AI success hinges on the use of massive LLMs such as the GPT models. Good for them – bad for those of us who want to use AI in real-life applications. Overrelying on these heavyweights can bloat your system and quickly become a costly liability, while smart system design and tailored models are important sources for differentiation and competitive advantage.

Toward customization and efficiency

Mainstream LLMs are generalists. Like high-school graduates, they have a mediocre-to-decent performance across a wide range of tasks. However, in business, decent isn’t enough. You need to send your AI model to university so it can specialize, respond to your specific business needs, and excel in your domain. This is where fine-tuning comes into play. However, it’s important to recognize that mainstream LLMs, while powerful, can quickly become slow and expensive if not managed efficiently. As Big Tech boasts about larger model sizes and longer context windows – i.e., how much information you can feed into one prompt – smart tech is quietly moving towards efficiency. Techniques like prompt compression reduce prompt size, making interactions faster and more cost-effective. Small language models (SLMs) are another trend (Figure 4). With up to a couple of billions of parameters, they allow companies to safely deploy task- and domain-specific intelligence on their internal infrastructure (Anacode).

Figure 4: Small Language Models are gaining attention as the inefficiencies of mainstream LLMs become apparent

But before fine-tuning an LLM, ask yourself whether generative AI is even the right solution for your specific challenge. In many cases, predictive AI models – those that focus on forecasting outcomes rather than generating content – are more effective, cheaper, and easier to defend from a competitive standpoint. And while this might sound like old news, most of AI value creation in businesses actually happens with predictive AI.

Crafting compound AI systems

AI models don’t operate in isolation. Just as the human brain consists of multiple regions, each responsible for specific capabilities like reasoning, vision, and language, a truly intelligent AI system often involves multiple components. This is also called a "compound AI system" (BAIR). Compound systems can accommodate different models, databases, and software tools and allow you to optimize for cost and transparency. They also enable faster iteration and extension – modular components are easier to test and rearrange than a huge monolithic LLM.

Figure 5: Companies are moving from monolithic models to compound AI systems for better customization, transparency, and iteration (image adapted from BAIR)

Take, for example, a customer service automation system for an SME. In its basic form – calling a commercial LLM – such a setup might cost you a significant amount – let’s say $21k/month for a "vanilla" system. This cost can easily scare away an SME, and they will not touch the opportunity at all. However, with careful engineering, optimization, and the integration of multiple models, the costs can be reduced by as much as 98% (FrugalGPT). Yes, you read it right, that’s 2% of the original cost – a staggering difference, putting a company with stronger AI and engineering skills at a clear advantage. At the moment, most businesses are not leveraging these advanced techniques, and we can only imagine how much there is yet to optimize in their AI usage.

Generative AI isn’t the finish line

While generative AI has captured everyone’s imagination with its ability to produce content, the real future of AI lies in reasoning and problem-solving. Unlike content generation, reasoning is nonlinear – it involves skills like abstraction and generalization which generative AI models aren’t trained for.

AI systems of the future will need to handle complex, multi-step activities that go far beyond what current generative models can do. We’re already seeing early demonstrations of AI’s reasoning capabilities, whether through language-based emulations or engineered add-ons. However, the limitations are apparent – past a certain threshold of complexity, these models start to hallucinate. Companies that invest in crafting AI systems designed to handle these complex, iterative processes will have a major head start. These companies will thrive as AI moves beyond its current generative phase and into a new era of smart, modular, and reasoning-driven systems.

User experience: Seamless integration into user workflows

User experience is the channel through which you can deliver the value of AI to users. It should smoothly transport the benefits users need to speed up and perfect their workflows, while inherent AI risks and issues such as erroneous outputs need to be filtered or mitigated.

Optimizing on the strengths of humans and AI

In most real-world scenarios, AI alone can’t achieve full automation. For example, at my company Equintel, we use AI to assist in the ESG reporting process, which involves multiple layers of analysis and decision-making. While AI excels at large-scale data processing, there are many subtasks that demand human judgment, creativity, and expertise. An ergonomic system design reflects this labor distribution, relieving humans from tedious data routines and giving them the space to focus on their strengths.

This strength-based approach also alleviates common fears of job replacement. When employees are empowered to focus on tasks where their skills shine, they’re more likely to view AI as a supporting tool, not a competitor. This fosters a win-win situation where both humans and AI thrive by working together.

Calibrating user trust

Every AI model has an inherent failure rate. Whether generative AI hallucinations or incorrect outputs from predictive models, mistakes happen and accumulate into the dreaded "last-mile problem." Even if your AI system performs well 90% of the time, a small error rate can quickly become a showstopper if users overtrust the system and don’t address its errors.

Consider a bank using AI for fraud detection. If the AI fails to flag a fraudulent transaction and the user doesn’t catch it, the resulting loss could be significant – let’s say $500,000 siphoned from a compromised account. Without proper trust calibration, users might lack the tools or alerts to question the AI’s decision, allowing fraud to go unnoticed.

Now, imagine another bank using the same system but with proper trust calibration in place. When the AI is uncertain about a transaction, it flags it for review, even if it doesn’t outright classify it as fraud. This additional layer of trust calibration encourages the user to investigate further, potentially catching fraud that would have slipped through. In this scenario, the bank could avoid the $500,000 loss. Multiply that across multiple transactions, and the savings – along with improved security and customer trust – are substantial.

Combining AI efficiency and human ingenuity is the new competitive frontier

Success with AI requires more than just adopting the latest technologies – it’s about identifying and nurturing the individual sweet spots where AI can drive the most value for your business. This involves:

Pinpointing the areas where AI can create a significant impact.
Aligning a top-tier team of engineers, domain experts, and business stakeholders to design AI systems that meet these needs.
Ensuring effective AI adoption by educating users on how to maximize its benefits.

Finally, I believe we are moving into a time when the notion of competitive advantage itself is shaken up. While in the past, competing was all about maximizing profitability, today, businesses are expected to balance financial gains with sustainability, which adds a new layer of complexity. AI has the potential to help companies not only optimize their operations but also move toward more sustainable practices. Imagine AI helping to reduce plastic waste, streamline shared economy models, or support other initiatives that make the world a better place. The real power of AI lies not just in efficiency but in the potential it offers us to reshape whole industries and drive both profit and positive social impact.

For deep-dives into many of the topics that were touched in this article, check out my upcoming book The Art of AI Product Development.

Note: Unless noted otherwise, all images are the author’s.

The post Carving out your competitive advantage with AI appeared first on Towards Data Science.

Designing the relationship between LLMs and user experience

Dr. Janna Lipenkova — Fri, 19 Apr 2024 15:19:24 +0000

Designing the Relationship Between LLMs and User Experience

A while ago, I wrote the article Choosing the right language model for your NLP use case on Medium. It focussed on the nuts and bolts of LLMs – and while rather popular, by now, I realize it doesn’t actually say much about selecting LLMs. I wrote it at the beginning of my Llm journey and somehow figured that the technical details about LLMs – their inner workings and training history – would speak for themselves, allowing AI product builders to confidently select LLMs for specific scenarios.

Since then, I have integrated LLMs into multiple AI products. This allowed me to discover how exactly the technical makeup of an LLM determines the final experience of a product. It also strengthened the belief that product managers and designers need to have a solid understanding of how an LLM works "under the hood." LLM interfaces are different from traditional graphical interfaces. The latter provide users with a (hopefully clear) mental model by displaying the functionality of a product in a rather implicit way. On the other hand, LLM interfaces use free text as the main interaction format, offering much more flexibility. At the same time, they also "hide" the capabilities and the limitations of the underlying model, leaving it to the user to explore and discover them. Thus, a simple text field or chat window invites an infinite number of intents and inputs and can display as many different outputs.

Figure 1 A simple chat window is open for an infinite number of inputs (image via vectorstock.com under license purchased by author)

The responsibility for the success of these interactions is not (only) on the engineering side – rather, a big part of it should be assumed by whoever manages and designs the product. In this article, we will flesh out the relationship between LLMs and User Experience, working with two universal ingredients that you can use to improve the experience of your product:

Functionality, i.e., the tasks that are performed by an LLM, such as conversation, question answering, and sentiment analysis
Quality with which an LLM performs the task, including objective criteria such as correctness and coherence, but also subjective criteria such as an appropriate tone and style

(Note: These two ingredients are part of any LLM application. Beyond these, most applications will also have a range of more individual criteria to be fulfilled, such as latency, privacy, and safety, which will not be addressed here.)

Thus, in Peter Drucker’s words, it’s about "doing the right things" (functionality) and "doing them right" (quality). Now, as we know, LLMs will never be 100% right. As a builder, you can approximate the ideal experience from two directions:

On the one hand, you need to strive for engineering excellence and make the right choices when selecting, fine-tuning, and evaluating your LLM.
On the other hand, you need to work your users by nudging them towards intents covered by the LLM, managing their expectations, and having routines that fire off when things go wrong.

In this article, we will focus on the engineering part. The design of the ideal partnership with human users will be covered in a future article. First, I will briefly introduce the steps in the engineering process – LLM selection, adaptation, and evaluation – which directly determine the final experience. Then, we will look at the two ingredients – functionality and quality – and provide some guidelines to steer your work with LLMs to optimize the product’s performance along these dimensions.

A note on scope: In this article, we will consider the use of stand-alone LLMs. Many of the principles and guidelines also apply to LLMs used in RAG (Retrieval-Augmented Generation) and agent systems. For a more detailed consideration of the user experience in these extended LLM scenarios, please refer to my book The Art of AI Product Development.

The LLM engineering process

In the following, we will focus on the three steps of LLM selection, adaptation, and evaluation. Let’s consider each of these steps:

LLM selection involves scoping your deployment options (in particular, open-source vs. commercial LLMs) and selecting an LLM whose training data and pre-training objective align with your target functionality. In addition, the more powerful the model you can select in terms of parameter size and training data quantity, the better the chances it will achieve a high quality.
LLM adaptation via in-context learning or fine-tuning gives you the chance to close the gap between your users’ intents and the model’s original pre-training objective. Additionally, you can tune the model’s quality by incorporating the style and tone you would like your model to assume into the fine-tuning data.
LLM evaluation involves continuously evaluating the model across its lifecycle. As such, it is not a final step at the end of a process but a continuous activity that evolves and becomes more specific as you collect more insights and data on the model.

The following figure summarizes the process:

Figure 2 Engineering the LLM user experience

In real life, the three stages will overlap, and there can be back-and-forth between the stages. In general, model selection is more the "one big decision." Of course, you can shift from one model to another further down the road and even should do this when new, more suitable models appear on the market. However, these changes are expensive since they affect everything downstream. Past the discovery phase, you will not want to make them on a regular basis. On the other hand, LLM adaptation and evaluation are highly iterative. They should be accompanied by continuous discovery activities where you learn more about the behavior of your model and your users. Finally, all three activities should be embedded into a solid LLMOps pipeline, which will allow you to integrate new insights and data with minimal engineering friction.

Now, let’s move to the second column of the chart, scoping the functionality of an LLM and learning how it can be shaped during the three stages of this process.

Functionality: responding to user intents

You might be wondering why we talk about the "functionality" of LLMs. After all, aren’t LLMs those versatile all-rounders that can magically perform any linguistic task we can think of? In fact, they are, as famously described in the paper Language Models Are Few-Shot Learners. LLMs can learn new capabilities from just a couple of examples. Sometimes, their capabilities will even "emerge" out of the blue during normal training and – hopefully – be discovered by chance. This is because the task of language modeling is just as versatile as it is challenging – as a side effect, it equips an LLM with the ability to perform many other related tasks.

Still, the pre-training objective of LLMs is to generate the next word given the context of past words (OK, that’s a simplification – in auto-encoding, the LLM can work in both directions [3]). This is what a pre-trained LLM, motivated by an imaginary "reward," will insist on doing once it is prompted. In most cases, there is quite a gap between this objective and a user who comes to your product to chat, get answers to questions, or translate a text from German to Italian. The landmark paper Climbing Towards NLU: On Meaning, Form, and Understanding in the Age of Data by Emily Bender and Alexander Koller even argues that language models are generally unable to recover communicative intents and thus are doomed to work with incomplete meaning representations.

Thus, it is one thing to brag about amazing LLM capabilities in scientific research and demonstrate them on highly controlled benchmarks and test scenarios. Rolling out an LLM to an anonymous crowd of users with different AI skills and intents—some harmful—is a different kind of game. This is especially true once you understand that your product inherits not only the capabilities of the LLM but also its weaknesses and risks, and you (not a third-party provider) hold the responsibility for its behavior.

In practice, we have learned that it is best to identify and isolate discrete islands of functionality when integrating LLMs into a product. These functions can largely correspond to the different intents with which your users come to your product. For example, it could be:

Engaging in conversation
Retrieving information
Seeking recommendations for a specific situation
Looking for inspiration

Oftentimes, these can be further decomposed into more granular, potentially even reusable, capabilities. "Engaging in conversation" could be decomposed into:

Provide informative and relevant conversational turns
Maintain a memory of past interactions (instead of starting from scratch at every turn)
Display a consistent personality

Taking this more discrete approach to LLM capabilities provides you with the following advantages:

ML engineers and data scientists can better focus their engineering activities (Figure 2) on the target functionalities.
Communication about your product becomes on-point and specific, helping you manage user expectations and preserving trust, integrity, and credibility.
In the user interface, you can use a range of design patterns, such as prompt templates and placeholders, to increase the chances that user intents are aligned with the model’s functionality.

Guidelines for ensuring the right functionality

Let’s summarize some practical guidelines to make sure that the LLM does the right thing in your product:

During LLM selection, make sure you understand the basic pre-training objective of the model. There are three basic pre-training objectives (auto-encoding, autoregression, sequence-to-sequence), and each of them influences the behavior of the model.
Many LLMs are also pre-trained with an advanced objective, such as conversation or executing explicit instructions (instruction fine-tuning). Selecting a model that is already prepared for your task will grant you an efficient head start, reducing the amount of downstream adaptation and fine-tuning you need to do to achieve satisfactory quality.
LLM adaptation via in-context learning or fine-tuning gives you the opportunity to close the gap between the original pre-training objective and the user intents you want to serve.

Figure 3 LLM adaptation closes the gap between pre-training objectives and user intents

During the initial discovery, you can use in-context learning to collect initial usage data and sharpen your understanding of relevant user intents and their distribution.
In most scenarios, in-context learning (prompt tuning) is not sustainable in the long term – it is simply not efficient. Over time, you can use your new data and learnings as a basis to fine-tune the weights of the model.
During model evaluation, make sure to apply task-specific metrics. For example, Text2SQL LLMs (cf. this article) can be evaluated using metrics like execution accuracy and test-suite accuracy, while summarization can be evaluated using similarity-based metrics.

These are just short snapshots of the lessons we learned when integrating LLMs. My upcoming book The Art of AI Product Development contains deep dives into each of the guidelines along with numerous examples. For the technical details behind pre-training objectives and procedures, you can refer to this article.

Ok, so you have gained an understanding of the intents with which your users come to your product and "motivated" your model to respond to these intents. You might even have put out the LLM into the world in the hope that it will kick off the data flywheel. Now, if you want to keep your good-willed users and acquire new users, you need to quickly ramp up on our second ingredient, namely quality.

Achieving a high quality

In the context of LLMs, quality can be decomposed into an objective and a subjective component. The objective component tells you when and why things go wrong (i.e., the LLM makes explicit mistakes). The subjective component is more subtle and emotional, reflecting the alignment with your specific user crowd.

Objective quality criteria

Using language to communicate comes naturally to humans. Language is ingrained in our minds from the beginning of our lives, and we have a hard time imagining how much effort it takes to learn it from scratch. Even the challenges we experience when learning a foreign language can’t compare to the training of an LLM. The LLM starts from a blank slate, while our learning process builds on an incredibly rich basis of existing knowledge about the world and about how language works in general.

When working with an LLM, we should constantly remain aware of the many ways in which things can go wrong:

The LLM might make linguistic mistakes.
The LLM might slack on coherence, logic, and consistency.
The LLM might have insufficient world knowledge, leading to wrong statements and hallucinations.

These shortcomings can quickly turn into showstoppers for your product – output quality is a central determinant of the user experience of an LLM product. For example, one of the major determinants of the "public" success of ChatGPT was that it was indeed able to generate correct, fluent, and relatively coherent text across a large variety of domains. Earlier generations of LLMs were not able to achieve this objective quality. Most pre-trained LLMs that are used in production today do have the capability to generate language. However, their performance on criteria like coherence, consistency, and world knowledge can be very variable and inconsistent. To achieve the experience you are aiming for, it is important to have these requirements clearly prioritized and select and adapt LLMs accordingly.

Subjective quality criteria

Venturing into the more nuanced subjective domain, you want to understand and monitor how users feel around your product. Do they feel good and trustful and get into a state of flow when they use it? Or do they go away with feelings of frustration, inefficiency, and misalignment? A lot of this hinges on individual nuances of culture, values, and style. If you are building a copilot for junior developers, you hardly want it to speak the language of senior executives and vice versa.

For the sake of example, imagine you are a product marketer. You have spent a lot of your time with a fellow engineer to iterate on an LLM that helps you with content generation. At some point, you find yourself chatting with the UX designer on your team and bragging about your new AI assistant. Your colleague doesn’t get the need for so much effort. He is regularly using ChatGPT to assist with the creation and evaluation of UX surveys and is very satisfied with the results. You counter – ChatGPT’s outputs are too generic and monotonous for your storytelling and writing tasks. In fact, you have been using it at the beginning and got quite embarrassed because, at some point, your readers started to recognize the characteristic ChatGPT flavor. That was a slippery episode in your career, after which you decided you needed something more sophisticated.

There is no right or wrong in this discussion. ChatGPT is good for straightforward factual tasks where style doesn’t matter that much. By contrast, you as a marketer need an assistant that can assist in crafting high-quality, persuasive communications that speak the language of your customers and reflect the unique DNA of your company.

These subjective nuances can ultimately define the difference between an LLM that is useless because its outputs need to be rewritten anyway and one that is "good enough" so users start using it and feed it with suitable fine-tuning data. The holy grail of LLM mastery is personalization – i.e., using efficient fine-tuning or prompt tuning to adapt the LLM to the individual preferences of any user who has spent a certain amount of time with the model. If you are just starting out on your LLM journey, these details might seem far off – but in the end, they can help you reach a level where your LLM delights users by responding in the exact manner and style that is desired, spurring user satisfaction and large-scale adoption and leaving your competition behind.

Guidelines

Here are our tips for managing the quality of your LLM:

Be alert to different kinds of feedback. The quest for quality is continuous and iterative – you start with a few data points and a very rough understanding of what quality means for your product. Over time, you flesh out more and more details and learn which levers you can pull to improve your LLM.
During model selection, you still have a lot of discovery to do – start with "eyeballing" and testing different LLMs with various inputs (ideally by multiple team members).
Your engineers will also be evaluating academic benchmarks and evaluation results that are published together with the model. However, keep in mind that these are only rough indicators of how the model will perform in your specific product.
At the beginning, perfectionism isn’t the answer. Your model should be just good enough to attract users who will start supplying it with relevant data for fine-tuning and evaluation.
Bring your team and users together for qualitative discussions of LLM outputs. As they use language to judge and debate what is right and what is wrong, you can gradually uncover their objective and emotional expectations.
Make sure to have a solid LLMOps pipeline in place so you can integrate new data smoothly, reducing engineering friction.
Don’t stop – at later stages, you can shift your focus toward nuances and personalization, which will also help you sharpen your competitive differentiation.

To sum up: assuming responsibility

Pre-trained LLMs are highly convenient – they make AI accessible to everyone, offloading the huge engineering, computation, and infrastructure spending needed to train a huge initial model. Once published, they are ready to use, and we can plug their amazing capabilities into our product. However, when using a third-party model in your product, you inherit not only its power but also the many ways in which it can and will fail. When things go wrong, the last thing you want to do to maintain integrity is to blame an external model provider, your engineers, or – worse – your users.

Thus, when building with LLMs, you should not only look for transparency into the model’s origins (training data and process) but also build a causal understanding of how its technical makeup shapes the experience offered by your product. This will allow you to find the sensitive balance between kicking off a robust data flywheel at the beginning of your journey and continuously optimizing and differentiating the LLM as your product matures toward excellence.

References

[1] Janna Lipenkova (2022). Choosing the right language model for your NLP use case, Medium.

[2] Tom B. Brown et al. (2020). Language Models are Few-Shot Learners.

[3] Jacob Devlin et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

[4] Emily M. Bender and Alexander Koller (2020). Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data.

[5] Janna Lipenkova (upcoming). The Art of AI Product Development, Manning Publications.

Note: All images are by the author, except when noted otherwise.

The post Designing the relationship between LLMs and user experience appeared first on Towards Data Science.

Redefining Conversational AI with Large Language Models

Dr. Janna Lipenkova — Thu, 28 Sep 2023 18:22:52 +0000

A guide to implementing conversational AI for a unified user experience

Source: rawpixel.com

Conversational AI is an application of LLMs that has triggered a lot of buzz and attention due to its scalability across many industries and use cases. While conversational systems have existed for decades, LLMs have brought the quality push that was needed for their large-scale adoption. In this article, we will use the mental model shown in Figure 1 to dissect conversational AI applications (cf. Building AI products with a holistic mental model for an introduction to the mental model). After considering the market opportunities and the business value of conversational AI systems, we will explain the additional "machinery" in terms of data, LLM fine-tuning, and conversational design that needs to be set up to make conversations not only possible but also useful and enjoyable.

Figure 1: Mental model of an AI system (cf. Building AI products with a holistic mental model)

1. Opportunity, value, and limitations

Traditional UX design is built around a multitude of artificial UX elements, swipes, taps, and clicks, requiring a learning curve for each new app. Using conversational AI, we can do away with this busyness, substituting it with the elegant experience of a naturally flowing conversation in which we can forget about the transitions between different apps, windows, and devices. We use language, our universal and familiar protocol for communication, to interact with different virtual assistants (VAs) and accomplish our tasks.

Conversational UIs are not exactly the new hot stuff. Interactive voice response systems (IVRs) and chatbots have been around since the 1990s, and major advances in NLP have been closely followed by waves of hope and development for voice and chat interfaces. However, before the time of LLMs, most of the systems were implemented in the symbolic paradigm, relying on rules, keywords, and conversational patterns. They were also limited to a specific, pre-defined domain of "competence", and users venturing outside of these would soon hit a dead end. All in all, these systems were mined with potential points of failure, and after a couple of frustrating attempts, many users never came back to them. The following figure illustrates an example dialogue. A user who wants to order tickets for a specific concert patiently goes through a detailed interrogation flow, only to find out at the end that the concert is sold out.

Figure 2: Example of a poor conversation flow

As an enabling technology, LLMs can take conversational interfaces to new levels of quality and user satisfaction. Conversational systems can now display much broader world knowledge, linguistic competence, and conversational ability. Leveraging pre-trained models, they can also be developed in much shorter timespans since the tedious work of compiling rules, keywords, and dialogue flows is now replaced by the statistical knowledge of the LLM. Let’s look at two prominent applications where conversational AI can provide value at scale:

Customer support and, more generally, applications that are used by a large number of users who often make similar requests. Here, the company providing the customer support has a clear information advantage over the user and can leverage this to create a more intuitive and enjoyable user experience. Consider the case of rebooking a flight. For myself, a rather frequent flyer, this is something that happens 1–2 times per year. In-between, I tend to forget the details of the process, not to speak of the user interface of a specific airline. By contrast, the customer support of the airline has rebooking requests at the front and center of their operations. Instead of exposing the rebooking process via a complex graphical interface, its logic can be "hidden" from customers who contact the support, and they can use language as a natural channel to make their rebooking. Of course, there will still remain a "long tail" of less familiar requests. For example, imagine a spontaneous mood swing that pushes a business customer to add her beloved dog as excess baggage to a booked flight. These more individual requests can be passed on to human agents or covered via an internal knowledge management system connected to the virtual assistant.
Knowledge management which is grounded in a large quantity of data. For many modern companies, the internal knowledge they accumulate over years of operating, iterating, and learning is a core asset and differentiator – if it is stored, managed, and accessed in an efficient way. Sitting on a wealth of data that is hidden in collaboration tools, internal wikis, knowledge bases, etc., they often fail to transform it into actionable knowledge. As employees leave, new employees are onboarded, and you never come to finalize that documentation page you started three months ago, valuable knowledge falls victim to entropy. It becomes more and more difficult to find a way through the internal data labyrinth and get your hands on the bits of information required in a specific business situation. This leads to huge efficiency losses for knowledge workers. To address this issue, we can augment LLMs with semantic search on internal data sources. LLMs allow to use natural-language questions instead of complex formal queries to ask questions against this database. Users can thus focus on their information needs rather than on the structure of the knowledge base or the syntax of a query language such as SQL. Being text-based, these systems work with data in a rich semantic space, making meaningful connections "under the hood".

Beyond these major application areas, there are numerous other applications, such as telehealth, mental health assistants, and educational chatbots, that can streamline UX and bring value to their users in a faster and more efficient way.

2. Data

LLMs are originally not trained to engage in fluent small talk or more substantial conversations. Rather, they learn to generate the following token at each inference step, eventually resulting in a coherent text. This low-level objective is different from the challenge of human conversation. Conversation is incredibly intuitive for humans, but it gets incredibly complex and nuanced when you want to teach a machine to do it. For example, let’s look at the fundamental notion of intents. When we use language, we do so for a specific purpose, which is our communicative intent – it could be to convey information, socialize, or ask someone to do something. While the first two are rather straightforward for an LLM (as long as it has seen the required information in the data), the latter is already more challenging. Not only does the LLM need to combine and structure the related information in a coherent way, but it also needs to set the right emotional tone in terms of soft criteria such as formality, creativity, humor, etc. This is a challenge for conversational design (cf. section 5), which is closely intertwined with the task of creating fine-tuning data.

Making the transition from classical language generation to recognizing and responding to specific communicative intents is an important step toward better usability and acceptance of conversational systems. As for all fine-tuning endeavors, this starts with the compilation of an appropriate dataset.

The fine-tuning data should come as close as possible to the (future) real-world data distribution. First, it should be conversational (dialogue) data. Second, if your virtual assistant will be specialized in a specific domain, you should try to assemble fine-tuning data that reflects the necessary domain knowledge. Third, if there are typical flows and requests that will be recurring frequently in your application, as in the case of customer support, try to incorporate varied examples of these in your training data. The following table shows a sample of conversational fine-tuning data from the 3K Conversations Dataset for ChatBot, which is freely available on Kaggle:

Table 1: Sample of conversational fine-tuning data from the 3K Conversations Dataset for ChatBot

Manually creating conversational data can become an expensive undertaking – crowdsourcing and using LLMs to help you generate data are two ways to scale up. Once the dialogue data is collected, the conversations need to be assessed and annotated. This allows you to show both positive and negative examples to your model and nudge it towards picking up the characteristics of the "right" conversations. The assessment can happen either with absolute scores or a ranking of different options between each other. The latter approach leads to more accurate fine-tuning data because humans are normally better at ranking multiple options than evaluating them in isolation.

With your data in place, you are ready to fine-tune your model and enrich it with additional capabilities. In the next section, we will look at fine-tuning, integrating additional information from memory and semantic search, and connecting agents to your conversational system to empower it to execute specific tasks.

3. Assembling the conversational system

A typical conversational system is built with a conversational agent that orchestrates and coordinates the components and capabilities of the system, such as the LLM, the memory, and external data sources. The development of conversational AI systems is a highly experimental and empirical task, and your developers will be in a constant back-and-forth between optimizing your data, improving the fine-tuning strategy, playing with additional components and enhancements, and testing the results. Non-technical team members, including product managers and UX designers, will also be continuously testing the product. Based on their customer discovery activities, they are in a great position to anticipate future users’ conversation style and content and should be actively contributing this knowledge.

3.1 Teaching conversation skills to your LLM

For fine-tuning, you need your fine-tuning data (cf. section 2) and a pre-trained LLM. LLMs already know a lot about language and the world, and our challenge is to teach them the principles of conversation. In fine-tuning, the target outputs are texts, and the model will be optimized to generate texts that are as similar as possible to the targets. For supervised fine-tuning, you first need to clearly define the conversational AI task you want the model to perform, gather the data, and run and iterate over the fine-tuning process.

With the hype around LLMs, a variety of fine-tuning methods have emerged. For a rather traditional example of fine-tuning for conversation, you can refer to the description of the LaMDA model.[1] LaMDA was fine-tuned in two steps. First, dialogue data is used to teach the model conversational skills ("generative" fine-tuning). Then, the labels produced by annotators during the assessment of the data are used to train classifiers that can assess the model’s outputs along desired attributes, which include sensibleness, specificity, interestingness, and safety ("discriminative" fine-tuning). These classifiers are then used to steer the behavior of the model towards these attributes.

Figure 3: LaMDA is fine-tuned in two steps

Additionally, factual groundedness – the ability to ground their outputs in credible external information – is an important attribute of LLMs. To ensure factual groundedness and minimize hallucination, LaMDA was fine-tuned with a dataset that involves calls to an external information retrieval system whenever external knowledge is required. Thus, the model learned to first retrieve factual information whenever the user made a query that required new knowledge.

Another popular fine-tuning technique is Reinforcement Learning from Human Feedback (RLHF)[2]. RLHF "redirects" the learning process of the LLM from the straightforward but artificial next-token prediction task towards learning human preferences in a given communicative situation. These human preferences are directly encoded in the training data. During the annotation process, humans are presented with prompts and either write the desired response or rank a series of existing responses. The behavior of the LLM is then optimized to reflect the human preference.

3.2 Adding external data and semantic search

Beyond compiling conversations for fine-tuning the model, you might want to enhance your system with specialized data that can be leveraged during the conversation. For example, your system might need access to external data, such as patents or scientific papers, or internal data, such as customer profiles or your technical documentation. This is normally done via semantic search (also known as retrieval-augmented generation, or RAG)[3]. The additional data is saved in a database in the form of semantic embeddings (cf. this article for an explanation of embeddings and further references). When the user request comes in, it is preprocessed and transformed into a semantic embedding. The semantic search then identifies the documents that are most relevant to the request and uses them as context for the prompt. By integrating additional data with semantic search, you can reduce hallucination and provide more useful, factually grounded responses. By continuously updating the embedding database, you can also keep the knowledge and responses of your system up-to-date without constantly rerunning your fine-tuning process.

3.3 Memory and context awareness

Imagine going to a party and meeting Peter, a lawyer. You get excited and start pitching the legal chatbot you are currently planning to build. Peter looks interested, leans towards you, uhms and nods. At some point, you want his opinion on whether he would like to use your app. Instead of an informative statement that would compensate for your eloquence, you hear: "Uhm… what was this app doing again?"

The unwritten contract of communication among humans presupposes that we are listening to our conversation partners and building our own speech acts on the context we are co-creating during the interaction. In social settings, the emergence of this joint understanding characterizes a fruitful, enriching conversation. In more mundane settings like reserving a restaurant table or buying a train ticket, it is an absolute necessity in order to accomplish the task and provide the expected value to the user. This requires your assistant to know the history of the current conversation, but also of past conversations – for example, it should not be asking for the name and other personal details of a user over and over whenever they initiate a conversation.

One of the challenges of maintaining context awareness is coreference resolution, i.e. understanding which objects are referred to by pronouns. Humans intuitively use a lot of contextual cues when they interpret language – for example, you can ask a young child, "Please get the green ball out of the red box and bring it to me," and the child will know you mean the ball, not the box. For virtual assistants, this task can be rather challenging, as illustrated by the following dialogue:

Assistant: Thanks, I will now book your flight. Would you also like to order a meal for your flight?

User: Uhm… can I decide later whether I want it?

Assistant: Sorry, this flight cannot be changed or canceled later.

Here, the assistant fails to recognize that the pronoun it from the user refers not to the flight, but to the meal, thus requiring another iteration to fix this misunderstanding.

3.4 Additional guardrails

Every now and then, even the best LLM will misbehave and hallucinate. In many cases, hallucinations are plain accuracy issues – and, well, you need to accept that no AI is 100% accurate. Compared to other AI systems, the "distance" between the user and the AI is rather small between the user and the AI. A plain accuracy issue can quickly turn into something that is perceived as toxic, discriminative, or generally harmful. Additionally, since LLMs don’t have an inherent understanding of privacy, they can also reveal sensitive data such as personally identifiable information (PII). You can work against these behaviors by using additional guardrails. Tools such as Guardrails AI, Rebuff, NeMo Guardrails, and Microsoft Guidance allow you to de-risk your system by formulating additional requirements on LLM outputs and blocking undesired outputs.

Multiple architectures are possible in conversational AI. The following schema shows a simple example of how the fine-tuned LLM, external data, and memory can be integrated by a conversational agent, which is also responsible for the prompt construction and the guardrails.

Figure 4: Schema of a conversational AI system including a fine-tuned LLM, a database for semantic search, and a memory component

4. User experience and conversational design

The charm of conversational interfaces lies in their simplicity and uniformity across different applications. If the future of user interfaces is that all apps look more or less the same, is the job of the UX designer doomed? Definitely not – conversation is an art to be taught to your LLM so it can conduct conversations that are helpful, natural, and comfortable for your users. Good conversational design emerges when we combine our knowledge of human psychology, linguistics, and UX design. In the following, we will first consider two basic choices when building a conversational system, namely whether you will use voice and/or chat, as well as the larger context of your system. Then, we will look at the conversations themselves, and see how you can design the personality of your assistant while teaching it to engage in helpful and cooperative conversations.

4.1 Voice versus chat

Conversational interfaces can be implemented using chat or voice. In a nutshell, voice is faster while chat allows users to stay private and to benefit from enriched UI functionality. Let’s dive a bit deeper into the two options since this is one of the first and most important decisions you will face when building a conversational app.

To pick between the two alternatives, start by considering the physical setting in which your app will be used. For example, why are almost all conversational systems in cars, such as those offered by Nuance Communications, based on voice? Because the hands of the driver are already busy and they cannot constantly switch between the steering wheel and a keyboard. This also applies to other activities like cooking, where users want to stay in the flow of their activity while using your app. Cars and kitchens are mostly private settings, so users can experience the joy of voice interaction without worrying about privacy or about bothering others. By contrast, if your app is to be used in a public setting like the office, a library, or a train station, voice might not be your first choice.

After understanding the physical setting, consider the emotional side. Voice can be used intentionally to transmit tone, mood, and personality – does this add value in your context? If you are building your app for leisure, voice might increase the fun factor, while an assistant for mental health could accommodate more empathy and allow a potentially troubled user a larger diapason of expression. By contrast, if your app will assist users in a professional setting like trading or customer service, a more anonymous, text-based interaction might contribute to more objective decisions and spare you the hassle of designing an overly emotional experience.

As a next step, think about the functionality. The text-based interface allows you to enrich the conversations with other media like images and graphical UI elements such as buttons. For example, in an e-commerce assistant, an app that suggests products by posting their pictures and structured descriptions will be way more user-friendly than one that describes products via voice and potentially provides their identifiers.

Finally, let’s talk about the additional design and development challenges of building a voice UI:

There is an additional step of speech recognition that happens before user inputs can be processed with LLMs and Natural Language Processing (NLP).
Voice is a more personal and emotional medium of communication – thus, the requirements for designing a consistent, appropriate, and enjoyable persona behind your virtual assistant are higher, and you will need to take into account additional factors of "voice design" such as timbre, stress, tone, and speaking speed.
Users expect your voice conversation to proceed at the same speed as a human conversation. To offer a natural interaction via voice, you need a much shorter latency than for chat. In human conversations, the typical gap between turns is 200 milliseconds – This prompt response is possible because we start constructing our turns while listening to our partner’s speech. Your voice assistant will need to match up with this degree of fluency in the interaction. By contrast, for chatbots, you compete with time spans of seconds, and some developers even introduce an additional delay to make the conversation feel like a typed chat between humans.
Communication via voice is a linear, one-off enterprise – if your user didn’t get what you said, you are in for a tedious, error-prone clarification loop. Thus, your turns need to be as concise, clear, and informative as possible.

If you go for the voice solution, make sure that you not only clearly understand the advantages as compared to chat, but also have the skills and resources to address these additional challenges.

4.2 Where will your conversational AI live?

Now, let’s consider the larger context in which you can integrate conversational AI. All of us are familiar with chatbots on company websites – those widgets on the right of your screen that pop up when we open the website of a business. Personally, more often than not, my intuitive reaction is to look for the Close button. Why is that? Through initial attempts to "converse" with these bots, I have learned that they cannot satisfy more specific information requirements, and in the end, I still need to comb through the website. The moral of the story? Don’t build a chatbot because it’s cool and trendy – rather, build it because you are sure it can create additional value for your users.

Beyond the controversial widget on a company website, there are several exciting contexts to integrate those more general chatbots that have become possible with LLMs:

Copilots: These assistants guide and advise you through specific processes and tasks, like GitHub CoPilot for programming. Normally, copilots are "tied" to a specific application (or a small suite of related applications).
Synthetic humans (also digital humans): These creatures "emulate" real humans in the digital world. They look, act, and talk like humans and thus also need rich conversational abilities. Synthetic humans are often used in immersive applications such as gaming, and augmented and virtual reality.
Digital twins: Digital twins are digital "copies" of real-world processes and objects, such as factories, cars, or engines. They are used to simulate, analyze, and optimize the design and behavior of the real object. Natural language interactions with digital twins allow for smoother and more versatile access to the data and models.
Databases: Nowadays, data is available on any topic, be it investment recommendations, code snippets, or educational materials. What is often hard is to find the very specific data that users need in a specific situation. Graphical interfaces to databases are either too coarse-grained or covered with endless search and filter widgets. Versatile query languages such as SQL and GraphQL are only accessible to users with the corresponding skills. Conversational solutions allow users to query the data in natural language, while the LLM that processes the requests automatically converts them into the corresponding query language (cf. this article for an explanation of Text2SQL).

4.3 Imprinting a personality on your assistant

As humans, we are wired to anthropomorphize, i.e. to inflict additional human traits when we see something that vaguely resembles a human. Language is one of humankind’s most unique and fascinating abilities, and conversational products will automatically be associated with humans. People will imagine a person behind their screen or device – and it is good practice to not leave this specific person to the chance of your users’ imaginations, but rather lend it a consistent personality that is aligned with your product and brand. This process is called "persona design".

The first step of persona design is understanding the character traits you would like your persona to display. Ideally, this is already done at the level of the training data – for example, when using RLHF, you can ask your annotators to rank the data according to traits like helpfulness, politeness, fun, etc., in order to bias the model towards the desired characteristics. These characteristics can be matched with your brand attributes to create a consistent image that continuously promotes your branding via the product experience.

Beyond general characteristics, you should also think about how your virtual assistant will deal with specific situations beyond the "happy path". For example, how will it respond to user requests that are beyond its scope, reply to questions about itself, and deal with abusive or vulgar language?

It is important to develop explicit internal guidelines on your persona that can be used by data annotators and conversation designers. This will allow you to design your persona in a purposeful way and keep it consistent across your team and over time, as your application undergoes multiple iterations and refinements.

4.4 Making conversations helpful with the "principle of cooperation"

Have you ever had the impression of talking to a brick wall when you were actually speaking with a human? Sometimes, we find our conversation partners are just not interested in leading the conversation to success. Fortunately, in most cases, things are smoother, and humans will intuitively follow the "principle of cooperation" that was introduced by the language philosopher Paul Grice. According to this principle, humans who successfully communicate with each other follow four maxims, namely quantity, quality, relevance, and manner.

Maxim of quantity

The maxim of quantity asks the speaker to be informative and make their contribution as informative as required. On the side of the virtual assistant, this also means actively moving the conversation forward. For example, consider this snippet from an e-commerce fashion app:

Assistant: What kind of clothing items are you looking for?

User: I am looking for a dress in orange.

Assistant: Don’t: Sorry, we don’t have orange dresses at the moment.

Do: Sorry, we don’t have dresses in orange, but we have this great and very comfortable dress in yellow: …

The user hopes to leave your app with a suitable item. Stopping the conversation because you don’t have items that would fit the exact description kills off the possibility of success. However, if your app makes suggestions about alternative items, it will appear more helpful and leave the option of a successful interaction open.

Especially in voice interactions, it is important to find the right balance between providing all the information the user might need for success while not overwhelming them with unnecessary information which might cloud the interaction.

Maxim of quality

The maxim of quality asks speakers to be truthful and not say things they believe are false or for which they lack adequate evidence. There is a lot of subjectivity and personal belief involved here – thus, this maxim cannot be directly projected onto LLMs. As we have seen, LLMs tend to hallucinate, and this risk is especially high when the conversation drifts off into less familiar territories. Additional guardrails can be used in order to avoid misleading statements and bring the conversation back on track.

Maxim of relevance

The maxim of relevance asks us to be – well, relevant. As a precondition, the virtual assistant has to understand the true intent of the user. Here is an example of an airline assistant that violates this rule:

User: I need a flight from Berlin to Zurich tomorrow morning.

Assistant: OK, I found the following flights from Berlin to Zurich for tomorrow morning. Flight BZ36294 leaves Berlin at 6.30am and arrives in Zurich at 8am. Flight BZ96294 leaves Berlin at 8am and arrives at 9.30am. Flight BZ96294 leaves Berlin at 8am and arrives in Zurich at 9.30am. Flight BZ95614 leaves Berlin at 11am and arrives in Zurich at 12.30pm. There is also the possibility of a connecting flight, in which case you would leave Berlin at 11.45am and arrive in Zurich at 3pm.

User: Sorry… could you repeat this again?

The assistant is overwhelming the user with detailed information that might not be relevant. By breaking this up and guiding the user to limit their search space, the dialogue becomes much easier to digest:

User: I need a flight from Berlin to Zurich tomorrow morning.

Assistant: Ok, I found flights leaving at 6.30, 8, and 11am. Which time would you like to leave?

User: I need to be in Zurich before 9am.

Assistant: OK, so you can take the flight BZ36294. It leaves at 6.30 and arrives at 8am. Should I buy the ticket for you?

User: Yes, thanks.

Maxim of manner

Finally, the maxim of manner states that our speech acts should be clear, concise and orderly, avoiding ambiguity and obscurity of expression. Your virtual assistant should avoid technical or internal jargon, and favour simple, universally understandable formulations.

While Grice’s principles are valid for all conversations independently of a specific domain, LLMs not trained specifically for conversation often fail to fulfill them. Thus, when compiling your training data, it is important to have enough dialogue samples that allow your model to learn these principles.

The domain of conversational design is developing rather quickly. Whether you are already building AI products or thinking about your career path in AI, I encourage you to dig deeper into this topic (cf. the excellent introductions in [5] and [6]). As AI is turning into a commodity, good design together with a defensible data strategy will become two important differentiators for AI products.

Summary

Let’s summarize the key takeaways from the article. Additionally, figure 5 offers a "cheat sheet" with the main points that you can download as a reference.

LLMs enhance conversational AI: Large Language Models (LLMs) have significantly improved the quality and scalability of conversational AI applications across various industries and use cases.
Conversational AI can add a lot of value to applications with lots of similar user requests (e.g., customer service) or that need to access a large quantity of unstructured data (e.g. knowledge management).
Data: Fine-tuning LLMs for conversational tasks requires high-quality conversational data that closely mirrors real-world interactions. Crowdsourcing and LLM-generated data can be valuable resources for scaling data collection.
Putting the system together: Developing conversational AI systems is an iterative and experimental process involving constant optimization of data, fine-tuning strategies, and component integration.
Teaching conversation skills to LLMs: Fine-tuning LLMs involves training them to recognize and respond to specific communicative intents and situations.
Adding external data with semantic search: Integrating external and internal data sources using semantic search enhances the AI’s responses by providing more contextually relevant information.
Memory and context awareness: Effective conversational systems must maintain context awareness, including tracking the history of the current conversation and past interactions, to provide meaningful and coherent responses.
Setting guardrails: To ensure responsible behavior, conversational AI systems should employ guardrails to prevent inaccuracies, hallucinations, and breaches of privacy.
Persona design: Designing a consistent persona for your conversational assistant is essential to creating a cohesive and branded User Experience. Persona characteristics should align with your product and brand attributes.
Voice vs. chat: Choosing between voice and chat interfaces depends on factors like the physical setting, emotional context, functionality, and design challenges. Consider these factors when deciding on the interface for your conversational AI.
Integration in various contexts: Conversational AI can be integrated in different contexts, including copilots, synthetic humans, digital twins, and databases, each with specific use cases and requirements.
Observing the Principle of Cooperation: Following the principles of quantity, quality, relevance, and manner in conversations can make interactions with conversational AI more helpful and user-friendly.

Figure 5: Key takeaways and best practices for conversational AI

References

[1] Heng-Tze Chen et al. 2022. LaMDA: Towards Safe, Grounded, and High-Quality Dialog Models for Everything.

[2] OpenAI. 2022. ChatGPT: Optimizing Language Models for Dialogue. Retrieved on January 13, 2022.

[3] Patrick Lewis et al. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

[4] Paul Grice. 1989. Studies in the Way of Words.

[5] Cathy Pearl. 2016. Designing Voice User Interfaces.

[6] Michael Cohen et al. 2004. Voice User Interface Design.

Note: All images are by the author, except noted otherwise.

The post Redefining Conversational AI with Large Language Models appeared first on Towards Data Science.

Building AI products with a holistic mental model

Dr. Janna Lipenkova — Sun, 03 Sep 2023 17:41:16 +0000

Building AI systems with a holistic mental model

Last update: October 19, 2024

Recently, I coached a client – an SME in the fintech sector – who had hit a dead end with their AI effort for marketing content generation. With their MS Copilot and access to the latest OpenAI models in place, they were all set for their AI adventure. They had even hired a prompt engineer to help them create prompts they could regularly use for new marketing content. The whole project was fun and engaging, but things didn’t make it into production. Dissatisfied with the many issues in the AI outputs – hallucination, the failure to integrate relevant sources, and a heavy AI flavour in the writing style – marketeers would switch back to "manual mode" as soon as things got serious. They simply couldn’t rely on the AI to produce content that would publicly represent the company.

Misunderstanding AI as prompting the latest GenAI models, this company failed to embrace all the elements of a successful AI initiative, so we had to put them back in place. In this article, I will introduce a mental model of for AI systems that we often use to help customers build a holistic understanding of their target application. This model can be used as a tool to ease collaboration, align the different perspectives inside and outside the AI team, and create successful applications based on a shared vision.

Figure 1: Mental model of an AI system

Inspired by product management, this model has two big sections – an "opportunity space" where you can define your use cases and value creation potentials, and a "solution space" where all the hard work of implementing, fine-tuning, and promoting your AI happens. In the following, I will explain the components that you need to define in each of these spaces.

Note: If you want to learn more about pragmatic AI applications in real-life business scenarios, subscribe to my newsletter AI for Business.

1. Pinpointing your opportunities for value creation with AI

With all the cool stuff you can now do with AI, you might be impatient to get your hands dirty and start building. However, to build something your users need and love, you should back your development with a use case that is in demand by your users. In the ideal world, users tell us what they need or want. For example, in the case of my client, the request for automating content generation came from marketeers who were overwhelmed in their jobs, but also saw that the company needs to produce more content to stay visible and relevant. If you are building for external users, you can look for hints about their needs and painpoints in existing customer feedback, such as in product reviews and notes from your sales and success teams.

However, since AI is an emerging technology, you shouldn’t overrely on what your users ask for – chances are, they simply don’t know what is possible yet. True innovators embrace and hone their information advantage over customers and users – for example, Henry Ford famously said: "If I had asked people what they wanted, they would have said faster horses." Luckily for us, he was proactive and didn’t wait for people to articulate they need cars. If stretch out your antennae, AI opportunities will come to you from many directions, such as:

Market positioning: AI is trendy – for established businesses, it can be used to reinforce the image of a business as innovative, high-tech, future-proof, etc. For example, it can elevate an existing marketing agency to an AI-powered service and differentiate it from competitors. However, don’t do AI for the sake of AI. The positioning trick is to be applied with caution and in combination with other opportunities – otherwise, you risk losing credibility.
Competitors: When your competitors make a move, it is likely that they have already done the underlying research and validation. Look at them after some time – was their development successful? Use this information to optimize your own solution, adopt the successful parts, and iron out the mistakes. For example, let’s say you are observing a competitor that is offering a service for fully automated generation of marketing content. Users click a "big red button", and the AI marches ahead to write and publish the content. After some research, you learn that users hesitate to use this product because they want to retain more control over the process and contribute their own expertise and personality to the writing. After all, writing is also about self-expression and individual creativity. This is the time for you to move ahead with a versatile tool that offers rich functionality and configuration for shaping your content. It boosts the efficiency of users while allowing them to "inject" themselves into the process whenever they wish.
Regulations: megatrends such as technological disruption and globalization force regulators to tighten their requirements. Regulations create pressure and are a bulletproof source of opportunity. For example, imagine a regulation comes into place that strictly requires everyone to advertise AI-generated content as such. Those companies that already use tools for AI content generation will disappear for internal discussions on whether they want this. Many of them will refrain because they want to maintain an image of genuine thought leadership, as opposed to producing visibly AI-generated boilerplate. Let’s say you were smart and opted for an augmented solution that gives users enough control so they can remain the official "authors" of the texts. As the new restriction is introduced, you are immune and can dash forward to capitalize on the regulation, while your competitors with fully automated solutions will need time to recover from the setback.
Enabling technologies: Emerging technologies and significant leaps in existing technologies, such as the wave of generative AI in 2022–23, can open up new ways of doing things, or catapult existing applications to a new level. Let’s say you have been running a traditional marketing agency for the last decade. Now, you can start introducing AI hacks and solutions into your business to increase the efficiency of your employees, serve more clients with the existing resources, and increase your profit. You are building on your existing expertise, reputation, and (hopefully good-willed) customer base, so introducing AI enhancements can be much smoother and less risky than it would be for a newcomer.

I also advise to proactively brainstorm opportunities and ideas to make your business more efficient, improve, and innovate. You can use the following four "buckets" of AI benefits to guide your brainstorming:

Productivity and automation: AI can support or automate routine processes where many small decisions need to be made, such as fraud detection, customer service, and invoice processing. This reduces human workload, frees up resources for more significant tasks, and generally makes your users happy. AI can also be applied to tasks that users would like (someone) to do but don’t because they lack the time and resources. For example, if you want to publish content daily rather than biweekly, AI can become your best friend.
Improvement and augmentation: AI can help you improve the outcome of certain tasks. For example, when creating new content, the wide knowledge of an LLM can be combined with the specific context knowledge of the human, thus enhancing the final result.
Personalization: Individualism is a trend, and modern users want products to adapt to their needs and preferences. Many B2C tech companies have already mastered this discipline – think of the personalized recommendations on YouTube, Netflix, or Amazon. Generative AI pushes this further, allowing for content personalization on a micro-level.
Innovation and transformation: Business environments, including competitors, regulations, and customers, are changing at high speed. To stay relevant over time, companies need to get into the habit of adjusting and innovating on a continuous basis, even in situations of stress and uncertainty. AI can be our best friend on this journey. It can connect the dots between different bodies of information, come up with innovative ideas, and assist us with the implementation of these innovations.

Some of these benefits – for example productivity – can be directly quantified for ROI. For less tangible gains like personalization, you will need to think of proxy metrics like user satisfaction. As you think about your AI strategy, you might want to start with productivity and automation, which are the low-hanging fruits, and move on to the more challenging and transformative buckets later on.

2. Data – the fuel of your AI system

In our content generation project, the company jumped right into using LLMs without having a high-quality, task-specific dataset at hand. This was one of the reasons for its failure – your AI is only as good as the data you feed it. For any kind of serious AI and machine learning, you need to collect and prepare your data so it reflects the real-life inputs and provides sufficient learning signals for your AI models. When you start out, there are different ways to get your hands on a decent dataset:

You can use an existing dataset. This can either be a standard machine learning dataset or a dataset with a different initial purpose that you adapt for your task. There are some dataset classics, such as the IMDB Movie Reviews Dataset for sentiment analysis and the MNIST dataset for hand-written character recognition. There are more exotic and exciting alternatives, like Catching Illegal Fishing and Dog Breed Identification, and innumerable user-curated datasets on data hubs like Kaggle. However, the chances of finding a dataset that exactly matches your specific task are low. In most cases, you will soon need to follow up with other methods to enrich your data.
You can annotate or create the data manually to create the right learning signals. In the content generation project, we built a small database of examples of past content pieces, and the marketing team evaluated them based on a range of criteria such as novelty and style. Annotation can be also be done by an external provider or a crowdsourcing service such as Amazon Mechanical Turk. It is a rather costly undertaking, so you can also consider automated methods to increase the scale of your training data.
You can quickly add more examples to an existing dataset using data augmentation. Thus, in our project, we used content performance metrics such as reads and likes to automatically evaluate and annotate a larger number of social media posts. For simpler tasks like sentiment analysis, you can produce near-duplicates of existing examples by introducing some additional noise into the texts, switch up a couple of words, etc. For more open generation tasks, you can use LLMs for automated training data generation. Once you have identified the best method to augment your data, you can easily scale it to reach the required dataset size – however, keep in mind that automated data generation also introduces more noise and bias into your data. Thus, you need to continuously monitor and curate the augmented data to maintain quality and ensure your AI doesn’t drift away into the wrong direction.

When creating your data, you face a trade-off between quality and quantity. You can manually annotate less data with a high quality, or spend your budget on developing hacks and tricks for automated data augmentation that will introduce additional noise. A rough rule of thumb is as follows:

Prioritize quantity when pre-training a model. It needs to acquire knowledge from scratch, which can only happen with a larger quantity of data.
Prioritize quality when fine-tuning an existing model. The controlled manual annotation of a small dataset using detailed guidelines might be the optimal solution in this case.

Ultimately, you will find your ideal data composition through a constant back-and-forth between training, evaluation, and enhancing your data. Thus, in the content generation project, the team is now continuously collecting and curating data based on new published content and updating the datasets used for LLM fine-tuning. They are able to quickly optimize and sharpen the system because the most valuable data comes directly from production. When your application goes live, you should have data collection mechanisms in place to collect user inputs, AI outputs, and, if possible, additional learning signals such as user evaluations. Using this data for fine-tuning will make your model come as close as possible to the "ground truth" of user expectations. This results in higher user satisfaction, more usage and engagement, and, in turn, more high-quality data – a virtuous cycle that is also called the data flywheel.

Figure 2: Data flywheel

3. Intelligence

Data is the raw material from which your model will learn, and hopefully, you can compile a representative, high-quality dataset for your challenges. Now, the actual intelligence of your AI system – its ability to generalize to new data – resides in the machine learning algorithms and models, and any additional tools and plugins that can be called by these.

In terms of the core AI models, there are three main approaches you can adopt:

Prompt an existing model. Mainstream LLMs (Large Language Models) of the GPT family, such as GPT-4o and GPT-4, as well as from other providers such as Anthropic and AI21 Labs are available for inference via API. With prompting, you can directly talk to these models, including in your prompt all the domain- and task-specific information required for a task. This can include specific content to be used, examples of analogous tasks (few-shot prompting) as well as instructions for the model to follow. For example, if your user wants to generate a blog post about a new product they are releasing, you might ask them to provide some core information about the product, such as its benefits and use cases, how to use it, the launch date, etc. Your product then fills this information into a carefully crafted prompt template and asks the LLM to generate the text. Prompting is great to get a head-start into pre-trained models. However, just as in the case of my client, it often leads to the "last-mile problem" – the AI gives reasonable outputs but their are just not good enough for real-life use. You can do whatever you want – provide more data, optimize your formulation, threaten the model, but this point, you’ve used up the optimization potential of prompting.
Fine-tune a pre-trained model. When you hit the ceiling with your prompting efforts, you can consider fine-tuning a model with your custom data. Think of the model as a high-school student which is knows a little about a lot of things, but doesn’t really excel at anything in particular. With fine-tuning, you can send it to university by feeding it specialized data and tasks. For example, for marketing content generation, we collected a set of blog posts that performed well in terms of engagement, and reverse-engineer the instructions for these. From this data, the model learned about the structure, flow, and style of successful articles. Fine-tuning requires some engineering skill, but is a great way to optimize the accuracy, privacy, and running cost of your application. Since you have full control and ownership over the fine-tuned model, this is also a great way to enhance your competitive moat.
Train your ML model from scratch. In general, this approach works well for simpler but highly specific problems such as sentiment analysis, fraud detection, and text classification. These tasks can often be solved with established machine learning methods like logistic regression, which are computationally less expensive than fancy deep learning methods. The generation of content does not exactly fall into this category – it requires advanced linguistic capabilities to get you off the ground, and these can only be acquired after training on ridiculously large amounts of data.

Beyond the training, evaluation is of primary importance for the successful use of machine learning. Suitable evaluation metrics and methods are not only important for a confident launch of your AI features but will also serve as a clear target for further optimization and as a common ground for internal discussions and decisions. While technical metrics such as precision, recall, and accuracy can provide a good starting point, ultimately, you will want to look for metrics that reflect the real-life value that your AI is delivering to users.

Finally, today, the trend is moving from using a single AI model to compound AI systems which accommodate different models, databases, and software tools and allow you to optimize for cost and transparency. Thus, in the content generation project, we used a RAG (Retrieval-Augmented Generation) architecture and combined the model with a database of domain-specific sources that it could use to produce specialized fintech content.

Figure 3: Retrieval-Augmented Generation architecture

After the user inputs a query, the system doesn’t pass it directly to the LLM, but rather retrieves the most relevant sources for the query in the database. Then, it uses these sources to augment the prompt passed to the LLM. Thus, the LLM can use up-to-date, specialized sources to generate its final answer. Compared to an isolated fine-tuned model, this reduced hallucinations and allowed users to always have access to the latest sources. Other types of compound systems are agent systems, LLM routers and cascades. A detailed description is out of the scope of this article – if you want to learn more about these patterns, you can refer to my book The Art of AI Product Management.

4. User experience

The user experience of AI products is a captivating theme – after all, users have high hopes but also fears about "partnering" with an AI that can supercharge and potentially outsmart them. The design of this human-AI partnership requires a thoughtful and sensible discovery and design process. One of the key considerations is the degree of automation you want to grant with your product – and, mind you, total automation is by far not always the ideal solution. The following figure illustrates the automation continuum:

Figure 4: The automation continuum of AI systems

Let’s look at each of these levels:

In the first stage, humans do all the work, and no automation is performed. Despite the hype around AI, most knowledge-intensive tasks in modern companies are still carried out on this level, presenting huge opportunities for automation. For example, the content writer who resists AI-driven tools and is persuaded that writing is a highly manual and idiosyncratic craft works here.
In the second stage of assisted AI, users have complete control over task execution and do a big part of the work manually, but AI tools help them save time and compensate for their weak points. For example, when writing a blog post with a tight deadline, a final linguistic check with Grammarly or a similar tool can become a welcome timesaver. It can eliminate manual revision, which requires a lot of your scarce time and attention and might still leave you with errors and overlooks – after all, to err is human.
With augmented intelligence, AI is a partner that augments the intelligence of the human, thus leveraging the strengths of both worlds. Compared to assisted AI, the machine has much more to say in your process and covers a larger set of responsibilities, like ideation, generation, and editing of drafts, and the final linguistic check. Users still need to participate in the work, make decisions, and perform parts of the task. The user interface should clearly indicate the labor distribution between human and AI, highlight error potentials, and provide transparency into the steps it performs. In short, the "augmented" experience guides users to the desired outcome via iteration and refinement.
And finally, we have full automation – an intriguing idea for AI geeks, philosophers, and pundits, but impractical for most real-life systems. Full automation means that you are offering one "big red button" that kicks off the process and gives full control to the machine. Once the AI is done, your users face the final output and either take it or leave it. Anything that happened in-between they cannot influence. Full automation is an important element of design approaches such as ambient intelligence and calm technology, as implemented in smart home appliances, voice assistants, etc. However, as of now, LLMs and other foundational models are far from being able to catch and process the rich context information they would need for seamless and reliable automated operation. As you can imagine, the UX options for full automation are rather limited since there is virtually no interaction going on. The bulk of the responsibility for success rests on the shoulders of your technical colleagues, who need to ensure an exceptionally high quality of the outputs.

AI systems need special treatment when it comes to design. Standard graphical interfaces are deterministic and allow you to foresee all possible paths the user might take. By contrast, large AI models are probabilistic and uncertain – they expose a range of amazing capabilities but also risks such as toxic, wrong, and harmful outputs. From the outside, your AI interface might look simple because a broad range of the capabilities of your product reside in the model and are not directly visible to users. For example, an LLM can interpret prompts, produce text, search for information, summarize it, adopt a certain style and terminology, execute instructions, etc. Even if your UI is a simple chat or prompting interface, don’t leave this potential unseen – in order to lead users to success, you need to be explicit and realistic. Make users aware of the capabilities and limitations of your AI models, allow them to easily discover and fix errors made by the AI, and teach them ways to iterate themselves to optimal outputs. By emphasizing trust, transparency, and user education, you can make your users collaborate with the AI. While a deep dive into AI UX Design is out of the scope of this article, I strongly encourage you to look for inspiration not only from other AI companies but also from other areas of design, such as human-machine interaction. You will soon identify a range of recurring design patterns, such as autocompletes, prompt suggestions, and templates, that you can integrate into your own interface to make the most out of your data and models.

5. Governance – balancing innovation with responsibility

When you start out with AI, it is easy to forget about governance because you are busy solving technological challenges and creating value. However, without a governance framework, your tool can be vulnerable to security risks, legal violations, and ethical concerns that can erode customer trust and harm your business. In the fintech example mentioned earlier, this led to issues such as hallucinations and irrelevant sources leaking into the public content of the company. A strong governance structure creates guardrails to prevent these issues. It protects sensitive data, ensures compliance with privacy regulations, maintains transparency, and mitigates biases in AI-generated content.

There are different definitions of AI governance. In my practice, companies were especially concerned with the following four types of risk:

Security: AI systems must ensure the protection of sensitive data from breaches and attacks through encryption, role-based access controls, and regular adversarial testing. This helps safeguard against unauthorized access or malicious inputs that could compromise the integrity of the system.
Transparency and explainability: It is important for AI systems to offer clear explanations of how decisions are made and outputs are generated. This transparency helps build user trust and ensures that users can understand and, when necessary, intervene in the system’s processes, especially in high-stakes or critical applications.
Privacy: Ensuring compliance with data protection laws, such as GDPR, is essential. This involves anonymizing user data, minimizing data collection, and giving users the option to opt out of data usage, which helps protect individual privacy and uphold ethical data practices.
Bias, fairness, and ethics: AI systems should actively work to identify and reduce bias in their models to ensure fair and equitable treatment across different user groups. Regular fairness audits and adherence to ethical guidelines are crucial to align AI with societal values and prevent discriminatory outcomes.

Exposure to these risks depends on the application you are building, so it is worth spending some time to analyze your specific situation. For example, demographic bias (based on gender, race, location, etc.) is an important topic if your model generates user-facing content or makes decisions about people, but it turns into a non-issue if you use your model to generate code in the B2B context. In my experience, B2B applications have higher requirements in terms of security and transparency, while B2C applications need mode guardrails to safeguard the privacy of user data and mitigate bias.

To set up your AI governance framework, begin by reviewing the relevant regulations and defining your objectives. At a minimum, ensure you meet the regulatory requirements for your industry and geography, such as the EU AI Act in Europe or the California Consumer Privacy Act in the U.S. Beyond compliance, you can also plan for additional guardrails to address key risks specific to your AI application. Next, assemble a cross-functional team of legal, compliance, security, and AI experts to define, implement, and assign governance measures. This team should regularly review and update the framework to adapt to system improvements, new risks, and evolving regulations. For example, the recent FTC actions against companies that exaggerated their AI performance signal the importance of focusing on quality and maintaining realistic communication about AI capabilities.

6. Specifying the mental model

Let’s summarize how we addressed the different components in the content generation project:

Figure 5: Specified mental model for AI-driven content generation

This representation can be used at different stages in the AI journey – to prioritize use cases, to guide your team planning and discussions, and to align different stakeholders. It is an evolving construct that can be updated as you move along with new learnings as you move forward with your project.

Summary

Let’s summarize the key take-aways from this article:

Identify AI use cases strategically: Focus on areas where AI can add real value, such as improving productivity, personalization, or driving innovation.
Balance AI with human expertise: Design AI tools to support, not replace, human skills, using AI to augment decision-making.
Prioritise data quality: High-quality data is essential for effective AI. Curate, annotate, and enhance your datasets to ensure they reflect real-world conditions.
Start with low-hanging fruits: Begin with simple AI applications like automation, then expand to more complex, strategic tasks.
Develop a data flywheel: Create a feedback loop by collecting real-time user data and using it to continuously improve your AI model.
Refine AI through fine-tuning: Customise pre-trained models with your own data to improve performance in your specific business context.
Optimise the user experience: Ensure AI products are user-friendly, with clear explanations and error-correction options to boost adoption.
Implement strong AI governance: Establish governance to ensure security, privacy, transparency, and bias mitigation, complying with legal standards like GDPR.
Create a cross-functional governance team: Form a team of experts in legal, compliance, security, and AI to manage governance and risks.
Focus on realistic communication and trust: Be transparent about AI capabilities and avoid overpromising, building user trust through responsible practices.

Where you can go from here:

To see the mental model in action, learn how it can be applied to conversational AI and to Text2SQL applications.
If you have already started implementing AI, read this article to learn how you can use it to carve out a competitive advantage.
For deep-dives into many of the topics that were touched in this article, check out my upcoming book The Art of AI Product Development.

Note: All images are by the author.

The post Building AI products with a holistic mental model appeared first on Towards Data Science.

Creating an Information Edge with Conversational Access to Data

Dr. Janna Lipenkova — Sun, 11 Jun 2023 15:59:34 +0000

Figure 1: Representation of the Text2SQL flow

As our world is getting more global and dynamic, businesses are more and more dependent on data for making informed, objective and timely decisions. In this article, we will see how AI can be used for intuitive conversational data access. We will use the mental model shown in Figure 2 to illustrate the implementation of Text2SQL systems (cf. Building AI products with a holistic mental model for an introduction to the mental model). After considering the market opportunities and the business value, we will explain the additional "machinery" in terms of data, LLM fine-tuning, and UX design that needs to be set up to make data widely accessible throughout the organization.

Figure 2: Mental model of an AI system

1. Opportunity

As of now, unleashing the full potential of organisational data is often a privilege of a handful of data scientists and analysts. Most employees don’t master the conventional Data Science toolkit (SQL, Python, R etc.). To access the desired data, they go via an additional layer where analysts or BI teams "translate" the prose of business questions into the language of data. The potential for friction and inefficiency on this journey is high – for example, the data might be delivered with delays or even when the question has already become obsolete. Information might get lost along the way when the requirements are not accurately translated into analytical queries. Besides, generating high-quality insights requires an iterative approach which is discouraged with every additional step in the loop. On the other side, these ad-hoc interactions create disruption for expensive data talent and distract them from more strategic data work, as described in these "confessions" of a data scientist:

When I was at Square and the team was smaller we had a dreaded "analytics on-call" rotation. It was strictly rotated on a weekly basis, and if it was your turn up you knew you would get very little "real" work done that week and spend most of your time fielding ad-hoc questions from the various product and operations teams at the company (SQL monkeying, we called it). There was cutthroat competition for manager roles on the analytics team and I think this was entirely the result of managers being exempted from this rotation – no status prize could rival the carrot of not doing on-call work.[1]

2. Value

Wouldn’t it be cool to talk directly to your data instead of having to go through multiple rounds of interaction with your data staff? This vision is embraced by conversational interfaces which allow humans to interact with data using language, our most intuitive and universal channel of communication. After parsing a question, an algorithm encodes it into a structured logical form in the query language of choice, such as SQL. Thus, non-technical users can chat with their data and quickly get their hands on specific, relevant and timely information, without making the detour via a BI team. The three main benefits are:

Business users can access organisational data in a direct and timely way.
This relieves data scientists and analysts from the burden of ad-hoc requests from business users and allows them to focus on advanced data challenges.
The business can leverage its data in a more fluid and strategic way, finally turning it into a solid basis for decision making.

Now, what are the product scenarios in which you might consider Text2SQL? The three main settings are:

You are offering a scalable data/BI product and want to enable more users to access their data in a non-technical way, thus growing both the usage and the user base. As an example, ServiceNow has integrated data queries into a larger conversational offering, and Atlan has recently announced natural-language data exploration.
You are looking to build something in the data/AI space to democratise data access in companies, in which case you could potentially consider an MVP with Text2SQL at the core. Providers like AI2SQL and Text2sql.ai are already making an entrance in this space.
You are working on a custom BI system and want to maximise and democratise its use in the individual company.

As we will see in the following sections, Text2SQL requires a non-trivial upfront setup. To estimate the ROI, consider the nature of the decisions that are to be supported as well as the available data. Text2SQL can be an absolute win in dynamic environments where data is changing quickly and is actively and frequently used in decision making, such as investing, marketing, manufacturing and the energy industry. In these environments, traditional tools for knowledge management are too static, and more fluent ways to access data and information help companies generate a competitive advantage. In terms of the data, Text2SQL provides the biggest value with a database that is:

Large and growing, so that Text2SQL can unfold its value over time as more and more of the data is leveraged.
High-quality, so that the Text2SQL algorithm does not have to deal with excessive noise (inconsistencies, empty values etc.) in the data. In general, data that is automatically generated by applications has a higher quality and consistency than data that is created and maintained by humans.
Semantically mature as opposed to raw, so that humans can query the data based on central concepts, relationships and metrics that exist in their mental model. Note that semantic maturity can be achieved by an additional transformation step which conforms raw data into a conceptual structure (cf. section "Enriching the prompt with database information").

3. Data

Any machine learning endeavour starts with data, so we will start by clarifying the structure of the input and target data that are used during training and prediction. Throughout the article, we will use the Text2SQL flow from Figure 1 as our running representation, and highlight the currently considered components and relationships in yellow.

Figure 3: In this Text2SQL representation, data-related elements and relations are marked in yellow.

1. 1 Format and structure of the data

Typically, a raw Text2SQL input-output pair consists of a natural-language question and the corresponding SQL query, for example:

Question: "List the name and number of followers for each user."

SQL query:

select name, followers from user_profiles

In the training data space, the mapping between questions and SQL queries is many-to-many:

A SQL query can be mapped to many different questions in natural language; for example, the above query semantics can be expressed by: "show me the names and numbers of followers per user", "how many followers are there for each user?" etc.
SQL syntax is highly versatile, and almost every question can be represented in SQL in multiple ways. The simplest example are different orderings of WHERE clauses. On a more advanced stance, everyone who has done SQL query optimisation will know that many roads lead to the same result, and semantically equivalent queries might have completely different syntax.

The manual collection of training data for Text2SQL is particularly tedious. It not only requires SQL mastery on the part of the annotator, but also more time per example than more general linguistic tasks such as sentiment analysis and text classification. To ensure a sufficient quantity of training examples, data augmentation can be used – for example, LLMs can be used to generate paraphrases for the same question. [3] provides a more complete survey of Text2SQL data augmentation techniques.

1.2 Enriching the prompt with database information

Text2SQL is an algorithm at the interface between unstructured and structured data. For optimal performance, both types of data need to be present during training and prediction. Specifically, the algorithm has to know about the queried database and be able to formulate the query in such a way that it can be executed against the database. This knowledge can encompass:

Columns and tables of the database
Relations between tables (foreign keys)
Database content

There are two options for incorporating database knowledge: on the one hand, the training data can be restricted to examples written for the specific database, in which case the schema is learned directly from the SQL query and its mapping to the question. This single-database setting allows to optimise the algorithm for an individual database and/or company. However, it kills off any ambitions for scalability, since the model needs to be fine-tuned for every single customer or database. Alternatively, in a multi-database setting, the database schema can be provided as part of the input, allowing the algorithm to "generalise" to new, unseen database schemas. While you will absolutely need to go for this approach if you want to use Text2SQL on many different databases, keep in mind that it requires considerable prompt engineering effort. For any reasonable business database, including the full information in the prompt will be extremely inefficient and most probably impossible due to prompt length limitations. Thus, the function responsible for prompt formulation should be smart enough to select a subset of database information which is most "useful" for a given question, and to do this for potentially unseen databases.

Finally, database structure plays a crucial role. In those scenarios where you have enough control over the database, you can make your model’s life easier by letting it learn from an intuitive structure. As a rule of thumb, the more your database reflects how business users talk about the business, the better and faster your model can learn from it. Thus, consider applying additional transformations to the data, such as assembling normalised or otherwise dispersed data into wide tables or a data vault, naming tables and columns in an explicit and unambiguous way etc. All business knowledge that you can encode up-front will reduce the burden of probabilistic learning on your model and help you achieve better results.

2. Intelligence

Figure 4: In this Text2SQL representation, algorithm-related elements and relations are marked in yellow.

Text2SQL is a type of semantic parsing – the mapping of texts to logical representations. Thus, the system has not only to "learn" natural language, but also the target representation – in our case, SQL. Specifically, it needs to acquire the following bits of knowledge:

SQL syntax and semantics
Database structure
Natural Language Understanding (NLU)
Mapping between natural language and SQL queries (syntactic, lexical and semantic)

2.1 Solving linguistic variability in the input

At the input, the main challenge of Text2SQL lies in the flexibility of language: as described in the section Format and structure of the data, the same question can be paraphrased in many different ways. Additionally, in the real-life conversational context, we have to deal with a number of issues such as spelling and grammar mistakes, incomplete and ambiguous inputs, multilingual inputs etc.

Figure 5: The Text2SQL algorithm has to deal with many different variants of a question

LLMs such as the GPT models, T5, and CodeX are coming closer and closer to solving this challenge. Learning from huge quantities of diverse text, they learn to deal with a large number of linguistic patterns and irregularities. In the end, they become able to generalise over questions which are semantically similar despite having different surface forms. LLMs can be applied out-of-the-box (zero-shot) or after fine-tuning. The former, while convenient, leads to lower accuracy. The latter requires more skill and work, but can significantly increase accuracy.

In terms of accuracy, as expected, the best-performing models are the latest models of the GPT family including the CodeX models. In April 2023, GPT-4 led to a dramatic accuracy increase of more than 5% over the previous state-of-the-art and achieved an accuracy of 85.3% (оn the metric "execution with values").[4] In the open-source camp, initial attempts at solving the Text2SQL puzzle were focussed on auto-encoding models such as BERT, which excel at NLU tasks.[5, 6, 7] However, amidst the hype around generative AI, recent approaches focus on autoregressive models such as the T5 model. T5 is pre-trained using multi-task learning and thus easily adapts to new linguistic tasks, incl. different variants of semantic parsing. However, autoregressive models have an intrinsic flaw when it comes to semantic parsing tasks: they have an unconstrained output space and no semantic guardrails that would constrain their output, which means they can get stunningly creative in their behaviour. While this is amazing stuff for generating free-form content, it is a nuisance for tasks like Text2SQL where we expect a constrained, well-structured target output.

2.2 Query validation and improvement

To constrain the LLM output, we can introduce additional mechanisms for validating and improving the query. This can be implemented as an extra validation step, as proposed in the PICARD system.[8] PICARD uses a SQL parser that can verify whether a partial SQL query can lead to a valid SQL query after completion. At each generation step by the LLM, tokens that would invalidate the query are rejected, and the highest-probability valid tokens are kept. Being deterministic, this approach ensures 100% SQL validity as long as the parser observes correct SQL rules. It also decouples the query validation from the generation, thus allowing to maintain both components independently of one another and to upgrade and modify the LLM.

Another approach is to incorporate structural and SQL knowledge directly into the LLM. For example, Graphix [9] uses graph-aware layers to inject structured SQL knowledge into the T5 model. Due to the probabilistic nature of this approach, it biases the system towards correct queries, but doesn’t provide a guarantee for success.

Finally, the LLM can be used as a multi-step agent that can autonomously check and improve the query.[10] Using multiple steps in a chain-of-thought prompt, the agent can be tasked to reflect on the correctness of its own queries and improve any flaws. If the validated query can still not be executed, the SQL exception traceback can be passed to the agent as an additional feedback for improvement.

Beyond these automated methods which happen in the backend, it is also possible to involve the user during the query checking process. We will describe this in more detail in the section on User experience.

2.3 Evaluation

To evaluate our Text2SQL algorithm, we need to generate a test (validation) dataset, run our algorithm on it and apply relevant evaluation metrics on the result. A naive dataset split into training, development and validation data would be based on question-query pairs and lead to suboptimal results. Validation queries might be revealed to the model during training and lead to an overly optimistic view on its generalisation skills. A query-based split, where the dataset is split in such a way that no query appears both during training and during validation, provides more truthful results.

In terms of evaluation metrics, what we care about in Text2SQL is not to generate queries that are completely identical to the gold standard. This "exact string match" method is too strict and will generate many false negatives, since different SQL queries can lead to the same returned dataset. Instead, we want to achieve high semantic accuracy and evaluate whether the predicted and the "gold standard" queries would always return the same datasets. There are three evaluation metrics that approximate this goal:

Exact-set match accuracy: the generated and target SQL queries are split into their constituents, and the resulting sets are compared for identity.[11] The shortcoming here is that it only accounts for order variations in the SQL query, but not for more pronounced syntactic differences between semantically equivalent queries.
Execution accuracy: the datasets resulting from the generated and target SQL queries are compared for identity. With good luck, queries with different semantics can still pass this test on a specific database instance. For example, assuming a database where all users are aged over 30, the following two queries would return identical results despite having different semantics: select from user select from user where age > 30
Test-suite accuracy: test-suite accuracy is a more advanced and less permissive version of execution accuracy. For each query, a set ("test suite") of databases is generated that are highly differentiated with respect to the variables, conditions and values in the query. Then, execution accuracy is tested on each of these databases. While requiring additional effort to engineer the test-suite generation, this metric also significantly reduces the risk of false positives in the evaluation.[12]

3. User experience

Figure 6: In this Text2SQL representation, UX-related elements and relations are marked in yellow.

The current state-of-the-art of Text2SQL doesn’t allow a completely seamless integration into production systems – instead, it is necessary to actively manage the expectations and the behaviour of the user, who should always be aware that she is interacting with an AI system.

3.1 Failure management

Text2SQL can fail in two modes, which need to be caught in different ways:

SQL errors: the generated query is not valid – either the SQL is invalid, or it cannot be executed against the specific database due to lexical or semantic flaws. In this case, no result can be returned to the user.
Semantic errors: the generated query is valid but it does not reflect the semantics of the question, thus leading to a wrong returned dataset.

The second mode is particularly tricky since the risk of "silent failures" – errors that go undetected by the user – is high. The prototypical user will have neither the time nor the technical skill to verify the correctness of the query and/or the resulting data. When data is used for decision making in the real world, this kind of failure can have devastating consequences. To avoid this, it is critical to educate users and establish guardrails on a business level that limit the potential impact, such as additional data checks for decisions with a higher impact. On the other hand, we can also use the user interface to manage the human-machine interaction and help the user detect and improve problematic requests.

3.2 Human-machine interaction

Users can get involved with your AI system with different degrees of intensity. More interaction per request can lead to better results, but it also slows down the fluidity of the user experience. Besides the potential negative impact of erroneous queries and results, also consider how motivated your users will be to provide back-and-forth feedback in order to get more accurate results and also help improve the product in the long term.

The easiest and least engaging way is to work with confidence scores. While the naive calculation of confidence as an average of the probabilities of the generated tokens is overly simplistic, more advanced methods like verbalised feedback can be used. [13] The confidence can be displayed in the interface and highlighted with an explicit alert in case it is dangerously low. This way, the responsibility of an appropriate follow-up in the "real world" – be it a rejection, acceptance or an additional check of the data – lands on the shoulders of your user. While this is a safe bet for you as a vendor, transferring this work to the user can also reduce the value of your product.

A second possibility is to engage the user in a clarification dialogue in the case of low-confidence, ambiguous or otherwise suspicious queries. For example, your system might suggest orthographic or grammar corrections to the input and ask to disambiguate specific words or grammatical structures. It might also allow the user to proactively ask for corrections in the query:[14]

USER: Show me John’s tasks in this sprint.

ASSISTANT: Would you like to see tasks John created, or those he is working on?

USER: tasks John created

ASSISTANT: Ok, here are the task IDs:

USER: Thanks, I would also like to see more information about the tasks. Please also sort by urgency.

ASSISTANT: Sure, here are the tasks along with short descriptions, assignees and deadlines, sorted by deadline.

Finally, to ease the understanding of queries by the user, your system can also provide an explicit textual reformulation of the query and ask the user to either confirm or correct it.[15]

4. Non-functional requirements

In this section, we discuss the specific non-functional requirements for Text2SQL as well as the trade-offs between them. We will focus on the six requirements that seem most important for the task: accuracy, scalability, speed, explainability, privacy and adaptability over time.

4.1 Accuracy

For Text2SQL, the requirements on accuracy are high. First, Text2SQL is typically applied in a conversation setting where predictions are made one-by-one. Thus, the "Law of large numbers" which typically helps balance off the error in batched predictions, does not help. Second, syntactic and lexical validity is a "hard" condition: the model has to generate a well-formed SQL query, potentially with complex syntax and semantics, otherwise the request cannot be executed against the database. And if this goes well and the query can be executed, it can still contain semantic errors and lead to a wrong returned dataset (cf. section 3.1 Failure management).

4.2 Scalability

The main scalability considerations are whether you want to apply Text2SQL on one or multiple databases – and in the latter case, whether the set of databases is known and closed. If yes, you will have an easier time since you can include the information about these databases during training. However, in a scenario of a scalable product – be it a standalone Text2SQL application or an integration into an existing data product – your algorithm has to cope with any new database schema on the fly. This scenario also doesn’t give you the opportunity to transform the database structure to make it more intuitive for learning (link!). All of this leads to a heavy trade-off with accuracy, which might also explain why current Text2SQL providers that offer ad-hoc querying of new databases have not yet achieve a significant market penetration.

4.3 Speed

Since Text2SQL requests will typically be processed online in a conversation, the speed aspect is important for user satisfaction. On the positive side, users are often aware of the fact that data requests can take a certain time and show the required patience. However, this goodwill can be undermined by the chat setting, where users subconsciously expect human-like conversation speed. Brute-force optimisation methods like reducing the size of the model might have an unacceptable impact on accuracy, so consider inference optimisation to satisfy this expectation.

4.4 Explainability and transparency

In the ideal case, the user can follow how the query was generated from the text, see the mapping between specific words or expressions in the question and the SQL query etc. This allows to verify the query and make any adjustments when interacting with the system. Besides, the system could also provide an explicit textual reformulation of the query and ask the user to either confirm or correct it.

4.5 Privacy

The Text2SQL function can be isolated from query execution, so the returned database information can be kept invisible. However, the critical question is how much information about the database is included in the prompt. The three options (by decreasing privacy level) are:

No information about the database
Database schema
Database content

Privacy trades off with accuracy – the less constrained you are in including useful information in the prompt, the better the results.

4.6 Adaptability over time

To use Text2SQL in a durable way, you need to adapt to data drift, i. e. the changing distribution of the data to which the model is applied. For example, let’s assume that the data used for initial fine-tuning reflects the simple querying behaviour of users when they start using the BI system. As time passes, information needs of users become more sophisticated and require more complex queries, which overwhelm your naive model. Besides, the goals or the strategy of a company change might also drift and direct the information needs towards other areas of the database. Finally, a Text2SQL-specific challenge is database drift. As the company database is extended, new, unseen columns and tables make their way into the prompt. While Text2SQL algorithms that are designed for multi-database application can handle this issue well, it can significantly impact the accuracy of a single-database model. All of these issues are best solved with a fine-tuning dataset that reflects the current, real-world behaviour of users. Thus, it is crucial to log user questions and results, as well as any associated feedback that can be collected from usage. Additionally, semantic clustering algorithms, for example using embeddings or topic modelling, can be applied to detect underlying long-term changes in user behaviour and use these as an additional source of information for perfecting your fine-tuning dataset

Conclusion

Let’s summarise the key points of the article:

Text2SQL allows to implement intuitive and democratic data access in a business, thus maximising the value of the available data.
Text2SQL data consist of questions at the input, and SQL queries at the output. The mapping between questions and SQL queries is many-to-many.
It is important to provide information about the database as part of the prompt. Additionally, the database structure can be optimised to make it easier for the algorithm to learn and understand it.
On the input, the main challenge is the linguistic variability of natural-language questions, which can be approached using LLMs that were pre-trained on a wide variety of different text styles
The output of Text2SQL should be a valid SQL query. This constraint can be incorporated by "injecting" SQL knowledge into the algorithm; alternatively, using an iterative approach, the query can be checked and improved in multiple steps.
Due to the potentially high impact of "silent failures" which return wrong data for decision-making, failure management is a primary concern in the user interface.
In an "augmented" fashion, users can be actively involved in iterative validation and improvement of SQL queries. While this makes the application less fluid, it also reduces failure rates, allows users to explore data in a more flexible way and creates valuable signals for further learning.
The major non-functional requirements to consider are accuracy, scalability, speed, explainability, privacy and adaptability over time. The main trade-offs consist between accuracy on the one hand, and scalability, speed and privacy on the other hand.

References

[1] Ken Van Haren. 2023. Replacing a SQL analyst with 26 recursive GPT prompts

[2] Nitarshan Rajkumar et al. 2022. Evaluating the Text-to-SQL Capabilities of Large Language Models

[3] Naihao Deng et al. 2023. Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect

[4] Mohammadreza Pourreza et al. 2023. DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction

[5] Victor Zhong et al. 2021. Grounded Adaptation for Zero-shot Executable Semantic Parsing

[6] Xi Victoria Lin et al. 2020. Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing

[7] Tong Guo et al. 2019. Content Enhanced BERT-based Text-to-SQL Generation

[8] Torsten Scholak et al. 2021. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models

[9] Jinyang Li et al. 2023. Graphix-T5: Mixing Pre-Trained Transformers with Graph-Aware Layers for Text-to-SQL Parsing

[10] LangChain. 2023. LLMs and SQL

[11] Tao Yu et al. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

[12] Ruiqi Zhong et al. 2020. Semantic Evaluation for Text-to-SQL with Distilled Test Suites

[13] Katherine Tian et al. 2023. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback

[14] Braden Hancock et al. 2019. Learning from Dialogue after Deployment: Feed Yourself, Chatbot!

[15] Ahmed Elgohary et al. 2020. Speak to your Parser: Interactive Text-to-SQL with Natural Language Feedback

[16] Janna Lipenkova. 2022. Talk to me! Text2SQL conversations with your company’s data, talk at New York Natural Language Processing meetup.

All images are by the author.

The post Creating an Information Edge with Conversational Access to Data appeared first on Towards Data Science.

Four LLM trends since ChatGPT and their implications for AI builders

Dr. Janna Lipenkova — Mon, 29 May 2023 10:10:13 +0000

In October 2022, I published an article on LLM selection for specific NLP use cases , such as conversation, translation and summarisation. Since then, AI has made a huge step forward, and in this article, we will review some of the trends of the past months as well as their implications for AI builders. Specifically, we will cover the topics of task selection for autoregressive models, the evolving trade-offs between commercial and open-source LLMs, as well as LLM integration and the mitigation of failures in production.

1. Generative AI pushes autoregressive models, while autoencoding models are waiting for their moment.

For many AI companies, it seems like ChatGPT has turned into the ultimate competitor. When pitching my analytics startups in earlier days, I would frequently be challenged: "what will you do if Google (Facebook, Alibaba, Yandex…) comes around the corner and does the same?" Now, the question du jour is: "why can’t you use ChatGPT to do this?"

The short answer is: ChatGPT is great for many things, but it does by far not cover the full spectrum of AI. The current hype happens explicitly around generative AI – not analytical AI, or its rather fresh branch of synthetic AI [1]. What does this mean for LLMs? As described in my previous article, LLMs can be pre-trained with three objectives – autoregression, autoencoding and sequence-to-sequence (cf. also Table 1, column "Pre-training objective"). Typically, a model is pre-trained with one of these objectives, but there are exceptions – for example, UniLM [2] was pre-trained on all three objectives. The fun generative tasks that have popularised AI in the past months are conversation, question answering and content generation – those tasks where the model indeed learns to "generate" the next token, sentence etc. These are best carried out by autoregressive models, which include the GPT family as well as most of the recent open-source models, like MPT-7B, OPT and Pythia. Autoencoding models, which are better suited for information extraction, distillation and other analytical tasks, are resting in the background – but let’s not forget that the initial LLM breakthrough in 2018 happened with BERT, an autoencoding model. While this might feel like stone age for modern AI, autoencoding models are especially relevant for many B2B use cases where the focus is on distilling concise insights that address specific business tasks. We might indeed witness another wave around autoencoding and a new generation of LLMs that excel at extracting and synthesizing information for analytical purposes.

For builders, this means that popular autoregressive models can be used for everything that is content generation – and the longer the content, the better. However, for analytical tasks, you should carefully evaluate whether the autoregressive LLM you use will output a satisfying result, and consider autoencoding models or even more traditional NLP methods otherwise.

2. Open-source competes with for-profits, spurring innovation in LLM efficiency and scaling.

In the past months, there has been a lot of debate about the uneasy relationship between open-source and commercial AI. In the short term, the open-source community cannot keep up in a race where winning entails a huge spend on data and/or compute. But with a long-term perspective in mind, even the big companies like Google and OpenAI feel threatened by open-source.[3] Spurred by this tension, both camps have continued building, and the resulting advances are eventually converging into fruitful synergies. The open-source community has a strong focus on frugality, i. e. increasing the efficiency of LLMs by doing more with less. This not only makes LLMs affordable to a broader user base – think AI democratisation – but also more sustainable from an environmental perspective. There are three principal dimensions along which LLMs can become more efficient:

Less compute and memory: for example, FlashAttention [4] allows to reduce number of reads and writes on GPU as compared to standard attention algorithms, thus leading to faster and memory-efficient fine-tuning.
Less parameters: in standard fine-tuning, all model weights are retrained – however, in most cases only a small subset of weights affect the performance of a model on the fine-tuning data. Parameter-efficient fine-tuning (PEFT) identifies this subset and "freezes" the other weights, which allows to heavily reduce resource usage while achieving a more stable performance of the model.
Less training data: data quality scales better than data size [3] – the more focussed and curated your training data, the less of it is needed to optimise performance. One of the most successful approaches here is instruction fine-tuning. During training, the LLM is provided with task-specific instructions which reflect how it will eventually be prompted during inference. Narrowing down the training space enables faster learning from less data. Instruction fine-tuning has been practiced for a while already, for instance in T0, FLAN, InstructGPT – and ultimately, it is also the method that underlies ChatGPT.

On the other extreme, for now, "generative AI control is in the hands of the few that can afford the dollars to train and deploy models at scale".[5] The commercial offerings are exploding in size – be it model size, data size or the time spent on training – and clearly outcompete open-source models in terms of output quality. There is not much to report here technically – rather, the concerns are more on the side of governance and regulation. Thus, "one key risk is that powerful LLMs like GPT develop only in a direction that suits the commercial objectives of these companies."[5]

How will these two ends meet – and will they meet at all? On the one hand, any tricks that allow to reduce resource consumption can eventually be scaled up again by throwing more resources at them. On the other hand, LLM training follows the power law, which means that the learning curve flattens out as model size, dataset size and training time increase.[6] You can think of this in terms of the human education analogy – over the lifetime of humanity, schooling times have increased, but did the intelligence and erudition of the average person follow suit?

The positive thing about a flattening learning curve is the relief it brings amidst fears about AI growing "stronger and smarter" than humans. But brace yourself – the LLM world is full of surprises, and one of the most unpredictable ones is emergence.[7] Emergence is when quantitative changes in a system result in qualitative changes in behaviour – summarised with "quantity leads to quality", or simply "more is different".[8] At some point in their training, LLMs seem to acquire new, unexpected capabilities that were not in the original training scope. At present, these capabilities come in the form of new linguistic skills – for instance, instead of just generating text, models suddenly learn to summarise or translate. It is impossible to predict when this might happen and what the nature and scope of the new capabilities will be. Hence, the phenomenon of emergence, while fascinating for researchers and futurists, is still far away from providing robust value in a commercial context.

As more and more methods are developed that increase the efficiency of LLM finetuning and inference, the resource bottleneck around the physical operation of open-source LLMs seems to be loosening. Concerned with the high usage cost and restricted quota of commercial LLMs, more and more companies consider deploying their own LLMs. However, development and maintenance costs remain, and most of the described optimisations also require extended technical skills for manipulating both the models and the hardware on which they are deployed. The choice between open-source and commercial LLMs is a strategic one and should be done after a careful exploration of a range of trade-offs that include costs (incl. development, operating and usage costs), availability, flexibility and performance. A common line of advice is to get a head start with the big commercial LLMs to quickly validate the business value of your end product, and "switch" to open-source later down the road. But this transition can be tough and even unrealistic, since LLMs widely differ in the tasks they are good at. There is a risk that open-source models cannot satisfy the requirements of your already developed application, or that you need to do considerable modifications to mitigate the associated trade-offs. Finally, the most advanced setup for companies that build a variety of features on LLMs is a multi-LLM architecture that allows to leverage the advantages of different LLMs.

3. LLMs are getting operational with plugins, agents and frameworks.

The big challenges of LLM training being roughly solved, another branch of work has focussed on the integration of LLMs into real-world products. Beyond providing ready-made components that enhance convenience for developers, these innovations also help overcome the existing limitations of LLMs and enrich them with additional capabilities such as reasoning and the use of non-linguistic data.[9] The basic idea is that, while LLMs are already great at mimicking human linguistic capacity, they still have to be placed into the context of a broader computational "cognition" to conduct more complex reasoning and execution. This cognition encompasses a number of different capacities such as reasoning, action and observation of the environment. Basis: At the moment, it is approximated using plugins and agents, which can be combined using modular LLM frameworks such as LangChain, LlamaIndex and AutoGPT.

3.1 Plugins offer access to external data and functionality

Pre-trained LLMs have significant practical limitations when it comes to the data they leverage: on the one hand, the data quickly gets outdated – for instance, while GPT-4 was published in 2023, its data was cut off in 2021. On the other hand, most real-world applications require some customisation of the knowledge in the LLM. Consider building an app that allows you to create personalised marketing content – the more information you can feed into the LLM about your product and specific users, the better the result. Plugins make this possible – your program can fetch data from an external source, like customer e-mails and call records, and insert these into the prompt for a personalised, controlled output.

3.2 Agents walk the talk

Language is closely tied with actionability. Our communicative intents often circle around action, for example when we ask someone to do something or when we refuse to act in a certain way. The same goes for computer programs, which can be seen as collections of functions that execute specific actions, block them when certain conditions are not met etc. LLM-based agents bring these two worlds together. The instructions for these agents are not hard-coded in a programming language, but are freely generated by LLMs in the form of reasoning chains that lead to achieving a given goal. Each agent has a set of plugins at hand and can juggle them around as required by the reasoning chain – for example, he can combine a search engine for retrieving specific information and a calculator to subsequently execute computations on this information. The idea of agents has existed for a long time in reinforcement learning – however, as of today, reinforcement learning still happens in relatively closed and safe environments. Backed by the vast common knowledge of LLMs, agents can now not only venture into the "big world", but also tap into an endless combinatorial potential: each agent can execute a multitude of tasks to reach their goals, and multiple agents can interact and collaborate with each other.[10] Moreover, agents learn from their interactions with the world and build up a memory that comes much closer to the multi-modal memory of humans than does the purely linguistic memory of LLMs.

3.3 Frameworks provide a handy interface for LLM integration

In the last months, we have seen a range of new LLM-based frameworks such as LangChain, AutoGPT and LlamaIndex. These frameworks allow to integrate plugins and agents into complex chains of generations and actions to implement complex processes that include multi-step reasoning and execution. Developers can now focus on efficient prompt engineering and quick app prototyping.[11] At the moment, a lot of hard-coding is still going on when you use these frameworks – but gradually, they might be evolving towards a more comprehensive and flexible system for modelling cognition and action, such as the JEPA architecture proposed by Yann LeCun.[12]

What are the implications of these new components and frameworks for builders? On the one hand, they boost the potential of LLMs by enhancing them with external data and agency. Frameworks, in combination with convenient commercial LLMs, have turned app prototyping into a matter of days. But the rise of LLM frameworks also has implications for the LLM layer. It is now hidden behind an additional abstraction, and as any abstraction it requires higher awareness and discipline to be leveraged in a sustainable way. First, when developing for production, a structured process is still required to evaluate and select specific LLMs for the tasks at hand. At the moment, many companies skip this process under the assumption that the latest models provided by OpenAI are the most appropriate. Second, LLM selection should be coordinated with the desired agent behavior: the more complex and flexible the desired behavior, the better the LLM should perform to ensure that it picks the right actions in a wide space of options.[13] Finally, in operation, an MLOps pipeline should ensure that the model doesn’t drift away from changing data distributions and user preferences.

4. The linguistic interface of LLMs introduces new challenges for human-machine interaction.

With the advance of prompting, using AI to do cool and creative things is becoming accessible for non-technical people. No need to be a programmer anymore – just use language, our natural communication medium, to tell the machine what to do. However, amidst all the buzz and excitement around quick prototyping and experimentation with LLMs, at some point, we still come to realize that "it’s easy to make something cool with LLMs, but very hard to make something production-ready with them."[14] In production, LLMs hallucinate, are sensitive to imperfect prompt designs, and raise a number of issues for governance, safety, and alignment with desired outcomes. And the thing we love most about LLMs – its open-ended space of in- and outputs – also makes it all the harder to test for potential failures before deploying them to production.

4.1 Hallucinations and silent failures

If you have ever built an AI product, you will know that end users are often highly sensitive to AI failures. Users are prone to a "negativity bias": even if your system achieves high overall accuracy, those occasional but unavoidable error cases will be scrutinized with a magnifying glass. With LLMs, the situation is different. Just as with any other complex AI system, LLMs do fail – but they do so in a silent way. Even if they don’t have a good response at hand, they will still generate something and present it in a highly confident way, tricking us into believing and accepting them and putting us in embarrassing situations further down the stream. Imagine a multi-step agent whose instructions are generated by an LLM – an error in the first generation will cascade to all subsequent tasks and corrupt the whole action sequence of the agent.

One of the biggest quality issues of LLMs is hallucination, which refers to the generation of texts that are semantically or syntactically plausible but are factually incorrect. Already Noam Chomsky, with his famous sentence "Colorless green ideas sleep furiously", made the point that a sentence can be perfectly well-formed from the linguistic point of view but completely nonsensical for humans. Not so for LLMs, which lack the non-linguistic knowledge that humans possess and thus cannot ground language in the reality of the underlying world. And while we can immediately spot the issue in Chomsky’s sentence, fact-checking LLM outputs becomes quite cumbersome once we get into more specialized domains that are outside of our field of expertise. The risk of undetected hallucinations is especially high for long-form content as well as for interactions for which no ground truth exists, such as forecasts and open-ended scientific or philosophical questions.[15]

There are multiple approaches to hallucination. From a statistical viewpoint, we can expect that hallucination decreases as language models learn more. But in a business context, the incrementality and uncertain timeline of this "solution" makes it rather unreliable. Another approach is rooted in neuro-symbolic AI. By combining the powers of statistical language generation and deterministic world knowledge, we may be able to reduce hallucinations and silent failures and finally make LLMs robust for large-scale production. For instance, ChatGPT makes this promise with the integration of Wolfram Alpha, a vast structured database of curated world knowledge.

4.2 The challenges of prompting

On the surface, the natural language interface offered by prompting seems to close the gap between AI experts and laypeople – after all, all of us know at least one language and use it for communication, so why not do the same with an LLM? But prompting is a fine craft. Successful prompting that goes beyond trivia requires not only strong linguistic intuitions but also knowledge about how LLMs learn and work. And then, the process of designing successful prompts is highly iterative and requires systematic experimentation. As shown in the paper Why Johnny can’t prompt, humans struggle to maintain this rigor. On the one hand, we often are primed by expectations that are rooted in our experience of human interaction. Talking to humans is different from talking to LLMs – when we interact with each other, our inputs are transmitted in a rich situational context, which allows us to neutralize the imprecisions and ambiguities of human language. An LLM only gets the linguistic information and thus is much less forgiving. On the other hand, it is difficult to adopt a systematic approach to prompt engineering, so we quickly end up with opportunistic trial-and-error, making it hard to construct a scalable and consistent system of prompts.

To resolve these challenges, it is necessary to educate both prompt engineers and users about the learning process and the failure modes of LLMs, and to maintain an awareness of possible mistakes in the interface. It should be clear that an LLM output is always an uncertain thing. For instance, this can be achieved using confidence scores in the user interface which can be derived via model calibration.[15] For prompt engineering, we currently see the rise of LLMOps, a subcategory of MLOps that allows to manage the prompt lifecycle with prompt templating, versioning, optimisation etc. Finally, finetuning trumps few-shot learning in terms of consistency since it removes the variable "human factor" of ad-hoc prompting and enriches the inherent knowledge of the LLM. Whenever possible given your setup, you should consider switching from prompting to finetuning once you have accumulated enough training data.

Conclusion

With new models, performance hacks and integrations coming up every day, the LLM rabbit hole is deepening day by day. For companies, it is important to stay differentiated, keep an eye on the recent developments and new risks and favour hands-on experimentation over the buzz – many trade-offs and issues related to LLMs only become visible during real-world use. In this article, we took a look at the recent developments and how they affect building with LLMs:

Most current LLMs are autoregressive and excel at generative tasks. They might be unreliable for analytical tasks, in which case either autoencoding LLMs or alternative NLP techniques should be preferred.
There are considerable differences between open-source and commercial LLMs, and switching between LLMs might turn out to be harder than it seems. Carefully consider the trade-offs, evaluate possible development paths (start with open-source and switch to commercial later) and consider a multi-LLM setup if different features of your product rely on LLMs.
Frameworks provide a handy interface to build with LLMs, but don’t underestimate the importance of the LLM layer – LLMs should undergo a process of experimentation and careful selection, after which they run through the full MLOps cycle to ensure a robust, continuously optimised operation and mitigate issues such as model shift.
Builders should proactively manage the human factor. LLMs have conquered language, a cognitive area that was originally only accessible to humans. As humans, we quickly forget that LLMs are still "machines", and fail to operate them as such. For users and employees, consider how you can raise their awareness and educate them on the correct operation and usage of LLMs.

References

[1] Andreessen Horowitz. 2023. For B2B Generative AI Apps, Is Less More?

[2] Li Dong et al. 2019. Unified language model pre-training for natural language understanding and generation. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 13063–13075.

[3] The Information. 2023. Google Researcher: Company Has ‘No Moat’ in AI.

[4] Tri Dao et al. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

[5] EE Times. 2023. Can Open-Source LLMs Solve AI’s Democratization Problem?

[6] Jared Kaplan et al. 2023. Scaling Laws for Neural Language Models.

[7] Jason Wei et al. 2023. Emergent Abilities of Large Language Models.

[8] Philip Anderson. 1972. More is Different. In Science, Vol 177, Issue 4047, pp. 393–396.

[9] Janna Lipenkova. 2023. Overcoming the Limitations of Large Language Models.

[10] Joon Sung Park et al. 2023. Generative Agents: Interactive Simulacra of Human Behavior.

[11] Harvard University. 2023. GPT-4 – How does it work, and how do I build apps with it? – CS50 Tech Talk.

[12] Yann LeCun. 2022. A Path Towards Autonomous Machine Intelligence.

[13] Jerry Liu. 2023. Dumber LLM Agents Need More Constraints and Better Tools.

[14] Chip Huyen. 2023. Building LLM applications for production.

[15] Stephanie Lin et al. 2022. Teaching models to express their uncertainty in words.

The post Four LLM trends since ChatGPT and their implications for AI builders appeared first on Towards Data Science.

Overcoming the Limitations of Large Language Models

Dr. Janna Lipenkova — Mon, 23 Jan 2023 19:30:13 +0000

How popular LLMs score along human cognitive skills (source: semantic embedding analysis of ca. 400k AI-related online texts since 2021)

Disclaimer: This article was written without the support of ChatGPT.

In the last couple of years, Large Language Models (LLMs) such as ChatGPT, T5 and LaMDA have developed amazing skills to produce human language. We are quick to attribute intelligence to models and algorithms, but how much of this is emulation, and how much is really reminiscent of the rich language capability of humans? When confronted with the natural-sounding, confident outputs of these models, it is sometimes easy to forget that language per se is only the tip of the communication iceberg. Its full power unfolds in combination with a wide range of complex cognitive skills relating to perception, reasoning and communication. While humans acquire these skills naturally from the surrounding world as they grow, the learning inputs and signals for LLMs are rather meagre. They are forced to learn only from the surface form of language, and their success criterion is not communicative efficiency but the reproduction of high-probability linguistic patterns.

In the business context, this can lead to bad surprises when too much power is given to an LLM. Facing its own limitations, it will not admit them and rather gravitate to the other extreme – produce non-sense, toxic content or even dangerous advice with a high level of confidence. For example, a medical virtual assistant driven by GPT-3 can advise its user to kill themselves at a certain point in the conversation.[4]

Considering these risks, how can we safely benefit from the power of LLMs when integrating them in our product development? On the one hand, it is important to be aware of inherent weak points and use rigorous evaluation and probing methods to target them in specific use cases, instead of relying on happy-path interactions. On the other hand, the race is on – all major AI labs are planting their seeds to enhance LLMs with additional capabilities, and there is plenty of space for a cheerful glance into the future. In this article, we will look into the limitations of LLMs and discuss ongoing efforts to control and enhance LLM behaviour. A basic knowledge of the workings of language models is assumed – if you are a newbie, please refer to this article.

Before diving into the technology, let’s set the scene with a thought experiment – the "Octopus test" as proposed by Emily Bender – to understand how differently humans and LLMs see the world.[1]

In the skin of an octopus

Imagine that Anna and Maria are stranded on two uninhabited islands. Luckily, they have discovered two telegraphs and an underwater cable left behind by previous visitors and start communicating with each other. Their conversations are "overheard" by a quick-witted octopus who has never seen the world above water but is exceptionally good at statistical learning. He picks up the words, syntactic patterns and communication flows between the two ladies and thus masters the external form of their language without understanding how it is actually grounded in the real world. As Ludwig Wittgenstein once put it, "the limits of language are the limits of my world" – while we know today that the world models of humans are composed of much more than language, the octopus would sympathise with this statement, at least regarding his knowledge of the world above water.

At some point, listening is not enough. Our octopus decides to take control, cuts the cable on Maria’s side and starts chatting with Anna. The interesting question is, when will Anna detect the change? As long as the two parties exchange social pleasantries, there is a reasonable chance that Anna will not suspect anything. Their small talk might go on as follows:

A: Hi Maria!

O: Hi Anna, how are you?

A: Thanks, I’m good, just enjoyed a coconut breakfast!

O: You are lucky, there are no coconuts on my island. What are your plans?

A: I wanted to go swimming but I am afraid there will be a storm. And you?

O: I am having my breakfast now and will do some woodwork afterwards.

A: Have a nice day, talk later!

O: Bye!

However, as their relationship deepens, their communication also grows in intensity and sophistication. Over the next sections, we will take the octopus through a couple of scenes from island life that require the mastery of common-sense knowledge, communicative context and reasoning. As we go, we will also survey approaches to incorporate additional intelligence into agents – be they fictive octopusses or LLMs – that are originally only trained from the surface form of language.

Injecting world knowledge into LLMs

One morning, Anna is planning a hunting trip and tries to forecast the weather for the day. Since the wind is coming from Maria’s direction, she asks "Maria" for a report on current weather conditions as an important piece of information. Being caught in deep waters, our octopus grows embarrassed about describing the weather conditions. Even if he had a chance to glance into the skies, he would not know what specific weather terms like "rain", "wind", "cloudy" etc. refer to in the real world. He desperately makes up some weather facts. Later in the day, while hunting in the woods, Anna is surprised by a dangerous thunderstorm. She attributes her failure to predict the storm to a lack of meteorological knowledge rather than a deliberate hallucination by her conversation partner.

At the surface, LLMs are able to truthfully reflect a lot of true facts about the world. However, their knowledge is limited to concepts and facts that they explicitly encountered in the training data. Even with huge training data, this knowledge cannot be complete. For example, it might miss domain-specific knowledge that is required for commercial use cases. Another important limitation, as of now, is the recency of the information. Since language models lack a notion of temporal context, they can’t work with dynamic information such as the current weather, stock prices or even today’s date.

This problem can be solved by systematically "injecting" additional knowledge into the LLM. This new input can come from various sources, such as structured external databases (e.g. FreeBase or WikiData), company-specific data sources and APIs. One possibility to inject it is via adapter networks that are "plugged in" between the LLM layers to learn the new knowledge:[2]

Figure 1: Architecture of adapter-based knowledge injection into LLMs [2]

The training of this architecture happens in two steps, namely memorisation and utilisation:

During memorisation, the LLM is frozen and the adapter networks learn the new facts from the knowledge base. The learning signal is provided via masked language modelling, whereby parts of the facts are hidden and the adapters learn to reproduce them:

Figure 2: Adapters are trained during the memorisation step

During utilisation, the LM learns to leverage the facts memorised by the adapters in the respective downstream tasks. Here, in turn, the adapter networks are frozen while the weights of the model are optimised:

Figure 3: LLM learns to use adapter knowledge during the utilisation step

During inference, the hidden state that the LLM provides to the adapter is fused with the adapter’s output using a fusion function to produce the final answer.

While architecture-level knowledge injection allows for efficient modular retraining of smaller adapter networks, the modification of the architecture also requires considerable engineering skill and effort. The easier alternative is input-level injection, where the model is directly fine-tuned on the new facts (cf. [3] for an example). The downside is the expensive fine-tuning required after each change – thus, it is not suitable for dynamic knowledge sources. A complete overview over existing knowledge injection approaches can be found in this article.

Knowledge injection helps you build domain intelligence, which is becoming a key differentiator for vertical AI products. In addition, you can use it to establish traceability so the model can point a user to the original sources of information. Beyond structured knowledge injection, efforts are underway to integrate multimodal information and knowledge into LLMs. For instance, in April 2022, DeepMind introduced Flamingo, a visual language model that can seamlessly ingest text, images and video.[5] At the same time, Google is working on Socratic Models, a modular framework in which multiple pre-trained models may be composed zero-shot i.e., via multi-modal prompting, to exchange information with each other.[6]

Embracing communicative context and intent

As Anna wants to share not only her thoughts about life, but also the delicious coconuts from her island with Maria, she invents a coconut catapult. She sends Maria a detailed instruction on how she did it and asks her for instructions to optimise it. At the receiving end, the octopus falls short of a meaningful reply. Even if he had a way of constructing the catapult underwater, he does not know what words such as rope and coconut refer to, and thus can’t physically reproduce and improve the experiment. So he simply says "Cool idea, great job! I need to go hunting now, bye!". Anna is bothered by the uncooperative response, but she also needs to go on with her daily business and forgets about the incident.

When we use language, we do so for a specific purpose, which is our communicative intent. For example, the communicative intent can be to convey information, socialise or ask someone to do something. While the first two are rather straightforward for an LLM (as long as it has seen the required information in the data), the latter is already more challenging. Let’s forget about the fact that the LLM does not have an ability to act in the real world and limit ourselves to tasks in its realm of language – writing a speech, an application letter etc. Not only does the LLM need to combine and structure the related information in a coherent way, but it also needs to set the right emotional tone in terms of soft criteria such as formality, creativity, humour etc.

Making the transition from classical language generation to recognising and responding to specific communicative intents is an important step to achieve better acceptance of user-facing NLP systems, especially in Conversational AI. One method for this is Reinforcement Learning from Human Feedback (RLHF), which has been recently implemented in ChatGPT ([7]) but has a longer history in human preference learning.[8] In a nutshell, RLHF "redirects" the learning process of the LLM from the straightforward but artificial next-token prediction task towards learning human preferences in a given communicative situation. These human preferences are directly encoded in the training data: during the annotation process, humans are presented with prompts and either write the desired response or rank a series of existing responses. The behaviour of the LLM is then optimised to reflect the human preference. Technically, RLHF is performed in three steps:

Pre-training and fine-tuning of an initial LLM: An LLM is trained with a classical pre-training objective. Additionally, it can be fine-tuned with human-annotated data (as in the case of InstructGPT and ChatGPT).
Reward model training: The reward model is trained based on human annotations that reflect communicative preferences in a given situation. Specifically, humans are presented with multiple outputs for a given prompt and rank these according to their suitability. The model learns to reward the higher-ranked outputs and penalise the lower-ranked outputs. The reward is a single scalar number, which makes it compatible with reinforcement learning in the next step.
Reinforcement Learning: the policy is the initial LLM, while the reward function combines two scores for a given text input:

The reward model score which ensures that the text responds to the communicative intent.
A penalty for generating texts that are too far away from the initial LLM output (e. g. Kullback-Leibler divergence), making sure that the text is semantically meaningful.

The LLM is thus fine-tuned to produce useful outputs that maximise human preferences in a given communicative situation, for example using Proximal Policy Optimisation (PPO).

For a more in-depth introduction into RLHF, please check out the excellent materials by Huggingface (article and video).

The RLHF methodology had a mind-blowing success with ChatGPT, especially in the areas of conversational AI and creative content creation. In fact, it not only leads to more authentic and purposeful conversations, but can also positively "bias" the model towards ethical values while mitigating unethical, discriminatory or even dangerous outputs. However, what is often left unsaid amidst the excitement about RLHF is that, while not introducing significant technological breakthroughs, its mega-power comes from linear human annotation effort. RLHF is prohibitively expensive in terms of labelled data, the known bottleneck for all supervised and reinforcement learning endeavours. Beyond human rankings for LLM outputs, OpenAI’s data for ChatGPT also include human-written responses to prompts that are used to fine-tune the initial LLM. It is obvious that only big companies committed to AI innovation can afford the necessary budget for data labelling at this scale.

With the help of a brainy community, most bottlenecks eventually get solved. In the past, the Deep Learning community solved the data shortage with self-supervision – pre-training LLMs using next-token prediction, a learning signal that is available "for free" since it is inherent to any text. The Reinforcement Learning community is using algorithms such as Variational Autoencoders or Generative Adversarial Networks to generate synthetic data – with varying degrees of success. To make RLHF broadly accessible, we will also need to figure out a way to crowdsource communicative reward data and/or to build it in a self-supervised or automated way. One possibility is to use ranking datasets that are available "in the wild", for example Reddit or Stackoverflow conversations where answers to questions are rated by users. Beyond simple ratings and thumbs up/down labels, some conversational AI systems also allow the user to directly edit the response to demonstrate the desired behaviour, which creates a more differentiated learning signal.

Modelling reasoning processes

Finally, Anna faces an emergency. She is pursued by an angry bear. In a panic, she grabs a couple of metal sticks and asks Maria to tell her how defend herself. Of course, the octopus has no clue what Anna means. Not only has he never faced a bear – he also doesn’t know how to behave in a bear attack and how the sticks can help Anna. Solving a task like this not only requires the ability to map accurately between words and objects in the real world, but also to reason about how these objects can be leveraged. The octopus miserably fails and Anna discovers the delusion in the lethal encounter.

Now, what if Maria was still there? Most humans can reason logically, even if there are huge individual differences in the mastery of this skill. Using reasoning, Maria could solve the task as follows:

Premise 1 (based on situation): Anna has a couple of metal sticks.

Premise 2 (based on common-sense knowledge): Bears are intimidated by noise.

Conclusion: Anna can try and use her sticks to make noise and scare the bear away.

LLMs often produce outputs with a valid reasoning chain. Yet, on closer inspection, most of this coherence is the result of pattern learning rather than a deliberate and novel combination of facts. DeepMind has been on a quest to solve causality for years, and a recent attempt is the faithful reasoning framework for question answering.[9] The architecture consists of two LLMs – one for the selection of relevant premises and another for inferring the final, conclusive answer to the question. When prompted with a question and its context, the selection LLM first picks the related statements from its data corpus and passes them to the inference LLM. The inference LLM deduces new statements and adds them to the context. This iterative reasoning process comes to an end when all statements line up into a coherent reasoning chain that provides a complete answer to the question:

Figure 4: Faithful reasoning components

The following shows the reasoning chain for our island incident:

Figure 5: Constructing a reasoning trace

Another method to perform reasoning with LLMs is chain-of-thought prompting. Here, the user first provides one or more examples of the reasoning process as part of the prompt, and the LLM "imitates" this reasoning process with new inputs.[13]

Beyond this general ability to reason logically, humans also access a whole toolbox of more specific reasoning skills. A classical example is mathematical calculation. LLMs can produce these calculations up to a certain level – for example, modern LLMs can confidently perform 2- or 3-digit addition. However, they start to fail systematically when complexity increases, for example when more digits are added or multiple operations need to be performed to solve a mathematical task. And "verbal" tasks formulated in natural language (for example, "I had 10 mangoes and lost 3. How many mangoes do I have left?") are much more challenging than explicit computations ("ten minus three equals…"). While LLM performance can be improved by increasing training time, training data, and parameter sizes, using a simple calculator will still remain the more reliable alternative.

Just as children who explicitly learn the laws of mathematics and other exact sciences, LLMs can also benefit from hard-coded rules. This sounds like a case for neuro-symbolic AI – and indeed, modular systems like MRKL (pronounced "miracle") by AI21 Labs split the workload of understanding the task, executing the computation and formulating the output result between different models.[12] MRKL stands for Modular Reasoning, Knowledge and Language and combines AI modules in a pragmatic plug-and-play fashion, switching back and forth between structured knowledge, symbolic methods and neural models. Coming back to our example, to perform mathematical calculations, an LLM is first fine-tuned to extract the formal arguments from a verbal arithmetic task (numbers, operands, parentheses). The calculation itself is then "routed" to a deterministic mathematical module, and the final the result is formatted in natural language using the output LLM.

As opposed to black-box, monolithic LLMs, reasoning add-ons create transparency and trust since they decompose the "thinking" process into individual steps. They are particularly useful for supporting complex, multi-step decision and action paths. For example, they can be used by virtual assistants that make data-driven recommendations and need to perform multiple steps of analytics and aggregation to get to a conclusion.

Conclusion and take-aways

In this article, we have provided an overview of approaches to complement the intelligence of LLMs. Let’s summarise our guidelines for maximising the benefits of LLMs and potential enhancements:

Make them fail: Don’t be fooled by initial results – language models can produce impressive outputs when you start working with them and anyway, we humans have a bias to attribute too much intelligence to machines. Jump into the role of a mean, adversarial user, stress-test your models and explore their weak points. Do this early on, before too much skin has been put in the game.
Evaluation and dedicated probing: The design of your training task and evaluation procedure are of central importance. As much as possible, it should reflect the context of natural language use. Knowing the pitfalls of LLMs, dedicate your evaluation to them.
Benefit from neuro-symbolic AI: Symbolic AI is not out – in the context of an individual business or product, setting some of your domain knowledge in stone can be an efficient approach to increase precision. It allows you to control the behaviour of the LLM where it is crucial for your business, while still unfolding its power at generating language based on wide external knowledge.
Strive towards a flexible architecture: On the surface, LLMs sometimes feel like blackboxes. However, as we have seen, numerous approaches exist – and will become available in the future – not only for fine-tuning, but also for "tweaking" their internal behaviour and learning. Use to open-source models and solutions if you have the technical capability – this will allow you to adapt and maximise the value-add of LLMs in your product.

And even with the described enhancements, LLMs stay far behind human understanding and language use – they simply lack the unique, powerful and mysterious synergy of cultural knowledge, intuition and experience that humans build up as they go through their lifes. According to Yann LeCun, "it is clear that these models are doomed to a shallow understanding that will never approximate the full-bodied thinking we see in humans."[11] When using AI, it is important to appreciate the wonders and complexity we find in language and cognition. Looking at smart machines from the right distance, we can differentiate between tasks that can be delegated to them and those that will remain the privilege of humans in the foreseeable future.

References

[1] Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–5198, Online. Association for Computational Linguistics.

[2] Emelin, Denis & Bonadiman, Daniele & Alqahtani, Sawsan & Zhang, Yi & Mansour, Saab. (2022). Injecting Domain Knowledge in Language Models for Task-Oriented Dialogue Systems. 10.48550/arXiv.2212.08120.

[3] Fedor Moiseev et al. 2022. SKILL: Structured Knowledge Infusion for Large Language Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1581–1588, Seattle, United States. Association for Computational Linguistics.

[4] Ryan Daws. 2020. Medical chatbot using OpenAI’s GPT-3 told a fake patient to kill themselves. Retrieved on January 13, 2022.

[5] DeepMind. 2022. Tackling multiple tasks with a single visual language model. Retrieved on January 13, 2022.

[6] Zeng et al. 2022. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. Preprint.

[7] OpenAI. 2022. ChatGPT: Optimizing Language Models for Dialogue. Retrieved on January 13, 2022.

[8] Christiano et al. 2017. Deep reinforcement learning from human preferences.

[9] Creswell & Shanahan. 2022. Faithful Reasoning Using Large Language Models. DeepMind.

[10] Karpas et al. 2022. MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. AI21 Labs.

[11] Jacob Browning & Yann LeCun. 2022. AI And The Limits Of Language. Retrieved on January 13, 2022.

[12] Karpas et al. 2022. MRKL Systems – A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning.

[13] Wei et al. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of NeurIPS 2022.

All images unless otherwise noted are by the author.

The post Overcoming the Limitations of Large Language Models appeared first on Towards Data Science.

Choosing the right language model for your NLP use case

Dr. Janna Lipenkova — Mon, 26 Sep 2022 17:59:50 +0000

A guide to understanding, selecting and deploying Large Language Models

Large Language Models (LLMs) are Deep Learning models trained to produce text. With this impressive ability, LLMs have become the backbone of modern Natural Language Processing (NLP). Traditionally, they are pre-trained by academic institutions and big tech companies such as OpenAI, Microsoft and NVIDIA. Most of them are then made available for public use. This plug-and-play approach is an important step towards large-scale AI adoption – instead of spending huge resources on the training of models with general linguistic knowledge, businesses can now focus on fine-tuning existing LLMs for specific use cases.

However, picking the right model for your application can be tricky. Users and other stakeholders have to make their way through a vibrant landscape of language models and related innovations. These improvements address different components of the language model including its training data, pre-training objective, architecture and fine-tuning approach – you could write a book on each of these aspects. On top of all this research, the marketing buzz and the intriguing aura of Artificial General Intelligence around huge language models obfuscate things even more.

In this article, I explain the main concepts and principles behind LLMs. The goal is to provide non-technical stakeholders with an intuitive understanding as well as a language for efficient interaction with developers and AI experts. For broader coverage, the article includes analyses that are rooted in a large number of NLP-related publications. While we will not dive into mathematical details of language models, these can be easily retrieved from the references.

The article is structured as follows: first, I situate language models in the context of the evolving NLP landscape. The second section explains how LLMs are built and pre-trained. Finally, I describe the fine-tuning process and provide some guidance on model selection.

The world of language models

Bridging the human-machine gap

Language is a fascinating skill of the human mind – it is a universal protocol for communicating our rich knowledge of the world, and also more subjective aspects such as intents, opinions and emotions. In the history of AI, there have been multiple waves of research to approximate ("model") human language with mathematical means. Before the era of Deep Learning, representations were based on simple algebraic and probabilistic concepts such as one-hot representations of words, sequential probability models and recursive structures. With the evolution of Deep Learning in the past years, linguistic representations have increased in precision, complexity and expressiveness.

In 2018, BERT was introduced as the first LLM on the basis of the new Transformer architecture. Since then, Transformer-based LLMs have gained strong momentum. Language modelling is especially attractive due to its universal usefulness. While many real-world NLP tasks such as sentiment analysis, information retrieval and information extraction do not need to generate language, the assumption is that a model that produces language also has the skills to solve a variety of more specialised linguistic challenges.

Size matters

Learning happens based on parameters – variables that are optimized during the training process to achieve the best prediction quality. As the number of parameters increases, the model is able to acquire more granular knowledge and improve its predictions. Since the introduction of the first LLMs in 2017–2018, we saw an exponential explosion in parameter sizes – while breakthrough BERT was trained with 340M parameters, Megatron-Turing NLG, a model released in 2022, is trained with 530B parameters – a more than thousand-fold increase.

Figure 1: The parameter sizes of language models increase exponentially over time [11]

Thus, the mainstream keeps wowing the public with ever bigger amounts of parameters. However, there have been critical voices pointing out that model performance is not increasing at the same rate as model size. On the other side, model pre-training can leave a considerable carbon footprint. Downsizing efforts have countered the brute-force approach to make progress in language modelling more sustainable.

The life of a language model

The LLM landscape is competitive and innovations are short-lived. The following chart shows the top-15 most popular LLMs in the timespan 2018–2022, along with their share-of-voice over time:

Figure 2: Mentions and share-of-voice of the top-15 most popular language models [12]

We can see that most models fade in popularity after a relatively short time. To stay cutting-edge, users should monitor the current innovations and evaluate whether an upgrade would be worthwhile.

Most LLMs follow a similar lifecycle: first, at the "upstream", the model is pre-trained. Due to the heavy requirements on data size and compute, it is mostly a privilege of large tech companies and universities. Recently, there have also been some collaborative efforts (e.g. the BigScience workshop) for the joint advancement of the LLM field. A handful of well-funded startups such as Cohere and AI21 Labs also provide pre-trained LLMs.

After the release, the model is adopted and deployed at the "downstream" by application-focussed developers and businesses. At this stage, most models require an extra fine-tuning step to specific domains and tasks. Others, like GPT-3, are more convenient in that they can learn a variety of linguistic tasks directly during prediction (zero- or few-shot prediction).

Finally, time knocks at the door and a better model comes around the corner – either with an even larger number of parameters, more efficient use of hardware or a more fundamental improvement to the modelling of human language. Models that brought about substantial innovations can give birth to whole model families. For example, BERT lives on in BERT-QA, DistilBERT and RoBERTa, which are all based on the original architecture.

In the next sections, we will look at the first two phases in this lifecycle – the pre-training and the fine-tuning for deployment.

Pre-training: how LLMs are born

Most teams and NLP practitioners will not be involved in the pre-training of LLMs, but rather in their fine-tuning and deployment. However, to successfully pick and use a model, it is important to understand what is going on "under the hood". In this section, we will look at the basic ingredients of an LLM:

Training data
Input representation
Pre-training objective
Model architecture (encoder-decoder)

Each of these will affect not only the choice, but also the fine-tuning and deployment of your LLM.

Training data

The data used for LLM training is mostly text data covering different styles, such as literature, user-generated content and news data. After seeing a variety of different text types, the resulting models become aware of the fine details of language. Other than text data, code is regularly used as input, teaching the model to generate valid programs and code snippets.

Unsurprisingly, the quality of the training data has a direct impact on model performance – and also on the required size of the model. If you are smart in preparing the training data, you can improve model quality while reducing its size. One example is the T0 model, which is 16 times smaller than GPT-3 but outperforms it on a range of benchmark tasks. Here is the trick: instead of just using any text as training data, it works directly with task formulations, thus making its learning signal much more focussed. Figure 3 illustrates some training examples.

Figure 3: T0 is trained on explicit task formulations for a wide range of linguistic tasks

A final note on training data: we often hear that language models are trained in an unsupervised manner. While this makes them appealing, it is technically wrong. Instead, well-formed text already provides the necessary learning signals, sparing us the tedious process of manual data annotation. The labels to be predicted correspond to past and/or future words in a sentence. Thus, annotation happens automatically and at scale, making possible the relatively quick progress in the field.

Input representation

Once the training data is assembled, we need to pack it into form that can be digested by the model. Neural networks are fed with algebraic structures (vectors and matrices), and the optimal algebraic representation of language is an ongoing quest – reaching from simple sets of words to representations containing highly differentiated context information. Each new step confronts researchers with the endless complexity of natural language, exposing the limitations of the current representation.

The basic unit of language is the word. In the beginnings of NLP, this gave rise to the naive bag-of-words representation that throws all words from a text together, irrespectively of their ordering. Consider these two examples:

In the bag-of-words world, these sentences would get exactly the same representation since they consist of the same words. Clearly, it embraces only a small part of their meaning.

Sequential representations accommodate information about word order. In Deep Learning, the processing of sequences was originally implemented in order-aware Recurrent Neural Networks (RNN).[2] However, going one step further, the underlying structure of language is not purely sequential but hierarchical. In other words, we are not talking about lists, but about trees. Words that are farther apart can actually have stronger syntactic and semantic ties than neighbouring words. Consider the following example:

Here, her refers to the girl. When an RNN reaches the end of the sentence and finally sees her, its memory of the beginning of the sentence might already be fading, thus not allowing it to recover this relationship.

To solve these long-distance dependencies, more complex neural structures were proposed to build up a more differentiated memory of the context. The idea is to keep words that are relevant for future predictions in memory while forgetting the other words. This was the contribution of Long-Short Term Memory (LSTM)[3] cells and Gated Recurrent Units (GRUs)[4]. However, these models don’t optimise for specific positions to be predicted, but rather for a generic future context. Moreover, due to their complex structure, they are even slower to train than traditional RNNs.

Finally, people have done away with recurrence and proposed the attention mechanism, as incorporated in the Transformer architecture.[5] Attention allows the model to focus back and forth between different words during prediction. Each word is weighted according to its relevance for the specific position to be predicted. For the above sentence, once the model reaches the position of her, girl will have a higher weight than at, despite the fact that it is much farther away in the linear order.

To date, the attention mechanism comes closest to the biological workings of the human brain during information processing. Studies have shown that attention learns hierarchical syntactic structures, incl. a range of complex syntactic phenomena (cf. the Primer on BERTology and the papers referenced therein). It also allows for parallel computation and, thus, faster and more efficient training.

Pre-training objectives

With the appropriate training data representation in place, our model can start learning. There are three generic objectives used for pre-training language models: sequence-to-sequence transduction, autoregression and auto-encoding. All of them require the model to master broad linguistic knowledge.

The original task addressed by the encoder-decoder architecture as well as the Transformer model is sequence-to-sequence transduction: a sequence is transduced into a sequence in a different representation framework. The classical sequence-to-sequence task is machine translation, but other tasks such as summarisation are frequently formulated in this manner. Note that the target sequence is not necessarily text – it can also be other unstructured data such as images as well as structured data such as programming languages. An example of sequence-to-sequence LLMs is the BART family.

The second task is autoregression, which is also the original language modelling objective. In autoregression, the model learns to predict the next output (token) based on previous tokens. The learning signal is restricted by the unidirectionality of the enterprise – the model can only use information from the right or from the left of the predicted token. This is a major limitation since words can depend both on past as well as on future positions. As an example, consider how the verb written impacts the following sentence in both directions:

Here, the position of paper is restricted to something that is writable, while the position of student is restricted to a human or, anyway, another intelligent entity capable of writing.

Many of the LLMs making today’s headlines are autoregressive, incl. the GPT family, PaLM and BLOOM.

The third task – auto-encoding – solves the issue of unidirectionality. Auto-encoding is very similar to the learning of classical word embeddings.[6] First, we corrupt the training data by hiding a certain portion of tokens – typically 10–20% – in the input. The model then learns to reconstruct the correct inputs based on the surrounding context, taking into account both the preceding and the following tokens. The typical example of auto-encoders is the BERT family, where BERT stands for Bidirectional Encoder Representations from Transformers.

Model architecture (encoder-decoder)

The basic building blocks of a language model are the encoder and the decoder. The encoder transforms the original input into a high-dimensional algebraic representation, also called a "hidden" vector. Wait a minute – hidden? Well, in reality there are no big secrets at this point. Of course you can look at this representation, but a lengthy vector of numbers will not convey anything meaningful to a human. It takes the mathematical intelligence of our model to deal with it. The decoder reproduces the hidden representation in an intelligible form such as another language, programming code, an image etc.

Figure 4: Basic schema of an encoder-decoder architecture (example of English-German translation)

The encoder-decoder architecture was originally introduced for Recurrent Neural Networks. Since the introduction of the attention-based Transformer model, traditional recurrence has lost its popularity while the encoder-decoder idea lives on. Most Natural Language Understanding (NLU) tasks rely on the encoder, while Natural Language Generation (NLG) tasks need the decoder and sequence-to-sequence transduction requires both components.

We will not go into the details of the Transformer architecture and the attention mechanism here. For those who want to master the details, be prepared to spend a good amount of time to wrap your head around it. Beyond the original paper, [7] and [8] provide excellent explanations. For a lightweight introduction, I recommend the corresponding sections in Andrew Ng’s Sequence models course.

Using language models in the real world

Fine-tuning

Language modelling is a powerful upstream task – if you have a model that successfully generates language, congratulations – it is an intelligent model. However, the business value of having a model bubbling with random text is limited. Instead, NLP is mostly used for more targeted downstream tasks such as sentiment analysis, question answering and information extraction. This is the time to apply transfer learning and reuse the existing linguistic knowledge for more specific challenges. During fine-tuning, a portion of the model is "freezed" and the rest is further trained with domain- or task-specific data.

Explicit fine-tuning adds complexity on the path towards LLM deployment. It can also lead to model explosion, where each business task requires its own fine-tuned model, escalating to an unmaintainable variety of models. So, folks have made an effort to get rid of the fine-tuning step using few- or zero-shot learning (e.g. in GPT-3 [9]). This learning happens on-the-fly during prediction: the model is fed with a "prompt" – a task description and potentially a few training examples – to guide its predictions for future examples.

While much quicker to implement, the convenience factor of zero- or few-shot learning is counterbalanced by its lower prediction quality. Besides, many of these models need to be accessed via cloud APIs. This might be a welcome opportunity at the beginning of your development – however, at more advanced stages, it can turn into another unwanted external dependency.

Picking the right model for your downstream task

Looking at the continuous supply of new language models on the AI market, selecting the right model for a specific downstream task and staying in synch with the state-of-the-art can be tricky.

Research papers normally benchmark each model against specific downstream tasks and datasets. Standardised task suites such as SuperGLUE and BIG-bench allow for unified benchmarking against a multitude of NLP tasks and provide a basis for comparison. Still, we should keep in mind that these tests are prepared in a highly controlled setting. As of today, the generalisation capacity of language models is rather limited – thus, the transfer to real-life datasets might significantly affect model performance. The evaluation and selection of an appropriate model should involve experimentation on data that is as close as possible to the production data.

As a rule of thumb, the pre-training objective provides an important hint: autoregressive models perform well on text generation tasks such as Conversational AI, question answering and text summarisation, while auto-encoders excel at "understanding" and structuring language, for example for sentiment analysis and various information extraction tasks. Models intended for zero-shot learning can theoretically perform all kinds of tasks as long as they receive appropriate prompts – however, their accuracy is generally lower than that of fine-tuned models.

To make things more concrete, the following chart shows how popular NLP tasks are associated with prominent language models in the NLP literature. The associations are computed based on multiple similarity and aggregation metrics, incl. embedding similarity and distance-weighted co-occurrence. Model-task pairs with higher scores, such as BART / Text Summarization and LaMDA / Conversational AI, indicate a good fit based on historical data.

Figure 5: Association strengths between language models and downstream tasks [12]

Key takeaways

In this article, we have covered the basic notions of LLMs and the main dimensions where innovation is happening. The following table provides a summary of the key features for the most popular LLMs:

Table 1: Summary of the features of the most popular Large Language Models

Let’s summarise some general guidelines for the selection and deployment of LLMs:

When evaluating potential models, be clear about where you are in your AI journey:

At the beginning, it might be a good idea to experiment with LLMs deployed via cloud APIs.
Once you have found product-market fit, consider hosting and maintaining your model on your side to have more control and further sharpen model performance to your application.

To align with your downstream task, your AI team should create a short-list of models based on the following criteria:

Benchmarking results in the academic literature, with a focus on your downstream task
Alignment between the pre-training objective and downstream task: consider auto-encoding for NLU and autoregression for NLG
Previous experience reported for this model-task combination (cf. Figure 5)

The short-listed models should be then tested against your real-world task and dataset to get a first feeling for the performance.
In most cases, you are likely to achieve a better quality with dedicated fine-tuning. However, consider few-/zero-shot-learning if you don’t have the internal tech skills or budget for fine-tuning, or if you need to cover a large number of tasks.
LLM innovations and trends are short-lived. When using language models, keep an eye on their lifecycle and the overall activity in the LLM landscape and watch out for opportunities to step up your game.

Finally, be aware of the limitations of LLMs. While they have the amazing, human-like capacity to produce language, their overall cognitive power is galaxies away from us humans. The world knowledge and reasoning capacity of these models are strictly limited to the information they find at the surface of language. They also can’t situate facts in time and might provide you with outdated information without blinking an eye. If you are building an application that relies on generating up-to-date or even original knowledge, consider combining your LLM with additional multimodal, structured or dynamic knowledge sources.

References

[1] Victor Sanh et al. 2021. Multitask prompted training enables zero-shot task generalization. CoRR, abs/2110.08207. [2] Yoshua Bengio et al. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166. [3] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8): 1735–1780. [4] Kyunghyun Cho et al. 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar. [5] Ashish Vaswani et al. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. [6] Tomas Mikolov et al. 2013. Distributed representations of words and phrases and their compositionality. CoRR, abs/1310.4546. [7] Jay Jalammar. 2018. The illustrated transformer. [8] Alexander Rush et al. 2018. The annotated transformer. [9] Tom B. Brown et al. 2020. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc. [10] Jacob Devlin et al. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. [11] Julien Simon 2021. Large Language Models: A New Moore’s Law? [12] Underlying dataset: more than 320k articles on AI and NLP published 2018–2022 in specialised AI resources, technology blogs and publications by the leading AI think tanks.

All images unless otherwise noted are by the author.

The post Choosing the right language model for your NLP use case appeared first on Towards Data Science.

Major trends in NLP: a review of 20 years of ACL research

Dr. Janna Lipenkova — Wed, 24 Jul 2019 15:11:00 +0000

In Depth Analysis

The 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019) is starting this week in Florence, Italy. We took the opportunity to review major research trends in the animated NLP space and formulate some implications from the business perspective. The article is backed by a statistical and – guess what – NLP-based analysis of ACL papers from the last 20 years.

1. Motivation

When compared to other species, natural language is one of the primary USPs of the human mind. NLP, a major buzzword in today’s tech discussion, deals with how computers can understand and generate language. The rise of NLP in the past decades is backed by a couple of global developments – the universal hype around AI, exponential advances in the field of Deep Learning and an ever-increasing quantity of available text data. But what is the substance behind the buzz? In fact, NLP is a highly complex, interdisciplinary field that is constantly supplied by high-quality fundamental research in linguistics, math and computer science. The ACL conference brings these different angles together. As the following chart shows, research activity has been flourishing in the past years:

Figure 1: Paper quantity published at the ACL conference by years

In the following, we summarize some core trends in terms of data strategies, algorithms, tasks as well as multilingual NLP. The analysis is based on ACL papers published since 1998 which were processed using a domain-specific ontology for the fields of NLP and Machine Learning.

2. Data: working around the bottlenecks

The quantity of freely available text data is increasing exponentially, mainly due to the massive production of Web content. However, this large body of data comes with some key challenges. First, large data is inherently noisy. Think of natural resources such as oil and metal – they need a process of refining and purification before they can be used in the final product. The same goes for data. In general, the more "democratic" the production channel, the dirtier the data – which means that more effort has to be spent on its cleaning. For example, data from social media will require a longer cleaning pipeline. Among others, you will need to deal with extravagancies of self-expression like smileys and irregular punctuation, which are normally absent in more formal settings such as scientific papers or legal contracts.

The other major challenge is the labeled data bottleneck: strictly speaking, most state-of-the-art algorithms are supervised. They not only need labeled data – they need Big Labeled Data. This is especially relevant for the advanced, complex algorithms of the Deep Learning family. Just as a child’s brain first needs a max of input before it can learn its native language, to go "deep", an algorithm first needs a large quantity of data to embrace language in its whole complexity.

Traditionally, training data at smaller scale has been annotated manually. However, dedicated manual annotation of large datasets comes with efficiency trade-offs which are rarely acceptable, especially in the business context.

What are the possible solutions? On the one hand, there are some enhancements on the management side, incl. crowd-sourcing and Training Data as a Service (TDaaS). On the other hand, a range of automatic workarounds for the creation of annotated datasets have also been suggested in the machine learning community. The following chart shows some trends:

Figure 2: Discussion of approaches for creation and reuse of training data (amounts of mentions normalised by paper quantity in the respective year)

Clearly, pretraining has seen the biggest rise in the past five years. In pre-training, a model is first trained on a large, general dataset and subsequently tweaked with task-specific data and objectives. Its popularity is largely due to the fact that companies such as Google and Facebook are making huge models available out-of-the-box to the open-source community. Especially pretrained word embeddings such as Word2Vec, FastText and BERT allow NLP developers to jump to the next level. Transfer learning is another approach to reusing models across different tasks. If the reuse of existing models is not an option, one can leverage a small quantity of labeled data to automatically label a larger quantity of data, as is done in distant and weak supervision – note, however, that these approaches usually lead to a decrease in the labeling precision.

3. Algorithms: a chain of disruptions in Deep Learning

In terms of algorithms, research in recent years has been strongly focussed on the Deep Learning family:

Figure 3: Discussion of Deep Learning algorithms (amounts of mentions normalised by paper quantity in the respective year)

Word embeddings are clearly taking up. In their basic form, word embeddings were introduced by Mikolov et al. (2013). The universal linguistic principle behind word embeddings is distributional similarity: a word can be characterized by the contexts in which it occurs. Thus, as humans, we normally have no difficulty completing the sentence "The customer signed the ___ today" with suitable words such as "deal" or "contract". Word embeddings allow to do this automatically and are thus extremely powerful for addressing the very core of the context awareness issue.

While word2vec, the original embedding algorithm, is statistical and does not account for complexities of life such as ambiguity, context sensitivity and linguistic structure, subsequent approaches have enriched word embeddings with all kinds of linguistic information. And, by the way, you can embed not only words, but also other things such as senses, sentences and whole documents.

Neural Networks are the workhorse of Deep Learning (cf. Goldberg and Hirst (2017) for an introduction of the basic architectures in the NLP context). Convolutional Neural Networks have seen an increase in the past years, whereas the popularity of the traditional Recurrent Neural Network (RNN) is dropping. This is due, on the one hand, to the availability of more efficient RNN-based architectures such as LSTM and GRU. On the other hand, a new and pretty disruptive mechanism for sequential processing – attention – has been introduced in the sequence-to-sequence (seq2seq) model by Sutskever et al. (2014). If you use Google Translate, you might have noticed the leapfrog in the translation quality a couple of years ago – seq2seq was the culprit. And while seq2seq still relies on RNNs in the pipeline, the transformer architecture, another major advance from 2017, finally gets rid of recurrence and completely relies on the attention mechanism (Vaswani et al. 2017).

Deep Learning is a vibrant and fascinating domain, but it can also be quite intimidating from the application point of view. When it does, keep in mind that most developments are motivated by increased efficiency at Big Data scale, context awareness and scalability to different tasks and languages. For a mathematical introduction, Young et al. (2018) present an excellent overview of the state-of-the-art algorithms.

4. Consolidating various NLP tasks

When we look at specific NLP tasks such as sentiment analysis and named entity recognition, the inventories are much steadier than for the underlying algorithms. Over the years, there has been an gradient evolution from preprocessing tasks such as stemming over syntactic parsing and information extraction to semantically oriented tasks such as sentiment/emotion analysis and semantic parsing. This corresponds to the three "global" NLP development curves – syntax, semantics and context awareness – as described by Cambria et al. (2014). As we have seen in the previous section, the third curve – the awareness of a larger context – has already become one of the main drivers behind new Deep Learning algorithms.

From an even more general perspective, there is an interesting trend towards task-agnostic research. In Section 2, we saw how the generalization power of modern mathematical approaches has been leveraged in scenarios such as transfer learning and pretraining. Indeed, modern algorithms are developing amazing multi-tasking powers – thus, the relevance of the specific task at hand decreases. The following chart shows an overall decline in the discussion of specific NLP tasks since 2006:

Figure 4: Amount of discussion of specific NLP tasks

5. A note on multilingual research

With globalization, going international becomes an imperative for business growth. English is traditionally the starting point for most NLP research, but the demand for scalable multilingual NLP systems increases in recent years. How is this need reflected in the research community? Think of different languages as different lenses through which we view the same world – they share many properties, a fact that is fully accommodated by modern learning algorithms with their increasing power for abstraction and generalization. Still, language-specific features have to be thoroughly addressed especially in the preprocessing phase. As the following chart shows, the diversity of languages addressed in ACL research keeps increasing:

Figure 5: Frequent languages per year (> 10 mentions per language)

However, just as seen for NLP tasks in the previous section, we can expect a consolidation once language-specific differences have been neutralized for the next wave of algorithms. The most popular languages are summarised in Figure 6.

Figure 6: Languages addressed by ACL research

For some of these languages, research interest meets commercial attractiveness: languages such as English, Chinese and Spanish bring together large quantities of available data, huge native speaker populations and a large economic potential in the corresponding geographical regions. However, the abundance of "smaller" languages also shows that the NLP field is generally evolving towards a theoretically sound treatment of multilinguality and cross-linguistic generalisation.

Summing up

Spurred by the global AI hype, the NLP field is exploding with new approaches and disruptive improvements. There is a shift towards modeling meaning and context dependence, probably the most universal and challenging fact of human language. The generalisation power of modern algorithms allows for efficient scaling across different tasks, languages and datasets, thus significantly speeding up the ROI cycle of NLP developments and allowing for a flexible and efficient integration of NLP into individual business scenarios.

Stay tuned for a review of ACL 2019 and more updates on NLP trends!

References

E. Cambria and B. White (2014). Jumping NLP Curves: A Review of Natural Language Processing Research [Review Article]. Comp. Intell. Mag. 9, 2.
J. Devlin, M. Wei, K. Lee and K. Toutanova (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Y. Goldberg and G. Hirst (2017). Neural Network Methods in Natural Language Processing. Morgan & Claypool Publishers.
T. Mikolov et al. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems – vol. 2 (NIPS’13).
R. Prabhavalkar, K. Rao, Kanishka, T. Sainath, B. Li, L. Johnson and N. Jaitly (2017). A Comparison of Sequence-to-Sequence Models for Speech Recognition. 939–943. 10.21437/Interspeech.2017–233.
I. Sutskever, O. Vinyals, and Q. V. Le (2014). Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems – vol. 2 (NIPS’14).
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin (2017). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17).
T. Young, D. Hazarika, S. Poria and E. Cambria (2018). Recent Trends in Deep Learning Based Natural Language Processing. In IEEE Computational Intelligence Magazine – vol. 13.

The post Major trends in NLP: a review of 20 years of ACL research appeared first on Towards Data Science.