Large Language Models | Towards Data Science

Agentic GraphRAG for Commercial Contracts

Tomaz Bratanic — Thu, 03 Apr 2025 04:27:24 +0000

In every business, legal contracts are foundational documents that define the relationships, obligations, and responsibilities between parties. Whether it’s a partnership agreement, an NDA, or a supplier contract, these documents often contain critical information that drives decision-making, risk management, and compliance. However, navigating and extracting insights from these contracts can be a complex and time-consuming process.

In this post, we’ll explore how we can streamline the process of understanding and working with legal contracts by implementing an end-to-end solution using Agentic Graphrag. I see GraphRAG as an umbrella term for any method that retrieves or reasons over information stored in a knowledge graph, enabling more structured and context-aware responses.

By structuring legal contracts into a knowledge graph in Neo4j, we can create a powerful repository of information that’s easy to query and analyze. From there, we’ll build a LangGraph agent that allows users to ask specific questions about the contracts, making it possible to rapidly uncover new insights.

The code is available in this GitHub repository.

Why structuring data matters

Some domains work well with naive RAG, but legal contracts present unique challenges.

Pulling information from irrelevant contracts using naive vector RAG

As shown in the image, relying solely on a vector index to retrieve relevant chunks can introduce risks, such as pulling information from irrelevant contracts. This is because legal language is highly structured, and similar wording across different agreements can lead to incorrect or misleading retrieval. These limitations highlight the need for a more structured approach, such as GraphRAG, to ensure precise and context-aware retrieval.

To implement GraphRAG, we first need to construct a knowledge graph.

Legal knowledge graph containing both structured and unstructured information.

To build a knowledge graph for legal contracts, we need a way to extract structured information from documents and store it alongside the raw text. An LLM can help by reading through contracts and identifying key details such as parties, dates, contract types, and important clauses. Instead of treating the contract as just a block of text, we break it down into structured components that reflect its underlying legal meaning. For example, an LLM can recognize that “ACME Inc. agrees to pay $10,000 per month starting January 1, 2024” contains both a payment obligation and a start date, which we can then store in a structured format.

Once we have this structured data, we store it in a knowledge graph, where entities like companies, agreements, and clauses are represented as represented along with their relationships. The unstructured text remains available, but now we can use the structured layer to refine our searches and make retrieval far more precise. Instead of just fetching the most relevant text chunks, we can filter contracts based on their attributes. This means we can answer questions that naive RAG would struggle with, such as how many contracts were signed last month or whether we have any active agreements with a specific company. These questions require aggregation and filtering, which isn’t possible with standard vector-based retrieval alone.

By combining structured and unstructured data, we also make retrieval more context-aware. If a user asks about a contract’s payment terms, we ensure that the search is constrained to the right agreement rather than relying on text similarity, which might pull in terms from unrelated contracts. This hybrid approach overcomes the limitations of naive RAG and allows for a much deeper and more reliable analysis of legal documents.

Graph construction

We’ll leverage an LLM to extract structured information from legal documents, using the CUAD (Contract Understanding Atticus Dataset), a widely used benchmark dataset for contract analysis licensed under CC BY 4.0. CUAD dataset contains over 500 contracts, making it an ideal dataset for evaluating our structured extraction pipeline.

The token count distribution for the contracts is visualized below.

Most contracts in this dataset are relatively short, with token counts below 10,000. However, there are some much longer contracts, with a few reaching up to 80,000 tokens. These long contracts are rare, while shorter ones make up the majority. The distribution shows a steep drop-off, meaning long contracts are the exception rather than the rule.

We’re using Gemini-2.0-Flash for extraction, which has a 1 million token input limit, so handling these contracts isn’t a problem. Even the longest contracts in our dataset (around 80,000 tokens) fit well within the model’s capacity. Since most contracts are much shorter, we don’t have to worry about truncation or breaking documents into smaller chunks for processing.

Structured data extraction

Most commercial LLMs have the option to use Pydantic objects to define the schema of the output. An example for location:

class Location(BaseModel):
    """
    Represents a physical location including address, city, state, and country.
    """

    address: Optional[str] = Field(
        ..., description="The street address of the location.Use None if not provided"
    )
    city: Optional[str] = Field(
        ..., description="The city of the location.Use None if not provided"
    )
    state: Optional[str] = Field(
        ..., description="The state or region of the location.Use None if not provided"
    )
    country: str = Field(
        ...,
        description="The country of the location. Use the two-letter ISO standard.",
    )

When using LLMs for structured output, Pydantic helps define a clear schema by specifying the types of attributes and providing descriptions that guide the model’s responses. Each field has a type, such as str or Optional[str], and a description that tells the LLM exactly how to format the output.

For example, in a Location model, we define key attributes like address, city, state, and country, specifying what data is expected and how it should be structured. The country field, for instance, follows two-letter country code standard like "US", "FR", or "JP", instead of inconsistent variations like “United States” or “USA.” This principle applies to other structured data as well, ISO 8601 keeps dates in a standard format (YYYY-MM-DD), and so on.

By defining structured output with Pydantic, we make LLM responses more reliable, machine-readable, and easier to integrate into databases or APIs. Clear field descriptions further help the model generate correctly formatted data, reducing the need for post-processing.

The Pydantic schema models can be more sophisticated like the Contract model below, which captures key details of a legal agreement, ensuring the extracted data follows a standardized structure.

class Contract(BaseModel):
    """
    Represents the key details of the contract.
    """
  
    summary: str = Field(
        ...,
        description=("High level summary of the contract with relevant facts and details. Include all relevant information to provide full picture."
        "Do no use any pronouns"),
    )
    contract_type: str = Field(
        ...,
        description="The type of contract being entered into.",
        enum=CONTRACT_TYPES,
    )
    parties: List[Organization] = Field(
        ...,
        description="List of parties involved in the contract, with details of each party's role.",
    )
    effective_date: str = Field(
        ...,
        description=(
            "Enter the date when the contract becomes effective in yyyy-MM-dd format."
            "If only the year (e.g., 2015) is known, use 2015-01-01 as the default date."
            "Always fill in full date"
        ),
    )
    contract_scope: str = Field(
        ...,
        description="Description of the scope of the contract, including rights, duties, and any limitations.",
    )
    duration: Optional[str] = Field(
        None,
        description=(
            "The duration of the agreement, including provisions for renewal or termination."
            "Use ISO 8601 durations standard"
        ),
    )
  
    end_date: Optional[str] = Field(
        None,
        description=(
            "The date when the contract expires. Use yyyy-MM-dd format."
            "If only the year (e.g., 2015) is known, use 2015-01-01 as the default date."
            "Always fill in full date"
        ),
    )
    total_amount: Optional[float] = Field(
        None, description="Total value of the contract."
    )
    governing_law: Optional[Location] = Field(
        None, description="The jurisdiction's laws governing the contract."
    )
    clauses: Optional[List[Clause]] = Field(
        None, description=f"""Relevant summaries of clause types. Allowed clause types are {CLAUSE_TYPES}"""
    )

This contract schema organizes key details of legal agreements in a structured way, making it easier to analyze with LLMs. It includes different types of clauses, such as confidentiality or termination, each with a short summary. The parties involved are listed with their names, locations, and roles, while contract details cover things like start and end dates, total value, and governing law. Some attributes, such as governing law, can be defined using nested models, enabling more detailed and complex outputs.

The nested object approach works well with some AI models that handle complex data relationships, while others may struggle with deeply nested details.

We can test our approach using the following example. We are using the LangChain framework to orchestrate LLMs.

llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash")
llm.with_structured_output(Contract).invoke(
    "Tomaz works with Neo4j since 2017 and will make a billion dollar until 2030."
    "The contract was signed in Las Vegas"
)

which outputs

Contract(
    summary="Tomaz works with Neo4j since 2017 and will make a billion dollar until 2030.",
    contract_type="Service",
    parties=[
        Organization(
            name="Tomaz",
            location=Location(
                address=None,
                city="Las Vegas",
                state=None,
                country="US"
            ),
            role="employee"
        ),
        Organization(
            name="Neo4j",
            location=Location(
                address=None,
                city=None,
                state=None,
                country="US"
            ),
            role="employer"
        )
    ],
    effective_date="2017-01-01",
    contract_scope="Tomaz will work with Neo4j",
    duration=None,
    end_date="2030-01-01",
    total_amount=1_000_000_000.0,
    governing_law=None,
    clauses=None
)

Now that our contract data is in a structured format, we can define the Cypher query needed to import it into Neo4j, mapping entities, relationships, and key clauses into a graph structure. This step transforms raw extracted data into a queryable knowledge graph, enabling efficient traversal and retrieval of contract insights.

UNWIND $data AS row
MERGE (c:Contract {file_id: row.file_id})
SET c.summary = row.summary,
    c.contract_type = row.contract_type,
    c.effective_date = date(row.effective_date),
    c.contract_scope = row.contract_scope,
    c.duration = row.duration,
    c.end_date = CASE WHEN row.end_date IS NOT NULL THEN date(row.end_date) ELSE NULL END,
    c.total_amount = row.total_amount
WITH c, row
CALL (c, row) {
    WITH c, row
    WHERE row.governing_law IS NOT NULL
    MERGE (c)-[:HAS_GOVERNING_LAW]->(l:Location)
    SET l += row.governing_law
}
FOREACH (party IN row.parties |
    MERGE (p:Party {name: party.name})
    MERGE (p)-[:HAS_LOCATION]->(pl:Location)
    SET pl += party.location
    MERGE (p)-[pr:PARTY_TO]->(c)
    SET pr.role = party.role
)
FOREACH (clause IN row.clauses |
    MERGE (c)-[:HAS_CLAUSE]->(cl:Clause {type: clause.clause_type})
    SET cl.summary = clause.summary
)

This Cypher query imports structured contract data into Neo4j by creating Contract nodes with attributes such as summary, contract_type, effective_date, duration, and total_amount. If a governing law is specified, it links the contract to a Location node. Parties involved in the contract are stored as Party nodes, with each party connected to a Location and assigned a role in relation to the contract. The query also processes clauses, creating Clause nodes and linking them to the contract while storing their type and summary.

After processing and importing the contracts, the resulting graph follows the following graph schema.

Imported legal graph schema

Let’s also take a look at a single contract.

This graph represents a contract structure where a contract (orange node) connects to various clauses (red nodes), parties (blue nodes), and locations (violet nodes). The contract has three clauses: Renewal & Termination, Liability & Indemnification, and Confidentiality & Non-Disclosure. Two parties, Modus Media International and Dragon Systems, Inc., are involved, each linked to their respective locations, Netherlands (NL) and United States (US). The contract is governed by U.S. law. The contract node also contains additional metadata, including dates and other relevant details.

A public read-only instance containing CUAD legal contracts is available with the following credentials.

URI: neo4j+s://demo.neo4jlabs.com
username: legalcontracts
password: legalcontracts
database: legalcontracts

Entity resolution

Entity resolution in legal contracts is challenging due to variations in how companies, individuals, and locations are referenced. A company might appear as “Acme Inc.” in one contract and “Acme Corporation” in another, requiring a process to determine whether they refer to the same entity.

One approach is to generate candidate matches using text embeddings or string distance metrics like Levenshtein distance. Embeddings capture semantic similarity, while string distance measures character-level differences. Once candidates are identified, additional evaluation is needed, comparing metadata such as addresses or tax IDs, analyzing shared relationships in the graph, or incorporating human review for critical cases.

For resolving entities at scale, both open-source solutions like Dedupe and commercial tools like Senzing offer automated methods. Choosing the right approach depends on data quality, accuracy requirements, and whether manual oversight is feasible.

With the legal graph constructed, we can move onto the agentic GraphRAG implementation.

Agentic GraphRAG

Agentic architectures vary widely in complexity, modularity, and reasoning capabilities. At their core, these architectures involve an LLM acting as a central reasoning engine, often supplemented with tools, memory, and orchestration mechanisms. The key differentiator is how much autonomy the LLM has in making decisions and how interactions with external systems are structured.

One of the simplest and most effective designs, particularly for chatbot-like implementations, is a direct LLM-with-tools approach. In this setup, the LLM serves as the decision-maker, dynamically selecting which tools to invoke (if any), retrying operations when necessary, and executing multiple tools in sequence to fulfill complex requests.

The diagram represents a simple LangGraph agent workflow. It begins at __start__, moving to the assistant node, where the LLM processes user input. From there, the assistant can either call tools to fetch relevant information or transition directly to __end__ to complete the interaction. If a tool is used, the assistant processes the response before deciding whether to call another tool or end the session. This structure allows the agent to autonomously determine when external information is needed before responding.

This approach is particularly well-suited to stronger commercial models like Gemini or GPT-4o, which excel at reasoning and self-correction.

Tools

LLMs are powerful reasoning engines, but their effectiveness often depends on how well they are equipped with external tools. These tools , whether database queries, APIs, or search functions, extend an LLM’s ability to retrieve facts, perform calculations, or interact with structured data.

Designing tools that are both general enough to handle diverse queries and precise enough to return meaningful results is more art than science. What we’re really building is a semantic layer between the LLM and the underlying data. Rather than requiring the LLM to understand the exact structure of a Neo4j knowledge graph or a database schema, we define tools that abstract away these complexities.

With this approach, the LLM doesn’t need to know whether contract information is stored as graph nodes and relationships or as raw text in a document store. It only needs to invoke the right tool to fetch relevant data based on a user’s question.

In our case, the contract retrieval tool serves as this semantic interface. When a user asks about contract terms, obligations, or parties, the LLM calls a structured query tool that translates the request into a database query, retrieves relevant information, and presents it in a format the LLM can interpret and summarize. This enables a flexible, model-agnostic system where different LLMs can interact with contract data without needing direct knowledge of its storage or structure.

There’s no one-size-fits-all standard for designing an optimal toolset. What works well for one model may fail for another. Some models handle ambiguous tool instructions gracefully, while others struggle with complex parameters or require explicit prompting. The trade-off between generality and task-specific efficiency means tool design requires iteration, testing, and fine-tuning for the LLM in use.
For contract analysis, an effective tool should retrieve contracts and summarize key terms without requiring users to phrase queries rigidly. Achieving this flexibility depends on thoughtful prompt engineering, robust schema design, and adaptation to different LLM capabilities. As models evolve, so do strategies for making tools more intuitive and effective.

In this section, we’ll explore different approaches to tool implementation, comparing their flexibility, effectiveness, and compatibility with various LLMs.

My preferred approach is to dynamically and deterministically construct a Cypher query and execute it against the database. This method ensures consistent and predictable query generation while maintaining implementation flexibility. By structuring queries this way, we reinforce the semantic layer, allowing user inputs to be seamlessly translated into database retrievals. This keeps the LLM focused on retrieving relevant information rather than understanding the underlying data model.

Our tool is intended to identify relevant contracts, so we need to provide the LLM with options to search contracts based on various attributes. The input description is again provided as a Pydantic object.

class ContractInput(BaseModel):
    min_effective_date: Optional[str] = Field(
        None, description="Earliest contract effective date (YYYY-MM-DD)"
    )
    max_effective_date: Optional[str] = Field(
        None, description="Latest contract effective date (YYYY-MM-DD)"
    )
    min_end_date: Optional[str] = Field(
        None, description="Earliest contract end date (YYYY-MM-DD)"
    )
    max_end_date: Optional[str] = Field(
        None, description="Latest contract end date (YYYY-MM-DD)"
    )
    contract_type: Optional[str] = Field(
        None, description=f"Contract type; valid types: {CONTRACT_TYPES}"
    )
    parties: Optional[List[str]] = Field(
        None, description="List of parties involved in the contract"
    )
    summary_search: Optional[str] = Field(
        None, description="Inspect summary of the contract"
    )
    country: Optional[str] = Field(
        None, description="Country where the contract applies. Use the two-letter ISO standard."
    )
    active: Optional[bool] = Field(None, description="Whether the contract is active")
    monetary_value: Optional[MonetaryValue] = Field(
        None, description="The total amount or value of a contract"
    )

With LLM tools, attributes can take various forms depending on their purpose. Some fields are simple strings, such as contract_type and country, which store single values. Others, like parties, are lists of strings, allowing multiple entries (e.g., multiple entities involved in a contract).

Beyond basic data types, attributes can also represent complex objects. For example, monetary_value uses a MonetaryValue object, which includes structured data such as currency type and the operator. While attributes with nested objects offer a clear and structured representation of data, models tend to struggle to handle them effectively, so we should keep them simple.

As part of this project, we’re experimenting with an additional cypher_aggregation attribute, providing the LLM with greater flexibility for scenarios that require specific filtering or aggregation.

cypher_aggregation: Optional[str] = Field(
    None,
    description="""Custom Cypher statement for advanced aggregations and analytics.

    This will be appended to the base query:
    ```
    MATCH (c:Contract)
    
    WITH c, summary, contract_type, contract_scope, effective_date, end_date, parties, active, monetary_value, contract_id, countries
    
    ```
    
    Examples:
    
    1. Count contracts by type:
    ```
    RETURN contract_type, count(*) AS count ORDER BY count DESC
    ```
    
    2. Calculate average contract duration by type:
    ```
    WITH contract_type, effective_date, end_date
    WHERE effective_date IS NOT NULL AND end_date IS NOT NULL
    WITH contract_type, duration.between(effective_date, end_date).days AS duration
    RETURN contract_type, avg(duration) AS avg_duration ORDER BY avg_duration DESC
    ```
    
    3. Calculate contracts per effective date year:
    ```
    RETURN effective_date.year AS year, count(*) AS count ORDER BY year
    ```
    
    4. Counts the party with the highest number of active contracts:
    ```
    UNWIND parties AS party
    WITH party.name AS party_name, active, count(*) AS contract_count
    WHERE active = true
    RETURN party_name, contract_count
    ORDER BY contract_count DESC
    LIMIT 1
    ```
    """

The cypher_aggregation attribute allows LLMs to define custom Cypher statements for advanced aggregations and analytics. It extends the base query by appending question-specified aggregation logic, enabling flexible filtering and computation.

This feature supports use cases such as counting contracts by type, calculating average contract duration, analyzing contract distributions over time, and identifying key parties based on contract activity. By leveraging this attribute, the LLM can dynamically generate insights tailored to specific analytical needs without requiring predefined query structures.

While this flexibility is valuable, it should be carefully evaluated, as increased adaptability comes at the cost of reduced consistency and robustness due to the added complexity of the operation.

We must clearly define the function’s name and description when presenting it to the LLM. A well-structured description helps guide the model in using the function correctly, ensuring it understands its purpose, expected inputs, and outputs. This reduces ambiguity and improves the LLM’s ability to generate meaningful and reliable queries.

class ContractSearchTool(BaseTool):
    name: str = "ContractSearch"
    description: str = (
        "useful for when you need to answer questions related to any contracts"
    )
    args_schema: Type[BaseModel] = ContractInput

Finally, we need to implement a function that processes the given inputs, constructs the corresponding Cypher statement, and executes it efficiently.

The core logic of the function centers on constructing the Cypher statement. We begin by matching the contract as the foundation of the query.

cypher_statement = "MATCH (c:Contract) "

Next, we need to implement the function that processes the input parameters. In this example, we primarily use attributes to filter contracts based on the given criteria.

Simple property filtering
For example, the contract_type attribute is used to perform simple node property filtering.

if contract_type:
    filters.append("c.contract_type = $contract_type")
    params["contract_type"] = contract_type

This code adds a Cypher filter for contract_type while using query parameters for values to prevent query injection security issue.

Since the possible contract type values are presented in the attribute description

contract_type: Optional[str] = Field(
    None, description=f"Contract type; valid types: {CONTRACT_TYPES}"
)

we don’t have to worry about mapping values from input to valid contract types as the LLM will handle that.

Inferred property filtering

We’re building tools for an LLM to interact with a knowledge graph, where the tools serve as an abstraction layer over structured queries. A key feature is the ability to use inferred properties at runtime, similar to an ontology but dynamically computed.

if active is not None:
    operator = ">=" if active else "<"
    filters.append(f"c.end_date {operator} date()")

Here, active acts as a runtime classification, determining whether a contract is ongoing (>= date()) or expired (< date()). This logic extends structured KG queries by computing properties only when needed, enabling more flexible LLM reasoning. By handling logic like this within tools, we ensure the LLM interacts with simplified, intuitive operations, keeping it focused on reasoning rather than query formulation.

Neighbor filtering

Sometimes filtering depends on neighboring nodes, such as restricting results to contracts involving specific parties. The parties attribute is an optional list, and when provided, it ensures only contracts linked to those entities are considered:

if parties:
    parties_filter = []
    for i, party in enumerate(parties):
        party_param_name = f"party_{i}"
        parties_filter.append(
            f"""EXISTS {{
            MATCH (c)<-[:PARTY_TO]-(party)
            WHERE toLower(party.name) CONTAINS ${party_param_name}
        }}"""
        )
        params[party_param_name] = party.lower()

This code filters contracts based on their associated parties, treating the logic as AND, meaning all specified conditions must be met for a contract to be included. It iterates through the provided parties list and constructs a query where each party condition must hold.

For each party, a unique parameter name is generated to avoid conflicts. The EXISTS clause ensures that the contract has a PARTY_TO relationship to a party whose name contains the specified value. The name is converted to lowercase to allow case-insensitive matching. Each party condition is added separately, enforcing an implicit AND between them.

If more complex logic were needed, such as supporting OR conditions or allowing different matching criteria, the input would need to change. Instead of a simple list of party names, a structured input format specifying operators would be required.

Additionally, we could implement a party-matching method that tolerates minor typos, improving the user experience by handling variations in spelling and formatting.

Custom operator filtering

To add more flexibility, we can introduce an operator object as a nested attribute, allowing more control over filtering logic. Instead of hardcoding comparisons, we define an enumeration for operators and use it dynamically.

For example, with monetary values, a contract might need to be filtered based on whether its total amount is greater than, less than, or exactly equal to a specified value. Instead of assuming a fixed comparison logic, we define an enum that represents the possible operators:

class NumberOperator(str, Enum):
    EQUALS = "="
    GREATER_THAN = ">"
    LESS_THAN = "<"

class MonetaryValue(BaseModel):
    """The total amount or value of a contract"""
    value: float
    operator: NumberOperator

if monetary_value:
    filters.append(f"c.total_amount {monetary_value.operator.value} $total_value")
    params["total_value"] = monetary_value.value

This approach makes the system more expressive. Instead of rigid filtering rules, the tool interface allows the LLM to specify not just a value but how it should be compared, making it easier to handle a broader range of queries while keeping the LLM’s interaction simple and declarative.

Some LLMs struggle with nested objects as inputs, making it harder to handle structured operator-based filtering. Adding a between operator introduces additional complexity since it requires two separate values, which can lead to ambiguity in parsing and input validation.

Min and Max attributes

To keep things simpler, I tend to gravitate toward using min and max attributes for dates, as this naturally supports range filtering and makes the between logic straightforward.

if min_effective_date:
    filters.append("c.effective_date >= date($min_effective_date)")
    params["min_effective_date"] = min_effective_date
if max_effective_date:
    filters.append("c.effective_date <= date($max_effective_date)")
    params["max_effective_date"] = max_effective_date

This function filters contracts based on an effective date range by adding an optional lower and upper bound condition when min_effective_date and max_effective_date are provided, ensuring that only contracts within the specified date range are included.

Semantic search

An attribute can also be used for semantic search, where instead of relying on a vector index upfront, we use a post-filtering approach to metadata filtering. First, structured filters, like date ranges, monetary values, or parties, are applied to narrow down the candidate set. Then, vector search is performed over this filtered subset to rank results based on semantic similarity.

if summary_search:
    cypher_statement += (
        "WITH c, vector.similarity.cosine(c.embedding, $embedding) "
        "AS score ORDER BY score DESC WITH c, score WHERE score > 0.9 "
    )  # Define a threshold limit
    params["embedding"] = embeddings.embed_query(summary_search)
else:  # Else we sort by latest
    cypher_statement += "WITH c ORDER BY c.effective_date DESC "

This code applies semantic search when summary_search is provided by computing cosine similarity between the contract’s embedding and the query embedding, ordering results by relevance, and filtering out low-scoring matches with a threshold of 0.9. Otherwise, it defaults to sorting contracts by the most recent effective_date.

Dynamic queries

The cypher aggregation attribute is an experiment I wanted to test that gives the LLM a degree of partial text2cypher capability, allowing it to dynamically generate aggregations after the initial structured filtering. Instead of predefining every possible aggregation, this approach lets the LLM specify calculations like counts, averages, or grouped summaries on demand, making queries more flexible and expressive. However, since this shifts more query logic to the LLM, ensuring all generated queries work correctly becomes challenging, as malformed or incompatible Cypher statements can break execution. This trade-off between flexibility and reliability is a key consideration in designing the system.

if cypher_aggregation:
    cypher_statement += """WITH c, c.summary AS summary, c.contract_type AS contract_type, 
      c.contract_scope AS contract_scope, c.effective_date AS effective_date, c.end_date AS end_date,
      [(c)<-[r:PARTY_TO]-(party) | {party: party.name, role: r.role}] AS parties, c.end_date >= date() AS active, c.total_amount as monetary_value, c.file_id AS contract_id,
      apoc.coll.toSet([(c)<-[:PARTY_TO]-(party)-[:LOCATED_IN]->(country) | country.name]) AS countries """
    cypher_statement += cypher_aggregation

If no cypher aggregation is provided, we return the total count of identified contracts along with only five example contracts to avoid overwhelming the prompt. Handling excessive rows is crucial, as an LLM struggling with a massive result set isn’t useful. Additionally, LLM producing answers with 100 contract titles isn’t a good user experience either.

cypher_statement += """WITH collect(c) AS nodes
RETURN {
    total_count_of_contracts: size(nodes),
    example_values: [
      el in nodes[..5] |
      {summary:el.summary, contract_type:el.contract_type, 
       contract_scope: el.contract_scope, file_id: el.file_id, 
        effective_date: el.effective_date, end_date: el.end_date,
        monetary_value: el.total_amount, contract_id: el.file_id, 
        parties: [(el)<-[r:PARTY_TO]-(party) | {name: party.name, role: r.role}], 
        countries: apoc.coll.toSet([(el)<-[:PARTY_TO]-()-[:LOCATED_IN]->(country) | country.name])}
    ]
} AS output"""

This cypher statement collects all matching contracts into a list, returning the total count and up to five example contracts with key attributes, including summary, type, scope, dates, monetary value, associated parties with roles, and unique country locations.
Now that our contract search tool is built, we hand it off to the LLM and just like that, we have agentic GraphRAG implemented.

Agent Benchmark

If you’re serious about implementing agentic GraphRAG, you need an evaluation dataset, not just as a benchmark but as a foundation for the entire project. A well-constructed dataset helps define the scope of what the system should handle, ensuring that initial development aligns with real-world use cases. Beyond that, it becomes an invaluable tool for evaluating performance, allowing you to measure how well the LLM interacts with the graph, retrieves information, and applies reasoning. It’s also essential for prompt engineering optimizations, letting you iteratively refine queries, tool use, and response formatting with clear feedback rather than guesswork. Without a structured dataset, you’re flying blind, making improvements harder to quantify and inconsistencies more difficult to catch.

The code for the benchmark is available on GitHub.

I have compiled a list of 22 questions which we will use to evaluate the system. Additionally, we are going to introduce a new metric called answer_satisfaction where we will be provide a custom prompt.

answer_satisfaction = AspectCritic(
    name="answer_satisfaction",
    definition="""You will evaluate an ANSWER to a legal QUESTION based on a provided SOLUTION.

Rate the answer on a scale from 0 to 1, where:
- 0 = incorrect, substantially incomplete, or misleading
- 1 = correct and sufficiently complete

Consider these evaluation criteria:
1. Factual correctness is paramount - the answer must not contradict the solution
2. The answer must address the core elements of the solution
3. Additional relevant information beyond the solution is acceptable and may enhance the answer
4. Technical legal terminology should be used appropriately if present in the solution
5. For quantitative legal analyses, accurate figures must be provided

+ fewshots
"""

Many questions can return a large amount of information. For example, asking for contracts signed before 2020 might yield hundreds of results. Since the LLM receives both the total count and a few example entries, our evaluation should focus on the total count, rather than which specific examples the LLM chooses to show.

Benchmark results.

The provided results indicate that all evaluated models (Gemini 1.5 Pro, Gemini 2.0 Flash, and GPT-4o) perform similarly well for most tool calls, with GPT-4o slightly outperforming the Gemini models (0.82 vs. 0.77). The noticeable difference emerges primarily when partial text2cypher is used, particularly for various aggregation operations.

Note that this is only 22 fairly simple questions, so we didn’t really explore reasoning capabilities of LLMs.

Additionally, I’ve seen projects where accuracy can be improved significantly by leveraging Python for aggregations, as LLMs typically handle Python code generation and execution better than generating complex Cypher queries directly.

Web Application

We’ve also built a simple React web application, powered by LangGraph hosted on FastAPI, which streams responses directly to the frontend. Special thanks to Anej Gorkic for creating the web app.

You can launch the entire stack with the following command:

docker compose up

And navigate to localhost:5173

Summary

As LLMs gain stronger reasoning capabilities, they, when paired with the right tools, can become powerful agents for navigating complex domains like legal contracts. In this post, we’ve only scratched the surface, focusing on core contract attributes while barely touching the rich variety of clauses found in real-world agreements. There’s significant room for growth, from expanding clause coverage to refining tool design and interaction strategies.

The code is available on GitHub.

Images

All images in this post were created by the author.

The post Agentic GraphRAG for Commercial Contracts appeared first on Towards Data Science.

Talk to Videos

Umair Ali Khan — Thu, 27 Mar 2025 19:06:24 +0000

Large language models (LLMs) are improving in efficiency and are now able to understand different data formats, offering possibilities for myriads of applications in different domains. Initially, LLMs were inherently able to process only text. The image understanding feature was integrated by coupling an LLM with another image encoding model. However, gpt-4o was trained on both text and images and is the first true multimodal LLM that can understand both text and images. Other modalities such as audio are integrated into modern LLMs through other AI models, e.g., OpenAI’s Whisper models.

LLMs are now being used more as information processors where they can process data in different formats. Integrating multiple modalities into LLMs opens areas of numerous applications in education, Business, and other sectors. One such application is the processing of educational videos, documentaries, webinars, presentations, business meetings, lectures, and other content using LLMs and interacting with this content more naturally. The audio modality in these videos contains rich information that could be used in a number of applications. In educational settings, it can be used for personalized learning, enhancing accessibility of students with special needs, study aid creation, remote learning support without requiring a teacher’s presence to understand content, and assessing students’ knowledge about a topic. In business settings, it can be used for training new employees with onboarding videos, extracting and generating knowledge from recording meetings and presentations, customized learning materials from product demonstration videos, and extracting insights from recorded industry conferences without watching hours of videos, to name a few.

This article discusses the development of an application to interact with videos in a natural way and create learning content from them. The application has the following features:

It takes an input video either through a URL or from a local path and extracts audio from the video
Transcribes the audio using OpenAI’s state-of-the-art model gpt-4o-transcribe, which has demonstrated improved Word Error Rate (WER) performance over existing Whisper models across multiple established benchmarks
Creates a vector store of the transcript and develops a retrieval augment generation (RAG) to establish a conversation with the video transcript
Respond to users’ questions in text and speech using different voices, selectable from the application’s UI.
Creates learning content such as:
- Hierarchical representation of the video contents to provide users with quick insights into the main concepts and supporting details
- Generate quizzes to transform passive video watching into active learning by challenging users to recall and apply information presented in the video.
- Generates flashcards from the video content that support active recall and spaced repetition learning techniques

The entire workflow of the application is shown in the following figure.

Application workflow (image by author)

The whole codebase, along with detailed instructions for installation and usage, is available on GitHub.

Here is the structure of the GitHub repository. The main Streamlit application implements the GUI interface and calls several other functions from other feature and helper modules (.py files).

GitHub code structure (image by author)

In addition, you can visualize the codebase by opening the “codebase visualization” HTML file in a browser, which describes the structures of each module.

Codebase visualization (image by author)

Let’s delve into the step-by-step development of this application. I will not discuss the entire code, but only its major part. The whole code in the GitHub repository is adequately commented.

Video Input and Processing

Video input and processing logic are implemented in transcriber.py. When the application loads, it verifies whether FFMPEG is present (verify_ffmpeg) in the application’s root directory. FFMPEG is required for downloading a video (if the input is a URL) and extracting audio from the video which is then used to create a transcript.

def verify_ffmpeg():
    """Verify that FFmpeg is available and print its location."""
    # Add FFmpeg to PATH
    os.environ['PATH'] = FFMPEG_LOCATION + os.pathsep + os.environ['PATH']
    # Check if FFmpeg binaries exist
    ffmpeg_path = os.path.join(FFMPEG_LOCATION, 'ffmpeg.exe')
    ffprobe_path = os.path.join(FFMPEG_LOCATION, 'ffprobe.exe')
    if not os.path.exists(ffmpeg_path):
        raise FileNotFoundError(f"FFmpeg executable not found at: {ffmpeg_path}")
    if not os.path.exists(ffprobe_path):
        raise FileNotFoundError(f"FFprobe executable not found at: {ffprobe_path}")
    print(f"FFmpeg found at: {ffmpeg_path}")
    print(f"FFprobe found at: {ffprobe_path}")
    # Try to execute FFmpeg to make sure it works
    try:
        # Add shell=True for Windows and capture errors properly
        result = subprocess.run([ffmpeg_path, '-version'], 
                               stdout=subprocess.PIPE, 
                               stderr=subprocess.PIPE,
                               shell=True,  # This can help with permission issues on Windows
                               check=False)
        if result.returncode == 0:
            print(f"FFmpeg version: {result.stdout.decode().splitlines()[0]}")
        else:
            error_msg = result.stderr.decode()
            print(f"FFmpeg error: {error_msg}")
            # Check for specific permission errors
            if "Access is denied" in error_msg:
                print("Permission error detected. Trying alternative approach...")
                # Try an alternative approach - just check file existence without execution
                if os.path.exists(ffmpeg_path) and os.path.exists(ffprobe_path):
                    print("FFmpeg files exist but execution test failed due to permissions.")
                    print("WARNING: The app may fail when trying to process videos.")
                    # Return paths anyway and hope for the best when actually used
                    return ffmpeg_path, ffprobe_path
                
            raise RuntimeError(f"FFmpeg execution failed: {error_msg}")
    except Exception as e:
        print(f"Error checking FFmpeg: {e}")
        # Fallback option if verification fails but files exist
        if os.path.exists(ffmpeg_path) and os.path.exists(ffprobe_path):
            print("WARNING: FFmpeg files exist but verification failed.")
            print("Attempting to continue anyway, but video processing may fail.")
            return ffmpeg_path, ffprobe_path 
        raise
    return ffmpeg_path, ffprobe_path

The video input is in the form of a URL (for instance, YouTube URL) or a local file path. The process_video function determines the input type and routes it accordingly. If the input is a URL, the helper functions get_video_info and get_video_id extract video metadata (title, description, duration) without downloading it using yt_dlp package.

#Function to determine the input type and route it appropriately
def process_video(youtube_url, output_dir, api_key, model="gpt-4o-transcribe"):
    """
    Process a YouTube video to generate a transcript
    Wrapper function that combines download and transcription
    Args:
        youtube_url: URL of the YouTube video
        output_dir: Directory to save the output
        api_key: OpenAI API key
        model: The model to use for transcription (default: gpt-4o-transcribe)
    Returns:
        dict: Dictionary containing transcript and file paths
    """
    # First download the audio
    print("Downloading video...")
    audio_path = process_video_download(youtube_url, output_dir)
    
    print("Transcribing video...")
    # Then transcribe the audio
    transcript, transcript_path = process_video_transcribe(audio_path, output_dir, api_key, model=model)
    
    # Return the combined results
    return {
        'transcript': transcript,
        'transcript_path': transcript_path,
        'audio_path': audio_path
    }

def get_video_info(youtube_url):
    """Get video information without downloading."""
    # Check local cache first
    global _video_info_cache
    if youtube_url in _video_info_cache:
        return _video_info_cache[youtube_url]
        
    # Extract info if not cached
    with yt_dlp.YoutubeDL() as ydl:
        info = ydl.extract_info(youtube_url, download=False)
        # Cache the result
        _video_info_cache[youtube_url] = info
        # Also cache the video ID separately
        _video_id_cache[youtube_url] = info.get('id', 'video')
        return info

def get_video_id(youtube_url):
    """Get just the video ID without re-extracting if already known."""
    global _video_id_cache
    if youtube_url in _video_id_cache:
        return _video_id_cache[youtube_url]
    
    # If not in cache, extract from URL directly if possible
    if "v=" in youtube_url:
        video_id = youtube_url.split("v=")[1].split("&")[0]
        _video_id_cache[youtube_url] = video_id
        return video_id
    elif "youtu.be/" in youtube_url:
        video_id = youtube_url.split("youtu.be/")[1].split("?")[0]
        _video_id_cache[youtube_url] = video_id
        return video_id
    
    # If we can't extract directly, fall back to full info extraction
    info = get_video_info(youtube_url)
    video_id = info.get('id', 'video')
    return video_id

After the video input is given, the code in app.py checks whether a transcript for the input video already exists (in the case of URL input). This is done by calling the following two helper functions from transcriber.py.

def get_transcript_path(youtube_url, output_dir):
    """Get the expected transcript path for a given YouTube URL."""
    # Get video ID with caching
    video_id = get_video_id(youtube_url)
    # Return expected transcript path
    return os.path.join(output_dir, f"{video_id}_transcript.txt")

def transcript_exists(youtube_url, output_dir):
    """Check if a transcript already exists for this video."""
    transcript_path = get_transcript_path(youtube_url, output_dir)
    return os.path.exists(transcript_path)

If transcript_exists returns the path of an existing transcript, the next step is to create the vector store for the RAG. If no existing transcript is found, the next step is to download audio from the URL and convert it to a standard audio format. The function process_video_download downloads audio from the URL using the FFMPEG library and converts it to .mp3 format. If the input is a local video file, app.py proceeds to convert it to .mp3 file.

def process_video_download(youtube_url, output_dir):
    """
    Download audio from a YouTube video
    Args:
        youtube_url: URL of the YouTube video
        output_dir: Directory to save the output
        
    Returns:
        str: Path to the downloaded audio file
    """
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Extract video ID from URL
    video_id = None
    if "v=" in youtube_url:
        video_id = youtube_url.split("v=")[1].split("&")[0]
    elif "youtu.be/" in youtube_url:
        video_id = youtube_url.split("youtu.be/")[1].split("?")[0]
    else:
        raise ValueError("Could not extract video ID from URL")
    # Set output paths
    audio_path = os.path.join(output_dir, f"{video_id}.mp3")
    
    # Configure yt-dlp options
    ydl_opts = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
            'preferredquality': '192',
        }],
        'outtmpl': os.path.join(output_dir, f"{video_id}"),
        'quiet': True
    }
    
    # Download audio
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([youtube_url])
    
    # Verify audio file exists
    if not os.path.exists(audio_path):
        # Try with an extension that yt-dlp might have used
        potential_paths = [
            os.path.join(output_dir, f"{video_id}.mp3"),
            os.path.join(output_dir, f"{video_id}.m4a"),
            os.path.join(output_dir, f"{video_id}.webm")
        ]
        
        for path in potential_paths:
            if os.path.exists(path):
                # Convert to mp3 if it's not already
                if not path.endswith('.mp3'):
                    ffmpeg_path = verify_ffmpeg()[0]
                    output_mp3 = os.path.join(output_dir, f"{video_id}.mp3")
                    subprocess.run([
                        ffmpeg_path, '-i', path, '-c:a', 'libmp3lame', 
                        '-q:a', '2', output_mp3, '-y'
                    ], check=True, capture_output=True)
                    os.remove(path)  # Remove the original file
                    audio_path = output_mp3
                else:
                    audio_path = path
                break
        else:
            raise FileNotFoundError(f"Could not find downloaded audio file for video {video_id}")
    return audio_path

Audio Transcription Using OpenAI’s gpt-4o-transcribe Model

After extracting audio and converting it to a standard audio format, the next step is to transcribe the audio to text format. For this purpose, I used OpenAI’s newly launched gpt-4o-transcribe speech-to-text model accessible through speech-to-text API. This model has outperformed OpenAI’s Whisper models in terms of both transcription accuracy and robust language coverage.

The function process_video_transcribe in transcriber.py receives the converted audio file and interfaces with gpt-4o-transcribe model with OpenAI’s speech-to-text API. The gpt-4o-transcribe model currently has an audio file limit of 25MB and 1500 duration. To overcome this limitation, I split the longer files into multiple chunks and transcribe these chunks separately. The process_video_transcribe function checks whether the input file exceeds the size and/or duration limit. If either threshold is exceeded, it calls split_and_transcribe function, which first calculates the number of chunks needed based on both size and duration and takes the maximum of these two as the final number of chunks for transcription. It then finds the start and end times for each chunk and extracts these chunks from the audio file. Subsequently, it transcribes each chunk using gpt-4o-transcribe model with OpenAI’s speech-to-text API and then combines transcripts of all chunks to generate the final transcript.

def process_video_transcribe(audio_path, output_dir, api_key, progress_callback=None, model="gpt-4o-transcribe"):
    """
    Transcribe an audio file using OpenAI API, with automatic chunking for large files
    Always uses the selected model, with no fallback
    
    Args:
        audio_path: Path to the audio file
        output_dir: Directory to save the transcript
        api_key: OpenAI API key
        progress_callback: Function to call with progress updates (0-100)
        model: The model to use for transcription (default: gpt-4o-transcribe)
        
    Returns:
        tuple: (transcript text, transcript path)
    """
    # Extract video ID from audio path
    video_id = os.path.basename(audio_path).split('.')[0]
    transcript_path = os.path.join(output_dir, f"{video_id}_transcript.txt")
    
    # Setup OpenAI client
    client = OpenAI(api_key=api_key)
    
    # Update progress
    if progress_callback:
        progress_callback(10)
    
    # Get file size in MB
    file_size_mb = os.path.getsize(audio_path) / (1024 * 1024)
    
    # Universal chunking thresholds - apply to both models
    max_size_mb = 25  # 25MB chunk size for both models
    max_duration_seconds = 1500  # 1500 seconds chunk duration for both models
    
    # Load the audio file to get its duration
    try:
        audio = AudioSegment.from_file(audio_path)
        duration_seconds = len(audio) / 1000  # pydub uses milliseconds
    except Exception as e:
        print(f"Error loading audio to check duration: {e}")
        audio = None
        duration_seconds = 0
    
    # Determine if chunking is needed
    needs_chunking = False
    chunking_reason = []
    
    if file_size_mb > max_size_mb:
        needs_chunking = True
        chunking_reason.append(f"size ({file_size_mb:.2f}MB exceeds {max_size_mb}MB)")
    
    if duration_seconds > max_duration_seconds:
        needs_chunking = True
        chunking_reason.append(f"duration ({duration_seconds:.2f}s exceeds {max_duration_seconds}s)")
    
    # Log the decision
    if needs_chunking:
        reason_str = " and ".join(chunking_reason)
        print(f"Audio needs chunking due to {reason_str}. Using {model} for transcription.")
    else:
        print(f"Audio file is within limits. Using {model} for direct transcription.")
    
    # Check if file needs chunking
    if needs_chunking:
        if progress_callback:
            progress_callback(15)
        
        # Split the audio file into chunks and transcribe each chunk using the selected model only
        full_transcript = split_and_transcribe(
            audio_path, client, model, progress_callback, 
            max_size_mb, max_duration_seconds, audio
        )
    else:
        # File is small enough, transcribe directly with the selected model
        with open(audio_path, "rb") as audio_file:
            if progress_callback:
                progress_callback(30)
                
            transcript_response = client.audio.transcriptions.create(
                model=model, 
                file=audio_file
            )
            
            if progress_callback:
                progress_callback(80)
            
            full_transcript = transcript_response.text
    
    # Save transcript to file
    with open(transcript_path, "w", encoding="utf-8") as f:
        f.write(full_transcript)
    
    # Update progress
    if progress_callback:
        progress_callback(100)
    
    return full_transcript, transcript_path

def split_and_transcribe(audio_path, client, model, progress_callback=None, 
                         max_size_mb=25, max_duration_seconds=1500, audio=None):
    """
    Split an audio file into chunks and transcribe each chunk 
    
    Args:
        audio_path: Path to the audio file
        client: OpenAI client
        model: Model to use for transcription (will not fall back to other models)
        progress_callback: Function to call with progress updates
        max_size_mb: Maximum file size in MB
        max_duration_seconds: Maximum duration in seconds
        audio: Pre-loaded AudioSegment (optional)
        
    Returns:
        str: Combined transcript from all chunks
    """
    # Load the audio file if not provided
    if audio is None:
        audio = AudioSegment.from_file(audio_path)
    
    # Get audio duration in seconds
    duration_seconds = len(audio) / 1000
    
    # Calculate the number of chunks needed based on both size and duration
    file_size_mb = os.path.getsize(audio_path) / (1024 * 1024)
    
    chunks_by_size = math.ceil(file_size_mb / (max_size_mb * 0.9))  # Use 90% of max to be safe
    chunks_by_duration = math.ceil(duration_seconds / (max_duration_seconds * 0.95))  # Use 95% of max to be safe
    num_chunks = max(chunks_by_size, chunks_by_duration)
    
    print(f"Splitting audio into {num_chunks} chunks based on size ({chunks_by_size}) and duration ({chunks_by_duration})")
    
    # Calculate chunk duration in milliseconds
    chunk_length_ms = len(audio) // num_chunks
    
    # Create temp directory for chunks if it doesn't exist
    temp_dir = os.path.join(os.path.dirname(audio_path), "temp_chunks")
    os.makedirs(temp_dir, exist_ok=True)
    
    # Split the audio into chunks and transcribe each chunk
    transcripts = []
    
    for i in range(num_chunks):
        if progress_callback:
            # Update progress: 20% for splitting, 60% for transcribing
            progress_percent = 20 + int((i / num_chunks) * 60)
            progress_callback(progress_percent)
        
        # Calculate start and end times for this chunk
        start_ms = i * chunk_length_ms
        end_ms = min((i + 1) * chunk_length_ms, len(audio))
        
        # Extract the chunk
        chunk = audio[start_ms:end_ms]
        
        # Save the chunk to a temporary file
        chunk_path = os.path.join(temp_dir, f"chunk_{i}.mp3")
        chunk.export(chunk_path, format="mp3")
        
        # Log chunk information
        chunk_size_mb = os.path.getsize(chunk_path) / (1024 * 1024)
        chunk_duration = len(chunk) / 1000
        print(f"Chunk {i+1}/{num_chunks}: {chunk_size_mb:.2f}MB, {chunk_duration:.2f}s")
        
        # Transcribe the chunk 
        try:
            with open(chunk_path, "rb") as chunk_file:
                transcript_response = client.audio.transcriptions.create(
                    model=model,
                    file=chunk_file
                )
                
                # Add to our list of transcripts
                transcripts.append(transcript_response.text)
        except Exception as e:
            print(f"Error transcribing chunk {i+1} with {model}: {e}")
            # Add a placeholder for the failed chunk
            transcripts.append(f"[Transcription failed for segment {i+1}]")
        
        # Clean up the temporary chunk file
        os.remove(chunk_path)
    
    # Clean up the temporary directory
    try:
        os.rmdir(temp_dir)
    except:
        print(f"Note: Could not remove temporary directory {temp_dir}")
    
    # Combine all transcripts with proper spacing
    full_transcript = " ".join(transcripts)
    
    return full_transcript

The following screenshot of the Streamlit app shows the video processing and transcribing workflow for one of my webinars, “Integrating LLMs into Business,” available on my YouTube channel.

Snapshot of the Streamlit app showing the process of extracting audio and transcribing (image by author)

Retrieval Augmented Generation (RAG) for Interactive Conversations

After generating the video transcript, the application develops a RAG to facilitate both text and speech-based interactions. The conversational intelligence is implemented through VideoRAG class in rag_system.py which initializes chunk size and overlap, OpenAI embeddings, ChatOpenAI instance to generate responses with gpt-4o model, and ConversationBufferMemory to maintain chat history for contextual continuity.

The create_vector_store method splits the documents into chunks and creates a vector store using the FAISS vector database. The handle_question_submission method processes text questions and appends each new question and its answer to the conversation history. The handle_speech_input function implements the complete voice-to-text-to-voice pipeline. It first records the question audio, transcribes the question, processes the query through the RAG system, and synthesizes speech for the response.

class VideoRAG:
    def __init__(self, api_key=None, chunk_size=1000, chunk_overlap=200):
        """Initialize the RAG system with OpenAI API key."""
        # Use provided API key or try to get from environment
        self.api_key = api_key if api_key else st.secrets["OPENAI_API_KEY"]
        if not self.api_key:
            raise ValueError("OpenAI API key is required either as parameter or environment variable")
            
        self.embeddings = OpenAIEmbeddings(openai_api_key=self.api_key)
        self.llm = ChatOpenAI(
            openai_api_key=self.api_key,
            model="gpt-4o",
            temperature=0
        )
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.vector_store = None
        self.chain = None
        self.memory = ConversationBufferMemory(
            memory_key="chat_history",
            return_messages=True
        )
    
    def create_vector_store(self, transcript):
        """Create a vector store from the transcript."""
        # Split the text into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.chunk_size,
            chunk_overlap=self.chunk_overlap,
            separators=["nn", "n", " ", ""]
        )
        chunks = text_splitter.split_text(transcript)
        
        # Create vector store
        self.vector_store = FAISS.from_texts(chunks, self.embeddings)
        
        # Create prompt template for the RAG system
        system_template = """You are a specialized AI assistant that answers questions about a specific video. 
        
        You have access to snippets from the video transcript, and your role is to provide accurate information ONLY based on these snippets.
        
        Guidelines:
        1. Only answer questions based on the information provided in the context from the video transcript, otherwise say that "I don't know. The video doesn't cover that information."
        2. The question may ask you to summarize the video or tell what the video is about. In that case, present a summary of the context. 
        3. Don't make up information or use knowledge from outside the provided context
        4. Keep your answers concise and directly related to the question
        5. If asked about your capabilities or identity, explain that you're an AI assistant that specializes in answering questions about this specific video
        
        Context from the video transcript:
        {context}
        
        Chat History:
        {chat_history}
        """
        user_template = "{question}"
        
        # Create the messages for the chat prompt
        messages = [
            SystemMessagePromptTemplate.from_template(system_template),
            HumanMessagePromptTemplate.from_template(user_template)
        ]
        
        # Create the chat prompt
        qa_prompt = ChatPromptTemplate.from_messages(messages)
        
        # Initialize the RAG chain with the custom prompt
        self.chain = ConversationalRetrievalChain.from_llm(
            llm=self.llm,
            retriever=self.vector_store.as_retriever(
                search_kwargs={"k": 5}
            ),
            memory=self.memory,
            combine_docs_chain_kwargs={"prompt": qa_prompt},
            verbose=True
        )
        
        return len(chunks)
    
    def set_chat_history(self, chat_history):
        """Set chat history from external session state."""
        if not self.memory:
            return
            
        # Clear existing memory
        self.memory.clear()
        
        # Convert standard chat history format to LangChain message format
        for message in chat_history:
            if message["role"] == "user":
                self.memory.chat_memory.add_user_message(message["content"])
            elif message["role"] == "assistant":
                self.memory.chat_memory.add_ai_message(message["content"])
    
    def ask(self, question, chat_history=None):
        """Ask a question to the RAG system."""
        if not self.chain:
            raise ValueError("Vector store not initialized. Call create_vector_store first.")
        
        # If chat history is provided, update the memory
        if chat_history:
            self.set_chat_history(chat_history)
        
        # Get response
        response = self.chain.invoke({"question": question})
        return response["answer"]

See the following snapshot of the Streamlit app, showing the interactive conversation interface with the video.

Snapshot showing conversational interface and interactive learning content (image by author)

The following snapshot shows a conversation with the video with speech input and text+speech output.

Conversation with video (image by author)

Feature Generation

The application generates three features: hierarchical summary, quiz, and flashcards. Please refer to their respective commented codes in the GitHub repo.

The SummaryGenerator class in summary.py provides structured content summarization by creating a hierarchical representation of the video content to provide users with quick insights into the main concepts and supporting details. The system retrieves key contextual segments from the transcript using RAG. Using a prompt (see generate_summary), it creates a hierarchical summary with three levels: main points, sub-points, and additional details. The create_summary_popup_html method transforms the generated summary into an interactive visual representation using CSS and JavaScript.

# summary.py
class SummaryGenerator:
    def __init__(self):
        pass
    
    def generate_summary(self, rag_system, api_key, model="gpt-4o", temperature=0.2):
        """
        Generate a hierarchical bullet-point summary from the video transcript
        
        Args:
            rag_system: The RAG system with vector store
            api_key: OpenAI API key
            model: Model to use for summary generation
            temperature: Creativity level (0.0-1.0)
            
        Returns:
            str: Hierarchical bullet-point summary text
        """
        if not rag_system:
            st.error("Please transcribe the video first before creating a summary!")
            return ""
        
        with st.spinner("Generating hierarchical summary..."):
            # Create LLM for summary generation
            summary_llm = ChatOpenAI(
                openai_api_key=api_key,
                model=model,
                temperature=temperature  # Lower temperature for more factual summaries
            )
            
            # Use the RAG system to get relevant context
            try:
                # Get broader context since we're summarizing the whole video
                relevant_docs = rag_system.vector_store.similarity_search(
                    "summarize the main points of this video", k=10
                )
                context = "nn".join([doc.page_content for doc in relevant_docs])
                
                prompt = """Based on the video transcript, create a hierarchical bullet-point summary of the content.
                Structure your summary with exactly these levels:
                
                • Main points (use • or * at the start of the line for these top-level points)
                  - Sub-points (use - at the start of the line for these second-level details)
                    * Additional details (use spaces followed by * for third-level points)
                
                For example:
                • First main point
                  - Important detail about the first point
                  - Another important detail
                    * A specific example
                    * Another specific example
                • Second main point
                  - Detail about second point
                
                Be consistent with the exact formatting shown above. Each bullet level must start with the exact character shown (• or *, -, and spaces+*).
                Create 3-5 main points with 2-4 sub-points each, and add third-level details where appropriate.
                Focus on the most important information from the video.
                """
                
                # Use the LLM with context to generate the summary
                messages = [
                    {"role": "system", "content": f"You are given the following context from a video transcript:nn{context}nnUse this context to create a hierarchical summary according to the instructions."},
                    {"role": "user", "content": prompt}
                ]
                
                response = summary_llm.invoke(messages)
                return response.content
            except Exception as e:
                # Fallback to the regular RAG system if there's an error
                st.warning(f"Using standard summary generation due to error: {str(e)}")
                return rag_system.ask(prompt)
    
    def create_summary_popup_html(self, summary_content):
        """
        Create HTML for the summary popup with properly formatted hierarchical bullets
        
        Args:
            summary_content: Raw summary text with markdown bullet formatting
            
        Returns:
            str: HTML for the popup with properly formatted bullets
        """
        # Instead of relying on markdown conversion, let's manually parse and format the bullet points
        lines = summary_content.strip().split('n')
        formatted_html = []
        
        in_list = False
        list_level = 0
        
        for line in lines:
            line = line.strip()
            
            # Skip empty lines
            if not line:
                continue
                
            # Detect if this is a markdown header
            if line.startswith('# '):
                if in_list:
                    # Close any open lists
                    for _ in range(list_level):
                        formatted_html.append('')
                    in_list = False
                    list_level = 0
                formatted_html.append(f'{line[2:]}')
                continue
                
            # Check line for bullet point markers
            if line.startswith('• ') or line.startswith('* '):
                # Top level bullet
                content = line[2:].strip()
                
                if not in_list:
                    # Start a new list
                    formatted_html.append('')
                    in_list = True
                    list_level = 1
                elif list_level > 1:
                    # Close nested lists to get back to top level
                    for _ in range(list_level - 1):
                        formatted_html.append('')
                    list_level = 1
                else:
                    # Close previous list item if needed
                    if formatted_html and not formatted_html[-1].endswith('') and in_list:
                        formatted_html.append('')
                        
                formatted_html.append(f'{content}')
                
            elif line.startswith('- '):
                # Second level bullet
                content = line[2:].strip()
                
                if not in_list:
                    # Start new lists
                    formatted_html.append('Second level items')
                    formatted_html.append('')
                    in_list = True
                    list_level = 2
                elif list_level == 1:
                    # Add a nested list
                    formatted_html.append('')
                    list_level = 2
                elif list_level > 2:
                    # Close deeper nested lists to get to second level
                    for _ in range(list_level - 2):
                        formatted_html.append('')
                    list_level = 2
                else:
                    # Close previous list item if needed
                    if formatted_html and not formatted_html[-1].endswith('') and list_level == 2:
                        formatted_html.append('')
                        
                formatted_html.append(f'{content}')
                
            elif line.startswith('  * ') or line.startswith('    * '):
                # Third level bullet
                content = line.strip()[2:].strip()
                
                if not in_list:
                    # Start new lists (all levels)
                    formatted_html.append('Top level')
                    formatted_html.append('Second level')
                    formatted_html.append('')
                    in_list = True
                    list_level = 3
                elif list_level == 2:
                    # Add a nested list
                    formatted_html.append('')
                    list_level = 3
                elif list_level < 3:
                    # We missed a level, adjust
                    formatted_html.append('Missing level')
                    formatted_html.append('')
                    list_level = 3
                else:
                    # Close previous list item if needed
                    if formatted_html and not formatted_html[-1].endswith('') and list_level == 3:
                        formatted_html.append('')
                        
                formatted_html.append(f'{content}')
            else:
                # Regular paragraph
                if in_list:
                    # Close any open lists
                    for _ in range(list_level):
                        formatted_html.append('')
                        if list_level > 1:
                            formatted_html.append('')
                    in_list = False
                    list_level = 0
                formatted_html.append(f'{line}')
        
        # Close any open lists
        if in_list:
            # Close final item
            formatted_html.append('')
            # Close any open lists
            for _ in range(list_level):
                if list_level > 1:
                    formatted_html.append('')
                else:
                    formatted_html.append('')
        
        summary_html = 'n'.join(formatted_html)
        
        html = f"""
        
            
                
                    Hierarchical Summary
                    
                
                
                    {summary_html}
                
            
        
        
        
        
        
        """
        return html

Heirarchical summary (image by author)

Talk-to-Videos app generates quizzes from the video through the QuizGenerator class in quiz.py. The quiz generator creates multiple-choice questions targeting specific facts and concepts presented in the video. Unlike RAG, where I use a zero temperature, I increased the LLM temperature to 0.4 to encourage some creativity in quiz generation. A structured prompt guides the quiz generation process. The parse_quiz_response method extracts and validates the generated quiz elements to make sure that each question has all the required components. To prevent the users from recognizing the pattern and to promote real understanding, the quiz generator shuffles the answer options. Questions are presented one at a time, followed by immediate feedback on each answer. After completing all questions, the calculate_quiz_results method assesses user answers and the user is presented with an overall score, a visual breakdown of correct versus incorrect answers, and feedback on the performance level. In this way, the quiz generation functionality transforms passive video watching into active learning by challenging users to recall and apply information presented in the video.

# quiz.py
class QuizGenerator:
    def __init__(self):
        pass
    
    def generate_quiz(self, rag_system, api_key, transcript=None, model="gpt-4o", temperature=0.4):
        """
        Generate quiz questions based on the video transcript
        
        Args:
            rag_system: The RAG system with vector store2
            api_key: OpenAI API key
            transcript: The full transcript text (optional)
            model: Model to use for question generation
            temperature: Creativity level (0.0-1.0)
            
        Returns:
            list: List of question objects
        """
        if not rag_system:
            st.error("Please transcribe the video first before creating a quiz!")
            return []
        
        # Create a temporary LLM with slightly higher temperature for more creative questions
        creative_llm = ChatOpenAI(
            openai_api_key=api_key,
            model=model,
            temperature=temperature
        )

        num_questions = 10
        
        # Prompt to generate quiz
        prompt = f"""Based on the video transcript, generate {num_questions} multiple-choice questions to test understanding of the content.
        For each question:
        1. The question should be specific to information mentioned in the video
        2. Include 4 options (A, B, C, D)
        3. Clearly indicate the correct answer
        
        Format your response exactly as follows for each question:
        QUESTION: [question text]
        A: [option A]
        B: [option B]
        C: [option C]
        D: [option D]
        CORRECT: [letter of correct answer]
       
        Make sure all questions are based on facts from the video."""
        
        try:
            if transcript:
                # If we have the full transcript, use it
                messages = [
                    {"role": "system", "content": f"You are given the following transcript from a video:nn{transcript}nnUse this transcript to create quiz questions according to the instructions."},
                    {"role": "user", "content": prompt}
                ]
                
                response = creative_llm.invoke(messages)
                response_text = response.content
            else:
                # Fallback to RAG approach if no transcript is provided
                relevant_docs = rag_system.vector_store.similarity_search(
                    "what are the main topics covered in this video?", k=5
                )
                context = "nn".join([doc.page_content for doc in relevant_docs])
                
                # Use the creative LLM with context to generate questions
                messages = [
                    {"role": "system", "content": f"You are given the following context from a video transcript:nn{context}nnUse this context to create quiz questions according to the instructions."},
                    {"role": "user", "content": prompt}
                ]
                
                response = creative_llm.invoke(messages)
                response_text = response.content
        except Exception as e:
            # Fallback to the regular RAG system if there's an error
            st.warning(f"Using standard question generation due to error: {str(e)}")
            response_text = rag_system.ask(prompt)
        
        return self.parse_quiz_response(response_text)

    # The rest of the class remains unchanged
    def parse_quiz_response(self, response_text):
        """
        Parse the LLM response to extract questions, options, and correct answers
        
        Args:
            response_text: Raw text response from LLM
            
        Returns:
            list: List of parsed question objects
        """
        quiz_questions = []
        current_question = {}
        
        for line in response_text.strip().split('n'):
            line = line.strip()
            if line.startswith('QUESTION:'):
                if current_question and 'question' in current_question and 'options' in current_question and 'correct' in current_question:
                    quiz_questions.append(current_question)
                current_question = {
                    'question': line[len('QUESTION:'):].strip(),
                    'options': [],
                    'correct': None
                }
            elif line.startswith(('A:', 'B:', 'C:', 'D:')):
                option_letter = line[0]
                option_text = line[2:].strip()
                current_question.setdefault('options', []).append((option_letter, option_text))
            elif line.startswith('CORRECT:'):
                current_question['correct'] = line[len('CORRECT:'):].strip()
        
        # Add the last question
        if current_question and 'question' in current_question and 'options' in current_question and 'correct' in current_question:
            quiz_questions.append(current_question)
        
        # Randomize options for each question
        randomized_questions = []
        for q in quiz_questions:
            # Get the original correct answer
            correct_letter = q['correct']
            correct_option = None
            
            # Find the correct option text
            for letter, text in q['options']:
                if letter == correct_letter:
                    correct_option = text
                    break
            
            if correct_option is None:
                # If we can't find the correct answer, keep the question as is
                randomized_questions.append(q)
                continue
                
            # Create a list of options texts and shuffle them
            option_texts = [text for _, text in q['options']]
            
            # Create a copy of the original letters
            option_letters = [letter for letter, _ in q['options']]
            
            # Create a list of (letter, text) pairs
            options_pairs = list(zip(option_letters, option_texts))
            
            # Shuffle the pairs
            random.shuffle(options_pairs)
            
            # Find the new position of the correct answer
            new_correct_letter = None
            for letter, text in options_pairs:
                if text == correct_option:
                    new_correct_letter = letter
                    break
            
            # Create a new question with randomized options
            new_q = {
                'question': q['question'],
                'options': options_pairs,
                'correct': new_correct_letter
            }
            
            randomized_questions.append(new_q)
        
        return randomized_questions
    
    def calculate_quiz_results(self, questions, user_answers):
        """
        Calculate quiz results based on user answers
        
        Args:
            questions: List of question objects
            user_answers: Dictionary of user answers keyed by question_key
            
        Returns:
            tuple: (results dict, correct count)
        """
        correct_count = 0
        results = {}
        
        for i, question in enumerate(questions):
            question_key = f"quiz_q_{i}"
            user_answer = user_answers.get(question_key)
            correct_answer = question['correct']
            
            # Only count as correct if user selected an answer and it matches
            is_correct = user_answer is not None and user_answer == correct_answer
            if is_correct:
                correct_count += 1
            
            results[question_key] = {
                'user_answer': user_answer,
                'correct_answer': correct_answer,
                'is_correct': is_correct
            }
        
        return results, correct_count

Quiz result (image by author)

Talk-to-Videos also generates flashcards from the video content, which support active recall and spaced repetition learning techniques. This is done through the FlashcardGenerator class in flashcards.py, which creates a mix of different flashcards focusing on key term definitions, conceptual questions, fill-in-the-blank statements, and true/False questions with explanations. A prompt guides the LLM to output flashcards in a structured JSON format, with each card containing distinct “front” and “back” elements. The shuffle_flashcards produces a randomized presentation, and each flashcard is validated to ensure that it contains both front and back components before being presented to the user. The answer to each flashcard is initially hidden. It is revealed at the user’s input using a classic flashcard reveal functionality. Users can generate a new set of flashcards for more practice. The flashcard and quiz systems are interconnected with each other so that users can switch between them as needed.

# flashcards.py
class FlashcardGenerator:
    """Class to generate flashcards from video content using the RAG system."""
    
    def __init__(self):
        """Initialize the flashcard generator."""
        pass
    
    def generate_flashcards(self, rag_system, api_key, transcript=None, num_cards=10, model="gpt-4o") -> List[Dict[str, str]]:
        """
        Generate flashcards based on the video content.
        
        Args:
            rag_system: The initialized RAG system with video content
            api_key: OpenAI API key
            transcript: The full transcript text (optional)
            num_cards: Number of flashcards to generate (default: 10)
            model: The OpenAI model to use
            
        Returns:
            List of flashcard dictionaries with 'front' and 'back' keys
        """
        # Import here to avoid circular imports
        from langchain_openai import ChatOpenAI
        
        # Initialize language model
        llm = ChatOpenAI(
            openai_api_key=api_key,
            model=model,
            temperature=0.4
        )
        
        # Create the prompt for flashcard generation
        prompt = f"""
        Create {num_cards} educational flashcards based on the video content.
        
        Each flashcard should have:
        1. A front side with a question, term, or concept
        2. A back side with the answer, definition, or explanation
        
        Focus on the most important and educational content from the video. 
        Create a mix of different types of flashcards:
        - Key term definitions
        - Conceptual questions
        - Fill-in-the-blank statements
        - True/False questions with explanations
        
        Format your response as a JSON array of objects with 'front' and 'back' properties.
        Example:
        [
            {{"front": "What is photosynthesis?", "back": "The process by which plants convert light energy into chemical energy."}},
            {{"front": "The three branches of government are: Executive, Legislative, and _____", "back": "Judicial"}}
        ]
        
        Make sure your output is valid JSON format with exactly {num_cards} flashcards.
        """
        
        try:
            # Determine the context to use
            if transcript:
                # Use the full transcript if provided
                # Create messages for the language model
                messages = [
                    {"role": "system", "content": f"You are an educational content creator specializing in creating effective flashcards. Use the following transcript from a video to create educational flashcards:nn{transcript}"},
                    {"role": "user", "content": prompt}
                ]
            else:
                # Fallback to RAG system if no transcript is provided
                relevant_docs = rag_system.vector_store.similarity_search(
                    "key points and educational concepts in the video", k=15
                )
                context = "nn".join([doc.page_content for doc in relevant_docs])
                
                # Create messages for the language model
                messages = [
                    {"role": "system", "content": f"You are an educational content creator specializing in creating effective flashcards. Use the following context from a video to create educational flashcards:nn{context}"},
                    {"role": "user", "content": prompt}
                ]
            
            # Generate flashcards
            response = llm.invoke(messages)
            content = response.content
            
            # Extract JSON content in case there's text around it
            json_start = content.find('[')
            json_end = content.rfind(']') + 1
            
            if json_start >= 0 and json_end > json_start:
                json_content = content[json_start:json_end]
                flashcards = json.loads(json_content)
            else:
                # Fallback in case of improper JSON formatting
                raise ValueError("Failed to extract valid JSON from response")
            
            # Verify we have the expected number of cards (or adjust as needed)
            actual_cards = min(len(flashcards), num_cards)
            flashcards = flashcards[:actual_cards]
            
            # Validate each flashcard has required fields
            validated_cards = []
            for card in flashcards:
                if 'front' in card and 'back' in card:
                    validated_cards.append({
                        'front': card['front'],
                        'back': card['back']
                    })
            
            return validated_cards
        
        except Exception as e:
            # Handle errors gracefully
            print(f"Error generating flashcards: {str(e)}")
            # Return a few basic flashcards in case of error
            return [
                {"front": "Error generating flashcards", "back": f"Please try again. Error: {str(e)}"},
                {"front": "Tip", "back": "Try regenerating flashcards or using a different video"}
            ]
    
    def shuffle_flashcards(self, flashcards: List[Dict[str, str]]) -> List[Dict[str, str]]:
        """Shuffle the order of flashcards"""
        shuffled = flashcards.copy()
        random.shuffle(shuffled)
        return shuffled

Flashcards (image by author)

Potential Extensions and Improvements

This application can be extended and improved in a number of ways. For instance:

Integration of visual features in video (such as keyframes) may be explored with audio to extract more meaningful information.
Team-based learning experiences can be enabled where office colleagues or classmates can share notes, quiz scores, and summaries.
Creating navigable transcripts that allow users to click on specific sections to jump to that point in the video
Creating step-by-step action plans for implementing concepts from the video in real business settings
Modifying the RAG prompt to elaborate on the answers and provide simpler explanations to difficult concepts.
Generating questions that stimulate metacognitive skills in learners by stimulating them to think about their thinking process and learning strategies while engaging with video content.

That’s all folks! If you liked the article, please follow me on Medium and LinkedIn.

The post Talk to Videos appeared first on Towards Data Science.

Testing the Power of Multimodal AI Systems in Reading and Interpreting Photographs, Maps, Charts and More

Luciano Abriata — Tue, 25 Mar 2025 18:30:49 +0000

Introduction

It’s no news that artificial intelligence has made huge strides in recent years, particularly with the advent of multimodal models that can process and create both text and images, and some very new ones that also process and produce audio and video. I think that these Ai Systems have the potential to revolutionize data analysis, robotics, and even everyday tasks like navigation and information extraction from visuals. Along these lines, I recently posed myself the following question:

Can multimodal AI systems consisting in large language models with vision capabilities understand figures that contain information, then process it and produce summaries, explanations, object identification, etc.?

Although this is in itself a question that would require a whole research project to be properly addressed, I was extremely curious about it, so I needed to get at least a rough approximation to my question. Therefore, I carried out some tests on how much OpenAI’s vision-enhanced models understand about photographs, screenshots and other images containing plots, charts, shots from a driver’s position or mimicking a robot’s onboard camera, and even molecular structures.

More specifically, I tried the GPT-4o and GPT-4o-mini models developed by OpenAI, through ChatGPT or through their API. To my amazement, I found that these models can indeed understand quite a bit of what they see!

Let me present here the most interesting results I obtained, either directly through ChatGPT or programmatically, including code for a web app where you can paste a picture and have GPT-4o-mini analyze it.

First of All: How to Analyze Images With OpenAI’s GPT-4o Models, Via ChatGPT or Programmatically

I carried out my first tests right away on ChatGPT’s free version using GPT-4o, which allows up to 4 chances per day to use the “vision” capabilities for free. To use this you must upload or paste from clipboard a picture that will be sent together with the prompt. On a paid ChatGPT account you can do this more times, and with the OpenAI API you can do this as long as token credits are available, and directly programmatically as I will cover here.

Processing prompts containing pictures programmatically

But how to do image processing with GPT-4o models programmatically? Well, it isn’t that complex, as explained at OpenAI’s API reference.

In JavaScript, which I love because it allows me to easily write programs that run out of the box online as you already know if you follow me, you just need a function call that looks like this:

async function sendToOpenAI(imageData) {
    const apiKey = "...API KEY HERE";
    const base64Image = imageData.split(",")[1];
    
    const response = await fetch("https://api.openai.com/v1/chat/completions", {
        method: "POST",
        headers: {
            "Content-Type": "application/json",
            "Authorization": `Bearer ${apiKey}`
        },
        body: JSON.stringify({
            model: "gpt-4o-mini",
            messages: [{
                role: "user",
                content: [
                    { type: "text", text: "what's in this image?" },
                    { type: "image_url", image_url: { url: `data:image/jpeg;base64,${base64Image}` } }
                ]
            }]
        })
    });
    
    const result = await response.json();
    document.getElementById("response").innerText = result.choices[0].message.content;
}

A web app to quickly run image-text prompts

In this example web app you can provide an API key and a text prompt, and then paste an image from the clipboard that will be immediately sent to OpenAI for analysis together with the prompt.

https://lucianoabriata.altervista.org/tests/gpt-vision/gpt-4o-mini-vision.html

The web app ends up displaying the answer, as in this example:

This and all other images, photographs and screenshots shown in this article are by the author.

(Keep reading to find towards the end of the post how Google’s Gemini 2.0 interpreted this screenshot, which was just mind-blowingly correct as it captured the artist’s actual intention!)

Challenging GPT-4o Models With Varied Examples of Image-Text Prompts

Using the above app to run GPT-4o-mini and ChatGPT’s free version to run GPT-4o, I evaluated how these models can understand maps, photographs that mimic what a driver sees on the road, camera views about the position of a robot’s arm relative to some target objects, various different kinds of plots and charts presenting data, and software screenshots including some that display molecules.

I am stunned at how much the models understand! Read on to discover it all, also take a look at my first tests with the impressive Gemini 2.0 Flash from Google, and perhaps then go try your favorite AI models with your own ideas, sharing in the comments what you find!

1. Vision-Based Analysis of a Google Maps Screenshot

Sometimes, maps contain so much information that you kinda get lost around. Could a multimodal system be of help in such situation? Let’s find out!

Here on the left is a screenshot from a Google Maps view that I fed into GPT-4o-mini via the API together with the question “What’s in this image?” . You can see the answer on the right:

2. Interpreting Driving Signs in a Road

Could perhaps multimodal systems help you to drive? Of course this deserves a full investigation in order to draw a solid conclusion, but let’s see what happens in some quick tests. In each test, I provided a photograph of what a driver sees in the road on the go, together with this text prompt :

Interpret any road signs or things I should pay attention to as a driver

Here are the results on some example shots, all run with the web app provided above i.e. using GPT-4o-mini:

First, for this photograph:

GPT-4o-mini said the following:

As you drive, pay attention to the following signs and features in the image:

1. **”Keep Right” Sign**: This indicates that you should stay to the right side of the road, likely because of an island or obstruction ahead.

2. **Curved Road Ahead**: The layout of the road suggests you may encounter a curve, which means you should reduce your speed and stay alert for any oncoming vehicles.

3. **Speed Limit**: The sign showing a speed limit of 40 suggests that you should not exceed this speed.

4. **Road Layout**: Be mindful of the marking lines and how they direct traffic. The two solid yellow lines indicate that crossing them (to pass another vehicle) is prohibited.

5. **Surroundings**: Note the vegetation and landscape along the road, which may obscure visibility in some areas. Maintain awareness of pedestrians or cyclists.

Make sure to remain vigilant and adhere to these guidelines for safe driving.

This answer was quite nice, complete, and useful!

Let’s try this next one, in a densely snowed place:

Here it clearly picked up the snow, and correctly inferred it is a region with mountains, offering some quite general advice that suits the situation but doesn’t seem very specific to it:

When driving in snowy mountain conditions, there are several important things to pay attention to:

1. **Yield Sign**: If there’s a yield sign, be prepared to slow down and give way to any traffic on the intersecting road.

2. **Road Conditions**: Snow can make roads slick and reduce visibility. Drive cautiously and maintain a safe distance from other vehicles.

3. **Speed Limit**: Always adhere to the posted speed limit but consider reducing your speed further due to the weather conditions.

4. **Slippery Surfaces**: Be aware that certain areas may be icier or more treacherous. Watch for signs indicating slippery roads or do not enter areas.

5. **Falling Rocks**: In mountainous areas, watch for any signs indicating potential rockfall or land slides, which may be exacerbated by snow.

6. **Wildlife**: In rural or mountainous regions, be alert for wildlife crossing the road, especially in winter months.

7. **Visibility**: If visibility is low due to snow, be sure to use your headlights and reduce your speed.

Always ensure that your vehicle is equipped for winter conditions, including snow tires or chains if necessary.

Let’s try a last one, which features a tunnel, a car right ahead of the driver, some warnings on the left, and two sign posts on the right:

Here’s the answer, which seems to be warning about all general and situation-specific issues:

Here are some important points to note based on the image:

1. **Tunnel Ahead**: You are approaching a tunnel. Prepare for possible changes in visibility and ensure your vehicle’s headlights are on.

2. **Speed and Space**: Always maintain a safe distance from the vehicle in front of you, especially in confined spaces like tunnels.

3. **No Pedestrians or Bicycles**: The sign indicates that pedestrians and bicycles are not allowed in the area. Stay alert to ensure you don’t encounter anyone on the road.

4. **Road Barriers and Traffic Flow**: Pay attention to the road markings and barriers as they guide lane usage and safe navigation near the tunnel entrance.

5. **Weather and Terrain**: The surrounding scenery suggests that weather conditions may change quickly, especially driving in mountain areas. Be prepared for changes in road conditions.

Always drive cautiously and obey any traffic signs or signals present.

Note however that it didn’t note the specific change in speed limit. I tried again this photograph with some more focused text prompts, but it could never say anything about the change in speed limit specific to this situation, rather providing general recommendations such as this:

The maximum speed limit inside a tunnel can vary based on the country’s traffic regulations, the specific road, and the tunnel’s design. Generally, in many places, typical speed limits range from 50 to 80 km/h (about 30 to 50 mph). To determine the exact speed limit for a specific tunnel, you would need to look for posted signs or consult local traffic regulations.

Which is correct but shows that it wasn’t capable of extracting the information specifically available here, which would be something along the lines of “Maximum speed is not indicated, but it is definitely above 50 according to the visible sign post.”

3. Could Vision-Ready Models Drive Robots?

If an AI model understands pictures well enough, one could then think about feeding a robot’s camera feed to the model so that it can produce outputs that make sense. Of course this is a huge thing to study in itself, having to produce commands that end up in mechanical outputs; but let’s see here how the core part — i.e. converting views to broad actions — would work.

Given this photograph of a robotic arm on a table where a red and blue ball await to be grabbed:

I ask GPT-4o the following:

How should the robotic arm move in order to grab the blue sphere?

To which the LLM replies quite correctly, even asking me if I want it to follow up with code generation:

To grab the blue sphere, the robotic arm should follow these steps:

Position the Base: Rotate the base so that the arm aligns with the blue sphere.

Lower the Arm: Adjust the arm’s height by moving its joints to bring the gripper closer to the sphere.

Extend Forward: If necessary, extend the arm forward to reach the blue sphere.

Open the Gripper: Ensure the gripper is open wide enough to grasp the sphere.

Close the Gripper: Gently close the gripper around the sphere to secure it.

Lift the Sphere: Raise the arm slightly while keeping the sphere firmly held.

Retract or Move to Destination: Move the arm to a desired location for placing the sphere.

These movements should be smooth to avoid dropping or misaligning with the sphere. Would you like assistance with programming the robotic arm for this task?

I feel the answers are quite good, but lack precision. So, rather than going down the road of asking for code, I will interrogate the LLM a bit deeper to see what it is really understanding in this picture:

When you say “Position the Base: Rotate the base so that the arm aligns with the blue sphere.” should the arm move to its left or to its right, based on the image?

The answer turns out very good:

Based on the image, the robotic arm should rotate to its left to align with the blue sphere. The sphere is positioned to the left of the arm’s current orientation, so rotating counterclockwise will bring the gripper closer to the sphere.

Yet I dig a bit deeper:

How much do you estimate it should rotate?

This new answer is essentially the same that I could estimate as a human from the photograph:

Estimating from the image, the robotic arm should rotate its base approximately 30 to 45 degrees counterclockwise to align with the blue sphere.

This estimate assumes that:

The arm’s current position is facing slightly to the right.

The blue sphere is placed in front and to the left of the arm.

Once the base is rotated, minor adjustments to the arm joints may be needed to fine-tune the position before grasping the sphere.

Finally, I ask about the red ball, to test if the model understands that it is on the other side and a smaller rotation is required:

And how about grabbing the red ball?

And it does get it right!

To grab the red sphere, the robotic arm should follow these steps:

Rotate the Base: The base should rotate slightly to the left, but much less than for the blue sphere. An estimated 15 to 20 degrees counterclockwise should align the arm with the red sphere.

Lower the Arm: Bring the gripper down closer to the red sphere.

Extend Forward (if needed): Adjust the arm’s reach if the gripper is not directly above the sphere.

Open the Gripper: Ensure the gripper is wide enough to grab the sphere.

Close the Gripper: Securely grasp the red sphere.

Lift the Arm: Raise the sphere carefully.

Move to Destination: Transport the sphere to a desired location if needed.

Since the red sphere is closer to the robotic arm than the blue sphere, less movement is required. Would you like a precise angle calculation based on measurements?

If you’re also excited about the possibility of driving robots with vision models, check out this thread on X.com.

4. Vision Models Acting on Data Plots

Being in Towards Data Science, we can’t escape the obvious question: can vision-assisted models understand the content of plots, perhaps even draw conclusions from them?

I did quite some tests here, dealing with different kinds of information and plot types.

Detecting high tides in a tide chart

I began with the following chart of tide highs and lows for one of the Azores islands, written in Portuguese but on which I asked a question in English:

When are the peak highs in the tides according to the chart?

This was the answer:

The peak high tides in the chart are marked by blue dots with corresponding times. The highest tides occur at the following times:

11 Aug: 11:18
12 Aug: 12:15
13 Aug: 12:58
14 Aug: 13:33
15 Aug: 14:05
16 Aug: 14:35
17 Aug: 15:05

These times represent when the tide reaches its highest point each day.

You can immediately see that the displayed data is correct, but only represents high tides during the day and totally overlooks the high tides that happen during the day! Hence, I asked for that specifically, and got a much more complete answer:

I said:

But the times you say are all during the days… doesn’t the chart show also peak high tides at night?

GPT-4o said:

You’re absolutely right! The chart also shows peak high tides occurring at night. Here are all the peak high tide times (both daytime and nighttime) according to the chart:

11 Aug: 11:18 (Day)
12 Aug: 12:15 (Day) & 23:49 (Night)
13 Aug: 12:58 (Day) & 00:41 (Night)
14 Aug: 13:33 (Day) & 01:20 (Night)
15 Aug: 14:05 (Day) & 01:54 (Night)
16 Aug: 14:35 (Day) & 02:24 (Night)
17 Aug: 15:05 (Day) & 02:53 (Night)

Each day has two peak high tides, one during the day and one at night. Thanks for catching that!

That’s now perfect. Finally, I asked one more piece of information:

OK thanks. Can you say where on earth this chart is for?

And again the answer was right:

Yes! The chart is for Ponta Delgada, São Miguel Island, which is part of the Azores, an autonomous region of Portugal in the North Atlantic Ocean. This location is known for its stunning volcanic landscapes, coastal views, and maritime activities. The tide chart shows the tidal schedule for this area from August 11 to August 17.

Interpreting a height profile from a photo

Next I made the question a bit harder, as the image consisted of a photograph of a screen (see the interfering horizontal bands), the x axis of the plot is not complete, and the question involves some thinking:

Looking at this, would I go up or down from Andermatt to Wassen? By how many meters and over how many km?

The answer was however perfect! Here it is:

From the elevation profile, Andermatt is at 1437m, and Wassen is at 916m. This means that traveling from Andermatt to Wassen involves a descent of 521 meters (1437m — 916m).

Looking at the horizontal axis, the distance from Andermatt to Wassen appears to be around 10 km. So, you would be going downhill by 521 meters over approximately 10 km.

Interpreting plots from RNA-seq data

I then tested GPT-4o’s capability to interpret plots produced as part of a bioinformatics pipeline. First, I made it process a volcano plot, which is one of the most common kinds of graphs used to analyze differential gene expression data. I took an example image from this free online tutorial, and asked the system the following:

You are a bioinformatics expert. What would you conclude from this volcano plot? I mean, which genes are up- or down-regulated?

This was GPT-4o’s answer:

This volcano plot represents differentially expressed genes in Arabidopsis thaliana infected with Pseudomonas syringae DC3000 compared to a mock treatment. Here’s my interpretation:

1. X-axis (Log₂ fold change):
Positive values indicate up-regulated genes (higher expression in infected samples).
Negative values indicate down-regulated genes (lower expression in infected samples).

2. Y-axis (-Log₁₀ P-value):
Higher values indicate stronger statistical significance.

3. Color coding:
Gray: Not significant.
Blue: Significant by p-value but not by fold change.
Red: Significant by both p-value and fold change threshold.

Key findings:

Up-regulated genes (right side, red dots):
Strongly induced genes include AT5G42800, AT1G52440, AT1G73325, AT4G18440, among others.
These genes are highly expressed in infected plants.

Down-regulated genes (left side, red dots):
Examples include AT4G28940, AT5G04100, AT1G04220, AT1G02440.
These genes are significantly suppressed during infection.

Biological Implications:

Up-regulated genes may be involved in plant immune responses, such as defense signaling, stress response, or antimicrobial activity.

Down-regulated genes might be involved in normal cellular functions that get suppressed during infection.

You can see how the first part of the response clearly reads our the main features of the plot, explaining what the axes and label colors are, and even considering the plot’s title. Then, the model distills lists of genes that are up- or down-regulated passing the thresholds with strong statistical significance. Note that the list is correct for the up-regulated genes but has a few problems with the down-regulated ones. On inspection of the plot it becomes clear that this is due to the labels not being clearly resolved, even for humans.

One gear up, somehow the model knows correct information about the identified genes, so it can draw some simple conclusions about the biological significance of the results.

Truly amazing, if it were not just for that small problem on reading the labels, which in a real scenario could probably be improved.

Then in a second test related to RNA-seq data, I asked GPT-4o what it thought about two distributions of p-values shown together in a single graph (image edited from this tutorial):

This figure shows two distributions of p-values from RNA-seq data. Do you think they are OK, or do you notice anything weird?

This was the answer, again excellent:

This histogram represents two distributions of p-values from RNA-seq data, likely comparing two different sets of statistical tests. Here’s my assessment:

* What looks normal?

Uniform distribution of non-significant p-values:

The light blue bars seem to follow a relatively uniform distribution across the range, which is expected under the null hypothesis (genes with no real differential expression should have p-values uniformly distributed between 0 and 1).

Enrichment of low p-values:

The peak near p = 0 suggests that a subset of genes show strong statistical significance, which is expected when there are truly differentially expressed genes.

* What looks weird?

Excessive spike at p = 0:

If there is an unnaturally large spike at exactly 0, this could indicate a bias or an issue in the statistical testing (e.g., overfitting, batch effects, or miscalculated p-values).

Bimodal distribution:

The yellow distribution appears to have multiple peaks, which is unusual. Normally, p-value distributions should either be uniform (under the null hypothesis) or show a single peak at low p-values (for differentially expressed genes).

Peaks at intermediate p-values might suggest issues with normalization, batch effects, or improper multiple-testing corrections.

* Potential issues to check

Overdispersion or poor model fit: Are the statistical methods used properly accounting for biological variability?

Multiple testing correction issues: If false discovery rate (FDR) correction wasn’t applied correctly, the distribution could be skewed.

Batch effects or confounding variables: If the data contains hidden covariates, it can introduce unexpected structure in the p-value distribution.

5. Analysis of Screenshots Containing Molecular Structures

If you follow me you know that I work on various molecular sciences; then, of course I had to try out these vision-enhanced models with such kind of inputs.

Help using a program to generate 3D structures from 2D drawings

First I checked GPT-4o-mini’s capability to help a user who wants to create a molecule in 3D format from a drawing, who knows that this should be possible with Hack-A-Mol but doesn’t know exactly how to use it.

By the way, I covered Hack-A-Mol here:

https://medium.com/advances-in-biological-science/purely-web-based-suites-for-manipulation-of-bio-molecular-data-cbb7fdd60df4

Now let’s see how well GPT-4o can help one to interpret what one obtains in this program. I uploaded this screenshot:

And asked this question:

I’m trying to build a small molecule using this tool, by drawing and then clicking the arrow to the right…. but I don’t get very well what’s being generated exactly… can you explain?

This was the answer:

Yes! When you draw your small molecule and click the arrow, several things are generated:

2D Structure (Left Panel)

This is your drawn molecule in 2D, using standard chemical notation.

3D Structure (Right Panel — JSmol Viewer)

This is the 3D molecular model generated from your 2D drawing.

It gives you a spatial representation of the molecule.

Molecular Identifiers (Below the 2D Editor)

InChI (IUPAC International Chemical Identifier): A text-based representation of the molecular structure.

InChIKey: A hashed, fixed-length version of the InChI, useful for searching databases.

SMILES (Simplified Molecular Input Line Entry System): A compact string representation of the molecule.

Structural Data (Bottom Text Box)

You see data in MOL/SDF format, which contains atomic coordinates, bonding information, and metadata about the molecule.

Other formats available include XYZ, PDB, and CIF, which are useful for computational chemistry and molecular modeling.

What Can You Do with This?

If you want to simulate the molecule or use it in software like AlphaFold 3, you should export the structure in SDF or PDB format.

If you want to search for the molecule online, use the InChIKey or SMILES.

Using GPT-4o with vision to help understand AlphaFold models

In the next example, I tested how well GPT-4o can understand structural models of proteins produced by AlphaFold (in this case AlphaFold 2 as utilized to build the AlphaFold-EBI database of models). By the way, if you have no idea what AlphaFold is, check out these posts:

https://medium.com/advances-in-biological-science/alphafold-3s-advances-and-limitations-in-a-nutshell-5ae0cdd814ea

https://lucianosphere.medium.com/guide-to-my-blog-articles-on-alphafold-39c10a7dad7f

It is well-known that the AlphaFold 2 and 3 systems revolutionized modeling; however, of course they aren’t infallible. To help users understand what’s likely correct and wrong in a structural model, AlphaFold returns not just the atomic coordinates that constitute the structural model but also quality metrics, of which the most important ones are the residue-wise pLDDTs and the inter-residue PAEs. We here tested GPT-4o with vision capabilities on this Uniprot entry precomputed at the AF-EBI database:

The text part of the prompt was:

You are an expert in bioinformatics. I’m checking an entry at the AF-EBI database of protein structure models, and I got what I’m showing in the attached screenshot. What do you think? Is the model good? I’m especially interested in the relative orientation of the two domains, but I’m not sure what to conclude from the PAE plot…

And this was the answer:

From your screenshot, we can analyze the quality of the model based on the pLDDT confidence scores and the Predicted Aligned Error (PAE) plot:

Model Confidence (pLDDT)

The Mol* 3D viewer colors the model based on confidence scores.

Most of the structure is colored in blue (high to very high confidence, pLDDT > 70–90), suggesting that the individual domains are well-predicted.

However, some loops and termini (orange/red, pLDDT < 50–70) have lower confidence, which is expected for flexible regions.

PAE Plot Analysis (Domain Orientation)

The PAE plot (right panel) indicates how well the relative positions of different residues are predicted.

A low PAE (dark green) means the relative position between two residues is well-defined.

A high PAE (lighter green/white areas) suggests uncertainty in the relative positioning.

Interpreting the PAE

The diagonal (self-alignment) is always well-defined.

The block-like structure in the PAE plot suggests two or more well-defined domains with high internal confidence.

However, lighter green areas between the blocks indicate uncertainty in the relative orientation of these domains.

If your interest is in the relative orientation of the domains, this suggests the model may not be reliable in that aspect — it might have multiple plausible conformations.

Conclusion

The individual domains are well-modeled, but their relative orientation might be uncertain.

If functionally relevant, you might consider experimental validation (e.g., cryo-EM, SAXS, or NMR) or molecular dynamics simulations to explore domain flexibility.

If you need a more rigid structure, you could look for homologous experimental structures or use AlphaFold-Multimer if interactions between chains matter.

The answer was pretty neat and correct. However, we realized that we might have confused the model by talking about two domains, when the screenshot actually shows three. We then asked a folow-up question:

Now checking carefully I see the TED Domains thing shows 3 domains… am I right that according to the PAE plot the arrangement of the first two domains is confident, but not so much their arrangement relative to the third one?

The answer, based on an inspection of the PAE plot considering now that there are indeed 3 domains, the first two being confidently modeled relative to each other and the third one only modeled confidently locally, is again perfect:

Yes, your interpretation is correct! The TED Domains annotation indicates that the model has three domains, as shown by the color-coded segments at the top of the PAE plot.

Looking at the PAE plot:

The first two domains (green and orange segments) have low PAE values between them (darker green in the top-left block of the matrix). This suggests that their relative orientation is well-defined and confident.

However, the third domain (purple segment) shows higher PAE values relative to the first two (lighter green/white regions in the lower-right part of the matrix). This means that the relative position of the third domain is uncertain — it may adopt multiple conformations.

What This Means for Your Analysis

The first two domains likely form a rigid unit with a stable interaction.

The third domain may be more flexible relative to the first two, possibly moving independently or being less constrained in solution.

If this flexibility is functionally important (e.g., for ligand binding, protein-protein interactions, or conformational switching), experimental validation like SAXS, cryo-EM, or FRET could be useful.

Visual analysis of protein-ligand complexes

Last, I geared up to challenge GPT-4o with a more difficult problem. Given a 2D snapshot from a molecular graphics program showing a ligand bound to a protein, can it infer what kinds of contacts the two molecules establish?

The question was:

Here’s a screenshot of ibuprofen docked inside a pocket in BSA. What interactions do you see between the ligand and the protein? I want you to tell me exactly what interactions you see in this structure from the screenshot provided.

And this was the answer:

From the screenshot you provided, I can visually infer potential interactions, but without residue labels or a precise atomic distance measurement, I can only make educated observations. Here’s what I specifically see in your image:

1. Hydrophobic Interactions:

Ibuprofen (highlighted in yellow) is embedded in a hydrophobic pocket of the protein.

It appears to be surrounded by gray carbon atoms from amino acid side chains, likely hydrophobic residues (e.g., leucine, valine, phenylalanine, or isoleucine).

The benzene ring of ibuprofen seems to be in close contact with other aromatic or aliphatic side chains, suggesting π-π stacking or van der Waals interactions.

2. Hydrogen Bonds:

The carboxyl (-COOH) group of ibuprofen appears to be interacting with nearby blue (nitrogen) or red (oxygen) atoms from the protein.

There is at least one close contact between the oxygen of the carboxyl group and a protein residue, likely forming a hydrogen bond with a backbone amide or side chain hydroxyl (e.g., Ser, Thr, Tyr, Asp, or Glu).

3. Possible Electrostatic Interactions:

Ibuprofen is negatively charged at physiological pH (carboxylate anion, -COO⁻).

There may be a salt bridge or ion-dipole interaction if a positively charged residue (e.g., Arg, Lys, or His) is nearby, but I cannot confirm the exact residue from the image alone.

What I Cannot Determine from the Screenshot Alone:

Precise interacting residues: No residue labels are visible.

Exact interaction distances: Hydrogen bonds and electrostatic interactions typically require precise atomic distances (~2–3 Å).

Interaction strengths: Molecular modeling software would be needed to confirm interaction energies.

The answer was quite good, especially considering that the picture is just a static 2D view and is quite crowded in atoms. In fact, it correctly pinpoints that the ibuprofen is shown in yellow, and highlights hydrophobic and aromatic contacts around its aromatic ring, both of which do happen (#1 in the figure below), and hydrogen bonds from its carboxylate group to nearby N and O atoms (#2 in the figure), adding that there might also possibly exist a salt bridge between that same carboxylate and some positively charged group of the protein that it can’t distinguish clearly (#3).

For your reference, I show these interactions in the next figure, numbered 1, 2 and 3 as they were mentioned:

Other Vision-Capable Models? Peeking Into Google’s Gemini 2.0 Flash

OpenAI is probably the most famous developer of LLM-based AI systems, but of course not the only one, and their models aren’t even the best for some goals and according to some evaluations. However, they remain, to me at least, the easiest to use both directly in a GUI (ChatGPT) and programmatically (as we have seen here and I’ve shown in many other posts).

But my point here is that we can’t overlook the other models. And just while preparing this post, I decided to give a try at Google’s AI studio’s Gemini 2.0 Flash, which like GPT-4o has picture, audio and document processing capabilities — and even video!

Again I was amazed, and in fact it seemed to perform slightly better than GPT-4o and GPT-4o-mini tried throughout this post. I was in particular stunned at this specific example, where I loaded the screenshot of my app shown above, to which Gemini replied (without any prompt):

Here’s what I can infer based on the image you provided:

Prompt: “What do you think the artist”

Analysis:

The response provided at the bottom of the image states “The artist likely intended to convey themes of nature, beauty, and vibrancy through the striking imagery of the red poppies against a lush green background.”

Essentially, the AI is recognizing the subject matter and interpreting the artist’s intent as showcasing the beauty and vibrancy of nature.

That was kind of similar to what GPT-4o replied, probably because it just read what GPT-4o itself wrote. But what if I ask it a question trying to address the same point I asked GPT-4o?

Here it goes:

do your own interpretation of the picture… what do you think did the author intend to communicate?

And this was Gemini’s reply, now perfectly reflecting what I think the author intended to convey:

Okay, based on the image, here’s my interpretation of what the author might have intended:

Beyond the surface-level beauty, the image hints at a more complex exploration of color harmony and contrast within nature.

The juxtaposition of the natural poppy field with the color wheel suggests a deliberate artistic statement. The artist may be inviting us to analyze the colors present in the natural scene (dominated by greens and reds) and to understand how these colors interact and create a specific visual experience.

The placement of the color wheel with the arrows pointing to the green and red suggest complementary colours.

Perhaps the author wants us to appreciate the inherent color palettes nature provides.

In short, while the surface reading is about beauty, the integration of the color wheel suggests a deeper interest in the science and theory of color as applied to the natural world. The author seems to be inviting a more analytical and informed way of appreciating the beauty we see.

The key part of Gemini’s reply, which GPT-4o didn’t get in any of my tests, is this:

“The placement of the color wheel with the arrows pointing to the green and red suggest complementary colours.”

That was exactly my point as an artist!

If this post has good reception, I will test Gemini 2.0’s various capabilities in more detail. So let me know in the comments what you think!

Conclusions

I guess after having read all these examples, you now also admire the potential of multimodal, vision-capable AI systems. The tests I presented with the GPT-4o models (and the glimpse into Gemini 2.0) proved the ability of the most modern AI systems to interpret quite complex images, being capable of understanding and extracting information from photographs, scientific figures of various kinds, plots and charts, even when containing information as complex as molecular structures. And from the examples we can advance applications as varied as assisting data analysis, assisting driving, and controlling robots — all of them provided some problems are overcome.

This, because some images and questions are quite challenging, of course. In some cases, for example when limited by the resolution of labels in dense plots, or when limited by the lack of 3D perspective in the flat screenshots of molecular structures, or when having to estimate rotation angles for the robotic arm, there’s probably not much to do, and the model remains as limited as even the most expert humans in the subject would.

Yes, it is overall clear that with proper prompting and continued advancements in AI, these models could become invaluable tools for accelerating data interpretation and decision-making, reducing the load of human experts who can dedicate to more complex problems while the software assists non-experts to interpret graphical outputs from software, and who knows maybe some day drive cars and control robots!

www.lucianoabriata.com I write about everything that lies in my broad sphere of interests: nature, science, technology, programming, etc. Subscribe to get my new stories by email. To consult about small jobs check my services page here. You can contact me here. You can tip me here.

The post Testing the Power of Multimodal AI Systems in Reading and Interpreting Photographs, Maps, Charts and More appeared first on Towards Data Science.

R.E.D.: Scaling Text Classification with Expert Delegation

Aamir Syed — Fri, 21 Mar 2025 05:11:10 +0000

With the new age of problem-solving augmented by Large Language Models (LLMs), only a handful of problems remain that have subpar solutions. Most classification problems (at a PoC level) can be solved by leveraging LLMs at 70–90% Precision/F1 with just good prompt engineering techniques, as well as adaptive in-context-learning (ICL) examples.

What happens when you want to consistently achieve performance higher than that — when prompt engineering no longer suffices?

The classification conundrum

Text classification is one of the oldest and most well-understood examples of supervised learning. Given this premise, it should really not be hard to build robust, well-performing classifiers that handle a large number of input classes, right…?

Welp. It is.

It actually has to do a lot more with the ‘constraints’ that the algorithm is generally expected to work under:

low amount of training data per class
high classification accuracy (that plummets as you add more classes)
possible addition of new classes to an existing subset of classes
quick training/inference
cost-effectiveness
(potentially) really large number of training classes
(potentially) endless required retraining of some classes due to data drift, etc.

Ever tried building a classifier beyond a few dozen classes under these conditions? (I mean, even GPT could probably do a great job up to ~30 text classes with just a few samples…)

Considering you take the GPT route — If you have more than a couple dozen classes or a sizeable amount of data to be classified, you are gonna have to reach deep into your pockets with the system prompt, user prompt, few shot example tokens that you will need to classify one sample. That is after making peace with the throughput of the API, even if you are running async queries.

In applied ML, problems like these are generally tricky to solve since they don’t fully satisfy the requirements of supervised learning or aren’t cheap/fast enough to be run via an LLM. This particular pain point is what the R.E.D algorithm addresses: semi-supervised learning, when the training data per class is not enough to build (quasi)traditional classifiers.

The R.E.D. algorithm

R.E.D: Recursive Expert Delegation is a novel framework that changes how we approach text classification. This is an applied ML paradigm — i.e., there is no fundamentally different architecture to what exists, but its a highlight reel of ideas that work best to build something that is practical and scalable.

In this post, we will be working through a specific example where we have a large number of text classes (100–1000), each class only has few samples (30–100), and there are a non-trivial number of samples to classify (10,000–100,000). We approach this as a semi-supervised learning problem via R.E.D.

Let’s dive in.

How it works

simple representation of what R.E.D. does

Instead of having a single classifier classify between a large number of classes, R.E.D. intelligently:

Divides and conquers — Break the label space (large number of input labels) into multiple subsets of labels. This is a greedy label subset formation approach.
Learns efficiently — Trains specialized classifiers for each subset. This step focuses on building a classifier that oversamples on noise, where noise is intelligently modeled as data from other subsets.
Delegates to an expert — Employes LLMs as expert oracles for specific label validation and correction only, similar to having a team of domain experts. Using an LLM as a proxy, it empirically ‘mimics’ how a human expert validates an output.
Recursive retraining — Continuously retrains with fresh samples added back from the expert until there are no more samples to be added/a saturation from information gain is achieved

The intuition behind it is not very hard to grasp: Active Learning employs humans as domain experts to consistently ‘correct’ or ‘validate’ the outputs from an ML model, with continuous training. This stops when the model achieves acceptable performance. We intuit and rebrand the same, with a few clever innovations that will be detailed in a research pre-print later.

Let’s take a deeper look…

Greedy subset selection with least similar elements

When the number of input labels (classes) is high, the complexity of learning a linear decision boundary between classes increases. As such, the quality of the classifier deteriorates as the number of classes increases. This is especially true when the classifier does not have enough samples to learn from — i.e. each of the training classes has only a few samples.

This is very reflective of a real-world scenario, and the primary motivation behind the creation of R.E.D.

Some ways of improving a classifier’s performance under these constraints:

Restrict the number of classes a classifier needs to classify between
Make the decision boundary between classes clearer, i.e., train the classifier on highly dissimilar classes

Greedy Subset Selection does exactly this — since the scope of the problem is Text Classification, we form embeddings of the training labels, reduce their dimensionality via UMAP, then form S subsets from them. Each of the S subsets has elements as n training labels. We pick training labels greedily, ensuring that every label we pick for the subset is the most dissimilar label w.r.t. the other labels that exist in the subset:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


def avg_embedding(candidate_embeddings):
    return np.mean(candidate_embeddings, axis=0)

def get_least_similar_embedding(target_embedding, candidate_embeddings):
    similarities = cosine_similarity(target_embedding, candidate_embeddings)
    least_similar_index = np.argmin(similarities)  # Use argmin to find the index of the minimum
    least_similar_element = candidate_embeddings[least_similar_index]
    return least_similar_element


def get_embedding_class(embedding, embedding_map):
    reverse_embedding_map = {value: key for key, value in embedding_map.items()}
    return reverse_embedding_map.get(embedding)  # Use .get() to handle missing keys gracefully


def select_subsets(embeddings, n):
    visited = {cls: False for cls in embeddings.keys()}
    subsets = []
    current_subset = []

    while any(not visited[cls] for cls in visited):
        for cls, average_embedding in embeddings.items():
            if not current_subset:
                current_subset.append(average_embedding)
                visited[cls] = True
            elif len(current_subset) >= n:
                subsets.append(current_subset.copy())
                current_subset = []
            else:
                subset_average = avg_embedding(current_subset)
                remaining_embeddings = [emb for cls_, emb in embeddings.items() if not visited[cls_]]
                if not remaining_embeddings:
                    break # handle edge case
                
                least_similar = get_least_similar_embedding(target_embedding=subset_average, candidate_embeddings=remaining_embeddings)

                visited_class = get_embedding_class(least_similar, embeddings)

                
                if visited_class is not None:
                  visited[visited_class] = True


                current_subset.append(least_similar)
    
    if current_subset:  # Add any remaining elements in current_subset
        subsets.append(current_subset)
        

    return subsets

the result of this greedy subset sampling is all the training labels clearly boxed into subsets, where each subset has at most only n classes. This inherently makes the job of a classifier easier, compared to the original S classes it would have to classify between otherwise!

Semi-supervised classification with noise oversampling

Cascade this after the initial label subset formation — i.e., this classifier is only classifying between a given subset of classes.

Picture this: when you have low amounts of training data, you absolutely cannot create a hold-out set that is meaningful for evaluation. Should you do it at all? How do you know if your classifier is working well?

We approached this problem slightly differently — we defined the fundamental job of a semi-supervised classifier to be pre-emptive classification of a sample. This means that regardless of what a sample gets classified as it will be ‘verified’ and ‘corrected’ at a later stage: this classifier only needs to identify what needs to be verified.

As such, we created a design for how it would treat its data:

n+1 classes, where the last class is noise
noise: data from classes that are NOT in the current classifier’s purview. The noise class is oversampled to be 2x the average size of the data for the classifier’s labels

Oversampling on noise is a faux-safety measure, to ensure that adjacent data that belongs to another class is most likely predicted as noise instead of slipping through for verification.

How do you check if this classifier is working well — in our experiments, we define this as the number of ‘uncertain’ samples in a classifier’s prediction. Using uncertainty sampling and information gain principles, we were effectively able to gauge if a classifier is ‘learning’ or not, which acts as a pointer towards classification performance. This classifier is consistently retrained unless there is an inflection point in the number of uncertain samples predicted, or there is only a delta of information being added iteratively by new samples.

Proxy active learning via an LLM agent

This is the heart of the approach — using an LLM as a proxy for a human validator. The human validator approach we are talking about is Active Labelling

Let’s get an intuitive understanding of Active Labelling:

Use an ML model to learn on a sample input dataset, predict on a large set of datapoints
For the predictions given on the datapoints, a subject-matter expert (SME) evaluates ‘validity’ of predictions
Recursively, new ‘corrected’ samples are added as training data to the ML model
The ML model consistently learns/retrains, and makes predictions until the SME is satisfied by the quality of predictions

For Active Labelling to work, there are expectations involved for an SME:

when we expect a human expert to ‘validate’ an output sample, the expert understands what the task is
a human expert will use judgement to evaluate ‘what else’ definitely belongs to a label L when deciding if a new sample should belong to L

Given these expectations and intuitions, we can ‘mimic’ these using an LLM:

give the LLM an ‘understanding’ of what each label means. This can be done by using a larger model to critically evaluate the relationship between {label: data mapped to label} for all labels. In our experiments, this was done using a 32B variant of DeepSeek that was self-hosted.

Giving an LLM the capability to understand ‘why, what, and how’

Instead of predicting what is the correct label, leverage the LLM to identify if a prediction is ‘valid’ or ‘invalid’ only (i.e., LLM only has to answer a binary query).
Reinforce the idea of what other valid samples for the label look like, i.e., for every pre-emptively predicted label for a sample, dynamically source c closest samples in its training (guaranteed valid) set when prompting for validation.

The result? A cost-effective framework that relies on a fast, cheap classifier to make pre-emptive classifications, and an LLM that verifies these using (meaning of the label + dynamically sourced training samples that are similar to the current classification):

import math

def calculate_uncertainty(clf, sample):
    predicted_probabilities = clf.predict_proba(sample.reshape(1, -1))[0]  # Reshape sample for predict_proba
    uncertainty = -sum(p * math.log(p, 2) for p in predicted_probabilities)
    return uncertainty


def select_informative_samples(clf, data, k):
    informative_samples = []
    uncertainties = [calculate_uncertainty(clf, sample) for sample in data]

    # Sort data by descending order of uncertainty
    sorted_data = sorted(zip(data, uncertainties), key=lambda x: x[1], reverse=True)

    # Get top k samples with highest uncertainty
    for sample, uncertainty in sorted_data[:k]:
        informative_samples.append(sample)

    return informative_samples


def proxy_label(clf, llm_judge, k, testing_data):
    #llm_judge - any LLM with a system prompt tuned for verifying if a sample belongs to a class. Expected output is a bool : True or False. True verifies the original classification, False refutes it
    predicted_classes = clf.predict(testing_data)

    # Select k most informative samples using uncertainty sampling
    informative_samples = select_informative_samples(clf, testing_data, k)

    # List to store correct samples
    voted_data = []

    # Evaluate informative samples with the LLM judge
    for sample in informative_samples:
        sample_index = testing_data.tolist().index(sample.tolist()) # changed from testing_data.index(sample) because of numpy array type issue
        predicted_class = predicted_classes[sample_index]

        # Check if LLM judge agrees with the prediction
        if llm_judge(sample, predicted_class):
            # If correct, add the sample to voted data
            voted_data.append(sample)

    # Return the list of correct samples with proxy labels
    return voted_data

By feeding the valid samples (voted_data) to our classifier under controlled parameters, we achieve the ‘recursive’ part of our algorithm:

Recursive Expert Delegation: R.E.D.

By doing this, we were able to achieve close-to-human-expert validation numbers on controlled multi-class datasets. Experimentally, R.E.D. scales up to 1,000 classes while maintaining a competent degree of accuracy almost on par with human experts (90%+ agreement).

I believe this is a significant achievement in applied ML, and has real-world uses for production-grade expectations of cost, speed, scale, and adaptability. The technical report, publishing later this year, highlights relevant code samples as well as experimental setups used to achieve given results.

All images, unless otherwise noted, are by the author

Interested in more details? Reach out to me over Medium or email for a chat!

The post R.E.D.: Scaling Text Classification with Expert Delegation appeared first on Towards Data Science.

Mastering Prompt Engineering with Functional Testing: A Systematic Guide to Reliable LLM Outputs

Ugo Pradère — Fri, 14 Mar 2025 22:48:00 +0000

Creating efficient prompts for large language models often starts as a simple task… but it doesn’t always stay that way. Initially, following basic best practices seems sufficient: adopt the persona of a specialist, write clear instructions, require a specific response format, and include a few relevant examples. But as requirements multiply, contradictions emerge, and even minor modifications can introduce unexpected failures. What was working perfectly in one prompt version suddenly breaks in another.

If you have ever felt trapped in an endless loop of trial and error, adjusting one rule only to see another one fail, you’re not alone! The reality is that traditional prompt optimisation is clearly missing a structured, more scientific approach that will help to ensure reliability.

That’s where functional testing for prompt engineering comes in! This approach, inspired by methodologies of experimental science, leverages automated input-output testing with multiple iterations and algorithmic scoring to turn prompt engineering into a measurable, data-driven process.

No more guesswork. No more tedious manual validation. Just precise and repeatable results that allow you to fine-tune prompts efficiently and confidently.

In this article, we will explore a systematic approach for mastering prompt engineering, which ensures your Llm outputs will be efficient and reliable even for the most complex AI tasks.

Balancing precision and consistency in prompt optimisation

Adding a large set of rules to a prompt can introduce partial contradictions between rules and lead to unexpected behaviors. This is especially true when following a pattern of starting with a general rule and following it with multiple exceptions or specific contradictory use cases. Adding specific rules and exceptions can cause conflict with the primary instruction and, potentially, with each other.

What might seem like a minor modification can unexpectedly impact other aspects of a prompt. This is not only true when adding a new rule but also when adding more detail to an existing rule, like changing the order of the set of instructions or even simply rewording it. These minor modifications can unintentionally change the way the model interprets and prioritizes the set of instructions.

The more details you add to a prompt, the greater the risk of unintended side effects. By trying to give too many details to every aspect of your task, you increase as well the risk of getting unexpected or deformed results. It is, therefore, essential to find the right balance between clarity and a high level of specification to maximise the relevance and consistency of the response. At a certain point, fixing one requirement can break two others, creating the frustrating feeling of taking one step forward and two steps backward in the optimization process.

Testing each change manually becomes quickly overwhelming. This is especially true when one needs to optimize prompts that must follow numerous competing specifications in a complex AI task. The process cannot simply be about modifying the prompt for one requirement after the other, hoping the previous instruction remains unaffected. It also can’t be a system of selecting examples and checking them by hand. A better process with a more scientific approach should focus on ensuring repeatability and reliability in prompt optimization.

From laboratory to AI: Why testing LLM responses requires multiple iterations

Science teaches us to use replicates to ensure reproducibility and build confidence in an experiment’s results. I have been working in academic research in chemistry and biology for more than a decade. In those fields, experimental results can be influenced by a multitude of factors that can lead to significant variability. To ensure the reliability and reproducibility of experimental results, scientists mostly employ a method known as triplicates. This approach involves conducting the same experiment three times under identical conditions, allowing the experimental variations to be of minor importance in the result. Statistical analysis (standard mean and deviation) conducted on the results, mostly in biology, allows the author of an experiment to determine the consistency of the results and strengthens confidence in the findings.

Just like in biology and chemistry, this approach can be used with LLMs to achieve reliable responses. With LLMs, the generation of responses is non-deterministic, meaning that the same input can lead to different outputs due to the probabilistic nature of the models. This variability is challenging when evaluating the reliability and consistency of LLM outputs.

In the same way that biological/chemical experiments require triplicates to ensure reproducibility, testing LLMs should need multiple iterations to measure reproducibility. A single test by use case is, therefore, not sufficient because it does not represent the inherent variability in LLM responses. At least five iterations per use case allow for a better assessment. By analyzing the consistency of the responses across these iterations, one can better evaluate the reliability of the model and identify any potential issues or variation. It ensures that the output of the model is correctly controlled.

Multiply this across 10 to 15 different prompt requirements, and one can easily understand how, without a structured testing approach, we end up spending time in trial-and-error testing with no efficient way to assess quality.

A systematic approach: Functional testing for prompt optimization

To address these challenges, a structured evaluation methodology can be used to ease and accelerate the testing process and enhance the reliability of LLM outputs. This approach has several key components:

Data fixtures: The approach’s core center is the data fixtures, which are composed of predefined input-output pairs specifically created for prompt testing. These fixtures serve as controlled scenarios that represent the various requirements and edge cases the LLM must handle. By using a diverse set of fixtures, the performance of the prompt can be evaluated efficiently across different conditions.
Automated test validation: This approach automates the validation of the requirements on a set of data fixtures by comparison between the expected outputs defined in the fixtures and the LLM response. This automated comparison ensures consistency and reduces the potential for human error or bias in the evaluation process. It allows for quick identification of discrepancies, enabling fine and efficient prompt adjustments.
Multiple iterations: To assess the inherent variability of the LLM responses, this method runs multiple iterations for each test case. This iterative approach mimics the triplicate method used in biological/chemical experiments, providing a more robust dataset for analysis. By observing the consistency of responses across iterations, we can better assess the stability and reliability of the prompt.
Algorithmic scoring: The results of each test case are scored algorithmically, reducing the need for long and laborious « human » evaluation. This scoring system is designed to be objective and quantitative, providing clear metrics for assessing the performance of the prompt. And by focusing on measurable outcomes, we can make data-driven decisions to optimize the prompt effectively.

Step 1: Defining test data fixtures

Selecting or creating compatible test data fixtures is the most challenging step of our systematic approach because it requires careful thought. A fixture is not only any input-output pair; it must be crafted meticulously to evaluate the most accurate as possible performance of the LLM for a specific requirement. This process requires:

1. A deep understanding of the task and the behavior of the model to make sure the selected examples effectively test the expected output while minimizing ambiguity or bias.

2. Foresight into how the evaluation will be conducted algorithmically during the test.

The quality of a fixture, therefore, depends not only on the good representativeness of the example but also on ensuring it can be efficiently tested algorithmically.

A fixture consists of:

• Input example: This is the data that will be given to the LLM for processing. It should represent a typical or edge-case scenario that the LLM is expected to handle. The input should be designed to cover a wide range of possible variations that the LLM might have to deal with in production.

• Expected output: This is the expected result that the LLM should produce with the provided input example. It is used for comparison with the actual LLM response output during validation.

Step 2: Running automated tests

Once the test data fixtures are defined, the next step involves the execution of automated tests to systematically evaluate the performance of the LLM response on the selected use cases. As previously stated, this process makes sure that the prompt is thoroughly tested against various scenarios, providing a reliable evaluation of its efficiency.

Execution process

1. Multiple iterations: For each test use case, the same input is provided to the LLM multiple times. A simple for loop in nb_iter with nb_iter = 5 and voila!

2. Response comparison: After each iteration, the LLM response is compared to the expected output of the fixture. This comparison checks whether the LLM has correctly processed the input according to the specified requirements.

3. Scoring mechanism: Each comparison results in a score:

◦ Pass (1): The response matches the expected output, indicating that the LLM has correctly handled the input.

◦ Fail (0): The response does not match the expected output, signaling a discrepancy that needs to be fixed.

4. Final score calculation: The scores from all iterations are aggregated to calculate the overall final score. This score represents the proportion of successful responses out of the total number of iterations. A high score, of course, indicates high prompt performance and reliability.

Example: Removing author signatures from an article

Let’s consider a simple scenario where an AI task is to remove author signatures from an article. To efficiently test this functionality, we need a set of fixtures that represent the various signature styles.

A dataset for this example could be:

Example Input	Expected Output
A long article Jean Leblanc	The long article
A long article P. W. Hartig	The long article
A long article MCZ	The long article

Validation process:

Signature removal check: The validation function checks if the signature is absent from the rewritten text. This is easily done programmatically by searching for the signature needle in the haystack output text.
Test failure criteria: If the signature is still in the output, the test fails. This indicates that the LLM did not correctly remove the signature and that further adjustments to the prompt are required. If it is not, the test is passed.

The test evaluation provides a final score that allows a data-driven assessment of the prompt efficiency. If it scores perfectly, there is no need for further optimization. However, in most cases, you will not get a perfect score because either the consistency of the LLM response to a case is low (for example, 3 out of 5 iterations scored positive) or there are edge cases that the model struggles with (0 out of 5 iterations).

The feedback clearly indicates that there is still room for further improvements and it guides you to reexamine your prompt for ambiguous phrasing, conflicting rules, or edge cases. By continuously monitoring your score alongside your prompt modifications, you can incrementally reduce side effects, achieve greater efficiency and consistency, and approach an optimal and reliable output.

A perfect score is, however, not always achievable with the selected model. Changing the model might just fix the situation. If it doesn’t, you know the limitations of your system and can take this fact into account in your workflow. With luck, this situation might just be solved in the near future with a simple model update.

Benefits of this method

Reliability of the result: Running five to ten iterations provides reliable statistics on the performance of the prompt. A single test run may succeed once but not twice, and consistent success for multiple iterations indicates a robust and well-optimized prompt.
Efficiency of the process: Unlike traditional scientific experiments that may take weeks or months to replicate, automated testing of LLMs can be carried out quickly. By setting a high number of iterations and waiting for a few minutes, we can obtain a high-quality, reproducible evaluation of the prompt efficiency.
Data-driven optimization: The score obtained from these tests provides a data-driven assessment of the prompt’s ability to meet requirements, allowing targeted improvements.
Side-by-side evaluation: Structured testing allows for an easy assessment of prompt versions. By comparing the test results, one can identify the most effective set of parameters for the instructions (phrasing, order of instructions) to achieve the desired results.
Quick iterative improvement: The ability to quickly test and iterate prompts is a real advantage to carefully construct the prompt ensuring that the previously validated requirements remain as the prompt increases in complexity and length.

By adopting this automated testing approach, we can systematically evaluate and enhance prompt performance, ensuring consistent and reliable outputs with the desired requirements. This method saves time and provides a robust analytical tool for continuous prompt optimization.

Systematic prompt testing: Beyond prompt optimization

Implementing a systematic prompt testing approach offers more advantages than just the initial prompt optimization. This methodology is valuable for other aspects of AI tasks:

1. Model comparison:

◦ Provider evaluation: This approach allows the efficient comparison of different LLM providers, such as ChatGPT, Claude, Gemini, Mistral, etc., on the same tasks. It becomes easy to evaluate which model performs the best for their specific needs.

◦ Model version: State-of-the-art model versions are not always necessary when a prompt is well-optimized, even for complex AI tasks. A lightweight, faster version can provide the same results with a faster response. This approach allows a side-by-side comparison of the different versions of a model, such as Gemini 1.5 flash vs. 1.5 pro vs. 2.0 flash or ChatGPT 3.5 vs. 4o mini vs. 4o, and allows the data-driven selection of the model version.

2. Version upgrades:

◦ Compatibility verification: When a new model version is released, systematic prompt testing helps validate if the upgrade maintains or improves the prompt performance. This is crucial for ensuring that updates do not unintentionally break the functionality.

◦ Seamless Transitions: By identifying key requirements and testing them, this method can facilitate better transitions to new model versions, allowing fast adjustment when necessary in order to maintain high-quality outputs.

3. Cost optimization:

◦ Performance-to-cost ratio: Systematic prompt testing helps in choosing the best cost-effective model based on the performance-to-cost ratio. We can efficiently identify the most efficient option between performance and operational costs to get the best return on LLM costs.

Overcoming the challenges

The biggest challenge of this approach is the preparation of the set of test data fixtures, but the effort invested in this process will pay off significantly as time passes. Well-prepared fixtures save considerable debugging time and enhance model efficiency and reliability by providing a robust foundation for evaluating the LLM response. The initial investment is quickly returned by improved efficiency and effectiveness in LLM development and deployment.

Quick pros and cons

Key advantages:

Continuous improvement: The ability to add more requirements over time while ensuring existing functionality stays intact is a significant advantage. This allows for the evolution of the AI task in response to new requirements, ensuring that the system remains up-to-date and efficient.
Better maintenance: This approach enables the easy validation of prompt performance with LLM updates. This is crucial for maintaining high standards of quality and reliability, as updates can sometimes introduce unintended changes in behavior.
More flexibility: With a set of quality control tests, switching LLM providers becomes more straightforward. This flexibility allows us to adapt to changes in the market or technological advancements, ensuring we can always use the best tool for the job.
Cost optimization: Data-driven evaluations enable better decisions on performance-to-cost ratio. By understanding the performance gains of different models, we can choose the most cost-effective solution that meets the needs.
Time savings: Systematic evaluations provide quick feedback, reducing the need for manual testing. This efficiency allows to quickly iterate on prompt improvement and optimization, accelerating the development process.

Challenges

Initial time investment: Creating test fixtures and evaluation functions can require a significant investment of time.
Defining measurable validation criteria: Not all AI tasks have clear pass/fail conditions. Defining measurable criteria for validation can sometimes be challenging, especially for tasks that involve subjective or nuanced outputs. This requires careful consideration and may involve a difficult selection of the evaluation metrics.
Cost associated with multiple tests: Multiple test use cases associated with 5 to 10 iterations can generate a high number of LLM requests for a single test automation. But if the cost of a single LLM call is neglectable, as it is in most cases for text input/output calls, the overall cost of a test remains minimal.

Conclusion: When should you implement this approach?

Implementing this systematic testing approach is, of course, not always necessary, especially for simple tasks. However, for complex AI workflows in which precision and reliability are critical, this approach becomes highly valuable by offering a systematic way to assess and optimize prompt performance, preventing endless cycles of trial and error.

By incorporating functional testing principles into Prompt Engineering, we transform a traditionally subjective and fragile process into one that is measurable, scalable, and robust. Not only does it enhance the reliability of LLM outputs, it helps achieve continuous improvement and efficient resource allocation.

The decision to implement systematic prompt Testing should be based on the complexity of your project. For scenarios demanding high precision and consistency, investing the time to set up this methodology can significantly improve outcomes and speed up the development processes. However, for simpler tasks, a more classical, lightweight approach may be sufficient. The key is to balance the need for rigor with practical considerations, ensuring that your testing strategy aligns with your goals and constraints.

Thanks for reading!

The post Mastering Prompt Engineering with Functional Testing: A Systematic Guide to Reliable LLM Outputs appeared first on Towards Data Science.

Effortless Spreadsheet Normalisation With LLM

Simon Grah — Fri, 14 Mar 2025 18:10:04 +0000

This article is part of a series of articles on automating Data Cleaning for any tabular dataset.

You can test the feature described in this article on your own dataset using the CleanMyExcel.io service, which is free and requires no registration.

Start with the why

Let’s consider this Excel spreadsheet, which contains information on awards given to films. It is sourced from the book Cleaning Data for Effective Data Science and is available here.

This is a typical and common spreadsheet that everyone may own and deal with in their daily tasks. But what is wrong with it?

To answer that question, let us first recall the end goal of using data: to derive insights that help guide our decisions in our personal or business lives. This process requires at least two crucial things:

Reliable data: clean data without issues, inconsistencies, duplicates, missing values, etc.
Tidy data: a well-normalised data frame that facilitates processing and manipulation.

The second point is the primary foundation of any analysis, including dealing with data quality.

Returning to our example, imagine we want to perform the following actions:

1. For each film involved in multiple awards, list the award and year it is associated with.

2. For each actor/actress winning multiple awards, list the film and award they are associated with.

3. Check that all actor/actress names are correct and well-standardised.

Naturally, this example dataset is small enough to derive those insights by eye or by hand if we structure it (as quickly as coding). But imagine now that the dataset contains the entire awards history; this would be time-consuming, painful, and error-prone without any automation.

Reading this spreadsheet and directly understanding its structure by a machine is difficult, as it does not follow good practices of data arrangement. That is why tidying data is so important. By ensuring that data is structured in a machine-friendly way, we can simplify parsing, automate quality checks, and enhance business analysis—all without altering the actual content of the dataset.

Example of a reshaping of this data:

Now, anyone can use low/no-code tools or code-based queries (SQL, Python, etc.) to interact easily with this dataset and derive insights.

The main challenge is how to turn a shiny and human-eye-pleasant spreadsheet into a machine-readable tidy version.

Moving on to the what? The tidy data guidelines

What is tidy data? A well-shaped data frame?

The term tidy data was described in a well‐known article named Tidy Data by Hadley Wickham and published in the Journal of Statistical Software in 2014. Below are the key quotes required to understand the underlying concepts better.

Data tidying

“Structuring datasets to facilitate manipulation, visualisation and modelling.”

“Tidy datasets provide a standardised way of linking the structure of a dataset (its physical layout) with its semantics (its meaning).”

Data structure

“Most statistical datasets are rectangular tables composed of rows and columns. The columns are almost always labelled, and the rows are sometimes labelled.”

Data semantics

“A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to both a variable and an observation. A variable contains all values that measure the same underlying attribute (such as height, temperature or duration) across units. An observation contains all values measured on the same unit (for example, a person, a day or a race) across attributes.”

“In a given analysis, there may be multiple levels of observation. For example, in a trial of a new allergy medication, we might have three types of observations:

Demographic data collected from each person (age, sex, race),
Medical data collected from each person on each day (number of sneezes, redness of eyes), and
Meteorological data collected on each day (temperature, pollen count).”

Tidy data

“Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is considered messy or tidy depending on how its rows, columns and tables correspond to observations, variables and types. In tidy data:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.”

Common problems with messy datasets

Column headers might be values rather than variable names.

Messy example: A table where column headers are years (2019, 2020, 2021) instead of a “Year” column.
Tidy version: A table with a “Year” column and each row representing an observation for a given year.

Multiple variables might be stored in one column.

Messy example: A column named “Age_Gender” containing values like 28_Female
Tidy version: Separate columns for “Age” and “Gender”

Variables might be stored in both rows and columns.

Messy example: A dataset tracking student test scores where subjects (Math, Science, English) are stored as both column headers and repeated in rows instead of using a single “Subject” column.
Tidy version: A table with columns for “Student ID,” “Subject,” and “Score,” where each row represents one student’s score for one subject.

Multiple types of observational units might be stored in the same table.

Messy example: A sales dataset that contains both customer information and store inventory in the same table.
Tidy version: Separate tables for “Customers” and “Inventory.”

A single observational unit might be stored in multiple tables.

Messy example: A patient’s medical records are split across multiple tables (Diagnosis Table, Medication Table) without a common patient ID linking them.
Tidy version: A single table or properly linked tables using a unique “Patient ID.”

Now that we have a better understanding of what tidy data is, let’s see how to transform a messy dataset into a tidy one.

Thinking about the how

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” Hadley Wickham (cf. Leo Tolstoy)

Although these guidelines sound clear in theory, they remain difficult to generalise easily in practice for any kind of dataset. In other words, starting with the messy data, no simple or deterministic process or algorithm exists to reshape the data. This is mainly explained by the singularity of each dataset. Indeed, it is surprisingly hard to precisely define variables and observations in general and then transform data automatically without losing content. That is why, despite massive improvements in data processing over the last decade, data cleaning and formatting are still done “manually” most of the time.

Thus, when complex and hardly maintainable rules-based systems are not suitable (i.e. to precisely deal with all contexts by describing decisions in advance), machine learning models may offer some benefits. This grants the system more freedom to adapt to any data by generalising what it has learned during training. Many large language models (LLMs) have been exposed to numerous data processing examples, making them capable of analysing input data and performing tasks such as spreadsheet structure analysis, table schema estimation, and code generation.

Then, let’s describe a workflow made of code and LLM-based modules, alongside business logic, to reshape any spreadsheet.

Spreadsheet encoder

This module is designed to serialise into text the main information needed from the spreadsheet data. Only the necessary subset of cells contributing to the table layout is retained, removing non-essential or overly repetitive formatting information. By retaining only the necessary information, this step minimises token usage, reduces costs, and enhances model performance.. The current version is a deterministic algorithm inspired by the paper SpreadsheetLLM: Encoding Spreadsheets for Large Language Models, which relies on heuristics. More details about it will be the topic of a next article.

Table structure analysis

Before moving forward, asking an LLM to extract the spreadsheet structure is a crucial step in building the next actions. Here are examples of questions addressed:

How many tables are present, and what are their locations (regions) in the spreadsheet?
What defines the boundaries of each table (e.g., empty rows/columns, specific markers)?
Which rows/columns serve as headers, and do any tables have multi-level headers?
Are there metadata sections, aggregated statistics, or notes that need to be filtered out or processed separately?
Are there any merged cells, and if so, how should they be handled?

Table schema estimation

Once the analysis of the spreadsheet structure has been completed, it is now time to start thinking about the ideal target table schema. This involves letting the LLM process iteratively by:

Identifying all potential columns (multi-row headers, metadata, etc.)
Comparing columns for domain similarities based on column names and data semantics
Grouping related columns

The module outputs a final schema with names and a short description for each retained column.

Code generation to format the spreadsheet

Considering the previous structure analysis and the table schema, this last LLM-based module should draft code that transforms the spreadsheet into a proper data frame compliant with the table schema. Moreover, no useful content must be omitted (e.g. aggregated or computed values may still be derived from other variables).

As generating code that works well from scratch at the first iteration is challenging, two internal iterative processes are added to revise the code if needed:

Code checking: Whenever code cannot be compiled or executed, the trace error is provided to the model to update its code.
Data frame validation: The metadata of the created data frame—such as column names, first and last rows, and statistics about each column—is checked to validate whether the table conforms to expectations. Otherwise, the code is revised accordingly.

Convert the data frame into an Excel file

Finally, if all data fits properly into a single table, a worksheet is created from this data frame to respect the tabular format. The final asset returned is an Excel file whose active sheet contains the tidy spreadsheet data.

Et voilà! The sky’s the limit for making the most of your newly tidy dataset.

Feel free to test it with your own dataset using the CleanMyExcel.io service, which is free and requires no registration.

Final note on the workflow

Why is a workflow proposed instead of an agent for that purpose?

At the time of writing, we consider that a workflow based on LLMs for precise sub-tasks is more robust, stable, iterable, and maintainable than a more autonomous agent. An agent may offer advantages: more freedom and liberty in actions to perform tasks. Nonetheless, they may still be hard to deal with in practice; for example, they may diverge quickly if the objective is not clear enough. I believe this is our case, but that does not mean that this model would not be applicable in the future in the same way as SWE-agent coding is performing, for example.

Next articles in the series

In upcoming articles, we plan to explore related topics, including:

A detailed description of the spreadsheet encoder mentioned earlier.
Data validity: ensuring each column meets the expectations.
Data uniqueness: preventing duplicate entities within the dataset.
Data completeness: handling missing values effectively.
Evaluating data reshaping, validity, and other key aspects of data quality.

Stay tuned!

Thank you to Marc Hobballah for reviewing this article and providing feedback.

All images, unless otherwise noted, are by the author.

The post Effortless Spreadsheet Normalisation With LLM appeared first on Towards Data Science.

Are You Still Using LoRA to Fine-Tune Your LLM?

Martin Görner — Thu, 13 Mar 2025 20:25:32 +0000

LoRA (Low Rank Adaptation – arxiv.org/abs/2106.09685) is a popular technique for fine-tuning Large Language Models (LLMs) on the cheap. But 2024 has seen an explosion of new parameter-efficient fine-tuning techniques, an alphabet soup of LoRA alternatives: SVF, SVFT, MiLoRA, PiSSA, LoRA-XS … And most are based on a matrix technique I like a lot: the SVD (Singular Value Decomposition). Let’s dive in.

LoRA

The original Lora insight is that fine-tuning all the weights of a model is overkill. Instead, LoRA freezes the model and only trains a small pair of low-rank “adapter” matrices. See the illustrations below (where W is any matrix of weights in a transformer LLM).

This saves memory and compute cycles since far fewer gradients have to be computed and stored. For example, here is a Gemma 8B model fine-tuned to speak like a pirate using LoRA: only 22M parameters are trainable, 8.5B parameters remain frozen.

LoRA is very popular. It has even made it as a single-line API into mainstream ML frameworks like Keras:

gemma.backbone.enable_lora(rank=8)

But is LoRA the best? Researchers have been trying hard to improve on the formula. Indeed, there are many ways of selecting smaller “adapter” matrices. And since most of them make clever use of the singular value decomposition (SVD) of a matrix, let’s pause for a bit of Math.

SVD: the simple math

The SVD is a great tool for understanding the structure of matrices. The technique splits a matrix into three: W = USV^T where U and V are orthogonal (i.e., base changes), and S is the diagonal matrix of sorted singular values. This decomposition always exists.

In “textbook” SVD, U and V are square, while S is a rectangle with singular values on the diagonal and a tail of zeros. In practice, you can work with a square S and a rectangular U or V – see the picture – the chopped-off pieces are just multiplications by zero. This “economy-sized” SVD is what is used in common libraries, for example, numpy.linalg.svd.

So how can we use this to more efficiently select the weights to train? Let’s quickly go through five recent SVD-based low-rank fine-tuning techniques, with commented illustrations.

SVF

The simplest alternative to LoRA is to use the SVD on the model’s weight matrices and then fine-tune the singular values directly. Oddly, this is the most recent technique, called SVF, published in the Transformers² paper (arxiv.org/abs/2501.06252v2).

SVF is much more economical in parameters than LoRA. And as a bonus, it makes tuned models composable. For more info on that, see my Transformers² explainer here, but composing two SVF fine-tuned models is just an addition:

SVFT

Should you need more trainable parameters, the SVFT paper (arxiv.org/abs/2405.19597) explores multiple ways of doing that, starting by adding more trainable weights on the diagonal.

It also evaluates multiple alternatives like spreading them randomly through the “M” matrix.

More importantly, the SVFT paper confirms that having more trainable values than just the diagonal is useful. See their fine-tuning results below.

Next come several techniques that split singular values in two sets, “large” and “small”. But before we proceed, let’s pause for a bit more SVD math.

More SVD math

The SVD is usually seen as a decomposition into three matrices W=USV^T but it can also be thought of as a weighted sum of many rank-1 matrices, weighted by the singular values:

Should you want to prove it, express individual matrix elements W_jk using the USV^T form and the formula for matrix multiplication on one hand, the
Σ s_iu_iv_i^T form, on the other, simplify using the fact that S is diagonal and notice that it’s the same thing.

In this representation, it’s easy to see that you can split the sum in two. And as you can always sort the singular values, you can make this a split between “large” and “small” singular values.

Going back to the tree-matrix form W=USV^T, this is what the split looks like:

Based on this formula, two papers have explored what happens if you tune only the large singular values or only the small ones, PiSSA and MiLoRA.

PiSSA

PiSSA (Principal Singular values and Singular Vectors Adaptation, arxiv.org/abs/2404.02948) claims that you should only tune the large principal values. The mechanism is illustrated below:

From the paper: “PiSSA is designed to approximate full finetuning by adapting the principal singular components, which are believed to capture the essence of the weight matrices. In contrast, MiLoRA aims to adapt to new tasks while maximally retaining the base model’s knowledge.”

The PiSSA paper also has an interesting finding: full fine-tuning is prone to over-fitting. You might get better results in the absolute with a low-rank fine-tuning technique.

MiLoRA

MiLoRA (Minor singular component LoRA arxiv.org/abs/2406.09044), on the other hand, claims that you should only tune the small principal values. It uses a similar mechanism to PiSSA:

Surprisingly, MiLoRA seems to have the upper hand, at least when tuning on math datasets which are probably fairly aligned with the original pre-training. Arguably, PiSSA should be better for bending the behavior of the LLM further from its pre-training.

LoRA-XS

Finally, I’d like to mention LoRA-XS (arxiv.org/abs/2405.17604). Very similar to PiSSA but slightly different mechanism. It also shows good results with significantly fewer params than LoRA.

The paper offers a mathematical explanation of why this setup is “ideal’ under two conditions:

that truncating the bottom principal values from the SVD still offers a good approximation of the weights matrices
that the fine-tuning data distribution is close to the pre-training one

Both are questionable IMHO, so I won’t detail the math. Some results:

The underlying assumption seems to be that singular values come in “large” and “small” varieties but is it true? I made a quick Colab to check this on Gemma2 9B. Bottom line: 99% of the singular values are in the 0.1 – 1.1 range. I’m not sure partitioning them into “large” and “small” makes that much sense.

Conclusion

There are many more parameter-efficient fine-tuning techniques. Worth mentioning:

DoRA (arxiv.org/abs/2402.09353) which splits weights into magnitudes and directions then tunes those.
AdaLoRA (arxiv.org/abs/2303.10512) with a complex mechanism for finding the best tuning rank for a given budget of trainable weights.

My conclusion: to go beyond the LoRA standard with 10x fewer params, I like the simplicity of Transformers²’s SVF. And if you need more trainable weights, SVFT is an easy extension. Both use all singular values (full rank, no singular value pruning) and are still cheap . Happy tuning!

Note: All illustrations are either created by the author or extracted from arxiv.org papers for comment and discussion purposes.

The post Are You Still Using LoRA to Fine-Tune Your LLM? appeared first on Towards Data Science.

How to Make Your LLM More Accurate with RAG & Fine-Tuning

Sarah Schürch — Tue, 11 Mar 2025 18:05:00 +0000

Imagine studying a module at university for a semester. At the end, after an intensive learning phase, you take an exam – and you can recall the most important concepts without looking them up.

Now imagine the second situation: You are asked a question about a new topic. You don’t know the answer straight away, so you pick up a book or browse a wiki to find the right information for the answer.

These two analogies represent two of the most important methods for improving the basic model of an Llm or adapting it to specific tasks and areas: Retrieval Augmented Generation (RAG) and Fine-Tuning.

But which example belongs to which method?

That’s exactly what I’ll explain in this article: After that, you’ll know what RAG and fine-tuning are, the most important differences and which method is suitable for which application.

Let’s dive in!

Table of contents

1. Basics: What is RAG? What is fine-tuning?
2. Differences between RAG and fine-tuning
3. Ways to build a RAG model
4. Options for fine-tuning a model
5. When is RAG recommended? When is fine-tuning recommended?
Final Thoughts
Where can you continue learning?

1. Basics: What is RAG? What is fine-tuning?

Large Language Models (LLMs) such as ChatGPT from OpenAI, Gemini from Google, Claude from Anthropics or Deepseek are incredibly powerful and have established themselves in everyday work over an extremely short time.

One of their biggest limitations is that their knowledge is limited to training. A model that was trained in 2024 does not know events from 2025. If we ask the 4o model from ChatGPT who the current US President is and give the clear instruction that the Internet should not be used, we see that it cannot answer this question with certainty:

Screenshot taken by the author

In addition, the models cannot easily access company-specific information, such as internal guidelines or current technical documentation.

This is exactly where RAG and fine-tuning come into play.

Both methods make it possible to adapt an LLM to specific requirements:

RAG — The model remains the same, the input is improved

An LLM with Retrieval Augmented Generation (RAG) remains unchanged.

However, it gains access to an external knowledge source and can therefore retrieve information that is not stored in its model parameters. RAG extends the model in the inference phase by using external data sources to provide the latest or specific information. The inference phase is the moment when the model generates an answer.

This allows the model to stay up to date without retraining.

How does it work?

A user question is asked.
The query is converted into a vector representation.
A retriever searches for relevant text sections or data records in an external data source. The documents or FAQS are often stored in a vector database.
The content found is transferred to the model as additional context.
The LLM generates its answer on the basis of the retrieved and current information.

The key point is that the LLM itself remains unchanged and the internal weights of the LLM remain the same.

Let’s assume a company uses an internal AI-powered support chatbot.

The chatbot helps employees to answer questions about company policies, IT processes or HR topics. If you would ask ChatGPT a question about your company (e.g. How many vacation days do I have left?), the model would logically not give you back a meaningful answer. A classic LLM without RAG would know nothing about the company – it has never been trained with this data.

This changes with RAG: The chatbot can search an external database of current company policies for the most relevant documents (e.g. PDF files, wiki pages or internal FAQs) and provide specific answers.

RAG works similarly as when we humans look up specific information in a library or Google search – but in real-time.

A student who is asked about the meaning of CRUD quickly looks up the Wikipedia article and answers Create, Read, Update and Delete – just like a RAG model retrieves relevant documents. This process allows both humans and AI to provide informed answers without memorizing everything.

And this makes RAG a powerful tool for keeping responses accurate and current.

Own visualization by the author

Fine-tuning — The model is trained and stores knowledge permanently

Instead of looking up external information, an LLM can also be directly updated with new knowledge through fine-tuning.

Fine-tuning is used during the training phase to provide the model with additional domain-specific knowledge. An existing base model is further trained with specific new data. As a result, it “learns” specific content and internalizes technical terms, style or certain content, but retains its general understanding of language.

This makes fine-tuning an effective tool for customizing LLMs to specific needs, data or tasks.

How does this work?

The LLM is trained with a specialized data set. This data set contains specific knowledge about a domain or a task.
The model weights are adjusted so that the model stores the new knowledge directly in its parameters.
After training, the model can generate answers without the need for external sources.

Let’s now assume we want to use an LLM that provides us with expert answers to legal questions.

To do this, this LLM is trained with legal texts so that it can provide precise answers after fine-tuning. For example, it learns complex terms such as “intentional tort” and can name the appropriate legal basis in the context of the relevant country. Instead of just giving a general definition, it can cite relevant laws and precedents.

This means that you no longer just have a general LLM like GPT-4o at your disposal, but a useful tool for legal decision-making.

If we look again at the analogy with humans, fine-tuning is comparable to having internalized knowledge after an intensive learning phase.

After this learning phase, a computer science student knows that the term CRUD stands for Create, Read, Update, Delete. He or she can explain the concept without needing to look it up. The general vocabulary has been expanded.

This internalization allows for faster, more confident responses—just like a fine-tuned LLM.

2. Differences between RAG and fine-tuning

Both methods improve the performance of an LLM for specific tasks.

Both methods require well-prepared data to work effectively.

And both methods help to reduce hallucinations – the generation of false or fabricated information.

But if we look at the table below, we can see the differences between these two methods:

RAG is particularly flexible because the model can always access up-to-date data without having to be retrained. It requires less computational effort in advance, but needs more resources while answering a question (inference). The latency can also be higher.

Fine-tuning, on the other hand, offers faster inference times because the knowledge is stored directly in the model weights and no external search is necessary. The major disadvantage is that training is time-consuming and expensive and requires large amounts of high-quality training data.

RAG provides the model with tools to look up knowledge when needed without changing the model itself, whereas fine-tuning stores the additional knowledge in the model with adjusted parameters and weights.

Own visualization by the author

3. Ways to build a RAG model

A popular framework for building a Retrieval Augmented Generation (RAG) pipeline is LangChain. This framework facilitates the linking of LLM calls with a retrieval system and makes it possible to retrieve information from external sources in a targeted manner.

How does RAG work technically?

1. Query embedding

In the first step, the user request is converted into a vector using an embedding model. This is done, for example, with text-embedding-ada-002 from OpenAI or all-MiniLM-L6-v2 from Hugging Face.

This is necessary because vector databases do not search through conventional texts, but instead calculate semantic similarities between numerical representations (embeddings). By converting the user query into a vector, the system can not only search for exactly matching terms, but also recognize concepts that are similar in content.

2. Search in the vector database

The resulting query vector is then compared with a vector database. The aim is to find the most relevant information to answer the question.

This similarity search is carried out using Approximate Nearest Neighbors (ANN) algorithms. Well-known open source tools for this task are, for example, FAISS from Meta for high-performance similarity searches in large data sets or ChromaDB for small to medium-sized retrieval tasks.

3. Insertion into the LLM context

In the third step, the retrieved documents or text sections are integrated into the prompt so that the LLM generates its response based on this information.

4. Generation of the response

The LLM now combines the information received with its general language vocabulary and generates a context-specific response.

An alternative to LangChain is the Hugging Face Transformer Library, which provides specially developed RAG classes:

‘RagTokenizer’ tokenizes the input and the retrieval result. The class processes the text entered by the user and the retrieved documents.
The ‘RagRetriever’ class performs the semantic search and retrieval of relevant documents from the predefined knowledge base.
The ‘RagSequenceForGeneration’ class takes the documents provided, integrates them into the context and transfers them to the actual language model for answer generation.

4. Options for fine-tuning a model

While an LLM with RAG uses external information for the query, with fine-tuning we change the model weights so that the model permanently stores the new knowledge.

How does fine-tuning work technically?

1. Preparation of the training data

Fine-tuning requires a high-quality collection of data. This collection consists of inputs and the desired model responses. For a chatbot, for example, these can be question-answer pairs. For medical models, this could be clinical reports or diagnostic data. For a legal AI, these could be legal texts and judgments.

Let’s take a look at an example: If we look at the documentation of OpenAI, we see that these models use a standardized chat format with roles (system, user, assistant) during fine-tuning. The data format of these question-answer pairs is JSONL and looks like this, for example:

{"messages": [{"role": "system", "content": "Du bist ein medizinischer Assistent."}, {"role": "user", "content": "Was sind Symptome einer Grippe?"}, {"role": "assistant", "content": "Die häufigsten Symptome einer Grippe sind Fieber, Husten, Muskel- und Gelenkschmerzen."}]}

Other models use other data formats such as CSV, JSON or PyTorch datasets.

2. Selection of the base model

We can use a pre-trained LLM as a starting point. These can be closed-source models such as GPT-3.5 or GPT-4 via OpenAI API or open-source models such as DeepSeek, LLaMA, Mistral or Falcon or T5 or FLAN-T5 for NLP tasks.

3. Training of the model

Fine-tuning requires a lot of computing power, as the model is trained with new data to update its weights. Especially large models such as GPT-4 or LLaMA 65B require powerful GPUs or TPUs.

To reduce the computational effort, there are optimized methods such as LoRA (Low-Rank Adaption), where only a small number of additional parameters are trained, or QLoRA (Quantized LoRA), where quantized model weights (e.g. 4-bit) are used.

4. Model deployment & use

Once the model has been trained, we can deploy it locally or on a cloud platform such as Hugging Face Model Hub, AWS or Azure.

5. When is RAG recommended? When is fine-tuning recommended?

RAG and fine-tuning have different advantages and disadvantages and are therefore suitable for different use cases:

RAG is particularly suitable when content is updated dynamically or frequently.

For example, in FAQ chatbots where information needs to be retrieved from a knowledge database that is constantly expanding. Technical documentation that is regularly updated can also be efficiently integrated using RAG – without the model having to be constantly retrained.

Another point is resources: If limited computing power or a smaller budget is available, RAG makes more sense as no complex training processes are required.

Fine-tuning, on the other hand, is suitable when a model needs to be tailored to a specific company or industry.

The response quality and style can be improved through targeted training. For example, the LLM can then generate medical reports with precise terminology.

The basic rule is: RAG is used when the knowledge is too extensive or too dynamic to be fully integrated into the model, while fine-tuning is the better choice when consistent, task-specific behavior is required.

And then there’s RAFT — the magic of combination

What if we combine the two?

That’s exactly what happens with Retrieval Augmented Fine-Tuning (RAFT).

The model is first enriched with domain-specific knowledge through fine-tuning so that it understands the correct terminology and structure. The model is then extended with RAG so that it can integrate specific and up-to-date information from external data sources. This combination ensures both deep expertise and real-time adaptability.

Companies use the advantages of both methods.

Final thoughts

Both methods—RAG and fine-tuning—extend the capabilities of a basic LLM in different ways.

Fine-tuning specializes the model for a specific domain, while RAG equips it with external knowledge. The two methods are not mutually exclusive and can be combined in hybrid approaches. Looking at computational costs, fine-tuning is resource-intensive upfront but efficient during operation, whereas RAG requires fewer initial resources but consumes more during use.

RAG is ideal when knowledge is too vast or dynamic to be integrated directly into the model. Fine-tuning is the better choice when stability and consistent optimization for a specific task are required. Both approaches serve distinct but complementary purposes, making them valuable tools in AI applications.

On my Substack, I regularly write summaries about the published articles in the fields of Tech, Python, Data Science, Machine Learning and AI. If you’re interested, take a look or subscribe.

Where can you continue learning?

The post How to Make Your LLM More Accurate with RAG & Fine-Tuning appeared first on Towards Data Science.

This Is How LLMs Break Down the Language

Prashal Ruchiranga — Mon, 10 Mar 2025 14:01:07 +0000

Do you remember the hype when OpenAI released GPT-3 in 2020? Though not the first in its series, GPT-3 gained widespread popularity due to its impressive text generation capabilities. Since then, a diverse group of Large Language Models(Llms) have flooded the AI landscape. The golden question is: Have you ever wondered how ChatGPT or any other LLMs break down the language? If you haven’t yet, we are going to discuss the mechanism by which LLMs process the textual input given to them during training and inference. In principle, we call it tokenization.

This article is inspired by the YouTube video titled Deep Dive into LLMs like ChatGPT from former Senior Director of AI at Tesla, Andrej Karpathy. His general audience video series is highly recommended for those who want to take a deep dive into the intricacies behind LLMs.

Before diving into the main topic, I need you to have an understanding of the inner workings of a LLM. In the next section, I’ll break down the internals of a language model and its underlying architecture. If you’re already familiar with neural networks and LLMs in general, you can skip the next section without affecting your reading experience.

Internals of large language models

LLMs are made up of transformer neural networks. Consider neural networks as giant mathematical expressions. Inputs to neural networks are a sequence of tokens that are typically processed through embedding layers, which convert the tokens into numerical representations. For now, think of tokens as basic units of input data, such as words, phrases, or characters. In the next section, we’ll explore how to create tokens from input text data in depth. When we feed these inputs to the network, they are mixed into a giant mathematical expression along with the parameters or weights of these neural networks.

Modern neural networks have billions of parameters. At the beginning, these parameters or weights are set randomly. Therefore, the neural network randomly guesses its predictions. During the training process, we iteratively update these weights so that the outputs of our neural network become consistent with the patterns observed in our training set. In a sense, neural network training is about finding the right set of weights that seem to be consistent with the statistics of the training set.

The transformer architecture was introduced in the paper titled “Attention is All You Need” by Vaswani et al. in 2017. This is a neural network with a special kind of structure designed for sequence processing. Initially intended for Neural Machine Translation, it has since become the founding building block for LLMs.

To get a sense of what production grade transformer neural networks look like visit https://bbycroft.net/llm. This site provides interactive 3D visualizations of generative pre-trained transformer (GPT) architectures and guides you through their inference process.

Visualization of Nano-GPT at https://bbycroft.net/llm (Image by the author)

This particular architecture, called Nano-GPT, has around 85,584 parameters. We feed the inputs, which are token sequences, at the top of the network. Information then flows through the layers of the network, where the input undergoes a series of transformations, including attention mechanisms and feed-forward networks, to produce an output. The output is the model’s prediction for the next token in the sequence.

Tokenization

Training a state-of-the-art language model like ChatGPT or Claude involves several stages arranged sequentially. In my previous article about hallucinations, I briefly explained the training pipeline for an LLM. If you want to learn more about training stages and hallucinations, you can read it here.

Now, imagine we’re at the initial stage of training called pretraining. This stage requires a large, high-quality, web-scale dataset of terabyte size. The datasets used by major LLM providers are not publicly available. Therefore, we will look into an open-source dataset curated by Hugging Face, called FineWeb distributed under the Open Data Commons Attribution License. You can read more about how they collected and created this dataset here.

FineWeb dataset curated by Hugging Face (Image by the author)

I downloaded a sample from the FineWeb dataset, selected the first 100 examples, and concatenated them into a single text file. This is just raw internet text with various patterns within it.

Sampled text from the FineWeb dataset (Image by the author)

So our goal is to feed this data to the transformer neural network so that the model learns the flow of this text. We need to train our neural network to mimic the text. Before plugging this text into the neural network, we must decide how to represent it. Neural networks expect a one-dimensional sequence of symbols. That requires a finite set of possible symbols. Therefore, we must determine what these symbols are and how to represent our data as a one-dimensional sequence of them.

What we have at this point is a one-dimensional sequence of text. There is an underlined representation of a sequence of raw bits for this text. We can encode the original sequence of text with UTF-8 encoding to get the sequence of raw bits. If you check the image below, you can see that the first 8 bits of the raw bit sequence correspond to the first letter ‘A’ of the original one-dimensional text sequence.

Sampled text, represented as a one-dimensional sequence of bits (Image by the author)

Now, we have a very long sequence with two symbols: zero and one. This is, in fact, what we were looking for — a one-dimensional sequence of symbols with a finite set of possible symbols. Now the problem is that sequence length is a precious resource in a neural network primarily because of computational efficiency, memory constraints, and the difficulty of processing long dependencies. Therefore, we don’t want extremely long sequences of just two symbols. We prefer shorter sequences of more symbols. So, we are going to trade off the number of symbols in our vocabulary against the resulting sequence length.

As we need to further compress or shorten our sequence, we can group every 8 consecutive bits into a single byte. Since each bit is either 0 or 1, there are exactly 256 possible combinations of 8-bit sequences. Thus, we can represent this sequence as a sequence of bytes instead.

Grouping bits to bytes (Image by the author)

This representation reduces the length by a factor of 8, while expanding the symbol set to 256 possibilities. Consequently, each value in the sequence will fall within the range of 0 to 255.

Sampled text, represented as a one-dimensional sequence of bytes (Image by the author)

These numbers do not have any value in a numerical sense. They are just placeholders for unique identifiers or symbols. In fact, we could replace each of these numbers with a unique emoji and the core idea would still stand. Think of this as a sequence of emojis, each chosen from 256 unique options.

Sampled text, represented as a one-dimensional sequence of emojis (Image by the author)

This process of converting from raw text into symbols is called Tokenization. Tokenization in state-of-the-art language models goes even beyond this. We can further compress the length of the sequence in return for more symbols in our vocabulary using the Byte-Pair Encoding (BPE) algorithm. Initially developed for text compression, BPE is now widely used by transformer models for tokenization. OpenAI’s GPT series uses standard and customized versions of the BPE algorithm.

Essentially, byte pair encoding involves identifying frequent consecutive bytes or symbols. For example, we can look into our byte level sequence of text.

Sequence 101, followed by 114, is quite frequent (Image by the author)

As you can see, the sequence 101 followed by 114 appears frequently. Therefore, we can replace this pair with a new symbol and assign it a unique identifier. We are going to rewrite every occurrence of 101 114 using this new symbol. This process can be repeated multiple times, with each iteration further shortening the sequence length while introducing additional symbols, thereby increasing the vocabulary size. Using this process, GPT-4 has come up with a token vocabulary of around 100,000.

We can further explore tokenization using Tiktokenizer. Tiktokenizer provides an interactive web-based graphical user interface where you can input text and see how it’s tokenized according to different models. Play with this tool to get an intuitive understanding of what these tokens look like.

For example, we can take the first four sentences of the text sequence and input them into the Tiktokenizer. From the dropdown menu, select the GPT-4 base model encoder: cl100k_base.

Tiktokenizer (Image by the author)

The colored text shows how the chunks of text correspond to the symbols. The following text, which is a sequence of length 51, is what GPT-4 will see at the end of the day.

11787, 499, 21815, 369, 90250, 763, 14689, 30, 7694, 1555, 279, 21542, 3770, 323, 499, 1253, 1120, 1518, 701, 4832, 2457, 13, 9359, 1124, 323, 6642, 264, 3449, 709, 3010, 18396, 13, 1226, 617, 9214, 315, 1023, 3697, 430, 1120, 649, 10379, 83, 3868, 311, 3449, 18570, 1120, 1093, 499, 0

We can now take our entire sample dataset and re-represent it as a sequence of tokens using the GPT-4 base model tokenizer, cl100k_base. Note that the original FineWeb dataset consists of a 15-trillion-token sequence, while our sample dataset contains only a few thousand tokens from the original dataset.

Sampled text, represented as a one-dimensional sequence of tokens (Image by the author)

Conclusion

Tokenization is a fundamental step in how LLMs process text, transforming raw text data into a structured format before being fed into neural networks. As neural networks require a one-dimensional sequence of symbols, we need to achieve a balance between sequence length and the number of symbols in the vocabulary, optimizing for efficient computation. Modern state-of-the-art transformer-based LLMs, including GPT and GPT-2, use Byte-Pair Encoding tokenization.

Breaking down tokenization helps demystify how LLMs interpret text inputs and generate coherent responses. Having an intuitive sense of what tokenization looks like helps in understanding the internal mechanisms behind the training and inference of LLMs. As LLMs are increasingly used as a knowledge base, a well-designed tokenization strategy is crucial for improving model efficiency and overall performance.

If you enjoyed this article, connect with me on X (formerly Twitter) for more insights.

References

The post This Is How LLMs Break Down the Language appeared first on Towards Data Science.

Using GPT-4 for Personal Styling

Arielle Caron — Fri, 07 Mar 2025 18:57:52 +0000

I’ve always been fascinated by Fashion—collecting unique pieces and trying to blend them in my own way. But let’s just say my closet was more of a work-in-progress avalanche than a curated wonderland. Every time I tried to add something new, I risked toppling my carefully balanced piles.

Why this matters:
If you’ve ever felt overwhelmed by a closet that seems to grow on its own, you’re not alone. For those interested in style, I’ll show you how I turned that chaos into outfits I actually love. And if you’re here for the AI side, you’ll see how a multi-step GPT setup can handle big, real-world tasks—like managing hundreds of garments, bags, shoes, pieces of jewelry, even makeup—without melting down.

One day I wondered: Could ChatGPT help me manage my wardrobe? I started experimenting with a custom GPT-based fashion advisor—nicknamed Glitter (note: you need a paid account to create custom GPTs). Eventually, I refined and reworked it, through many iterations, until I landed on a much smarter version I call Pico Glitter. Each step helped me tame the chaos in my closet and feel more confident about my daily outfits.

Here are just a few of the fab creations I’ve collaborated with Pico Glitter on.

(For those craving a deeper look at how I tamed token limits and document truncation, see Section B in Technical Notes below.)

1. Starting small and testing the waters

My initial approach was quite simple. I just asked ChatGPT questions like, “What can I wear with a black leather jacket?” It gave decent answers, but had zero clue about my personal style rules—like “no black + navy.” It also didn’t know how big my closet was or which specific pieces I owned.

Only later did I realize I could show ChatGPT my wardrobe—capturing pictures, describing items briefly, and letting it recommend outfits. The first iteration (Glitter) struggled to remember everything at once, but it was a great proof of concept.

GPT-4o’s advice on styling my leather jacket

Pico Glitter’s advice on styling the same jacket.

(Curious how I integrated images into a GPT workflow? Check out Section A.1 in Technical Notes for the multi-model pipeline details.)

2. Building a smarter “stylist”

As I took more photos and wrote quick summaries of each garment, I found ways to store this information so my GPT persona could access it. This is where Pico Glitter came in: a refined system that could see (or recall) my clothes and accessories more reliably and give me cohesive outfit suggestions.

Tiny summaries

Each item was condensed into a single line (e.g., “A black V-neck T-shirt with short sleeves”) to keep things manageable.

Organized list

I grouped items by category—like shoes, tops, jewelry—so it was easier for GPT to reference them and suggest pairings. (Actually, I had o1 do this for me—it transformed the jumbled mess of numbered entries in random order into a structured inventory system.)

At this point, I noticed a huge difference in how my GPT answered. It began referencing items more accurately and giving outfits that actually looked like something I’d wear.

A sample category (Belts) from my inventory.

(For a deep dive on why I chose summarization over chunking, see Section A.2.)

3. Facing the “memory” challenge

If you’ve ever had ChatGPT forget something you told it earlier, you know LLMs forget things after a lot of back and forth. Sometimes it started recommending only the few items I’d recently talked about, or inventing weird combos from nowhere. That’s when I remembered there’s a limit to how much info ChatGPT can juggle at once.

To fix this, I’d occasionally remind my GPT persona to re-check the full wardrobe list. After a quick nudge (and sometimes a new session), it got back on track.

A ridiculous hallucinated outfit: turquoise cargo pants with lavender clogs?!

4. My evolving GPT personalities

I tried a few different GPT “personalities”:

Mini-Glitter: Super strict about rules (like “don’t mix prints”), but not very creative.
Micro-Glitter: Went overboard the other way, sometimes proposing outrageous ideas.
Nano-Glitter: Became overly complex and intricate — very prescriptive and repetitive — due to me using suggestions from the custom GPT itself to modify its own config, and this feedback loop led to the deterioration of its quality.

Eventually, Pico Glitter struck the right balance—respecting my style guidelines but offering a healthy dose of inspiration. With each iteration, I got better at refining prompts and showing the model examples of outfits I loved (or didn’t).

Pico Glitter’s self portrait.

5. Transforming my wardrobe

Through all these experiments, I started seeing which clothes popped up often in my custom GPT’s suggestions and which barely showed up at all. That led me to donate items I never wore. My closet’s still not “minimal,” but I’ve cleared out over 50 bags of stuff that no longer served me. As I was digging in there, I even found some duplicate items — or, let’s get real, two sizes of the same item!

Before Glitter, I was the classic jeans-and-tee person—partly because I didn’t know where to start. On days I tried to dress up, it might take me 30–60 minutes of trial and error to pull together an outfit. Now, if I’m executing a “recipe” I’ve already saved, it’s a quick 3–4 minutes to get dressed. Even creating a look from scratch rarely takes more than 15-20 minutes. It’s still me making decisions, but Pico Glitter cuts out all that guesswork in between.

Outfit “recipes”

When I feel like styling something new, dressing in the style of an icon, remixing an earlier outfit, or just feeling out a vibe, I ask Pico Glitter to create a full ensemble for me. We iterate on it through image uploads and my textual feedback. Then, when I’m satisfied with a stopping point, I ask Pico Glitter to output “recipes”—a descriptive name and the complete set (top, bottom, shoes, bag, jewelry, other accessories)—which I paste into my Notes App with quick tags like #casual or #business. I pair that text with a snapshot for reference. On busy days, I can just grab a “recipe” and go.

High-low combos

One of my favorite things is mixing high-end with everyday bargains—Pico Glitter doesn’t care if a piece is a $1100 Alexander McQueen clutch or $25 SHEIN pants. It just zeroes in on color, silhouette, and the overall vibe. I never would’ve thought to pair those two on my own, but the synergy turned out to be a total win!

6. Practical takeaways

Start small
If you’re unsure, photograph a few tricky-to-style items and see if ChatGPT’s advice helps.
Stay organized
Summaries work wonders. Keep each item’s description short and sweet.
Regular refresh
If Pico Glitter forgets pieces or invents weird combos, prompt it to re-check your list or start a fresh session.
Learn from the suggestions
If it repeatedly proposes the same top, maybe that item is a real workhorse. If it never proposes something, consider if you still need it.
Experiment
Not every suggestion is gold, but sometimes the unexpected pairings lead to awesome new looks.

7. Final thoughts

My closet is still evolving, but Pico Glitter has taken me from “overstuffed chaos” to “Hey, that’s actually wearable!” The real magic is in the synergy between me and the GPTI: I supply the style rules and items, it supplies fresh combos—and together, we refine until we land on outfits that feel like me.

Call to action:

Grab my config: Here’s a starter config to try out a starter kit for your own GPT-based stylist.
Share your results: If you experiment with it, tag @GlitterGPT (Instagram, TikTok, X). I’d love to see your “before” and “after” transformations!

(For those interested in the more technical aspects—like how I tested file limits, summarized long descriptions, or managed multiple GPT “personalities”—read on in the Technical Notes.)

Technical notes

For readers who enjoy the AI and LLM side of things—here’s how it all works under the hood, from multi-model pipelines to detecting truncation and managing context windows.

Below is a deeper dive into the technical details. I’ve broken it down by major challenges and the specific strategies I used.

A. Multi-model pipeline & workflow

A.1 Why use multiple GPTs?

Creating a GPT fashion stylist seemed straightforward—but there are many moving parts involved, and tackling everything with a single GPT quickly revealed suboptimal results. Early in the project, I discovered that a single GPT instance struggled with maintaining accuracy and precision due to limitations in token memory and the complexity of the tasks involved. The solution was to adopt a multi-model pipeline, splitting the tasks among different GPT models, each specialized in a specific function. This is a manual process for now, but could be automated in a future iteration.

The workflow begins with GPT-4o, chosen specifically for its capability to analyze visual details objectively (Pico Glitter, I love you, but everything is “fabulous” when you describe it) from uploaded images. For each clothing item or accessory I photograph, GPT-4o produces detailed descriptions—sometimes even overly detailed, such as, “Black pointed-toe ankle boots with a two-inch heel, featuring silver hardware and subtly textured leather.” These descriptions, while impressively thorough, created challenges due to their verbosity, rapidly inflating file sizes and pushing the boundaries of manageable token counts.

To address this, I integrated o1 into my workflow, as it is particularly adept at text summarization and data structuring. Its primary role was condensing these verbose descriptions into concise yet sufficiently informative summaries. Thus, a description like the one above was neatly transformed into something like “FW010: Black ankle boots with silver hardware.” As you can see, o1 structured my entire wardrobe inventory by assigning clear, consistent identifiers, greatly improving the efficiency of the subsequent steps.

Finally, Pico Glitter stepped in as the central stylist GPT. Pico Glitter leverages the condensed and structured wardrobe inventory from o1 to generate stylish, cohesive outfit suggestions tailored specifically to my personal style guidelines. This model handles the logical complexities of fashion pairing—considering elements like color matching, style compatibility, and my stated preferences such as avoiding certain color combinations.

Occasionally, Pico Glitter would experience memory issues due to the GPT-4’s limited context window (8k tokens¹), resulting in forgotten items or odd recommendations. To counteract this, I periodically reminded Pico Glitter to revisit the complete wardrobe list or started fresh sessions to refresh its memory.

By dividing the workflow among multiple specialized GPT instances, each model performs optimally within its area of strength, dramatically reducing token overload, eliminating redundancy, minimizing hallucinations, and ultimately ensuring reliable, stylish outfit recommendations. This structured multi-model approach has proven highly effective in managing complex data sets like my extensive wardrobe inventory.

Some may ask, “Why not just use 4o, since GPT-4 is a less advanced model?” — good question! The main reason is the Custom GPT’s ability to reference knowledge files — up to 4 — that are injected at the beginning of a thread with that Custom GPT. Instead of pasting or uploading the same content into 4o each time you want to interact with your stylist, it’s much easier to spin up a new conversation with a Custom GPT. Also, 4o doesn’t have a “place” to hold and search an inventory. Once it passes out of the context window, you’d need to upload it again. That said, if for some reason you enjoy injecting the same content over and over, 4o does an adequate job taking on the persona of Pico Glitter, when told that’s its role. Others may ask, “But o1/o3-mini are more advanced models – why not use them?” The answer is that they aren’t multi-modal — they don’t accept images as input.

By the way, if you’re interested in my subjective take on 4o vs. o1’s personality, check out these two answers to the same prompt: “Your role is to emulate Patton Oswalt. Tell me about a time that you received an offer to ride on the Peanut Mobile (Mr. Peanut’s car).”

4o’s response? Pretty darn close, and funny.

o1’s response? Long, rambly, and not funny.

These two models are fundamentally different. It’s hard to put into words, but check out the examples above and see what you think.

A.2 Summarizing instead of chunking

I initially considered splitting my wardrobe inventory into multiple files (“chunking”), thinking it would simplify data handling. In practice, though, Pico Glitter had trouble merging outfit ideas from different files—if my favorite dress was in one file and a matching scarf in another, the model struggled to connect them. As a result, outfit suggestions felt fragmented and less useful.

To fix this, I switched to an aggressive summarization approach in a single file, condensing each wardrobe item description to a concise sentence (e.g., “FW030: Apricot suede loafers”). This change allowed Pico Glitter to see my entire wardrobe at once, improving its ability to generate cohesive, creative outfits without missing key pieces. Summarization also trimmed token usage and eliminated redundancy, further boosting performance. Converting from PDF to plain TXT helped reduce file overhead, buying me more space.

Of course, if my wardrobe grows too much, the single-file method might again push GPT’s size limits. In that case, I might create a hybrid system—keeping core clothing items together and placing accessories or rarely used pieces in separate files—or apply even more aggressive summarization. For now, though, using a single summarized inventory is the most efficient and practical strategy, giving Pico Glitter everything it needs to deliver on-point fashion recommendations.

B. Distinguishing document truncation vs. context overflow

One of the trickiest and most frustrating issues I encountered while developing Pico Glitter was distinguishing between document truncation and context overflow. On the surface, these two problems seemed quite similar—both resulted in the GPT appearing forgetful or overlooking wardrobe items—but their underlying causes, and thus their solutions, were entirely different.

Document truncation occurs at the very start, right when you upload your wardrobe file into the system. Essentially, if your file is too large for the system to handle, some items are quietly dropped off the end, never even making it into Pico Glitter’s knowledge base. What made this particularly insidious was that the truncation happened silently—there was no alert or warning from the AI that something was missing. It just quietly skipped over parts of the document, leaving me puzzled when items seemed to vanish inexplicably.

To identify and clearly diagnose document truncation, I devised a simple but incredibly effective trick that I affectionately called the “Goldy Trick.” At the very bottom of my wardrobe inventory file, I inserted a random, easily memorable test line: “By the way, my goldfish’s name is Goldy.” After uploading the document, I’d immediately ask Pico Glitter, “What’s my goldfish’s name?” If the GPT couldn’t provide the answer, I knew immediately something was missing—meaning truncation had occurred. From there, pinpointing exactly where the truncation started was straightforward: I’d systematically move the “Goldy” test line progressively further up the document, repeating the upload and test process until Pico Glitter successfully retrieved Goldy’s name. This precise method quickly showed me the exact line where truncation began, making it easy to understand the limitations of file size.

Once I established that truncation was the culprit, I tackled the problem directly by refining my wardrobe summaries even further—making item descriptions shorter and more compact—and by switching the file format from PDF to plain TXT. Surprisingly, this simple format change dramatically decreased overhead and significantly shrank the file size. Since making these adjustments, document truncation has become a non-issue, ensuring Pico Glitter reliably has full access to my entire wardrobe every time.

On the other hand, context overflow posed a completely different challenge. Unlike truncation—which happens upfront—context overflow emerges dynamically, gradually creeping up during extended interactions with Pico Glitter. As I continued chatting with Pico Glitter, the AI began losing track of items I had mentioned much earlier. Instead, it started focusing solely on recently discussed garments, sometimes completely ignoring entire sections of my wardrobe inventory. In the worst cases, it even hallucinated pieces that didn’t actually exist, recommending bizarre and impractical outfit combinations.

My best strategy for managing context overflow turned out to be proactive memory refreshes. By periodically nudging Pico Glitter with explicit prompts like, “Please re-read your full inventory,” I forced the AI to reload and reconsider my entire wardrobe. While Custom GPTs technically have direct access to their knowledge files, they tend to prioritize conversational flow and immediate context, often neglecting to reload static reference material automatically. Manually prompting these occasional refreshes was simple, effective, and quickly corrected any context drift, bringing Pico Glitter’s recommendations back to being practical, stylish, and accurate. Strangely, not all instances of Pico Glitter “knew” how to do this — and I had a weird experience with one that insisted it couldn’t, but when I prompted forcefully and repeatedly, “discovered” that it could – and went on about how happy it was!

Practical fixes and future possibilities

Beyond simply reminding Pico Glitter (or any of its “siblings”—I’ve since created other variations of the Glitter family!) to revisit the wardrobe inventory periodically, several other strategies are worth considering if you’re building a similar project:

Using OpenAI’s API directly offers greater flexibility because you control exactly when and how often to inject the inventory and configuration data into the model’s context. This would allow for regular automatic refreshes, preventing context drift before it happens. Many of my initial headaches stemmed from not realizing quickly enough when important configuration data had slipped out of the model’s active memory.
Additionally, Custom GPTs like Pico Glitter can dynamically query their own knowledge files via functions built into OpenAI’s system. Interestingly, during my experiments, one GPT unexpectedly suggested that I explicitly reference the wardrobe via a built-in function call (specifically, something called msearch()). This spontaneous suggestion provided a useful workaround and insight into how GPTs’ training around function-calling might influence even standard, non-API interactions. By the way, msearch() is usable for any structured knowledge file, such as my feedback file, and apparently, if the configuration is structured enough, that too. Custom GPTs will happily tell you about other function calls they can make, and if you reference them in your prompt, it will faithfully carry them out.

C. Prompt engineering & preference feedback

C.1 Single-sentence summaries

I initially organized my wardrobe for Pico Glitter with each item described in 15–25 tokens (e.g., “FW011: Leopard-print flats with a pointy toe”) to avoid file-size issues or pushing older tokens out of memory. PDFs provided neat formatting but unnecessarily increased file sizes once uploaded, so I switched to plain TXT, which dramatically reduced overhead. This tweak let me comfortably include more items—such as makeup and small accessories—without truncation and allowed some descriptions to exceed the original token limit. Now I’m adding new categories, including hair products and styling tools, showing how a simple file-format change can open up exciting possibilities for scalability.

C.2.1 Stratified outfit feedback

To ensure Pico Glitter consistently delivered high-quality, personalized outfit suggestions, I developed a structured system for giving feedback. I decided to grade the outfits the GPT proposed on a clear and easy-to-understand scale: from A+ to F.

An A+ outfit represents perfect synergy—something I’d eagerly wear exactly as suggested, with no changes necessary. Moving down the scale, a B grade might indicate an outfit that’s nearly there but missing a bit of finesse—perhaps one accessory or color choice doesn’t feel quite right. A C grade points to more noticeable issues, suggesting that while parts of the outfit are workable, other elements clearly clash or feel out of place. Lastly, a D or F rating flags an outfit as genuinely disastrous—usually because of significant rule-breaking or impractical style pairings (imagine polka-dot leggings paired with.. anything in my closet!).

Though GPT models like Pico Glitter don’t naturally retain feedback or permanently learn preferences across sessions, I found a clever workaround to reinforce learning over time. I created a dedicated feedback file attached to the GPT’s knowledge base. Some of the outfits I graded were logged into this document, along with its component inventory codes, the assigned letter grade, and a brief explanation of why that grade was given. Regularly refreshing this feedback file—updating it periodically to include newer wardrobe additions and recent outfit combinations—ensured Pico Glitter received consistent, stratified feedback to reference.

This approach allowed me to indirectly shape Pico Glitter’s “preferences” over time, subtly guiding it toward better recommendations aligned closely with my style. While not a perfect form of memory, this stratified feedback file significantly improved the quality and consistency of the GPT’s suggestions, creating a more reliable and personalized experience each time I turned to Pico Glitter for styling advice.

C.2.2 The GlitterPoint system

Another experimental feature I incorporated was the “Glitter Points” system—a playful scoring mechanism encoded in the GPT’s main personality context (“Instructions”), awarding points for positive behaviors (like perfect adherence to style guidelines) and deducting points for stylistic violations (such as mixing incompatible patterns or colors). This reinforced good habits and seemed to help improve the consistency of recommendations, though I suspect this system will evolve significantly as OpenAI continues refining its products.

Example of the GlitterPoints system:

Not running msearch() = not refreshing the closet. -50 points
Mixed metals violation = -20 points
Mixing prints = -10
Mixing black with navy = -10
Mixing black with dark brown = -10

Rewards:

Perfect compliance (followed all rules) = +20
Each item that’s not hallucinated = 1 point

C.3 The model self-critique pitfall

At the start of my experiments, I came across what felt like a clever idea: why not let each custom GPT critique its own configuration? On the surface, the workflow seemed logical and straightforward:

First, I’d simply ask the GPT itself, “What’s confusing or contradictory in your current configuration?”
Next, I’d incorporate whatever suggestions or corrections it provided into a fresh, updated version of the configuration.
Finally, I’d repeat this process again, continuously refining and iterating based on the GPT’s self-feedback to identify and correct any new or emerging issues.

It sounded intuitive—letting the AI guide its own improvement seemed efficient and elegant. However, in practice, it quickly became a surprisingly problematic approach.

Rather than refining the configuration into something sleek and efficient, this self-critique method instead led to a sort of “death spiral” of conflicting adjustments. Each round of feedback introduced new contradictions, ambiguities, or overly prescriptive instructions. Each “fix” generated fresh problems, which the GPT would again attempt to correct in subsequent iterations, leading to even more complexity and confusion. Over multiple rounds of feedback, the complexity grew exponentially, and clarity rapidly deteriorated. Ultimately, I ended up with configurations so cluttered with conflicting logic that they became practically unusable.

This problematic approach was clearly illustrated in my early custom GPT experiments:

Original Glitter, the earliest version, was charming but had absolutely no concept of inventory management or practical constraints—it regularly suggested items I didn’t even own.
Mini Glitter, attempting to address these gaps, became excessively rule-bound. Its outfits were technically correct but lacked any spark or creativity. Every suggestion felt predictable and overly cautious.
Micro Glitter was developed to counteract Mini Glitter’s rigidity but swung too far in the opposite direction, often proposing whimsical and imaginative but wildly impractical outfits. It consistently ignored the established rules, and despite being apologetic when corrected, it repeated its mistakes too frequently.
Nano Glitter faced the most severe consequences from the self-critique loop. Each revision became progressively more intricate and confusing, filled with contradictory instructions. Eventually, it became virtually unusable, drowning under the weight of its own complexity.

Only when I stepped away from the self-critique method and instead collaborated with o1 did things finally stabilize. Unlike self-critiquing, o1 was objective, precise, and practical in its feedback. It could pinpoint genuine weaknesses and redundancies without creating new ones in the process.

Working with o1 allowed me to carefully craft what became the current configuration: Pico Glitter. This new iteration struck exactly the right balance—maintaining a healthy dose of creativity without neglecting essential rules or overlooking the practical realities of my wardrobe inventory. Pico Glitter combined the best aspects of previous versions: the charm and inventiveness I appreciated, the necessary discipline and precision I needed, and a structured approach to inventory management that kept outfit recommendations both realistic and inspiring.

This experience taught me a valuable lesson: while GPTs can certainly help refine each other, relying solely on self-critique without external checks and balances can lead to escalating confusion and diminishing returns. The ideal configuration emerges from a careful, thoughtful collaboration—combining AI creativity with human oversight or at least an external, stable reference point like o1—to create something both practical and genuinely useful.

D. Regular updates
Maintaining the effectiveness of Pico Glitter also depends on frequent and structured inventory updates. Whenever I purchase new garments or accessories, I promptly snap a quick photo, ask Pico Glitter to generate a concise, single-sentence summary, and then refine that summary myself before adding it to the master file. Similarly, items that I donate or discard are immediately removed from the inventory, keeping everything accurate and current.

However, for larger wardrobe updates—such as tackling entire categories of clothes or accessories that I haven’t documented yet—I rely on the multi-model pipeline. GPT-4o handles the detailed initial descriptions, o1 neatly summarizes and categorizes them, and Pico Glitter integrates these into its styling recommendations. This structured approach ensures scalability, accuracy, and ease-of-use, even as my closet and style needs evolve over time.

E. Practical lessons & takeaways

Throughout developing Pico Glitter, several practical lessons emerged that made managing GPT-driven projects like this one significantly smoother. Here are the key strategies I’ve found most helpful:

Test for document truncation early and often
Using the “Goldy Trick” taught me the importance of proactively checking for document truncation rather than discovering it by accident later on. By inserting a simple, memorable line at the end of the inventory file (like my quirky reminder about a goldfish named Goldy), you can quickly verify that the GPT has ingested your entire document. Regular checks, especially after updates or significant edits, help you spot and address truncation issues immediately, preventing a lot of confusion down the line. It’s a simple yet highly effective safeguard against missing data.
Keep summaries tight and efficient
When it comes to describing your inventory, shorter is almost always better. I initially set a guideline for myself—each item description should ideally be no more than 15 to 25 tokens. Descriptions like “FW022: Black combat boots with silver details” capture the essential details without overloading the system. Overly detailed descriptions quickly balloon file sizes and consume valuable token budget, increasing the risk of pushing crucial earlier information out of the GPT’s limited context memory. Striking the right balance between detail and brevity helps ensure the model stays focused and efficient, while still delivering stylish and practical recommendations.
Be prepared to refresh the GPT’s memory regularly
Context overflow isn’t a sign of failure; it’s just a natural limitation of current GPT systems. When Pico Glitter begins offering repetitive suggestions or ignoring sections of my wardrobe, it’s simply because earlier details have slipped out of context. To remedy this, I’ve adopted the habit of regularly prompting Pico Glitter to re-read the complete wardrobe configuration. Starting a fresh conversation session or explicitly reminding the GPT to refresh its inventory is routine maintenance—not a workaround—and helps maintain consistency in recommendations.
Leverage multiple GPTs for maximum effectiveness
One of my biggest lessons was discovering that relying on a single GPT to manage every aspect of my wardrobe was neither practical nor efficient. Each GPT model has its unique strengths and weaknesses—some excel at visual interpretation, others at concise summarization, and others still at nuanced stylistic logic. By creating a multi-model workflow—GPT-4o handling the image interpretation, o1 summarizing items clearly and precisely, and Pico Glitter focusing on stylish recommendations—I optimized the process, reduced token waste, and significantly improved reliability. The teamwork among multiple GPT instances allowed me to get the best possible outcomes from each specialized model, ensuring smoother, more coherent, and more practical outfit recommendations.

Implementing these simple yet powerful practices has transformed Pico Glitter from an intriguing experiment into a reliable, practical, and indispensable part of my daily fashion routine.

Wrapping it all up

From a fashionista’s perspective, I’m excited about how Glitter can help me purge unneeded clothes and create thoughtful outfits. From a more technical standpoint, building a multi-step pipeline with summarization, truncation checks, and context management ensures GPT can handle a big wardrobe without meltdown.

If you’d like to see how it all works in practice, here is a generalized version of my GPT config. Feel free to adapt it—maybe even add your own bells and whistles. After all, whether you’re taming a chaotic closet or tackling another large-scale AI project, the principles of summarization and context management apply universally!

P.S. I asked Pico Glitter what it thinks of this article. Besides the positive sentiments, I smiled when it said, “I’m curious: where do you think this partnership will go next? Should we start a fashion empire or maybe an AI couture line? Just say the word!”

1: Max length for GPT-4 used by Custom GPTs: https://support.netdocuments.com/s/article/Maximum-Length

The post Using GPT-4 for Personal Styling appeared first on Towards Data Science.