Large Language Models, commonly abbreviated as LLMs, have recently been top-of-mind for the AI open-source community, non-technical consumers, as well as organizations looking to implement their own models into production. Today we will define an LLM, how they work, and the benefit of using an LLM to solve enterprise problems.
A Language Model (LM) is a type of artificial intelligence designed to understand and generate human language. It works by using a neural network, which is a collection of interconnected neurons where each connection has a number attached to it, called a weight. The difference between a Language Model and a Large Language Model (LLM) is that the LLM has billions of these weights and is trained on even more data. The more data the LLM is trained on, typically leads to the model’s better understanding of human language.
LLMs have become increasingly popular due to their impressive performance on a wide range of natural language processing tasks, such as language translation, question answering, and text summarization. If you’ve used an AI customer service chatbot, you may have been interacting with an LLM without even realizing it!
One of the most impressive features of LLMs is their ability to learn and store vast amounts of information about the world through training. This allows them to “memorize” a great quantity of facts and generate responses based on this knowledge. While this has significant implications for fields such as education, healthcare, and customer service where large amounts of information need to be processed and communicated effectively, it also raises concerns about the ethical use of such technology.
As LLMs become more prevalent in enterprise settings, it’s crucial to ensure that the design and use incorporates trust at the foundation. This includes considerations such as privacy, transparency, and fairness. For example, LLMs may have access to sensitive information about individuals or groups, which raises questions about privacy and data security. Already we have seen in the news that Snap’s “My AI” chatbot has been lying to users about the fact that it has access to their location. Additionally, if the data used to train an LLM is biased, the model may perpetuate that bias, resulting in unfair outcomes. This problem of bias is not a new one, in fact, it wasn’t so long ago when Amazon made the news for using an automated tool that penalized resumes from women. To learn more about bias in AI, please read our series on bias monitoring and accountability.
Finally, to ensure that LLMs are implemented ethically, it’s important to have a comprehensive understanding of the technology, its limitations, and potential ethical risks. It’s also important to establish clear guidelines and protocols for the collection and use of data, and to regularly evaluate the model’s performance for any signs of bias or unfairness.
In the simplest case, LLMs are a probabilistic model that, when given a query, tries to predict the answer based on what it has learned about the language during training. This query is usually paired with some form of prompt instruction to create what we call the context window. The context window represents how long our input can be for the model; note that this is capped depending on which model you are using. For example, GPT3 has a context window of about ~3 pages while GPT4 increased the context window to ~50 pages. So, what can you do when you have much more than 50 pages of information that you want to investigate with an LLM? We will answer that in our example use case.
Example Use Case – Question Answering Agent
In the current state of LLMs, they are primarily trained on public information and data licensed from third parties. For businesses, the need arises to derive answers from their own enterprise data that can be distributed across multiple sources such as emails, Slack messages, presentation slides, PDF documents, etc. In the following sections, we will explore the process of developing a comprehensive solution that enables us to extract valuable insights from our private data.
With any problem, hypothetical or not, it is crucial to never start a project before clearly defining its scope.
For our purposes, let’s use the following scenario:
Problem: It currently requires an excessive amount of manual effort and coordination to give out weekly updates on the projects being worked on throughout the company. On average, the current process takes approximately 5 hours per week for each team involved. This inefficiency leads to tensions across team members, project managers, and stakeholders who eagerly anticipate progress updates.
Impact: Assuming there are 10 teams, the current process consumes a total of 50 hours per week across all teams. If we assume the average hourly wage is $50/hour, this leads to a cost of $2,500 a week. This significant investment takes away valuable resources that could be dedicated to project-related tasks, resulting in delays and decreased productivity. Additionally, the manual nature of the process introduces the risk of errors and inconsistent reporting.
Goal: Reduce the time required for weekly updates by at least 80%, saving approximately 40 hours per week across the organization or $2,000 per week.
Methodology: Can we leverage new technology in the LLM space to integrate a model that can accurately answer questions about project information from various data sources (such as Confluence, Slack, Jira, Google Drive, etc.)? By doing so, we can ask the Question-Answering (QA) Agent to summarize our project information to any degree of complexity.
Now that we have a solid understanding of the problem we aim to solve, it’s crucial to identify the necessary tools for the task at hand. In particular, due to LLMs being primarily trained on public data, we cannot use them out-of-the-box and expect a relevant answer to private enterprise information.
To address this issue, it is vital to provide the LLM with a corpus of information that it can reference when generating project updates. While the process of collecting, aggregating, and compiling this information may appear challenging, it is a crucial step in the implementation process. As with any AI model, the quality of the input directly impacts the quality of the output. Fortunately, there are various tools available to help us compile these documents and generate the necessary embeddings for storing our knowledge sources. Embeddings are numerical representations of words or phrases that capture their meaning and relationships. Storing the embeddings, rather than the raw document text, enables what we refer to as Semantic Search. Semantic Search can often be compared with Keyword search, where the latter searches for the exact words or phrases inputted by the user and the former analyzes the input’s meaning by looking at what is similar in terms of its embedding. The designated location for storing our vector embeddings is commonly known as a Vector Store.
This curated knowledge source is an important step in what is typically referred to as Retrieval Augmented Generation (RAG).
To put it all together, we can reference the visual below and see how RAG works.
Now that we have a functioning QA Agent, it is important to remember that proper evaluation is a critical aspect of any AI product. The standard supervised Machine Learning problem has what is called “ground truth” labels that can be used to clearly identify how our model performs. However, because the input of our QA Agent is user-generated, there is no way to predefine every possible input and what the output should be. Therefore, we rely on other techniques to evaluate the model. For simplicity, I will divide evaluation techniques into two categories: Human Feedback and Automated QA.
Human Feedback
If you use ChatGPT today, you can see that each output has the option for a “thumbs up” or “thumbs down”. Although this is a fairly simple feedback loop, the way that information is integrated and used to improve the model is much more complex. In particular, if you want to read more about how ChatGPT was able to iteratively improve, please read this very informative article by AssemblyAI. This process tends to be very tedious and is limited by the bias present by the users.
Automated QA
To help with evaluating, there are also many open-source packages that provide the tooling for us to understand how well our model is doing. LangChain, one of the most popular tools, has what is called a ContextQAEvalChain that will essentially “grade” the accuracy of our answers that are outputted from the model with the context that is retrieved from our knowledge sources. This process will help mitigate hallucinations -when an LLM makes up information.
Now that we have some data points, or at least an understanding of where our model fails, now what? There are a variety of ways we can proceed after evaluating our model’s performance.
Improving the Quality of Our Knowledge Base
During our evaluation analysis, we may discover that some of the sources our model should be using for questions are not being utilized. This could be due to an issue with the way the source information is being processed. To improve our QA Agent, we may need to address this issue by fixing the ingestion process. Additionally, we may want to add more sources to our knowledge base, which could further enhance the performance of our model.
Prompt Engineering
We can include examples of a query and its corresponding response that is correct in our prompt. This approach, known as Few Shot Learning, has shown to improve various types of LLM tasks. Additionally, we can reword and refine the prompt itself if the output doesn’t meet our expected results. For instance, using different prompts like “Answer the following question” and “Answer the following question using the context below. If you don’t know the answer, simply state that you don’t know the answer. Do not try to make up an answer that doesn’t have the context to back it up” can produce vastly different outcomes, with the first prompt being more susceptible to hallucinations.
Finally, it is important to note the risks that are associated with using an LLM that has access to proprietary information. In particular, if your input data includes sensitive information that should not be shown to all users, you need to implement more protections against information leakage.
Prompt Injection
Although the term “Prompt Injection” is relatively new, the concept of injection attacks isn’t; with some of the most catastrophic being SQL injections that can directly access and change confidential information. Prompt injection is essentially an attack on an LLM to make it act in an unintentional way that can be harmful. These can be further divided into two types.
Direct Prompt Injections occur when a user of an LLM is trying to manipulate the LLM by carefully tailoring prompts until they find one that causes the LLM to act inappropriately.
Indirect Prompt Injections can occur in the data the model is trained on (Data Poisoning) or when using an external application. In a study to identify the sensitivity of Data Poisoning, Robust Intelligence showed that simply controlling 0.01% of a dataset is enough to poison a model. A popular example of an indirect prompt injection occurred with Bing Chat, which is an AI tool that allows you to communicate with whatever web page you are on. Going on to the wrong webpage could transform your Bing Chat to act as a scammer trying to get a user’s payment details. You can read more about this example here.
A few ways to mitigate the security risks is to implement guardrails in your system, red-teaming models, scenario-based testing, and checking the input/output against another model to check for nefarious intent. In addition, it is crucial to have a system in place to quickly respond to these issues if and when they occur.
This is just one of many use cases for LLMs, but there are a number of other useful applications for LLMs. The following list is not exhaustive and more use cases are being discovered daily.
With an LLM, you can…
Throughout this blog, we have given a brief introduction on LLMs and have gone through the thought process of building our own Question-Answering (QA) Agent using the Retrieval Augmented Generation approach. It is important to understand that every enterprise problem is different, however, there are key parts to every solution:
To stay up-to-date on the most important responsible AI news, subscribe to our free email digest, Voices of Trusted AI. In addition, you’ll receive top insights on designing and building AI that increases trust, buy-in, and success—straight to your inbox every other month.
Nicolas Decavel is a Data Science Consultant LII at Pandata.