Ensuring Trusted Generative AI Responses with Retrieval Augmented Generation: The Importance of Pulling Original Sources

September 30, 2024

Generative AI’s potential for companies is well-known, but the technology can create new risks if it is not powered by original and trustworthy data sources. In the second blog in our ‘RAGs to Riches’ series, we explore those risks; highlight best practices around pulling data for generative AI using a Retrieval Augmented Generation (RAG) technique; and suggest the key questions to ask your data provider for a trustworthy and effective approach.

The dangers of inadequately sourcing data for generative AI

87% of companies plan to adopt generative AI technology (if they haven’t already), according to the LexisNexis® Future of Work Report 2024. But, in recent years, far too many corporate AI initiatives have ended in failure. A common cause of this is poor quality data – as the saying goes, “garbage in, garbage out”. The outputs from generative AI tools will only be as accurate and relevant as the data powering them.

The problem typically lies in companies inputting low-quality data from third parties into their generative AI models. This might be a third-party generative AI tool which a company uses to support its work, or a third-party data aggregator from which it pulls content to power its own generative AI solution. If these providers cannot clearly demonstrate where and how they pulled their data, it poses five main risks:

Unethically collected data: Some firms have faced reputational damage for allegedly scraping data from individual social media users, leading to a backlash from consumers.
Regulatory breaches: There have been recent legal cases brought by publishers against generative AI providers for allegedly using their data without permission or payment. Poor quality data risks breaching privacy, confidentiality and intellectual property regulations.
Unprovenanced data: When data is not pulled from original sources, it is harder for companies to know what sources each part of a generative AI tool’s response came from. This makes it impossible to verify the accuracy of the response, or to act on it with confidence.
Inaccuracies: Vague and opaque data from second-hand sources makes it difficult for a company to verify that data pulled is accurate and up-to-date.
Hallucinations: A limitation of generative AI solutions is that a response will sometimes sound plausible but have no basis in fact or the underlying data. This stems from the tool learning from outdated data as well as its ongoing prompts and responses to users, which leads to outputs based on ‘made up’ data. If the response does not cite the original source for each of its claims, it will be difficult to detect if a response is a hallucination.

RAG powered by original sources is the best defense against these risks

Retrieval Augmented Generation (RAG) is a technique to enhance a generative AI tool to mitigate these risks. Traditionally, a tool learns continuously from its original training data and its prompts and responses with users. But Retrieval Augmented Generation forces the model to pull information from an extra layer of data which supersedes the previously learned data. This data should be credible, authoritative and pulled directly from original sources, such as the data licensed for generative AI use by LexisNexis®. The generative AI model is therefore required to generate every answer by pulling from this data as context and cite the original source(s) used in each response.

Retrieval Augmented Generation offers myriad benefits, for example:

Responses are more relevant and tied to credible sources for improved accuracy.
Responses capture the latest changes because the contextual data can be updated periodically if delivered via an API.
The output from a generative AI tool can be verified by following citations back to the original source.

Unlocking the benefits of a RAG approach to generative AI requires access to trustworthy data which is optimized for use in this specific technology. The LexisNexis® Future of Work Report 2024 found that 9/10 of professionals’ main consideration for choosing a generative AI tool is the quality and accuracy of its output. While 7/10 said trusted, accurate data sources are the key to fostering trust in their use of generative AI. So how can companies pull this contextual data for their generative AI models using a RAG approach from original sources?

Eight questions to find a trustworthy provider of data and technology

Pulling from original sources to power generative AI initiatives involves going to individual, reliable publishers and requesting to use their data. Companies operating worldwide may need to do this for sources across multiple jurisdictions and languages. This would be extremely time-consuming, both to negotiate acquiring the data and to ensure compliance with differing regulations over time.

Therefore, it is far more efficient to outsource the acquisition of data sources to a specialist third-party provider. Depending on your budget, there are two approaches you might take:

Sign up to use a trustworthy third-party generative AI tool which is transparent about the data sources it uses.
Engage a well-regarded third-party data provider from which you can pull content to apply to your own generative AI tool as contextual data, which is pulled for every response using the Retrieval Augmented Generation technique. This content can be brought in via an API.

Whichever approach you take, it is critical that the third-party provider has ensured each data source it uses is licensed and approved for the specific use of generative AI and meets all relevant regulations and ethical standards around data protection and privacy. Your company will be held accountable for any failures in this respect. Questions to ask a potential provider include:

What are the data sources you have collected?
Who is the original publisher for each source?
How reliable is each publisher?
How did you collect the data?
Has each publisher licensed and approved their content for use in generative AI tools?
How have you ensured the data meets regulations around data protection and high ethical standards?
How can you guarantee the data is up-to-date and will be regularly refreshed?
How will the data be delivered to my company? Is it possible to ingest it via a single, flexible API?

LexisNexis^® offers data and technology for a successful RAG approach

Applying Retrieval Augmented Generation into your generative AI development is only effective if the contextual data it brings in is accurate, trustworthy, and approved for use in generative AI tools. LexisNexis provides licensed content and optimized technology to support your generative AI and RAG ambitions:

Data for generative AI: Our extensive news coverage, enriched with robust metadata, is readily available for integration into your generative AI projects with Nexis® Data+. Thousands of sources are already available for use with generative AI technology.
Generative AI for research: Nexis+ AI™ is a new, AI-powered research platform that combines time-saving generative AI tools with our vast library of trusted sources. Nexis+ AI not only can save time on core research tasks like document analysis, article summarization and report generation, but deploys Retrieval Augmented Generations and citations that transparently illustrate the sources used for AI-generated content.

Tags:

Ensuring Trusted Generative AI Responses with Retrieval Augmented Generation: The Importance of Pulling Original Sources

The dangers of inadequately sourcing data for generative AI

RAG powered by original sources is the best defense against these risks

Eight questions to find a trustworthy provider of data and technology

LexisNexis® offers data and technology for a successful RAG approach

LexisNexis^® offers data and technology for a successful RAG approach