A Beginner’s Guide To Build A RAG Data Ingestion Pipeline For Large Datasets

Gain an understanding of RAG data ingestion pipeline for large datasets which has become an important part of the technological landscape relating to artificial intelligence.
A Beginner's Guide To Build A RAG Data Ingestion Pipeline For Large Datasets

Table of Content

Subscribe to latest Insights

By clicking "Subscribe", you are agreeing to the our Terms of Use and Privacy Policy.


RAG (Retrieval Augmented Generation) is as simple to understand as it is spelled. 

  • Retrieval includes retrieving the relevant document from the vector database.
  • Augmentation includes the prompt engineering technique implemented on user prompt (user query) which helps in communicating with LLM. 
  • Generation is for generating the end result to the user by joining the retrieved documents and user prompt (after prompting).

In this blog, we will go through many other topics which will help us better understand the concept of RAG, its uses, and use cases with implementation. 

We will also discover how RAG will help us in ETL (extract, transform and load) in the ingestion pipeline, ingesting the large datasets/large documents in the Vector Database in form of embeddings. 

These embeddings are not the exact embeddings of original datasets but a route to original datasets, which will be discussed further in this blog.

This blog includes part one of the complete RAG process, therefore, it is more focussed on delivering the overview of RAG and basic techniques of ingestion pipeline for large context datasets.

What is Retrieval Augmented Generation?

RAG (Retrieval Augmented Generation) is an AI framework that implements an optimization approach to answer user queries with accuracy and up-to-date information. It performs data retrieval from databases that are more relevant to the user query. 

RAG is enriched with LLM powered capabilities. LLMs are trained on vast datasets with millions or billions of parameters, such as, Mistral, Llama, etc., LLM models. They are used for numerous tasks like text generation, translation, QA, chatbots etc. 

RAG enhances the LLM capabilities to provide specific answers relevant to the user query. The final information is generated from the retrieved document and user query. 

In Retrieval Augmented Generation, we do not need to retrain the model on the data for retrieval. So, Retrieval Augmented Generation is cost-effective, relevant, and accurate for data retrieval.

How Important is Retrieval-Augmented Generation?

Retrieval Augmented Generation is very important as it can be used as an AI tool or framework for powering chatbots, Natural Language Programs, or Machine Learning Applications.

There are multiple use cases that prove RAG’s importance:-

Question Answering

RAG data ingestion pipeline helps in improving question-answering systems by retrieving relevant documents or passages from a knowledge base and generating concise, accurate answers based on retrieved information.

Content Creation

RAG data ingestion pipeline supports content creators by retrieving relevant information from a variety of sources and generating coherent, informative content, such as articles, blog posts, or reports.

Language Translation

RAG data ingestion pipeline also enhances language translation systems by retrieving parallel texts or bilingual collection and generating more accurate translations by incorporating retrieved information.

Code Generation

RAG data ingestion pipeline assists developers by retrieving relevant code snippets, documentation, or examples from a codebase or online repositories and generating code tailored to specific tasks or requirements.


A Beginner's Guide To Build A RAG Data Ingestion Pipeline For Large Datasets

RAG Data Ingestion Pipeline

RAG data ingestion pipeline is used for collecting, importing and processing data. It makes raw data accessible for further processing.

The raw data is taken from different data sources and stored into a database to analyze and retrieve the data. There are a number of diverse data sources, such as, databases, servers, other data centers, and various online sources. 

Additionally, there are several types of data ingestion such as, batch ingestion, chunks ingestion, real-time ingestion, change data capture (CDC) ingestion and streaming ingestion.

How Can We Use RAG Data Ingestion Pipeline For Large Datasets?

Now, you should have an understanding of the components we used to build our project. 

So, let us start with the actual steps that are implemented, with the help of code snippets. It will help you understand the concept in a better light.

1. Load the required dependencies first

					install langchain_community tiktoken langchain-openai langchainhub chromadb langchain youtube-transcript-api pytube

2. Import the modules and create long-chain environment 

					from langchain_community.document_loaders import DirectoryLoader
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from import InMemoryByteStore
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.retrievers.multi_vector import MultiVectorRetriever
import uuid
import os

os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = ''
os.environ['LANGCHAIN_API_KEY'] = <your-langchain-api-key>

3. Likewise, create an openai environment using the openai–API-key

os.environ['OPENAI_API_KEY'] = <your-openai-key>

4. Load the data from the local folder

					loader = DirectoryLoader('text files', glob="**/*.txt")
docs = loader.load()

5. Define a processing chain

					chain = (
   {"docs": lambda x: x.page_content}
   | ChatPromptTemplate.from_template("Summarize the following document:\n\n{docs}")
   | ChatOpenAI(model="gpt-3.5-turbo",max_retries=0)
   | StrOutputParser()

summaries = chain.batch(docs, {"max_concurrency": 5})

6.Create embedding in chromadb using openai. You can also use any other model locally for this task such as, the Sbert model.

					# The vectorstore to index the summaries
Vectorstore_docs = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())

7. To set up a storage layer for parent documents and define retrievers for those documents. Also, for generating unique IDs for each document set and creating documents linked to summaries for each retriever

					# The storage layer for the parent documents
store_docs = InMemoryByteStore()
id_key = "doc_id"
# The retrievers
retriever_docs = MultiVectorRetriever(
# To Generate unique IDs for each document set use below code 
doc_ids_docs = [str(uuid.uuid4()) for _ in docs]
# Docs linked to summaries for each retriever
summary_docs_docs = [
   Document(page_content=s, metadata={id_key: doc_ids_docs[i]})
   for i, s in enumerate(summaries)

8. Add documents and summaries to the retrievers

retriever_docs.docstore.mset(list(zip(doc_ids_docs, docs)))

8. See document retrieval using vector store similarity search

					query = "what are LLM? "
sub_docs = vectorstore_docs.similarity_search(query,k=1) # k value top results 
retrieved_docs = retriever.get_relevant_documents(query,n_results=1)


In this blog, we explored Retrieval Augmented Generation (RAG) and its important role in improving AI frameworks like language models (LLMs) for precise data retrieval. 

We demonstrated the practical implementation of RAG data ingestion pipelines for large datasets, showcasing loading data, defining processing chains, creating embeddings in ChromaDB using OpenAI, setting up storage layers, defining retrievers, generating unique IDs for documents, and retrieving documents using vector store similarity search.

We at CrossML use RAG for fast retrieval of information from large contexts. We intend to use RAG technique as a tool for data retrieval and RAG data ingestion pipeline for large datasets.


RAG stands for Retrieval Augmented Generation, and it generates optimized results based upon retrieved data from the retriever. Augmented means routing the user query to specific data in the vector database. RAG increases the capability of LLMs for accurate response.

RAG helps in improving accuracy for generating results, optimizing the retrieved results and with the help of the optimization, we can also improve the speed of query processing.

RAG can be used for various use cases, which are very vast compared to normal QA chatbots, we can also use RAG for creating employee information retrieval applications, Content Creation applications, and Translation applications.

A data ingestion pipeline is an important part of any application handling large datasets, and the pipeline can handle processing of data from diverse sources. It uses ETL Process (Extracting, Transferring, Loading) for data storage.

Various types of data ingestion methods include batch ingestion, stream ingestion, log-based ingestion, CDC, and manual ingestion. Each of these methods have their unique benefits.