PDF Summarizer and Question Answering: Unlocking Insights from PDF Documents📖

8 min readJul 8, 2023

Introduction

PDF documents serve an important role in sharing and protecting information in today’s digital world. However, obtaining useful information from these papers can be difficult. The PDF Summarizer & Question Answering tool comes in handy here. It summarizes PDF content and also responds accurately to user questions.

It enables users to quickly extract key information and gain a deeper understanding of the document’s content. This tool can provide a concise summary of a single page, an overview of a page range, or an overall summary of the entire document.

In this blog, we will explore the process of building a PDF Summarizer and QA system using Streamlit, OpenAI, and Langchain. We use OpenAI models for text summarization and question-answering, and Langchain for document loading, text splitting, and vector storage. Streamlit will be used for designing a user-friendly interface.

Let’s dive into the code and see how it works.

Prerequisites:

Before importing dependencies, make sure to install the following libraries:

!pip install langchain — This package is a library for building language chains, which are pipelines for natural language processing tasks such as summarization, question-answering, and text generation.

!pip install openai — This package provides an interface to the OpenAI GPT models, allowing you to interact with the language models provided by OpenAI.

!pip install PyPDF2 — This package is used for working with PDF files, such as reading and manipulating PDF documents.

!pip install faiss-cpu — This package is an efficient library for similarity search and clustering of dense vectors. The “cpu” variant refers to the CPU version of the FAISS library.

!pip install tiktoken — This package is used for tokenizing text and counting tokens. It provides an efficient tokenization method.

!pip install streamlit — This package is used for building interactive web applications

Importing Dependencies:

Let’s start by importing the necessary dependencies for our project:

# Streamlit library, used to create the user interface for the application.
import streamlit as st
# Module from the Langchain library that provides embeddings for text processing using OpenAI language models.
from langchain.embeddings.openai import OpenAIEmbeddings
# Python built-in module for handling temporary files.
import tempfile
# Python built-in module for time-related operations.
import time
# Below are the classes from the Langchain library
from langchain import OpenAI, PromptTemplate, LLMChain
# class from the Langchain library that splits text into smaller chunks based on specified parameters.
from langchain.text_splitter import CharacterTextSplitter
# This is a class from the Langchain library that loads PDF documents and splits them into pages.
from langchain.document_loaders import PyPDFLoader
# This is a function from the Langchain library that loads a summarization chain for generating summaries.
from langchain.chains.summarize import load_summarize_chain
# This is a class from the Langchain library that represents a document.
from langchain.docstore.document import Document
# This is a class from the Langchain library that provides vector indexing and similarity search using FAISS.
from langchain.vectorstores import FAISS
# This is a function from the Langchain library that loads a question-answering chain for generating answers to questions.
from langchain.chains.question_answering import load_qa_chain

Initializing OpenAI and Text Splitter:

llm = OpenAI(openai_api_key = 'your_openai_api_key', temperature=0)

The “temperature” parameter is used to control the randomness of the generated text. The temperature value typically ranges from 0 to 1. Higher temperature indicates more random and diverse text, where the model is more likely to choose less probable words. Lower temperature makes the model choose the most probable words or responses. The choice of temperate value depends on the desired output.

Make sure to replace ‘your_openai_api_key’ with your actual OpenAI API key.

# We need to split the text using Character Text Split such that it should not increase token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)

The ‘CharacterTextSplitter’ class is used for splitting text into smaller chunks based on specific parameters.

separator = “\n”: This indicates that the text will be split into chunks based on newline occurrences.

chunk_size: This parameter determines the maximum size or length of each text chunk

chunk_overlap: Number of overlapping characters between adjacent text chunks, this ensures that important context is not lost

length_function: This parameter is used to specify the length of the text

Building User Interface using Streamlit

We will now start building our user interface for the PDF Summarizer and Question Answer system using Streamlit library

st.title("PDF Summarizer & QA")
pdf_file = st.file_uploader("Choose a PDF file", type="pdf")

‘st.title()’ line sets the title of the web application to “PDF Summarizer & QA”.

The purpose of the ‘st.file_uploader’ function is to provide a user interface where users can choose a PDF file to be used by the PDF Summarizer and QA system.

Handling uploaded PDF:

When the user uploads a PDF file, we will process it to extract the content and split it into pages:

if pdf_file is not None:
    with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
        tmp_file.write(pdf_file.read())
        pdf_path = tmp_file.name
        loader = PyPDFLoader(pdf_path)
        pages = loader.load_and_split()

with tempfile.NamedTemporaryFile(delete=False) as tmp_file:: This line creates a temporary file using NamedTemporaryFile from the tempfile module.

tmp_file.write(pdf_file.read()): This line reads the content of the uploaded PDF file and writes it to the temporary file.

The PyPDFLoader is a custom loader class or module that is responsible for loading and splitting the PDF document.

In summary, the code reads the uploaded PDF file, saves it to a temporary file, creates a PyPDFLoader object using the temporary file’s path, and then loads and splits the PDF document into pages. The resulting pages are stored in the ‘pages’ variable for further usage.

User Selection:

Now we will provide the user with different options for selection:

# User input for page selection
page_selection = st.radio("Page selection", ["Single page", "Page range", "Overall Summary", "Question"])

Single page — Allow users to select a specific page and summarize its content

Page range — Allow users to select a range of pages and generate a summary for the combined content of those pages. (Ex: 3–7 pages)

Overall Summary — Generate an overall summary of the entire PDF

Question — Allow users to input a question and generate an answer based on the content of the PDF.

Single Page Summarization:

On the Selection of “Single page” radio button, will ask the user to enter the page number to display the summary of the page.

    if page_selection == "Single page":
        page_number = st.number_input("Enter page number", min_value=1, max_value=len(pages), value=1, step=1)
        view = pages[page_number - 1]
        texts = text_splitter.split_text(view.page_content)
        docs = [Document(page_content=t) for t in texts]
        chain = load_summarize_chain(llm, chain_type="map_reduce")
        summaries = chain.run(docs)

        st.subheader("Summary")
        st.write(summaries)

page_number = st.number_input(“Enter page number”, min_value=1, max_value=len(pages), value=1, step=1):

This line displays a number input field using st.number_input(), where the user can enter the page number that wants to be summarized.

The min_value and max_value parameters ensure that the entered page number is within the valid range of pages. ‘Value’ is set to 1, indicating that the default selected page is the first page. ‘Step’ parameter defines the step size or increment/decrement value for the number input.

By using the st.number_input() function with these parameters, you can create a number input field that allows the user to enter a page number within the specified range (minimum and maximum values) with a specific step size.

texts = text_splitter.split_text(view.page_content): This line splits the content of the selected page (view.page_content) into smaller chunks of text using the text_splitter object

docs = [Document(page_content=t) for t in texts]: This line creates a list of Document objects where each object represents a chunk of text from the selected page.

chain = load_summarize_chain(llm, chain_type=”map_reduce”): The purpose of this code is to load a summarization chain, which is a sequence of steps or models designed to perform text summarization. The llm parameter provides the necessary credentials and access to the OpenAI API, while the chain_type parameter determines the specific configuration of the summarization chain to be loaded.

chain.run(docs): This line runs the summarization chain on the list of Document objects (docs). The run() method processes the documents and generates the summaries.

st.subheader(“Summary”) and st.write(summaries): These lines display the summarized text by creating a subheader titled “Summary” and writing the summaries using st.write().

   elif page_selection == "Page range":
        start_page = st.number_input("Enter start page", min_value=1, max_value=len(pages), value=1, step=1)
        end_page = st.number_input("Enter end page", min_value=start_page, max_value=len(pages), value=start_page, step=1)

        texts = []
        for page_number in range(start_page, end_page+1):
            view = pages[page_number-1]
            page_texts = text_splitter.split_text(view.page_content)
            texts.extend(page_texts)
        docs = [Document(page_content=t) for t in texts]
        chain = load_summarize_chain(llm, chain_type="map_reduce")
        summaries = chain.run(docs)
        st.subheader("Summary")
        st.write(summaries)

    elif page_selection == "Overall Summary":
        combined_content = ''.join([p.page_content for p in pages])  # we get entire page data
        texts = text_splitter.split_text(combined_content)
        docs = [Document(page_content=t) for t in texts]
        chain = load_summarize_chain(llm, chain_type="map_reduce")
        summaries = chain.run(docs)
        st.subheader("Summary")
        st.write(summaries)

Similarly, on the selection of the “Page range” option, we will ask the user to enter the start and end pages and display the summary:

On selecting the “Overall Summary” option, we will combine all the pages’ content, split it into chunks, and display the summary.

Question-Answering:

For the “Question” option, we will ask the user to enter their question and display the answer:

   elif page_selection == "Question":
        question = st.text_input("Enter your question", value="Enter your question here...")
        combined_content = ''.join([p.page_content for p in pages])
        texts = text_splitter.split_text(combined_content)
        embedding = OpenAIEmbeddings(openai_api_key = 'your_api_key')
        document_search = FAISS.from_texts(texts, embedding)
        chain = load_qa_chain(llm, chain_type="stuff")
        docs = document_search.similarity_search(question)
        summaries = chain.run(input_documents=docs, question=question)
        st.write(summaries)

Make sure to replace ‘your_api_key’ with your actual OpenAI API key.

embedding = OpenAIEmbeddings(openai_api_key=’your_api_key’):

This line creates an instance of the OpenAIEmbeddings class, which is used for embedding the text data.

‘texts’: Each text segment represents a chunk or segment of the document.

‘embedding’: The embedding object is an instance of the OpenAIEmbeddings class, which is used for embedding text data. It provides the necessary embedding capabilities required for building the document search index.

FAISS.from_texts(texts, embedding): This method creates a document search index based on the provided texts and their corresponding embeddings.

The FAISS library is a popular library for efficient similarity search and clustering of large datasets.

docs = document_search.similarity_search(question): This line performs a similarity search on the document_search object using the user’s input question. It retrieves the most relevant content from the documents based on the similarity of the question to the indexed texts.

Overall, this code segment allows the user to enter a question, search for relevant documents based on the question, and generate summaries or answers using a question-answering chain. The results are then displayed using Streamlit.

No PDF File Uploaded:

If no PDF file is uploaded in 30sec, we will display a warning message.

else:
    time.sleep(30)
    st.warning("No PDF file uploaded")

Conclusion:

In this tutorial, we have learned how to build a PDF Summarizer and Question-Answering system using Streamlit, OpenAI, and Langchain. We covered different options for page selection and demonstrated how to summarize pages and answer user questions based on the uploaded PDF file. This system can be useful for quickly extracting information and insights from PDF documents

I hope you find this tutorial helpful for building your PDF Summarizer and QA system with Streamlit and OpenAI!

Web Application:

If you like the blog, make sure to clap for the story and support me👏

View more content on my medium profile 📰

LinkedIn Profile: https://www.linkedin.com/in/saiharish-ch/

References:

Summarization | 🦜️🔗 Langchain

A summarization chain can be used to summarize multiple documents. One way is to input multiple smaller documents…

python.langchain.com

Document QA | 🦜️🔗 Langchain

Here we walk through how to use LangChain for question-answering over a list of documents. Under the hood, we'll be…

python.langchain.com

PDF | 🦜️🔗 Langchain

Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present…

python.langchain.com

PDF Summarizer and Question Answering: Unlocking Insights from PDF Documents📖

Building User Interface using Streamlit

References:

Summarization | 🦜️🔗 Langchain

A summarization chain can be used to summarize multiple documents. One way is to input multiple smaller documents…

Document QA | 🦜️🔗 Langchain

Here we walk through how to use LangChain for question-answering over a list of documents. Under the hood, we'll be…

PDF | 🦜️🔗 Langchain

Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present…

Written by sai harish cherukuri