Invoice Data Extraction using IBM Granite LLM locally (not from watsonx!)

Adapting Granite use-case to execute locally.

Image description

Introduction

The Granite family of foundation models span an increasing variety of modalities, including language, code, time series, and science (e.g., materials) — with much more to come. We’re building them with transparency and with focus on fulfilling rigorous enterprise requirements that are emerging for AI. If you’d like to learn more about the models themselves and how we build them, check out Granite Models.

Having introduced Granite, a wealth of valuable content and demonstrations awaits exploration on its public GitHub repository. While some of these insightful demonstrations are conveniently presented as Notebooks designed for the IBM Cloud environment, I recognized an opportunity to broaden accessibility. To this end, I’ve translated one such example into a standalone local Python application, streamlining the testing process for individuals preferring a local setup.

Implementation and Test

I picked an example of “Invoice Data Extraction using IBM Granite LLM from watsonx” and adapted to run locally by using Granite LLM accessible from the “Hugging Face” site.

For starters, I reproduce the original notebook content (the text describing different notebook could be seen on the original page.

!pip install -q git+https://github.com/ibm-granite-community/utils \
    docling==2.14.0 \
    langchain==0.2.12 \
    langchain-ibm==0.1.11 \
    langchain-community==0.2.11 \
    langchain-core==0.2.28 \
    ibm-watsonx-ai==1.1.2 \
    transformers==4.47.1
from docling.document_converter import DocumentConverter
from langchain_ibm import WatsonxLLM
from langchain_core.prompts import PromptTemplate
import os
import json
import pandas as pd
import re
import os
import requests
from ibm_granite_community.notebook_utils import get_env_var
# Define the InvoiceProcessor class
class InvoiceProcessor:
    def __init__(self, ibm_cloud_api_key, project_id, watson_url):
        self.llm = WatsonxLLM(
            model_id='ibm-granite/granite-3.3-8b-instruct',
            apikey=ibm_cloud_api_key,
            project_id=project_id,
            params={
                "decoding_method": "greedy",
                "max_new_tokens": 8000,
                "min_new_tokens": 1,
                "repetition_penalty": 1.01
            },
            url=watson_url
        )
        self.converter = DocumentConverter()



    def extract_invoice_data(self, source):
        result = self.converter.convert(source)
        markdown_output =  result.document.export_to_markdown()


        prompt_template = PromptTemplate(
            input_variables=["DOCUMENT"],
            template='''
            <|start_of_role|>System<|end_of_role|> You are an AI assistant for processing invoices. Based on the provided invoice data, extract the 'Invoice Number', 'Total Net Amount', 'Total VAT or TAX or GST Amount', 'Total Amount' , 'Invoice Date', 'Purchase Order Number' and 'Customer number', without the currency values.

            |Instructions|
            Identify and extract the following information:
            - **Invoice Number**: The unique identifier for the invoice.
            - **Net Amount**: The Total Net Amount indicated on the invoice.
            - **VAT or TAX or GST Amount**: The Total VAT or TAX or GST Amount indicated on the invoice.
            - **Total Amount**: The Total Cost indicated on the invoice.
            - **Invoice Date**: The date the invoice was issued.
            - **Purchase Order Number**: The unique identifier for the purchase order.
            - **Customer Number**: The unique identifier for the customer.

            Invoice Data:
            {DOCUMENT}


            Strictly provide the extracted information in the following JSON format:

            json
            {{
              "invoice_number": "extracted_invoice_number",
              "net_amount": "extracted_new_amount",
              "vat_or_tax_or_gst_amount" : "extracted_vat_or_tax_or_gst_amount",
              "total_amount": "extracted_total_amount",
              "invoice_date": "extracted_invoice_date",
              "purchase_order_number": "extracted_purchase_order_number",
              "customer_number": "extracted_customer_number"
            }}

            <|end_of_text|>

            <|start_of_role|>assistant<|end_of_role|>
        ''')

        prompt = prompt_template.format(DOCUMENT=str(markdown_output).strip())
        answer = self.llm.invoke(prompt)
        #print(answer)

        json_string = re.search(r'\{.*\}', answer, re.DOTALL).group(0).replace('\n', '')
        data = json.loads(json_string)

        try:
            net_amount = round(float(data['net_amount'].replace(",", "").replace("$", "").strip()), 2)
            vat_or_tax_or_gst_amount = round(float(data['vat_or_tax_or_gst_amount'].replace(",", "").replace("$", "").strip()), 2)
            total_amount = round(float(data['total_amount'].replace(",", "").replace("$", "").strip()), 2)

            data['Validation'] = 'correct' if round(net_amount + vat_or_tax_or_gst_amount, 2) == total_amount else 'check'
            print("Processed -- ", source)
        except (ValueError, KeyError):
            data['Validation'] = 'check'

        return data

    def process_invoices(self, folder_path):
        columns = ['File_Name', 'Invoice_Number', 'Net_Amount', 'TAX_Amount', 'Total_Amount', 'Validation', 'Invoice_Date', 'Purchase_Order_Number', 'Customer_Number']
        df_invoice = pd.DataFrame(columns=columns)

        for filename in os.listdir(folder_path):
            if filename.endswith('.pdf'):
                pdf_path = os.path.join(folder_path, filename)
                try:
                    data = self.extract_invoice_data(pdf_path)
                    data['FileName'] = filename

                    new_row = {
                        'File_Name': data['FileName'],
                        'Invoice_Number': data['invoice_number'],
                        'Net_Amount': data['net_amount'],
                        'TAX_Amount': data['vat_or_tax_or_gst_amount'],
                        'Total_Amount': data['total_amount'],
                        'Validation': data['Validation'],
                        'Invoice_Date': data['invoice_date'],
                        'Purchase_Order_Number': data['purchase_order_number'],
                        'Customer_Number': data['customer_number']
                    }

                    df_invoice = pd.concat([df_invoice, pd.DataFrame([new_row])], ignore_index=True)
                except Exception:
                    pass

        return df_invoice
def setup_directory(directory):
    """
    Ensure the specified directory exists. Create it if it doesn't.
    """
    os.makedirs(directory, exist_ok=True)
    print(f"Directory '{directory}' is ready.")

def download_files(file_list, base_url, directory):
    """
    Download files from a given base URL into a specified directory.
    """
    for file_name in file_list:
        file_url = base_url + file_name
        local_file_path = os.path.join(directory, file_name)
        try:
            # Download the file
            response = requests.get(file_url)
            if response.status_code == 200:
                # Save to the specified directory
                with open(local_file_path, "wb") as file:
                    file.write(response.content)
                print(f"Downloaded: {file_name}")
            else:
                print(f"Failed to download {file_name}. Status code: {response.status_code}")
        except Exception as e:
            print(f"Error downloading {file_name}: {e}")

def delete_files(directory):
    """
    Delete all files in the specified directory.
    """
    try:
        for file_name in os.listdir(directory):
            file_path = os.path.join(directory, file_name)
            if os.path.isfile(file_path):
                os.remove(file_path)
                print(f"Deleted: {file_name}")
    except Exception as e:
        print(f"Error deleting files: {e}")

def cleanup_directory(directory):
    """
    Remove the directory if it is empty.
    """
    try:
        os.rmdir(directory)
        print(f"Directory '{directory}' removed successfully.")
    except OSError as e:
        print(f"Error removing directory '{directory}': {e}")

def download_invoice():
    """
    Main workflow for downloading, processing, and cleaning up files.
    """
    # List of file names
    files = [
        "6900026063.pdf",
        "6900026069.pdf",
        "6905212892.pdf",
        "904000640.pdf",
        "PL_IERPIC_MISSING.pdf"
    ]

    # Base URL for the raw files
    base_url = "https://raw.githubusercontent.com/ibm-granite-community/granite-snack-cookbook/refs/heads/main/recipes/Invoice-Extraction/Invoices/"



    data_dir = "data"
    setup_directory(data_dir)

    # Step 2: Download the files
    print("Downloading files...")
    download_files(files, base_url, data_dir)
if __name__ == "__main__":
    download_invoice()
    # Main script to initialize and process invoices
    ibm_cloud_api_key = get_env_var('WATSONX_APIKEY')
    project_id = get_env_var('WATSONX_PROJECT_ID')
    watson_url = get_env_var('WATSONX_URL')
    folder_path = os.getenv('folder_path')
    invoice_processor = InvoiceProcessor(ibm_cloud_api_key, project_id, watson_url)
    df_invoice = invoice_processor.process_invoices("data")
    delete_files('data')
df_invoice

So to test the above code locally hereafter goes the steps.

Regarding your hardware, the proposed libraries and configuration for the code could be different, I have an Intel/CPU laptop!

Environment preparation.

python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip

Installing the requirements; the necessary imports are provided below in the “requirements.txt”

git+https://github.com/ibm-granite-community/utils
docling==2.14.0
langchain==0.2.12
langchain-community==0.2.11
langchain-core==0.2.28
python-dotenv
torch==2.0.1+cpu --index-url https://download.pytorch.org/whl/cpu
torchvision==0.15.2+cpu --index-url https://download.pytorch.org/whl/cpu
torchaudio==2.0.2+cpu --index-url https://download.pytorch.org/whl/cpu
transformers==4.31.0

The “.env” file 🐊.

HUGGINGFACE_API_KEY="your-hf-api-key"

The code 🔣.

The 🪨 Granite LLM I use in this code is a newer version then the one that could be found in the original code. I implemented “ibm-granite/granite-3.3–8b-instruct”.

Image description

# app.py
from dotenv import load_dotenv
from docling.document_converter import DocumentConverter
from langchain.llms import HuggingFaceHub
from langchain_core.prompts import PromptTemplate
import os
import json
import pandas as pd
import re

# Load environment variables from .env file
load_dotenv()
HUGGINGFACE_API_KEY = os.getenv("HUGGINGFACE_API_KEY")

# Define the InvoiceProcessor class
class InvoiceProcessor:
    def __init__(self, huggingface_api_key, model_id):
        # Initialize HuggingFaceHub LLM
        self.llm = HuggingFaceHub(
            repo_id=model_id,
            huggingfacehub_api_token=huggingface_api_key,
            model_kwargs={
                "max_new_tokens": 8000,
                "min_new_tokens": 1,
                "repetition_penalty": 1.01,
                # Add other relevant model parameters here if needed
            }
        )
        self.converter = DocumentConverter()

    def extract_invoice_data(self, source):
        result = self.converter.convert(source)
        markdown_output =  result.document.export_to_markdown()

        prompt_template = PromptTemplate(
            input_variables=["DOCUMENT"],
            template='''
            <|start_of_role|>System<|end_of_role|> You are an AI assistant for processing invoices. Based on the provided invoice data, extract the 'Invoice Number', 'Total Net Amount', 'Total VAT or TAX or GST Amount', 'Total Amount' , 'Invoice Date', 'Purchase Order Number' and 'Customer number', without the currency values.

            |Instructions|
            Identify and extract the following information:
            - **Invoice Number**: The unique identifier for the invoice.
            - **Net Amount**: The Total Net Amount indicated on the invoice.
            - **VAT or TAX or GST Amount**: The Total VAT or TAX or GST Amount indicated on the invoice.
            - **Total Amount**: The Total Cost indicated on the invoice.
            - **Invoice Date**: The date the invoice was issued.
            - **Purchase Order Number**: The unique identifier for the purchase order.
            - **Customer Number**: The unique identifier for the customer.

            Invoice Data:
            {DOCUMENT}


            Strictly provide the extracted information in the following JSON format:

            ```

json
            {{
              "invoice_number": "extracted_invoice_number",
              "net_amount": "extracted_new_amount",
              "vat_or_tax_or_gst_amount" : "extracted_vat_or_tax_or_gst_amount",
              "total_amount": "extracted_total_amount",
              "invoice_date": "extracted_invoice_date",
              "purchase_order_number": "extracted_purchase_order_number",
              "customer_number": "extracted_customer_number"
            }}


            ```

            <|end_of_text|>

            <|start_of_role|>assistant<|end_of_role|>
            '''
        )

        prompt = prompt_template.format(DOCUMENT=str(markdown_output).strip())
        answer = self.llm.invoke(prompt)
        #print(answer)

        json_string = re.search(r'\{.*\}', answer, re.DOTALL).group(0).replace('\n', '')
        data = json.loads(json_string)

        try:
            net_amount = round(float(data['net_amount'].replace(",", "").replace("$", "").strip()), 2)
            vat_or_tax_or_gst_amount = round(float(data['vat_or_tax_or_gst_amount'].replace(",", "").replace("$", "").strip()), 2)
            total_amount = round(float(data['total_amount'].replace(",", "").replace("$", "").strip()), 2)

            data['Validation'] = 'correct' if round(net_amount + vat_or_tax_or_gst_amount, 2) == total_amount else 'check'
            print("Processed -- ", source)
        except (ValueError, KeyError):
            data['Validation'] = 'check'

        return data

    def process_invoices(self, folder_path):
        columns = ['File_Name', 'Invoice_Number', 'Net_Amount', 'TAX_Amount', 'Total_Amount', 'Validation', 'Invoice_Date', 'Purchase_Order_Number', 'Customer_Number']
        df_invoice = pd.DataFrame(columns=columns)

        for filename in os.listdir(folder_path):
            if filename.endswith('.pdf'):
                pdf_path = os.path.join(folder_path, filename)
                try:
                    data = self.extract_invoice_data(pdf_path)
                    data['FileName'] = filename

                    new_row = {
                        'File_Name': data['FileName'],
                        'Invoice_Number': data['invoice_number'],
                        'Net_Amount': data['net_amount'],
                        'TAX_Amount': data['vat_or_tax_or_gst_amount'],
                        'Total_Amount': data['total_amount'],
                        'Validation': data['Validation'],
                        'Invoice_Date': data['invoice_date'],
                        'Purchase_Order_Number': data['purchase_order_number'],
                        'Customer_Number': data['customer_number']
                    }

                    df_invoice = pd.concat([df_invoice, pd.DataFrame([new_row])], ignore_index=True)
                except Exception as e:
                    print(f"Error processing {filename}: {e}")

        return df_invoice

if __name__ == "__main__":
    # --- Instructions for local Hugging Face LLM ---
    # Install the required libraries.
    # Create a .env file in the same directory as your script.
    # Add your Hugging Face API key to the .env file like this:
    #    HUGGINGFACE_API_KEY=YOUR_ACTUAL_HUGGINGFACE_API_KEY

   model_id = "ibm-granite/granite-3.3-8b-instruct"

    # Initialize the InvoiceProcessor.
    if HUGGINGFACE_API_KEY:
        invoice_processor = InvoiceProcessor(HUGGINGFACE_API_KEY, model_id)

        # Define the input and output folder paths.
        input_folder = "./input"
        output_folder = "./output"

        # Ensure the output folder exists.
        if not os.path.exists(output_folder):
            os.makedirs(output_folder)
            print(f"Created output folder: {output_folder}. The extracted data will be saved here.")
        else:
            print(f"Extracted data will be saved in: {output_folder}")

        # assuming the input folder './input' already exists and is accessible.
        print(f"Looking for invoices in: {input_folder}")

        # Process the invoices.
        df_invoice = invoice_processor.process_invoices(input_folder)

        # Save the extracted data to a CSV file in the output folder.
        output_file_path = os.path.join(output_folder, "extracted_invoice_data.csv")
        df_invoice.to_csv(output_file_path, index=False)
        print(f"\nExtracted data saved to: {output_file_path}")
    else:
        print("HUGGINGFACE_API_KEY not found in .env file. Please create the file and add your API key.")

Execution of the script.

python app.py

...Bash
# results will show up in a CSV file
# File_Name,Invoice_Number,Net_Amount,TAX_Amount,Total_Amount,Validation,Invoice_Date,Purchase_Order_Number,Customer_Number

Conclusion

Having successfully executed a previously cloud-centric Granite demonstration within a local Python environment, we’ve now witnessed firsthand the adaptability and potential for wider adoption of this technology. This local implementation not only validates the underlying concepts but also offers a more accessible pathway for experimentation and integration, helping a broader range of users to explore its capabilities.

Invoice Data Extraction using IBM Granite LLM locally (not from watsonx!)

Introduction

Implementation and Test

Conclusion

Links

Comments (0)

Read More

#reading

#popular

Invoice Data Extraction using IBM Granite LLM locally (not from watsonx!)

Introduction

Implementation and Test

Conclusion

Links

Comments (0)

Read More

LLM 훈련/추론 시 총 메모리 크기는?

Building Smarter Dashboards: Improve Power BI Copilot Accuracy with Semantic Models and Metadata

Neuralese: The Most Spoken Language You’ll Never Speak

#1 on SWE-bench lite, achieved fully autonomously by open-source Refact.ai Agent

#reading

#popular