py txt to Excel-03

Here's an optimized version of the script to handle large JSON text files (~25 MB, 500,000 lines) using multithreading, chunked processing, and a progress monitor. The script processes the file in chunks, leverages parallel processing for JSON parsing, and writes to Excel incrementally to manage memory efficiently.

import pandas as pd
import json
import re
from openpyxl import Workbook
from concurrent.futures import ThreadPoolExecutor, as_completed
import threading
from tqdm import tqdm
import os
import queue
import logging
from typing import List, Dict

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

def parse_json_chunk(chunk: str) -> List[Dict]:
    """
    Parse a chunk of JSON strings into a list of dictionaries.
    """
    try:
        # Split chunk into individual JSON objects
        chunk = re.sub(r'}\s*{', '}\n{', chunk)
        objects = []
        current_obj = ""
        for line in chunk.splitlines():
            line = line.strip()
            if not line:
                continue
            current_obj += line
            try:
                parsed = json.loads(current_obj)
                objects.append(parsed)
                current_obj = ""
            except json.JSONDecodeError:
                continue
        return objects
    except Exception as e:
        logger.error(f"Error parsing chunk: {e}")
        return []

def process_chunk(chunk: str, chunk_id: int) -> pd.DataFrame:
    """
    Process a chunk of text and convert it to a DataFrame.
    """
    try:
        data = parse_json_chunk(chunk)
        if data:
            return pd.json_normalize(data)
        return pd.DataFrame()
    except Exception as e:
        logger.error(f"Error processing chunk {chunk_id}: {e}")
        return pd.DataFrame()

def read_file_in_chunks(file_path: str, chunk_size: int = 1024 * 1024) -> queue.Queue:
    """
    Read the file in chunks and store them in a queue.
    """
    chunk_queue = queue.Queue()
    current_chunk = ""
    chunk_id = 0

    with open(file_path, 'r', encoding='utf-8') as file:
        while True:
            chunk = file.read(chunk_size)
            if not chunk:
                if current_chunk:
                    chunk_queue.put((chunk_id, current_chunk))
                    chunk_id += 1
                break
            current_chunk += chunk
            # Split on JSON object boundaries
            last_brace = current_chunk.rfind('}')
            if last_brace != -1:
                chunk_to_process = current_chunk[:last_brace + 1]
                chunk_queue.put((chunk_id, chunk_to_process))
                current_chunk = current_chunk[last_brace + 1:]
                chunk_id += 1

    return chunk_queue

def json_to_excel(json_file: str, excel_file: str = 'result.xlsx', chunk_size: int = 1024 * 1024, max_workers: int = 4):
    """
    Convert a large JSON text file to Excel with multithreading and progress monitoring.
    """
    logger.info(f"Starting conversion of {json_file} to {excel_file}")

    # Initialize Excel workbook
    wb = Workbook()
    wb.save(excel_file)
    wb.close()

    # Read chunks into a queue
    chunk_queue = read_file_in_chunks(json_file, chunk_size)
    total_chunks = chunk_queue.qsize()

    # Initialize progress bar
    progress = tqdm(total=total_chunks, desc="Processing chunks", unit="chunk")

    # Process chunks with ThreadPoolExecutor
    results = []
    lock = threading.Lock()

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_chunk = {
            executor.submit(process_chunk, chunk, chunk_id): chunk_id
            for chunk_id, chunk in iter(chunk_queue.get, None)
        }

        for future in as_completed(future_to_chunk):
            chunk_id = future_to_chunk[future]
            try:
                df = future.result()
                if not df.empty:
                    with lock:
                        results.append(df)
            except Exception as e:
                logger.error(f"Chunk {chunk_id} generated an exception: {e}")
            progress.update(1)

    progress.close()

    if not results:
        logger.warning("No valid data processed.")
        return

    # Concatenate all DataFrames
    logger.info("Concatenating DataFrames")
    final_df = pd.concat(results, ignore_index=True)

    # Write to Excel in chunks to avoid memory issues
    logger.info(f"Writing DataFrame to {excel_file}")
    with pd.ExcelWriter(excel_file, engine='openpyxl', mode='a' if os.path.exists(excel_file) else 'w') as writer:
        final_df.to_excel(writer, index=False, sheet_name='Sheet1')

    logger.info(f"Data written successfully to {excel_file}")

if __name__ == "__main__":
    json_file_path = "sample2.txt"
    excel_file_path = "result.xlsx"
    chunk_size = 1024 * 1024  # 1MB chunks
    max_workers = os.cpu_count() or 4  # Use number of CPU cores or default to 4
    json_to_excel(json_file_path, excel_file_path, chunk_size, max_workers)

Key Optimizations and Features:

Chunked File Reading:
- Reads the file in manageable chunks (default: 1MB) using a queue to prevent loading the entire file into memory.
- Ensures chunks are split at JSON object boundaries to avoid parsing errors.
Multithreading:
- Uses ThreadPoolExecutor to process chunks in parallel, with the number of workers set to the number of CPU cores (or a default of 4).
- Thread-safe operations with a lock when collecting results.
Progress Monitoring:
- Integrates tqdm to display a progress bar, showing the number of chunks processed.
Memory Efficiency:
- Processes and normalizes JSON data chunk by chunk, reducing memory usage.
- Writes the final DataFrame to Excel incrementally using openpyxl.
Error Handling:
- Robust error handling for JSON parsing and processing, with logging for debugging.
- Skips malformed JSON objects and continues processing.
Logging:
- Uses the logging module to provide detailed information about the process and any errors.
Flexible Configuration:
- Adjustable chunk_size and max_workers parameters to tune performance based on system resources.

Requirements:

Install the required libraries:

pip install pandas openpyxl tqdm

Usage:

Place your JSON text file (sample2.txt) in the same directory as the script.
Run the script. It will process the file and generate result.xlsx.
Adjust chunk_size and max_workers if needed for your system or file size.

Performance Notes:

For a 25 MB file with ~500,000 lines, the script should process efficiently on modern hardware, typically within a few minutes, depending on CPU and disk speed.
The chunk size (1MB) and number of workers (CPU cores) are balanced for most systems but can be tuned:
- Increase chunk_size for faster processing if you have more RAM.
- Adjust max_workers to balance CPU usage and memory.

Limitations:

Assumes JSON objects are relatively small and can be normalized into a DataFrame. Very deeply nested JSON may require additional handling.
Excel writing with openpyxl may become a bottleneck for very large datasets. Consider splitting output into multiple sheets or using CSV for extremely large files.

This script should handle your large JSON text file efficiently while providing feedback on progress and errors. Let me know if you need further tweaks or have specific requirements!

Key Optimizations and Features:

Requirements:

Usage:

Performance Notes:

Limitations:

Comments (0)

Read More

#reading

#popular