Python txt to Excel - 02 | Programming and Development

Yes! We can significantly speed up the process using multi-threading, chunked processing, and optimized I/O operations. Here's a high-performance version of your script:

Optimized Version (Using ThreadPool + Chunked Processing)

import pandas as pd
import json
import re
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm  # For progress bar (install with `pip install tqdm`)

def json_to_excel_optimized(json_file, excel_file='result.xlsx', chunk_size=5000, max_workers=4):
    def parse_json_chunk(chunk):
        parsed_data = []
        for line in chunk:
            try:
                data = json.loads(line)
                parsed_data.append(data)
            except json.JSONDecodeError:
                continue
        return parsed_data

    # Step 1: Preprocess file into valid JSON lines
    print("Preprocessing JSON file...")
    with open(json_file, 'r') as f:
        content = f.read()
    # Fix concatenated JSON objects (e.g., "}{" -> "}\n{")
    content = re.sub(r'}\s*{', '}\n{', content)
    lines = content.splitlines()

    # Step 2: Process in parallel chunks
    print("Processing chunks in parallel...")
    chunks = [lines[i:i + chunk_size] for i in range(0, len(lines), chunk_size)]

    all_data = []
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(parse_json_chunk, chunk) for chunk in chunks]
        for future in tqdm(as_completed(futures), total=len(chunks)):
            all_data.extend(future.result())

    # Step 3: Write to Excel in one go (fastest for OpenPyXL)
    print("Writing to Excel...")
    df = pd.json_normalize(all_data)
    df.to_excel(excel_file, index=False, engine='openpyxl')
    print(f"Processed {len(lines)} lines. Output saved to {excel_file}")

# Usage
json_to_excel_optimized("large_file.txt", max_workers=8)  # Adjust workers based on CPU cores

Key Optimizations:

1. Multi-threading with `ThreadPoolExecutor`

Parallelizes JSON parsing (CPU-bound task) across multiple threads.
max_workers=4 (default) can be increased if your CPU has more cores.

2. Chunked Processing

Splits the file into smaller chunks (e.g., 5,000 lines per chunk).
Each chunk is processed independently, reducing memory overhead.

3. Batched I/O Operations

Reads the entire file once (content = f.read()), then splits it into lines.
Writes to Excel in a single operation (faster than incremental writes).

4. Progress Tracking with `tqdm`

Shows a real-time progress bar for monitoring.

5. Regex Optimization

Preprocesses the entire file once to fix malformed JSON (e.g., }{ → }\n{).

Alternative: Multi-processing (For CPU-Intensive Tasks)

If parsing is very CPU-intensive (e.g., complex JSON), replace ThreadPoolExecutor with ProcessPoolExecutor:

from concurrent.futures import ProcessPoolExecutor

with ProcessPoolExecutor(max_workers=4) as executor:  # Instead of ThreadPoolExecutor
    ...

Note: Multi-processing has higher overhead but avoids Python's GIL limitations.

Performance Comparison (Estimated)

Method	Time (20MB File)	Memory Usage
Original	~5-10 mins	High
Optimized (Threads)	~1-2 mins	Medium
Optimized (Processes)	~30-60 secs	Low

Further Optimizations (If Needed):

Use ujson instead of json (2-3x faster parsing):

pip install ujson

import ujson as json

Write to CSV First (Then convert to Excel):

df.to_csv("temp.csv", index=False)
   pd.read_csv("temp.csv").to_excel(excel_file, index=False)

Dask for Out-of-Core Processing (For 100MB+ files):

import dask.dataframe as dd
   ddf = dd.read_json(json_file, lines=True)
   ddf.compute().to_excel(excel_file, index=False)

Let me know if you'd like to explore any of these alternatives!

Python txt to Excel - 02

Optimized Version (Using ThreadPool + Chunked Processing)

Key Optimizations:

1. Multi-threading with `ThreadPoolExecutor`

2. Chunked Processing

3. Batched I/O Operations

4. Progress Tracking with `tqdm`

5. Regex Optimization

Alternative: Multi-processing (For CPU-Intensive Tasks)

Performance Comparison (Estimated)

Further Optimizations (If Needed):

Comments (0)

Read More

#reading

#popular

Python txt to Excel - 02

Optimized Version (Using ThreadPool + Chunked Processing)

Key Optimizations:

1. Multi-threading with ThreadPoolExecutor

2. Chunked Processing

3. Batched I/O Operations

4. Progress Tracking with tqdm

5. Regex Optimization

Alternative: Multi-processing (For CPU-Intensive Tasks)

Performance Comparison (Estimated)

Further Optimizations (If Needed):

Comments (0)

Read More

#reading

#popular

1. Multi-threading with `ThreadPoolExecutor`

4. Progress Tracking with `tqdm`