Yes! We can significantly speed up the process using multi-threading, chunked processing, and optimized I/O operations. Here's a high-performance version of your script:
Optimized Version (Using ThreadPool + Chunked Processing)
import pandas as pd
import json
import re
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm # For progress bar (install with `pip install tqdm`)
def json_to_excel_optimized(json_file, excel_file='result.xlsx', chunk_size=5000, max_workers=4):
def parse_json_chunk(chunk):
parsed_data = []
for line in chunk:
try:
data = json.loads(line)
parsed_data.append(data)
except json.JSONDecodeError:
continue
return parsed_data
# Step 1: Preprocess file into valid JSON lines
print("Preprocessing JSON file...")
with open(json_file, 'r') as f:
content = f.read()
# Fix concatenated JSON objects (e.g., "}{" -> "}\n{")
content = re.sub(r'}\s*{', '}\n{', content)
lines = content.splitlines()
# Step 2: Process in parallel chunks
print("Processing chunks in parallel...")
chunks = [lines[i:i + chunk_size] for i in range(0, len(lines), chunk_size)]
all_data = []
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = [executor.submit(parse_json_chunk, chunk) for chunk in chunks]
for future in tqdm(as_completed(futures), total=len(chunks)):
all_data.extend(future.result())
# Step 3: Write to Excel in one go (fastest for OpenPyXL)
print("Writing to Excel...")
df = pd.json_normalize(all_data)
df.to_excel(excel_file, index=False, engine='openpyxl')
print(f"Processed {len(lines)} lines. Output saved to {excel_file}")
# Usage
json_to_excel_optimized("large_file.txt", max_workers=8) # Adjust workers based on CPU cores
Key Optimizations:
1. Multi-threading with ThreadPoolExecutor
- Parallelizes JSON parsing (CPU-bound task) across multiple threads.
-
max_workers=4
(default) can be increased if your CPU has more cores.
2. Chunked Processing
- Splits the file into smaller chunks (e.g., 5,000 lines per chunk).
- Each chunk is processed independently, reducing memory overhead.
3. Batched I/O Operations
- Reads the entire file once (
content = f.read()
), then splits it into lines. - Writes to Excel in a single operation (faster than incremental writes).
4. Progress Tracking with tqdm
- Shows a real-time progress bar for monitoring.
5. Regex Optimization
- Preprocesses the entire file once to fix malformed JSON (e.g.,
}{
→}\n{
).
Alternative: Multi-processing (For CPU-Intensive Tasks)
If parsing is very CPU-intensive (e.g., complex JSON), replace ThreadPoolExecutor
with ProcessPoolExecutor
:
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=4) as executor: # Instead of ThreadPoolExecutor
...
Note: Multi-processing has higher overhead but avoids Python's GIL limitations.
Performance Comparison (Estimated)
Method | Time (20MB File) | Memory Usage |
---|---|---|
Original | ~5-10 mins | High |
Optimized (Threads) | ~1-2 mins | Medium |
Optimized (Processes) | ~30-60 secs | Low |
Further Optimizations (If Needed):
-
Use
ujson
instead ofjson
(2-3x faster parsing):
pip install ujson
import ujson as json
- Write to CSV First (Then convert to Excel):
df.to_csv("temp.csv", index=False)
pd.read_csv("temp.csv").to_excel(excel_file, index=False)
- Dask for Out-of-Core Processing (For 100MB+ files):
import dask.dataframe as dd
ddf = dd.read_json(json_file, lines=True)
ddf.compute().to_excel(excel_file, index=False)
Let me know if you'd like to explore any of these alternatives!