Here's an optimized version of the script to handle large JSON text files (~25 MB, 500,000 lines) using multithreading, chunked processing, and a progress monitor. The script processes the file in chunks, leverages parallel processing for JSON parsing, and writes to Excel incrementally to manage memory efficiently.
import pandas as pd
import json
import re
from openpyxl import Workbook
from concurrent.futures import ThreadPoolExecutor, as_completed
import threading
from tqdm import tqdm
import os
import queue
import logging
from typing import List, Dict
# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
def parse_json_chunk(chunk: str) -> List[Dict]:
"""
Parse a chunk of JSON strings into a list of dictionaries.
"""
try:
# Split chunk into individual JSON objects
chunk = re.sub(r'}\s*{', '}\n{', chunk)
objects = []
current_obj = ""
for line in chunk.splitlines():
line = line.strip()
if not line:
continue
current_obj += line
try:
parsed = json.loads(current_obj)
objects.append(parsed)
current_obj = ""
except json.JSONDecodeError:
continue
return objects
except Exception as e:
logger.error(f"Error parsing chunk: {e}")
return []
def process_chunk(chunk: str, chunk_id: int) -> pd.DataFrame:
"""
Process a chunk of text and convert it to a DataFrame.
"""
try:
data = parse_json_chunk(chunk)
if data:
return pd.json_normalize(data)
return pd.DataFrame()
except Exception as e:
logger.error(f"Error processing chunk {chunk_id}: {e}")
return pd.DataFrame()
def read_file_in_chunks(file_path: str, chunk_size: int = 1024 * 1024) -> queue.Queue:
"""
Read the file in chunks and store them in a queue.
"""
chunk_queue = queue.Queue()
current_chunk = ""
chunk_id = 0
with open(file_path, 'r', encoding='utf-8') as file:
while True:
chunk = file.read(chunk_size)
if not chunk:
if current_chunk:
chunk_queue.put((chunk_id, current_chunk))
chunk_id += 1
break
current_chunk += chunk
# Split on JSON object boundaries
last_brace = current_chunk.rfind('}')
if last_brace != -1:
chunk_to_process = current_chunk[:last_brace + 1]
chunk_queue.put((chunk_id, chunk_to_process))
current_chunk = current_chunk[last_brace + 1:]
chunk_id += 1
return chunk_queue
def json_to_excel(json_file: str, excel_file: str = 'result.xlsx', chunk_size: int = 1024 * 1024, max_workers: int = 4):
"""
Convert a large JSON text file to Excel with multithreading and progress monitoring.
"""
logger.info(f"Starting conversion of {json_file} to {excel_file}")
# Initialize Excel workbook
wb = Workbook()
wb.save(excel_file)
wb.close()
# Read chunks into a queue
chunk_queue = read_file_in_chunks(json_file, chunk_size)
total_chunks = chunk_queue.qsize()
# Initialize progress bar
progress = tqdm(total=total_chunks, desc="Processing chunks", unit="chunk")
# Process chunks with ThreadPoolExecutor
results = []
lock = threading.Lock()
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_chunk = {
executor.submit(process_chunk, chunk, chunk_id): chunk_id
for chunk_id, chunk in iter(chunk_queue.get, None)
}
for future in as_completed(future_to_chunk):
chunk_id = future_to_chunk[future]
try:
df = future.result()
if not df.empty:
with lock:
results.append(df)
except Exception as e:
logger.error(f"Chunk {chunk_id} generated an exception: {e}")
progress.update(1)
progress.close()
if not results:
logger.warning("No valid data processed.")
return
# Concatenate all DataFrames
logger.info("Concatenating DataFrames")
final_df = pd.concat(results, ignore_index=True)
# Write to Excel in chunks to avoid memory issues
logger.info(f"Writing DataFrame to {excel_file}")
with pd.ExcelWriter(excel_file, engine='openpyxl', mode='a' if os.path.exists(excel_file) else 'w') as writer:
final_df.to_excel(writer, index=False, sheet_name='Sheet1')
logger.info(f"Data written successfully to {excel_file}")
if __name__ == "__main__":
json_file_path = "sample2.txt"
excel_file_path = "result.xlsx"
chunk_size = 1024 * 1024 # 1MB chunks
max_workers = os.cpu_count() or 4 # Use number of CPU cores or default to 4
json_to_excel(json_file_path, excel_file_path, chunk_size, max_workers)
Key Optimizations and Features:
-
Chunked File Reading:
- Reads the file in manageable chunks (default: 1MB) using a queue to prevent loading the entire file into memory.
- Ensures chunks are split at JSON object boundaries to avoid parsing errors.
-
Multithreading:
- Uses
ThreadPoolExecutor
to process chunks in parallel, with the number of workers set to the number of CPU cores (or a default of 4). - Thread-safe operations with a lock when collecting results.
- Uses
-
Progress Monitoring:
- Integrates
tqdm
to display a progress bar, showing the number of chunks processed.
- Integrates
-
Memory Efficiency:
- Processes and normalizes JSON data chunk by chunk, reducing memory usage.
- Writes the final DataFrame to Excel incrementally using
openpyxl
.
-
Error Handling:
- Robust error handling for JSON parsing and processing, with logging for debugging.
- Skips malformed JSON objects and continues processing.
-
Logging:
- Uses the
logging
module to provide detailed information about the process and any errors.
- Uses the
-
Flexible Configuration:
- Adjustable
chunk_size
andmax_workers
parameters to tune performance based on system resources.
- Adjustable
Requirements:
Install the required libraries:
pip install pandas openpyxl tqdm
Usage:
- Place your JSON text file (
sample2.txt
) in the same directory as the script. - Run the script. It will process the file and generate
result.xlsx
. - Adjust
chunk_size
andmax_workers
if needed for your system or file size.
Performance Notes:
- For a 25 MB file with ~500,000 lines, the script should process efficiently on modern hardware, typically within a few minutes, depending on CPU and disk speed.
- The chunk size (1MB) and number of workers (CPU cores) are balanced for most systems but can be tuned:
- Increase
chunk_size
for faster processing if you have more RAM. - Adjust
max_workers
to balance CPU usage and memory.
- Increase
Limitations:
- Assumes JSON objects are relatively small and can be normalized into a DataFrame. Very deeply nested JSON may require additional handling.
- Excel writing with
openpyxl
may become a bottleneck for very large datasets. Consider splitting output into multiple sheets or using CSV for extremely large files.
This script should handle your large JSON text file efficiently while providing feedback on progress and errors. Let me know if you need further tweaks or have specific requirements!