For handling large text files (~25 MB or 500,000 lines) containing JSON data, we need to optimize the script to process the file efficiently, minimize memory usage, and improve performance. Using a CSV file instead of an Excel file is a good choice for large datasets because CSV files are simpler, faster to write, and consume less memory. Below is a modified version of your script that incorporates optimizations for large files, uses streaming for JSON parsing, and outputs to a CSV file.
Key Optimizations
-
Streaming JSON Parsing: Use
ijson
to parse JSON objects iteratively, avoiding loading the entire file into memory. - Chunked Processing: Process the data in chunks to manage memory usage when converting to a DataFrame and writing to CSV.
- CSV Output: Write to a CSV file instead of Excel for faster performance and lower memory usage.
- Generator-Based Processing: Use generators to yield JSON objects one at a time, reducing memory overhead.
- Efficient File Reading: Read the file line by line or in chunks to handle large files efficiently.
-
Avoid Regex for Splitting: Instead of using
re.sub
to split JSON objects, rely on streaming parsing to handle object boundaries.
Modified Script
import pandas as pd
import ijson
import csv
import os
from itertools import islice
def json_to_csv(json_file, csv_file='result.csv', chunk_size=10000):
"""
Convert a large JSON file to CSV efficiently using streaming and chunked processing.
Args:
json_file (str): Path to the input JSON file.
csv_file (str): Path to the output CSV file.
chunk_size (int): Number of records to process in each chunk.
"""
# Check if file exists
if not os.path.exists(json_file):
print(f"Error: File {json_file} does not exist.")
return
# Initialize CSV file with headers
first_record = True
records = []
# Use ijson to stream JSON objects
with open(json_file, 'rb') as file: # Binary mode for ijson
try:
# Parse JSON objects iteratively
parser = ijson.items(file, 'item') # Assumes JSON array of objects
for obj in parser:
records.append(obj)
# Process records in chunks
if len(records) >= chunk_size:
write_chunk(records, csv_file, first_record)
first_record = False
records = [] # Clear records after writing
except ijson.JSONError as e:
print(f"JSON parsing error: {e}. Attempting line-by-line parsing...")
# Fallback: Parse line by line for malformed JSON
file.seek(0) # Reset file pointer
records = []
current_obj = ""
for line in file:
line = line.decode('utf-8').strip() # Decode from bytes to string
if line:
current_obj += line
try:
import json
parsed = json.loads(current_obj)
records.append(parsed)
current_obj = ""
except json.JSONDecodeError:
continue
# Process records in chunks
if len(records) >= chunk_size:
write_chunk(records, csv_file, first_record)
first_record = False
records = []
# Write any remaining records
if records:
write_chunk(records, csv_file, first_record)
print(f"Data written successfully to {csv_file}")
def write_chunk(records, csv_file, first_record):
"""
Write a chunk of records to the CSV file.
Args:
records (list): List of JSON objects to write.
csv_file (str): Path to the output CSV file.
first_record (bool): Whether this is the first chunk (to write headers).
"""
# Convert records to DataFrame
df = pd.json_normalize(records)
# Write to CSV
mode = 'w' if first_record else 'a'
header = first_record
df.to_csv(csv_file, mode=mode, header=header, index=False, encoding='utf-8')
# Example usage
json_file_path = "sample2.txt"
json_to_csv(json_file_path, csv_file='result.csv', chunk_size=10000)
Explanation of Changes and Optimizations
-
Use of
ijson
:- The
ijson
library is used to parse JSON objects iteratively, which is critical for large files. It avoids loading the entire file into memory. - The
items
method with'item'
assumes the JSON file is an array of objects (e.g.,[{}, {}, ...]
). If your JSON file has a different structure (e.g., newline-delimited JSON), adjust the parsing logic accordingly.
- The
-
Chunked Processing:
- Records are collected in chunks of
chunk_size
(default: 10,000) before being converted to a DataFrame and written to CSV. This balances memory usage and performance. - The
write_chunk
function handles writing each chunk to the CSV file, ensuring headers are written only for the first chunk.
- Records are collected in chunks of
-
CSV Output:
- CSV is used instead of Excel because it’s faster and more memory-efficient for large datasets. Libraries like
pandas
write to CSV more efficiently than to Excel formats like XLSX. - The
to_csv
method is used withmode='a'
(append) for subsequent chunks to avoid overwriting the file.
- CSV is used instead of Excel because it’s faster and more memory-efficient for large datasets. Libraries like
-
Fallback for Malformed JSON:
- If
ijson
fails (e.g., due to malformed JSON), the script falls back to line-by-line parsing, similar to your original script, but optimized for memory by processing chunks.
- If
-
Binary File Reading:
- The file is opened in binary mode (
'rb'
) forijson
, which is more efficient for large files. For the fallback method, lines are decoded to strings as needed.
- The file is opened in binary mode (
-
Error Handling:
- Checks if the input file exists.
- Handles JSON parsing errors gracefully with a fallback mechanism.
-
Configurable Chunk Size:
- The
chunk_size
parameter allows you to tune the number of records processed at a time based on your system’s memory capacity. A largerchunk_size
reduces I/O overhead but increases memory usage.
- The
Prerequisites
Install the required libraries:
pip install pandas ijson openpyxl
Notes
-
JSON Structure: The script assumes the JSON file is either an array of objects or newline-delimited JSON (one object per line). If your JSON has a different structure, you may need to adjust the
ijson.items
call or the fallback parsing logic. -
Performance Tuning: Adjust
chunk_size
based on your system’s memory. For a system with 8 GB RAM, achunk_size
of 10,000–50,000 is typically reasonable. - Output Size: CSV files are generally smaller than Excel files, but ensure you have enough disk space for the output (~25 MB input could produce a larger CSV depending on the data).
-
Excel Alternative: If you still need Excel output, you can modify the
write_chunk
function to usedf.to_excel
withengine='openpyxl'
and append mode (usingExcelWriter
), but this will be slower and more memory-intensive.
Example Input and Output
Input (sample2.txt
):
[{"id": 1, "name": "Alice", "age": 30}, {"id": 2, "name": "Bob", "age": 25}]
or newline-delimited:
{"id": 1, "name": "Alice", "age": 30}
{"id": 2, "name": "Bob", "age": 25}
Output (result.csv
):
id,name,age
1,Alice,30
2,Bob,25
This script should handle large files efficiently while maintaining robustness and flexibility. Let me know if you need further tweaks or have a specific JSON structure to accommodate!