Benefits of using Parquet file format: Parquet is a columnar storage file format that is optimized for big data processing workloads. Parquet files are highly compressed and can significantly improve query performance for analytical workloads. They are particularly beneficial for read-heavy operations where only a subset of columns is accessed.
Amazon Athena, Amazon Redshift Spectrum, and AWS Glue all support reading Parquet files, making it a popular choice for big data processing on AWS. Parquet files can be stored in Amazon S3, which is a highly scalable and durable object storage service. This allows for efficient storage and retrieval of large datasets in a cost-effective manner.
Converting JSON to Parquet file format can be done using Python. Below is a simple Python script that reads a JSON file and writes the data to a Parquet file using the pandas
library.
You can install the pandas
library using pip install pandas
if you don’t already have it installed. You can use virtual environments to manage your Python dependencies too.
main.py
file
import pandas as pd
import sys
file_in = sys.argv[1]
file_op = sys.argv[2]
data = pd.read_json(file_in)
data.to_parquet(file_op)
JSON Data is as below:
data.json
[
{
"event_id": "b8a7eb74-cfca-49e4-83c4-728c10458c23",
"title": "Zieme and Sons",
"language": "en",
"content": "entrepreneur I am Kate Daniel IV living in Fort Freeman and my email address is Ena62@gmail.com and August6@hotmail.com. I work for the company McKenzie, Koepp and Bednar my mobile to reach is 776.923.5653"
},
{
"event_id": "ab83a922-44f7-4e57-b24a-92a9bd193ed5",
"title": "Bode - Funk",
"language": "en",
"content": "I am Bessie Cormier my mobile to reach is 219-795-3134 x8903"
},
{
"event_id": "bc459a28-3ce5-423e-b68a-05464e32fce0",
"title": "Reichert Group",
"language": "en",
"content": "I am iving in Sandraborough. I work for the company Hermann - Herzog"
},
{
"event_id": "430d3ba2-58a1-4b4f-9628-c6a99eced72a",
"title": "Schamberger, Schmidt and Reilly",
"language": "en",
"content": "Hello I am veteran living in Deckowberg"
},
{
"event_id": "ca40439d-4f0c-4490-b3c9-5e0282dd95f9",
"title": "Cervantes - Bailey",
"language": "en",
"content": "Richard Tucker, born on July 28, 1981, living at 331 Mckinney Mount Cruzland, SD 16758, with a phone number of 540-308-0574x8345 and email address thomasrodriguez@gmail.com, recently applied for a loan."
}
]
Running
python main.py ~/data.json ~/data.parquet