How to ingest CSV, Parquet & JSON file into snowflake datawarehouse using Pyspark Dataframes??
Hi,
In this blog, I would ingest the various types of data (CSV, PARQUET & JSON) file into snowflake datawarehouse. Pyspark used its session while it performs the ingestion task it creates a temporary stage into snowflake to insert data into db table.
You would be familiar with Pyspark data engine, how to convert the files data from various data formats to data-frames and then you can perform various actions and transformations on data-frames. I also used few to check the schema, number of records, duplicate records, how to remove the duplicate records and finally ingest the data into snowflake data-warehouse.
I performed the data ingestion using following steps:
- Create snowflake connection using a private key.
- Create spark instance using Spark-Session and local cluster mode.
- Read data from files (CSV, JSON, PARQUET) and perform transformations on data-frames.
- Ingest the data into snowflake data-warehouse using Pyspark write API.
Create a SNOWFLAKE Connection using private key:
a config.properties file inside your project folder:
a snowflake create method using above details:
connection is ready in below format options:
read the csv file data:
Please go through the below video, it is complete practical on spark data engine how would you be able to perform csv data ingestion into snowflake datawarehouse.
read the parquet file data:
read the json file data:
Please go through the below video, it is complete practical on spark data engine how would you be able to perform parquet, JSON files data ingestion into snowflake data-warehouse.
Transformations:
below are the duplicate records in data-frame:
remove duplicate records using distinct() or dropDuplicates() method
finally we have to perform data ingestion step :)
Please watch and subscribe my videos, you would get the full idea how to perform the above steps:
Thank You :)