How to ingest CSV, Parquet & JSON file into snowflake datawarehouse using Pyspark Dataframes??

Vidit tyagi
3 min readSep 18, 2022

--

pyspark-snowflake data ingestion with csv, parquet and json data files
pyspark snowflake data ingestion

Hi,

In this blog, I would ingest the various types of data (CSV, PARQUET & JSON) file into snowflake datawarehouse. Pyspark used its session while it performs the ingestion task it creates a temporary stage into snowflake to insert data into db table.

You would be familiar with Pyspark data engine, how to convert the files data from various data formats to data-frames and then you can perform various actions and transformations on data-frames. I also used few to check the schema, number of records, duplicate records, how to remove the duplicate records and finally ingest the data into snowflake data-warehouse.

I performed the data ingestion using following steps:

  1. Create snowflake connection using a private key.
  2. Create spark instance using Spark-Session and local cluster mode.
  3. Read data from files (CSV, JSON, PARQUET) and perform transformations on data-frames.
  4. Ingest the data into snowflake data-warehouse using Pyspark write API.

Create a SNOWFLAKE Connection using private key:

a config.properties file inside your project folder:

snowflake connection properties with rsa_key.p8
snowflake account properties to make connection in python

a snowflake create method using above details:

make snowflake connection using account details and rsa_key.p8
snowflake connection using account details and rsa key

connection is ready in below format options:

snowflake connection ready with configuration details

read the csv file data:

read csv file using pyspark
read csv file using pyspark

Please go through the below video, it is complete practical on spark data engine how would you be able to perform csv data ingestion into snowflake datawarehouse.

read the parquet file data:

read parquet file data using pyspark

read the json file data:

read json file data using pyspark
read json file data using pyspark

Please go through the below video, it is complete practical on spark data engine how would you be able to perform parquet, JSON files data ingestion into snowflake data-warehouse.

Transformations:
below are the duplicate records in data-frame:

duplicate records in pyspark dataframe
duplicate records in pyspark dataframe

remove duplicate records using distinct() or dropDuplicates() method

remove duplicate records using distinct()
remove duplicate records using distinct()

finally we have to perform data ingestion step :)

Please watch and subscribe my videos, you would get the full idea how to perform the above steps:

Thank You :)

--

--