Start Pyspark framework with Jupyter Notebook and Snowflake Datawarehouse using JSON and CSV data files

Vidit tyagi
3 min readSep 5, 2022
learn pyspark with jupyter notebook and snowflake datawarehouse with load/unload json.csv files

Hi Guys,

I am writing this blog to understand how we can start the PySpark, how we can execute actions/transformations on pyspark data-frames.

Interesting thing is we can do it using Jupyter Notebook -a framework available inside python, we can install that using pip manager command — pip install notebook and can execute each line of python code using Jupyter Notebook.

I am gonna here to use the below prerequisites/requirements for this ETL process.

  1. Have a trial account on Snowflake Datawarehouse using https://signup.snowflake.com/ to load/unload data using .snowsql and pyspark.
  2. Python is required on you system, get python version info —
    python--version (3.7+)
  3. Install PIP using python -m pip install — upgrade pip
  4. Install pyspark using pip install pyspark
  5. Install jupyter notebook pip install notebook
  6. run the jupyter notebook python -m notebook

lets be ready with some data files json, csv etc. any of data file you can choose to play with dataframe transformations actions.

start pyspark with jupyter notebook using python

I am using snowflake datawarehouse to extract data from snowflake using .snowsql command line tool. As I included snowflake datawarehouse here , you can make a trial account on snowflake which is free for 30 days without adding any card details so you can enjoy that and learn snowflake datawarehouse.

Extract data from snowflake datawarehouse using user stage

extract data from snowflake shared database
extract data into snowflake user stage

and let’s download into you system using snowsql:

download csv/json files data using snowsql get command.
snowsql get command to download the json/csv files

Please watch and subscribe the below videos.

There are various functions you can apply on data-frames and play around :) to understand it.

pyspark dataframe operations, filters etc.
pyspark data filters and methods

Once you are able to do that, further you can try pyspark aggregate functions(count, avg, max, min, stddev etc. with groupBy command). which are important things to learn.

You can watch and follow the below pyspark aggregate functions video.

pyspark aggregate functions
pyspark aggregate functions

I would be adding more into this, till the time you learn this and play around the data :)

--

--