Start Pyspark framework with Jupyter Notebook and Snowflake Datawarehouse using JSON and CSV data files
Hi Guys,
I am writing this blog to understand how we can start the PySpark, how we can execute actions/transformations on pyspark data-frames.
Interesting thing is we can do it using Jupyter Notebook -a framework available inside python, we can install that using pip manager command — pip install notebook and can execute each line of python code using Jupyter Notebook.
I am gonna here to use the below prerequisites/requirements for this ETL process.
- Have a trial account on Snowflake Datawarehouse using https://signup.snowflake.com/ to load/unload data using .snowsql and pyspark.
- Python is required on you system, get python version info —
python--version (3.7+) - Install PIP using python -m pip install — upgrade pip
- Install pyspark using pip install pyspark
- Install jupyter notebook pip install notebook
- run the jupyter notebook python -m notebook
lets be ready with some data files json, csv etc. any of data file you can choose to play with dataframe transformations actions.
I am using snowflake datawarehouse to extract data from snowflake using .snowsql command line tool. As I included snowflake datawarehouse here , you can make a trial account on snowflake which is free for 30 days without adding any card details so you can enjoy that and learn snowflake datawarehouse.
Extract data from snowflake datawarehouse using user stage
and let’s download into you system using snowsql:
Please watch and subscribe the below videos.
There are various functions you can apply on data-frames and play around :) to understand it.
Once you are able to do that, further you can try pyspark aggregate functions(count, avg, max, min, stddev etc. with groupBy command). which are important things to learn.
You can watch and follow the below pyspark aggregate functions video.
I would be adding more into this, till the time you learn this and play around the data :)