...
...
...
...
...
Prerequisites: Python
Versions: Python 3.9.6, pandas 2.0.1
Read Time: 60 minutes
Python is a powerful programming language that can be used for a variety of tasks, including analyzing data from a CSV file. We'll go over how to use Python to import data and run an analysis on it. We'll be using the pandas library, a popular data analysis tool for Python.
Amazon Best Sellers are updated every hour. The actual list is made of 100 books, but the data we're working with features just the top 50 books. 📖
In this tutorial, we will work with a CSV (comma-separated values) file that features some fun data about the top 50 best selling books on Amazon from 2009 to 2019 (provided by Kaggle).
Note: If you don't have a Kaggle account, you can also download it here in our GitHub:
The .csv file contains 550 books. Here are the seven columns:
Name
: Book nameAuthor
: Book authorUser Rating
: Amazon user rating (0.0 - 5.0)Reviews
: Number of user reviewsPrice
: Book price (as of 2020)Year
: The year(s) it rankedGenre
: Fiction or non-fictionPoke around and see if you have any questions. For example: Who are the best selling authors? What are the books with 4.8+ user ratings? Which one performed better: fiction or non-fiction?
Let's dig in!
Before we begin, we need to install the pandas library. To do this, open the terminal and type the following command:
pip3 install pandas
We can confirm that pandas was installed by running pip3 list
.
Note: If you run into any issues when running these commands, check if pip is installed by running pip --version
or pip3 --version
. If you don't get a version number, follow these steps to install pip on your computer.
Next, we'll create a folder named amazon-best-sellers-analysis. Inside the folder, create a main.py file and then import pandas
(using pd
as an alias):
import pandas as pd
# Rest of the file go here...
Next, we need to import the pandas library and load the data into our Python program.
Download the bestsellers.csv file and add it to the same folder as your main.py file, amazon-best-sellers-analysis.
To read CSV files, we'll use the .read_csv()
function provided by pandas. Then we will save this data to a new df
variable:
import pandas as pd
df = pd.read_csv('bestsellers.csv')
The df
variable is a DataFrame object that allows us to work with tabular data via pandas.
Let's save the main.py file and move to the next step!
Once we've loaded the spreadsheet data, we can explore it to get a better understanding of what we're working with. We can use various functions provided by pandas to do this. Here are a few examples:
# Get the first 5 rows of the spreadsheet
print(df.head())
# Get the shape of the spreadsheet
print(df.shape)
# Get the column names of the spreadsheet
print(df.columns)
# Get summary statistics for each column
print(df.describe())
These functions will give us a good overview of the data we're working with, including:
Let's save the main.py file and run python amazon-best-sellers-analysis/main.py
to print the following:
Name ... Genre
0 10-Day Green Smoothie Cleanse ... Non Fiction
1 11/22/63: A Novel ... Fiction
2 12 Rules for Life: An Antidote to Chaos ... Non Fiction
3 1984 (Signet Classics) ... Fiction
4 5,000 Awesome Facts (About Everything!) (Natio... ... Non Fiction
[5 rows x 7 columns]
(550, 7)
Index(['Name', 'Author', 'User Rating', 'Reviews', 'Price', 'Year', 'Genre'], dtype='object')
User Rating Reviews Price Year
count 550.000000 550.000000 550.000000 550.000000
mean 4.618364 11953.281818 13.100000 2014.000000
std 0.226980 11731.132017 10.842262 3.165156
min 3.300000 37.000000 0.000000 2009.000000
25% 4.500000 4058.000000 7.000000 2011.000000
50% 4.700000 8580.000000 11.000000 2014.000000
75% 4.800000 17253.250000 16.000000 2017.000000
max 4.900000 87841.000000 105.000000 2019.000000
Once we've explored the data, we may need to clean it before running an analysis.
The first thing we can do is check for and remove any duplicate rows in the dataset using the drop_duplicates()
function.
df.drop_duplicates(inplace=True)
By setting the inplace
parameter to True
, the changes are made directly to the original DataFrame.
Next, we can rename the columns of the DataFrame to make them more descriptive and easier to work with. We can use the rename()
function to do this.
df.rename(columns={"Name": "Title", "Year": "Publication Year", "User Rating": "Rating"}, inplace=True)
This code renames the "Name" column to "Title" and the "Year" column to "Publication Year". Again, by setting the inplace
parameter to True
, the changes are made directly to the original DataFrame.
Finally, we can convert the "Price" column to a float data type to make it easier to work with. We can use the astype() function to do this.
df["Price"] = df["Price"].astype(float)
This code converts the "Price" column to a float data type. Note that we select the "Price" column of the DataFrame using the square bracket notation, and then apply the astype()
function to it. The resulting values are then stored back in the "Price" column of the DataFrame.
After performing these cleaning operations, our DataFrame should be ready for analysis.
Once we've cleaned the data, we can run an analysis on it. This could involve calculating summary statistics or creating visualizations.
Here are a few examples:
Using methods from our df
DataFrame object, we can get a glimpse of which authors have the most books on the Amazon Best Sellers list.
This can be done by selecting the 'Author'
column data and using the value_counts()
method. We can assign this to an author_counts
variable:
author_counts = df['Author'].value_counts()
print(author_counts)
We can then print it and output the following:
Author
Jeff Kinney 12
Gary Chapman 11
Rick Riordan 11
Suzanne Collins 11
American Psychological Association 10
..
Keith Richards 1
Chris Cleave 1
Alice Schertle 1
Celeste Ng 1
Adam Gasiewski 1
Name: count, Length: 248, dtype: int64
We can also use the cleaned data to determine the average rating for each genre in the dataset. We can do this by grouping the data by the "Genre" column and calculating the mean of the "Rating" column.
avg_rating_by_genre = df.groupby("Genre")["Rating"].mean()
print(avg_rating_by_genre)
This code groups the data by the "Genre" column, and then calculates the mean of the "Rating" column for each genre. It then prints the resulting series of average ratings by genre.
Genre
Fiction 4.648333
Non Fiction 4.595161
Name: Rating, dtype: float64
These are just a few examples of the types of analyses we can run on our data. The possibilities are endless, depending on the type of data we're working with and the questions we're trying to answer.
Finally, once we have performed our analysis and obtained some insights from the data, we may want to export the results to a file for further analysis or sharing with others. We can use the to_csv()
function to export our cleaned and analyzed DataFrame to a CSV file.
# Export top selling authors to a CSV file
author_counts.head(10).to_csv("top_authors.csv")
# Export average rating by genre to a CSV file
avg_rating_by_genre.to_csv("avg_rating_by_genre.csv")
This will export the top 10 selling authors and the average rating by genre to a separate .csv files.
By default, these files will be saved in the same directory as our main.py file. We can specify a different directory by providing a file path instead of just a file name.
Congratulations! We've made it to the end of the tutorial! 🎊
We were able to harness the power of Python libraries like pandas to analyze data from a CSV file. Specifically, we did the following:
View the full source for this project here.
Also, check out the following resources to learn more about data analysis with Python:
Need Help?