IPL Data Analysis & Visualization with Python
Using Numpy, Pandas, Matplotlib & Seaborn
It’s 2020 and we have almost completed 9 months of it . This Year has not been good so far as we are into the pandemic Situation world wide. Speaking about India It’s Been almost 7 months since the First of covid 19 positive was found. Later situations became more horrific and all the sport events and tournaments were Cancelled including the very beloved cricket League of India, The IPL (Indian Premiere League). But now it’s Happening and it’s taking place in UAE. It’s Already been 5 Days since the IPL 2020 Kickoff and I am feeling the hype Everywhere. Also recently I have been learning Data Science. So lets do data analysis for previous Season’s of IPL.
Data Analysis: Data analysis is a process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, informing conclusions and supporting decision-making.
For the Data Analysis I used the IPL Dataset Available on Kaggle . The python libraries I used are Numpy, pandas, matplotlib and Seaborn.
The main steps involved in any Data Analysis start from Reading a Dataset
Reading The data
The Data Set I am using has information for IPL matches from 2008 till the last year 2019. It contains various columns like (‘team1’, ‘team2’, ‘toss winner’, ‘toss decision’, ‘winner’. ‘Season’, etc.
Now that we have loaded the data into Data Frame Lets take an overview
Seems we have got many columns. Lets Drop the Umpire Columns as we wont be using them in this analysis. We can drop columns using .drop()
Lets See how many teams we got in our data set
As we can see the Pune’s Team got 3 Names: 2 Real Names while The Third is a error in name so lets clean it . i.e. “Rising Pune Supergiants ” got an Extra ‘s’. And For Our Convenience We will merge the Pune Warriors with RPS as they both represent Pune.
Similarly Bangalore has got 2 names the old one and new one Bengaluru’
The 3rd Cleaning is in name of Delhi capital’s. We will replace Delhi Daredevils With Delhi Capitals from all columns involving these names
We will do this using the .replace() method
Now Lets Check For Missing values
We have 15 Missing values. Lets See where they are
Now Lets Change this with Dubai
We have some NaN values in the columns of Winner and Player of The Match But this are because the match result was not Normal. Maybe it was Cancelled due to Rain or similar case.
Now lets Move Further with analysis as we have completed the cleaning of Data.
Lets see how many matches have been Played from 2008 to 2019
Lets See how many of them had a normal Result
We can see out of 756 matches Played in 12 Seasons 743 went well till end
Now lets see in which all cities the IPL has hosted the matches using .unique() method
Now Lets see which City Hosted the most number of Matches
Lets Visualize the Results
Since the beginning of IPL there has been a Fan Battle between CSK and MI. Lets see
Which team has won most number of matches from 2008 to 2019 ?
Plotting these values
Now let’s Explore More
Total Matches hosted by Each Season ?
Lets use plotting for having a clearer Insight
Lets now Dig more interesting Insights
Which is the most preferred on winning toss (Bat/Field) ?
Lets Explore
This observation tells us. Almost 60% of the times the toss winner Chooses to Field First.
Now Lets See how correct the Most Preferred decision proves
For this we will create 2 separate Data Frames one for Bat and other field then will merge it into one single data frame.
field_df = ipl_df.loc[(ipl_df['toss_winner'] == ipl_df['winner']) & (ipl_df['toss_decision'] == 'field'), ['id', 'winner','toss_decision']]field_df.winner.count()
#prints 259bat_df = ipl_df.loc[(ipl_df['toss_winner'] == ipl_df['winner']) & (ipl_df['toss_decision'] == 'bat'), ['id', 'winner','toss_decision']]bat_df.winner.count()
#prints 134
Seems Chasing the score has proved to be a win almost 60% of times
Now Lets See
Which Stadium has hosted the Most Number of Matches ?
Seems IPL has Hosted matches across 41 venue’s over 12 Seasons
Lets See which one has witnessed the most Matches
venue_df = ipl_df.groupby('venue')[['id']].count()
venue_df = venue_df.sort_values('id',ascending=False).reset_index()
venue_df.rename(columns={'id':'Total','venue':'Stadium'},inplace=True)
Talking about Maximum’s Lets see which Player has Received the most “Man of the Match” title
Lets first Check how many players have been awarded with man of the match Title
We can see A lot of players have been awarded. Lets Sort and choose the top ones.
player_df = ipl_df.groupby('player_of_match')[['id']].count()
player_df =player_df.sort_values('id',ascending=False).reset_index()
Lets Plot the data using seaborn Plot
Now Lets See
Which team hold the winner title the most ?
For doing so we will create a Copy of the original data frame using .copy() so that the changes we make wont affect the main data Frame
final_df = ipl_df.groupby('Season').tail(1).copy()#Now Lets sort The Data According to Seasons
final_df = final_df.sort_values('Season')
final_df
Now lets see which all teams have won the Title
We can see MI has won the most Seasons followed by CSK and KKR
Now lets plot this data using seaborn count plot
So With that, we’ve come to the end of this analysis on IPL Data. From the analysis it’s Clear that MI and CSK are Two Dominant Teams in IPL. Apart from this you might have noticed how few lines of can give us greater insights from the data. Also other takeaways are the ease of python : Syntax as well as the code readability. It’s a powerful language and has a wide variety of libraries like for almost Everything some examples are Tensor flow, scikit-learn for ML. Numpy for numerical computing. Matplotlib, seaborn for visualization and many more. Hope you find this blog and the insights interesting and it helps you get an idea of what data analysis is. If you are interested in Data Science. Join our Data Science Community at Jovian.ml .
Suggestions are always welcomed.