IPL Data Analysis & Visualization with Python

Pratik Jadhav
5 min readSep 24, 2020

--

Using Numpy, Pandas, Matplotlib & Seaborn

It’s 2020 and we have almost completed 9 months of it . This Year has not been good so far as we are into the pandemic Situation world wide. Speaking about India It’s Been almost 7 months since the First of covid 19 positive was found. Later situations became more horrific and all the sport events and tournaments were Cancelled including the very beloved cricket League of India, The IPL (Indian Premiere League). But now it’s Happening and it’s taking place in UAE. It’s Already been 5 Days since the IPL 2020 Kickoff and I am feeling the hype Everywhere. Also recently I have been learning Data Science. So lets do data analysis for previous Season’s of IPL.

Data Analysis: Data analysis is a process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, informing conclusions and supporting decision-making.

For the Data Analysis I used the IPL Dataset Available on Kaggle . The python libraries I used are Numpy, pandas, matplotlib and Seaborn.

The main steps involved in any Data Analysis start from Reading a Dataset

Reading The data

The Data Set I am using has information for IPL matches from 2008 till the last year 2019. It contains various columns like (‘team1’, ‘team2’, ‘toss winner’, ‘toss decision’, ‘winner’. ‘Season’, etc.

Now that we have loaded the data into Data Frame Lets take an overview

We have got 18 columns each with 756 rows

Seems we have got many columns. Lets Drop the Umpire Columns as we wont be using them in this analysis. We can drop columns using .drop()

Lets See how many teams we got in our data set

As we can see the Pune’s Team got 3 Names: 2 Real Names while The Third is a error in name so lets clean it . i.e. “Rising Pune Supergiants ” got an Extra ‘s’. And For Our Convenience We will merge the Pune Warriors with RPS as they both represent Pune.

Similarly Bangalore has got 2 names the old one and new one Bengaluru’

The 3rd Cleaning is in name of Delhi capital’s. We will replace Delhi Daredevils With Delhi Capitals from all columns involving these names

We will do this using the .replace() method

Now Lets Check For Missing values

We have 15 Missing values. Lets See where they are

Now Lets Change this with Dubai

We have some NaN values in the columns of Winner and Player of The Match But this are because the match result was not Normal. Maybe it was Cancelled due to Rain or similar case.

Now lets Move Further with analysis as we have completed the cleaning of Data.

Lets see how many matches have been Played from 2008 to 2019

Lets See how many of them had a normal Result

We can see out of 756 matches Played in 12 Seasons 743 went well till end

Now lets see in which all cities the IPL has hosted the matches using .unique() method

Now Lets see which City Hosted the most number of Matches

Lets Visualize the Results

We can See MUMBAI hosted Maximum number of matches (101) followed by Bengaluru which hosted 80

Since the beginning of IPL there has been a Fan Battle between CSK and MI. Lets see

Which team has won most number of matches from 2008 to 2019 ?

Plotting these values

Although there’s not much difference we can clearly See MUMBAI INDIANS has won most number of Matches and CHENNAI Stands on 2nd Position

Now let’s Explore More

Total Matches hosted by Each Season ?

Lets use plotting for having a clearer Insight

Lets now Dig more interesting Insights

Which is the most preferred on winning toss (Bat/Field) ?

Lets Explore

This observation tells us. Almost 60% of the times the toss winner Chooses to Field First.

Now Lets See how correct the Most Preferred decision proves

For this we will create 2 separate Data Frames one for Bat and other field then will merge it into one single data frame.

field_df = ipl_df.loc[(ipl_df['toss_winner'] == ipl_df['winner']) & (ipl_df['toss_decision'] == 'field'), ['id', 'winner','toss_decision']]field_df.winner.count()
#prints 259
bat_df = ipl_df.loc[(ipl_df['toss_winner'] == ipl_df['winner']) & (ipl_df['toss_decision'] == 'bat'), ['id', 'winner','toss_decision']]bat_df.winner.count()
#prints 134
So it seems “Choose to Field” Proved to be correct almost 55% of the time while “Choose to Bat” proved a success only 45%

Seems Chasing the score has proved to be a win almost 60% of times

Now Lets See

Which Stadium has hosted the Most Number of Matches ?

Seems IPL has Hosted matches across 41 venue’s over 12 Seasons

Lets See which one has witnessed the most Matches

venue_df = ipl_df.groupby('venue')[['id']].count()
venue_df = venue_df.sort_values('id',ascending=False).reset_index()
venue_df.rename(columns={'id':'Total','venue':'Stadium'},inplace=True)
Although not much difference but from the above pie chart we can see Eden garden has hosted Maximum number of matches and after it comes Wankhede Stadium

Talking about Maximum’s Lets see which Player has Received the most “Man of the Match” title

Lets first Check how many players have been awarded with man of the match Title

We can see A lot of players have been awarded. Lets Sort and choose the top ones.

player_df = ipl_df.groupby('player_of_match')[['id']].count()
player_df =player_df.sort_values('id',ascending=False).reset_index()

Lets Plot the data using seaborn Plot

From the above plot we can see Chris Gayle has Been the Man of the match For maximum number of Times

Now Lets See

Which team hold the winner title the most ?

For doing so we will create a Copy of the original data frame using .copy() so that the changes we make wont affect the main data Frame

final_df = ipl_df.groupby('Season').tail(1).copy()#Now Lets sort The Data According to Seasons
final_df = final_df.sort_values('Season')
final_df

Now lets see which all teams have won the Title

We can see MI has won the most Seasons followed by CSK and KKR

Now lets plot this data using seaborn count plot

So With that, we’ve come to the end of this analysis on IPL Data. From the analysis it’s Clear that MI and CSK are Two Dominant Teams in IPL. Apart from this you might have noticed how few lines of can give us greater insights from the data. Also other takeaways are the ease of python : Syntax as well as the code readability. It’s a powerful language and has a wide variety of libraries like for almost Everything some examples are Tensor flow, scikit-learn for ML. Numpy for numerical computing. Matplotlib, seaborn for visualization and many more. Hope you find this blog and the insights interesting and it helps you get an idea of what data analysis is. If you are interested in Data Science. Join our Data Science Community at Jovian.ml .

Suggestions are always welcomed.

--

--