Introduction to Data Science

Pratik Jadhav
9 min readApr 19, 2021

--

You’ll once in a while have heard about Data Science as a Hot topic in the Industries. So you might be wondering What is data science? Then when you start Reading about data Science & you come across “BIG DATA” so you again question yourself What is big data? Keep reading to find out what exactly it means and Indeed these are hot topics but are often misunderstood. In the current world, they are having great Importance and are getting critical Day by day. So let’s try to uncover some misconceptions and give you all a brief idea about Data Science & Big Data.

Image by By Meme Maker

At the Heart of Data Science and Big Data, we have got “DATA” So what exactly is Data? In computing, data is information that has been translated into a form that is efficient for movement or processing. In simple words, we can safely assume whatever we know is Data. i.e. Your Birthdate is a form of Data, your name, address, mobile number, etc. This known information is a kind of Data. So at this point, you might have got a clear idea about what exactly Data is.

Let’s now explore types of Data — There may exist numerous types of data but in the end, it all boils to 3 Categories i.e Structured Data, Unstructured Data & Semi-Structured Data.

Structured Data- As the word suggests this is data that is highly organized and neatly formatted. In other words, it’s got rows and Columns (tabular) and these rows and columns have some relation with each other. Furthermore, it can work easily with most standard analytical models. It also requires less storage space. Some examples of structured data are Excel files, Google Sheets, and traditional DataBase Management Systems (DBMS).

Unstructured Data- Unstructured data is data that is not organized in any predefined manner. It can be textual, numbers, dates, or BLOBs (Binary Large Objects). Irregularities and disorganization within unstructured data make it difficult to handle and understand. Some examples include text data, social media comments, documents, phone call transcriptions, various log files like server logs, sensor logs, images, audio, video, etc. One interesting fact about unstructured data — Approximately 80% of the worldwide data is unstructured.

Semi-Structured Data- Semi-structured data is a hybrid of both structured and unstructured data. It has some organizational framework but does not have the complete structure that is required to fit in a relational database. Semi-structured data has a self-describing structure that contains tags or attributes to separate various entities within data. Examples of semi-structured data include XML, JSON, Emails, NoSQL DBs, event tracking, and web pages.

So now that we know what is Data & its types let’s Explore Data Science now.

Formally defining, Data science is the field of study that combines domain expertise, programming skills, and knowledge of mathematics and statistics to extract meaningful insights from data. Data science practitioners apply machine Learning Algorithms to numbers, text, images, video, audio, and more to produce artificial intelligence(AI) systems to perform tasks that ordinarily require human intelligence. In turn, these systems generate insights that analysts and business users can translate into tangible business value.

If we talk in most basic terms, it can be defined as obtaining insights and information, really anything of value, out of data. Like any new field, it’s often tempting but counterproductive to try to put concrete bounds on its definition. This is data science. This is not. In reality, data science is evolving so fast and has already shown such an enormous range of possibility that a wider definition is essential to understanding it.

And while it’s hard to pin down a specific definition, it’s quite easy to see and feel its impact. Data science, when applied to different fields can lead to incredible new insights. So let’s now see what steps are involved in Data Science. The image below represents 5 Stages of the Data Science Life Cycle.

The image represents the five stages of the data science life cycle: Capture, (data acquisition, data entry, signal reception, data extraction); Maintain (data warehousing, data cleansing, data staging, data processing, data architecture); Process (data mining, clustering/classification, data modeling, data summarization); Analyze (exploratory/confirmatory, predictive analysis, regression, text mining, qualitative analysis); Communicate (data reporting, data visualization, business intelligence, decision making).

Now let’s talk about Fundamental Steps that are required in any Data Science process —

Step 1: Obtain Data

The very first step of a data science project is straightforward. We obtain the data that we need from available data sources. For instance, a data Source can be anything like a Database itself for which you would need to query the database to obtain data, data can be in form of CSV, JSON for which you have specific packages in python & R to Read such data. The more basic and most widely used technique to obtain data is Downloading Dataset from Kaggle and using it for your projects.

Now that you have got the Data the very next Step is Cleaning Data-

Step 2: Cleaning Data (Pre-processing)

After obtaining data, the next immediate thing to do is cleaning data. This process is for us to “clean” and to filter the data. Remember the “garbage in, garbage out” philosophy, if the data is unfiltered and irrelevant, the results of the analysis will not mean anything. It’s always necessary to clean the data as it may contain data that you do not require or is irrelevant to your requirement. In such a case, you need to maybe drop some columns and Delete some rows. There are various packages in python & R that allow you to make use of Data Frames to clean and prepare data for further usage.

Once you have the data in place and have cleaned it thoroughly, The next step very much Everyone’s favorite is Exploring Data visually-

Step 3: Explore Data

Once your data is ready to be used, and right before you jump into AI and Machine Learning, you will have to examine the data. Very first we need to explore Data as different data types like numerical data, categorical data, ordinal and nominal data, etc. require different treatments. In the next step, we need to compute descriptive statistics to extract features and test significant variables. Testing significant variables often is done with correlation. For example, exploring the risk of someone getting high blood pressure in relation to their height and weight.

The term “Feature” used in Machine Learning or Modelling, is the data features that help us to identify the characteristics that represent the data. For example, “Name”, “Age”, “Gender” are typical features of members or employees dataset. After getting these insights we can use various Plots like Line Plot, Bar plot, pie chart, etc to visualize the data and gain more Insights.

So once you have Obtained, Cleaned, and Explored Data, the Next Step comes is to Model the Data or use it in your Machine Learning Models-

Step 4: Exploring the World of Machine Learning

Before Moving Further Let’s Explore What is Machine Learning — Watch this Video to know how huge an impact Machine Learning has in our Day to Day life:

Formally defining Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed. ML is one of the most exciting technologies that one would have ever come across. As it is evident from the name, it gives the computer that makes it more similar to humans: The ability to learn. Machine learning is actively being used today, perhaps in many more places than one would expect.

In the words of data scientists, machine learning is the process of deploying machines for understanding a system or an underlying process and making changes for its improvement. And, an algorithm can be termed as a set of instructions to the computer system to drive a particular task.

Here are the three types of machine learning methods you need to know about:

Supervised Learning: It is based on the outcomes of a similar process in the past. Supervised learning helps in predicting an outcome based on historical patterns.

Example: Who are the unhappy customers?

  • Another great example of supervised learning is text classification problems. In this set of problems, the goal is to predict the class label of a given piece of text.
  • One particularly popular topic in text classification is to predict the sentiment of a piece of text, like a tweet or a product review. This is widely used in the e-commerce industry to help companies to determine negative comments made by customers.

Examples of Supervised Learning algorithms: Regression, Decision Tree, Random Forest, KNN, Logistic Regression, etc.

Unsupervised Learning: In this algorithm, we do not have any target or outcome variable to predict / estimate. It is used for clustering populations in different groups, which is widely used for segmenting customers in different groups for specific intervention.

Examples of Unsupervised Learning:

  • Suppose the unsupervised learning algorithm is given an input dataset containing images of different types of cats and dogs. The algorithm is never trained upon the given dataset, which means it does not have any idea about the features of the dataset. The task of the unsupervised learning algorithm is to identify the image features on their own. An unsupervised learning algorithm will perform this task by clustering the image dataset into groups according to similarities between images.

Examples of Unsupervised Learning Algorithm: Apriori algorithm, K-means.

Reinforcement Learning: This is an interesting machine learning methodology that relies on a dynamic dataset interacting with the real world. In simple words, it is a method where the system learns from its mistakes and gets better day by day.

In Reinforcement Learning (RL), agents are trained on a reward and punishment mechanism. The agent is rewarded for correct moves and punished for the wrong ones. In doing so, the agent tries to minimize wrong moves and maximize the right ones.

Examples of Reinforcement Learning Algorithms: Markov Decision Process, Q learning.

Step 5: Evaluate & Deploy the Model

Once you are done with picking the right machine learning algorithm, next comes its evaluation. You need to validate the algorithm to check whether it produces the desired results for your business.

Techniques such as cross-validation or even ROC (Receiver operating characteristic) curve, work well for generalizing the model output for new data. If the model appears to be producing satisfying results, you are all good to go! Implement the model and see your business making a difference like never before.

If you want to Experience Hands-on Experience of Building a Machine Learning Model:

Visit the link below:

Teachable MachineExperience the Fun of Building Machine Learning Models with No Code.

So to conclude I would also like to convey that Machine Learning, Data Science, and Big Data concepts are linked together as follows:

  • You can use machine learning algorithms to make data science
  • You can use data science and machine learning techniques on big data.

Big data: It is when you deal with a huge amount of data

Machine learning: Field which concerns the implementation of methods that allow a computer to automatically learn from data. These data can often be big data, as there is a lot to discover in this data.

Data science: Field which deals with the analysis of data to extract knowledge. It includes many techniques like data visualization, probability, machine learning, statistics.

So I hope you must have enjoyed reading it. If so do drop claps Below :)

For any queries, feel free to reach out on —

--

--

Pratik Jadhav
Pratik Jadhav

Written by Pratik Jadhav

Data Science Enthusiast with a Love for Tech Communities. Know more on https://www.pratikjadhav.me/

Responses (1)