Techniques in Python pandas Data frame for beginners in Data scientist field #Feed8

Getting started with pandas in python.

Photo by Tim Swaan on Unsplash

Pandas is one of the most popular python libraries. It was built on two core python libraries

a) Numpy - To perform mathematical operations

b) matplotlib - To visualize the data

Pandas library is used in data exploration - from understanding what is inside the data to visualize the data.

Pandas support two data structures,

a) Series

b)Data frame

Series is a one-dimensional labeled array holding any type of data.

Data frame is a two-dimensional structure with data arranged in the form of rows and columns.

Let's start with the data frame basic analysis.

1. Importing pandas library

Before we start using pandas for data exploration we need to import the pandas library

import pandas as pd

2. Importing data from excels

As most of the data were saved in an excel file, let's get started with excel. I have created an excel file with student names and their corresponding marks in different subjects. In total, we have 9 students with corresponding marks in four different subjects in the table.

Syntax:

pd.read_excel(r' path to the excel file ')

Here r stands for raw strings, it is generally given to avoid unicode error

Ex :

data=pd.read_excel(r'F:\Python\Student_Marks.xlsx')

To check the data imported from excel, print the variable

print(data)

data

Output:

3. Creating data frames in python pandas

Creating a data frame of random data with index and column names. Dataframes are the data structure that includes rows and columns.

Syntax:

pd.DataFrame(Data, index,columns)

Ex: Creating a data frame of data imported using excel

df=pd.DataFrame(data)

In this example, we created a data frame variable as df. When the index is not provided, python by default provides numbers as index starting with 0. Columns will same as in the excel data.

Output:

4. Checking the first few rows of the data frame

Syntax:

data_frame_variable .head( number of rows need to be displayed from top of the table)

Earlier we created a data_frame_variable called df. So let's use it.

a) By default the head() function will provide first 5 rows of the data frame

df.head()

Output:

b) To get the first 8 rows of the data frame we need to specify the number of rows as arguments in the head function

df.head(8)

Output:

5. Checking the last few rows of the data

Syntax:

data_frame_variable.tail( number of rows need to be displayed from the bottom of the table)

a) By default the tail() function will provide the last 5 rows of the data frame

df.tail()

Output:

b) To get only the last 3 rows of the data frame, we need to specify the number of rows as arguments in the tail function

df.tail(3)

Output:

df.tail(3) will show the last three rows of the table or data frame.

6. To get all the column names in the data frame

Syntax:

df.columns

Output:

7. To get the index values in the data frame

Syntax:

df.index

Output:

8. To know the data type of the columns in the data frame

Syntax:

df.dtypes

Output:

Here the names of the students in our example are string and python displays the data type of columns with string as object. As marks are numbers the datatype of marks in int64.

9. Finding the missing values in the table

a) Syntax:

df.isnull()

This syntax will return the boolean values corresponding to the values in the data frame. If there were no missing values it will return "False" and if the data has missing values the syntax will return "True".

Output:

In this example, there were no missing values and therefore it returns False.

b) To find the number of missing values in each column of the data

Syntax:

df.isnull().sum()

Output:

Since there were no missing values in the data, the number of missing values in each column is zero.

10. To get the statistics of the data frame

Syntax:

df.describe()

Output:

describe() function will return the five-number summary along with the count, mean and standard deviation of each column in the data.

count indicates the number of rows in the data frame.

The five-number summary indicates

1. Minimum,

2. First quartile -- Q1 (25th percentile),

3. Second quartile is also known as median -- Q2 (50th percentile),

4. Third quartile -- Q3 (75th percentile)

5. Maximum.

This five-number summary is used to identify the outliers in the data.

Thank you for reading and Happy learning ...

Space_S

Search This Blog

Techniques in Python pandas Data frame for beginners in Data scientist field #Feed8

Comments

Post a Comment