Techniques in Python pandas Data frame for beginners in Data scientist field #Feed8


Getting started with pandas in python.


Photo by Tim Swaan on Unsplash

Pandas is one of the most popular python libraries. It was built on two core python libraries

        a) Numpy - To perform mathematical operations

        b) matplotlib - To visualize the data

Pandas library is used in data exploration - from understanding what is inside the data to visualize the data. 

Pandas support two data structures,

        a) Series

        b)Data frame

Series is a one-dimensional labeled array holding any type of data.

Data frame is a two-dimensional structure with data arranged in the form of rows and columns. 

Let's start with the data frame basic analysis.

1. Importing pandas library

    Before we start using pandas for data exploration we need to import the pandas library

        import pandas as pd                                                                                                                     

2. Importing data from excels

        As most of the data were saved in an excel file, let's get started with excel. I have created an excel file with student names and their corresponding marks in different subjects. In total, we have 9 students with corresponding marks in four different subjects in the table.

    Syntax:
pd.read_excel(r' path to the excel file ')

    Here r stands for raw strings, it is generally given to avoid unicode error

Ex :

    data=pd.read_excel(r'F:\Python\Student_Marks.xlsx')                                                                    

    To check the data imported from excel, print the variable

    print(data)                                                                                                                                        

    or 

    data                                                                                                                                                 

     Output:
    

3. Creating data frames in python pandas

    Creating a data frame of random data with index and column names. Dataframes are the data structure that includes rows and columns.

    Syntax:

pd.DataFrame(Data, index,columns)

    Ex: Creating a data frame of  data imported using excel

      df=pd.DataFrame(data)                                                                                                                    

        In this example, we created a data frame variable as df. When the index is not provided, python by default provides numbers as index starting with 0. Columns will same as in the excel data.

    Output:


4. Checking the first few rows of the data frame

    Syntax:

 data_frame_variable .head( number of rows need to be displayed from top of the table)

    Earlier we created a data_frame_variable called df. So let's use it.

   a) By default the head() function will provide first 5 rows of the data frame

        df.head()                                                                                                                                    

        Output:



    b) To get the first 8 rows of the data frame we need to specify the number of rows as arguments in the  head function

            df.head(8)                                                                                                                                

      Output:

5. Checking the last few rows of the data

        Syntax:

 data_frame_variable.tail( number of rows need to be displayed from the bottom of the table)

    a) By default the tail() function will provide the last 5 rows of the data frame

     df.tail()                                                                                                                                            

      Output:
   


    b) To get only the last 3 rows of the data frame, we need to specify the number of rows as arguments         in the tail function

     df.tail(3)                                                                                                                                            

      Output:

    df.tail(3) will show the last three rows of the table or data frame.

6. To get all the column names in the data frame

    Syntax:

    df.columns                                                                                                                                        

    Output:


7. To get the index values in the data frame

   Syntax:

     df.index                                                                                                                                                

  Output:

8. To know the data type of the columns in the data frame

    Syntax:

    df.dtypes                                                                                                                                                

    Output:

    Here the names of the students in our example are string and python displays the data type of columns with string as object. As marks are numbers the datatype of marks in int64.

9. Finding the missing values in the table

    a) Syntax:
     
    df.isnull()                                                                                                                                    

        This syntax will return the boolean values corresponding to the values in the data frame.  If there were no missing values it will return "False" and if the data has missing values the syntax will return   "True".

   Output:


In this example, there were no missing values and therefore it returns False.

b) To find the number of missing values in each column of the data 

    Syntax: 

     df.isnull().sum()                                                                                                                                    

     Output:



    Since there were no missing values in the data, the number of missing values in each column is zero.

10. To get the statistics of the data frame

     Syntax:

    df.describe()                                                                                                                                        

    Output:


describe() function will return the five-number summary along with the count, mean and standard deviation of each column in the data.

count indicates the number of rows in the data frame.

The five-number summary indicates 

1. Minimum, 

2. First quartile -- Q1 (25th percentile), 

3. Second quartile is also known as median -- Q2 (50th percentile), 

4. Third quartile -- Q3 (75th percentile) 

5. Maximum.

This five-number summary is used to identify the outliers in the data.


Thank you for reading and Happy learning ...

Comments