Home / Blog / Data Science / Memory Optimization using Pandas Library

Memory Optimization using Pandas Library

February 20, 2023
76

Meet the Author : Mr. Sharat Chandra

Sharat Chandra is the head of analytics at 360DigiTMG as well as one of the founders and directors of Innodatatics Private Limited. With more than 17 years of work experience in the IT sector and Worked as a Data scientist for 14+ years across several industry domains, Sharat Chandra has a wide range of expertise in areas like retail, manufacturing, medical care, etc. With over ten years of expertise as the head trainer at 360DigiTMG, Sharat Chandra has been assisting his pupils in making the move to the IT industry simple. Along with the Oncology team, he made a contribution to the field of LSHC, especially to the field of cancer therapy, which was published in the British magazine of Cancer research magazine.

Learn the core concepts of Data Science Course video on Youtube:

Tip 1

While we are working with large datasets, one of the simplest approach to apply is to identify the specific variables that are of interest and load only those variables for processing rather importing the entire dataset into memory.

Memory Optimization using Pandas Library

The option = deep in info() function is used to perform a real memory usage. The calculation is performed at the cost of computational resources. If we do not use deep option then the memory usage is based on the column dtype and number of rows. Assuming values consume the same memory amount for corresponding dtypes are calculated.

The memory consumption for the sample dataset is 120.2 MB.

The question we need to ask ourselfs is:
Q. Do we require the entire data for processing?

Optimize the Memory Usage

Assume our interest is only for the columns: Age and Status.

Why not import only these two columns?

This approach will reduce the memory consuption drastically.

Memory Optimization using Pandas Library

The usecols parameter of read_csv() function can filter out all other columns and import only the required fields.

This will allow us to utilize the memory in an optimized manner which can enhance the performance of your code.

Memory Optimization using Pandas Library

Tip 2.

Choose the Appropriate Data types.

In Python programming the standard data types are used. Every column is automatically infered for the data types based on the data it holds.

Each of these standard data types have predefined structure and storage defined.

Lets discuss the numerical data types and their memory consumption.

Memory Optimization using Pandas Library

Importing the entire sample data consumes 120.2 MB of memory. To optimize the memory utilization we can alter the data types from int64 to int32, int16, or int8 as appropriate.

For Example: Age column consists of positive 2 digit numbers, so we can typecast Age to int8 or uint8.

Memory Optimization using Pandas Library

Observe the difference in the size of the data post typecasting it to int8.

From 2.28 MB it has reduced to 0.28 MB.

Typecasting all the numeric columns will reduce the overall memory consumption for the dataframe.

Memory Optimization using Pandas Library

Applying this strategy has brought down the size of the data from 120.2 MB to 100.2 MB

Further we can also convert the values into boolean type if the data is binary in nature. Use unique() or value_counts() functions to verify the object columns.

Tip3.

Reduce the memory consumption of a catergorical values by renameing the values.

Memory Optimization using Pandas Library

The 'DayOfTheWeek' column is of (19276698 bytes) 18.38 MB, this is due to the full form of the weekday. The size can be reduced by altering the full form with a short form represetation. This can be achived by using a datatype called category . The data which is repeated in non-numerical column is stored in a comparatively compact representation.

Memory Optimization using Pandas Library

The memory consumption has reduced to 0.28 MB from 18.38 MB, this is a huge compression, especially if we have a big data in terms of the number of rows.

Tip4.

Conversion of Date columns to Datetime will impact the memory usage very effectively. The values are inferred as string/object type by pandas library by default.

Memory Optimization using Pandas Library

The column AppointmentRegistration is inferred as object type by default. Changing this column into datetime will bring the memory consmuption drastically down. The current memory usage is rounded to 22 MB. Lets convert this data to datetime and measure the consmption.

Memory Optimization using Pandas Library

Post the conversion to datetime the memory consumption has come down to 2.28 MB from 22 MB.

Conclusion:

Let us implement all the techniques that we have discussed in this blog on the sample data and see the over all effect in memory utilization.

Memory Optimization using Pandas Library

Typecasting to 'int8' has reduced the memory usage by 20 MB.

Lets target the date columns now.

Memory Optimization using Pandas Library

The typecasting technique can help reduce the memory usage to some extent.

We have successfully shrinked the memory usage from 120.2 MB to 42.6 MB.

There are certain limitations though with Pandas library especially if the dataset is very large when compared to the machines RAM capacity.

Pandas as is not an idea library to handle large datasets. There are supportive packages that can help Pandas to scale up to deal with large datasets.