Sent Successfully.
Home / Blog / Data Science Digital Book / CRISP - DM Data Cleansing / Data Preparation
CRISP - DM Data Cleansing / Data Preparation
Table of Content
Data Cleansing / Data Preparation
Other names for data cleaning include data preparation, data organisation, munging, and data wrangling.
Outlier or Extreme Values: Any value that deviates significantly from the rest of the data in terms of size or range.
.
Outliers are treated using 3 R technique:
Learn the core concepts of Data Science Course video on YouTube:
Winsorization Technique
By reducing outliers, the Winsorization process alters the sample distribution of random variables. In the case of 90% winsorization, all data below the 5th percentile would be placed at that level, and all data above the 95th percentile would be set at that level.
Alpha Trimmed Technique
You may establish an alpha value using the Alpha Trimmed Technique; for instance, if alpha = 5%, all values in the lower and higher 5% range are trimmed or eliminated.
Missing Values
Missing values refer to data fields that may be empty or include NA, NaN, or Null.
3 Variants of Missing Values
- Missingness At Random (MAR)
- Missingness Not At Random (MNAR)
- Missingness Completely At Random (MCAR)
Imputation
Imputation is a technique used to replace missing values with logical values. Wide variety of Techniques are available, choosing the one which fits the data is an art:
Transformation
Changing the underlying nature of the data for better analysis.
Types of transformation
- Logarithmic
- Exponential
- Square Root
- Reciprocal
- Box-Cox
- Johnson
- Discretization / Binning / Grouping - Converting continuous data to discrete
- Binarization - Converting continuous data into two categories (binary)
- Rounding - Rounding off the decimals to the nearest integer e.g. 5.6 = 6
Binning - Two types of Binning
- Fixed Width Binning
- Adaptive Binning
Normalization
Normalization / Standardization - Making the data scale-free and unitless.
Methods of Normalization / Standardization includes
- Standardized Scaling also called as Standardization
- Min-Max Scaler also called as Normalization or Range Method, Robust Scaling
Standardization has two parts:
- Mean Normalization or Mean Subtraction - Mean Normalization will make the mean of the data ‘Zero’
- Variance Normalization - Variance Normalization will make the variance of the data ‘One’
The Min-Max Scaler or Range technique is another name for normalisation. When dealing with negative numbers, the range of normalised data can occasionally be between -1 and +1 with a minimum value of 0 and a maximum value of 1.
The drawback of Mix-Max Scaler is that outliers might affect scaled numbers.
Because it takes into account the "Median" and "IQR," robust scaling is not impacted by outliers.
Dummy Variable
Create a dummy variable by representing or converting numerical data from categorical data.
Techniques for Dummy Variable creation are:
Type Casting
transforming one type to another, such is changing a character type to a factor type or an integer type to a floating-point type.
Handling Duplicates
enables us to gather the truth from all the many sources into a single source.
For instance, a person may open a bank account, but his transactions might be shown as John Travolta in some, John in some, and Travolta in some—despite the fact that all three names belong to the same individual. We thus combine all of these names into one.
String Manipulation
Working with textual data. Various ways of converting unstructured textual data into structured data are:
Zero or Near Zero Variance
variables that are important on a single level or on the same levels for the majority of them. For instance, all of the zip code numbers are the same or all entries in the gender column are classified as female.
We exclude variables from our analysis that have zero or almost zero feature variance.
Data Science Placement Success Story
Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore
Data Science Training Institutes in Other Locations
Agra, Ahmedabad, Amritsar, Anand, Anantapur, Bangalore, Bhopal, Bhubaneswar, Chengalpattu, Chennai, Cochin, Dehradun, Malaysia, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Gwalior, Hebbal, Hyderabad, Jabalpur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Khammam, Kolhapur, Kothrud, Ludhiana, Madurai, Meerut, Mohali, Moradabad, Noida, Pimpri, Pondicherry, Pune, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thane, Thiruvananthapuram, Tiruchchirappalli, Trichur, Udaipur, Yelahanka, Andhra Pradesh, Anna Nagar, Bhilai, Borivali, Calicut, Chandigarh, Chromepet, Coimbatore, Dilsukhnagar, ECIL, Faridabad, Greater Warangal, Guduvanchery, Guntur, Gurgaon, Guwahati, Hoodi, Indore, Jaipur, Kalaburagi, Kanpur, Kharadi, Kochi, Kolkata, Kompally, Lucknow, Mangalore, Mumbai, Mysore, Nagpur, Nashik, Navi Mumbai, Patna, Porur, Raipur, Salem, Surat, Thoraipakkam, Trichy, Uppal, Vadodara, Varanasi, Vijayawada, Vizag, Tirunelveli, Aurangabad
Navigate to Address
360DigiTMG - Data Science, Data Scientist Course Training in Bangalore
No 23, 2nd Floor, 9th Main Rd, 22nd Cross Rd, 7th Sector, HSR Layout, Bengaluru, Karnataka 560102
1800-212-654-321