In the data cleaning stage, which is the third step of data preparation, data errors are identified and cleaned. If done traditionally data cleaning takes a lot of time in data preparation, but it is very important to remove bad data and fill in missing data.
Data cleaning creates a complete and accurate data set to provide valid answers when analyzed. This step can be done manually for small data but requires a mechanized method for real data sets.
Data cleaning includes the following: removing duplicate and outlier data, removing extra charges, correcting input errors, removing or filling in missing values, matching data to a standardized pattern, masking private or sensitive data such as names or addresses.
After the data cleaning stage, the process of data preparation and pre-processing up to this stage should be tested for errors so that if an error is found in this stage, it can be fixed before entering the next stage.