The process of cleaning and manipulating data can be executed with some of the existing tools. This article will explain the phase applied in Microsoft Excel, which is one of the spreadsheets tools managed by Microsoft. Remembering all of the processes will use formulas in a spreadsheet, it is possible to be applied in other tools such as Google Spreadsheet. The case used in this article is adapted from the Weekly Practice Challenge provided by Coursera in “Excel Skills for Data Analytics and Visualization Specialization”
Fig 1 represents the raw data that needs to be processed. The data relating to Human Resources contains some information from each employee. When we look closer, there is some blank column that needs to be filled.
The first step is to fill the “emp id” column which represents Employee ID. In this case, we used to combine the “full/part time” column and “emp no” column to get the Employee ID. We can apply CONCATENATE formula as shown in Fig 2.
The second step is to fill the “full name” column by combining the “first name” and “last name” columns. After checking into detail of each column, it contains inconsistent capitalizations. Because of that, we need to combine two columns and transform them into proper capitalization using PROPER formula as shown in Fig 2.
The next column we need to fix is “email”, it needs to combine the first letter of the first name, the surname, and “@zenco.com”. After getting the pattern of the email, all the letters should be transformed into lower case. Look into this case, we need CONCATENATE formula to combine the column and LOWER formula to make all of them in lower case. The implementation is shown in Fig 4.
The next step is to extract the floor number from the first two numbers of “location”. Because of its location on the left of the column, it means we just need to use LEFT formula. Detail of the formula shown in Fig 5.
From the same source, the text between the two dashes in “location” column will be extracted into “wing” column. After it is extracted, adjust the formula that allows “NorthEast” are replaced with just “North”. To use the value located in the middle, we can use MID formula, and to replace it with another value, we can use SUBSTITUTE formula as shown in Fig 6.
Still from the same source as “floor no” and “wing”, we also extract the last four characters from “location” to fill the “extension” data. We can apply the RIGHT formula, remembering the position value is on the right column. But, after it is applied, you will notice that instead of getting four numbers you only got three or in some cases 2 numbers. Because it contains some spaces in the last character, we need to combine the SUBSTITUTE, CLEAN, and UNICHAR formula. The detail of the used formula shown in Fig 7.
The finished sheet should look like Fig 8, all of the grey highlights represent the answer to work that has been done in this article.
Conclusion:
1. CONCATENATE: used to combine several texts into one text in a cell
2. PROPER: used to convert a text to proper case, this means the first letter in each word will be uppercase and the other letters will be lowercase
3. LOWER: used to convert all letters to lowercase, the opposite function goes for UPPER formula
4. LEFT: used to return a value of characters from the start of a text string, another function is RIGHT which returns a value from the end of a text string.
5. MID: used to return a value of characters from the middle of a text string, we can custom a starting position and how long the length
6. SUBSTITUTE: used to replace existing text with new text
7. UNICHAR: used to return the Unicode character referenced by the given numeric value
8. CLEAN: used to remove all the nonprintable characters