Data Transformation with OpenRefine
August 27, 2020
Data, Data transformation, Tutorial
With cleaner data, we can begin to transform it. Data transformation can manifest in different forms. It can be clustering, merging, adding information, replacing strings, and so on. OpenRefine covers them all.
To watch the video-tutorial visit our YouTube channel.
OpenRefine supports clustering. A useful feature when it comes to data transformation of misspelled records. It takes data from a column and clusters it allowing the user to bundle possibly-similar groups together. For example, if there are many entries as “Central Park”, and a bunch of entries as “central park”. Data under each would be bundled as different groups; however it is clear that the only difference between both is the lower and upper cases of the titles. For easier data analysis we would need to combine the two groups by clustering.
To check if there are any misspelled words or phrases in your data, you simply select the column -> Facet -> Text facet.
As in can be observed from the facet log, various “product-color” entries have spelling errors, such as lower case letters, additional spaces before/after/in-between words, misplaced period points, etc., for the groups “Midnight blue”, “ Grass green”, and “Red”.
To cluster them together click on the “Cluster” button in the top right corner of the facet log and select which groups should be clustered together, choose the default “New Cell Value” title or customize how you prefer, then apply the changes by clicking “Merge Selected & Re-cluster”.
The misspelled entries are now corrected to match the main group’s title.
This data transformation can be very useful and change a lot for your data analysis.
Now, we are going to dive deeper into modifying data more specifically. Frequently, some unnecessary strings or symbols can ruin the work of analysis with the data. For example, the “$” in the “product-price” column is an obstacle, because with the symbol the cell values will not be recognized as numbers for many spreadsheet softwares; therefore, we have to eliminate the symbols. The term “time-consuming” is an underestimation of how much time it would take to manually go over 8000 cells to delete a single symbol. In OpenRefine it can be done in only a minute with the help of GREL (General Refine Expression Language).
GREL obtains such possibilities as:
- creating a custom text or numeric facet;
- adding a column based on another column;
- Create a new column by fetching a URL;
- Transforming cells in a column.
To delete the “$” from the “product-price” column select the column -> Edit cells -> Transform.
In the “Expression” part a formula is needed that will only remove the unnecessary symbols.
This can be done by two functions:
- replace(value, "$", "");
- value.split("$").join("").
Input the function, make sure that the formula works by checking the “Preview” log, then click “OK”.
As simple as that! Now the data transformation from a mixed data type column can be selected as solely a number column, making analysis and work with data easier.
Now, we are going to look at how to import/merge data from a different project with OpenRefine. To begin, download the “Discounts-Data.csv” file here.
Import the data into OpenRefine by creating a new project.
If we take a look at the additional file, it shows that there is a new column of discounts for two specific products.
We are going to merge “product-discount” into our primary data.
For this process, we are going to use another GREL formula.:
cell.cross('arg1','arg2').cells['arg3'].value[arg4]
- arg1 = name of the project you are exporting data from = “Discounts Data csv”;
- arg2 = name of the key column = “product-title”;
- arg3 = name of the column you are importing = “product-discount”;
- arg4 = indicate which value to import in the array (if multiple matches for the key) (recommended to use 0).
Since there will be cells that will not obtain an additional value, we add a little bit of logic to clean things up, and for the data not to show null values.
To begin the merging, select the “product-title” column -> Edit column -> Add column based on this column.
Following the formula layout, the formula we are going to input in the “Expression” log is as follows:
if(value!='null',cell.cross("Discounts Data csv", "product-title").cells["product-discount"].value[0],'')
Name the column as “product-discount” or as you wish to name it, and click “OK”.
Now that the data has been merged together, for simpler layout we are going to reorder the columns so that the “product-discount” is placed after the “product price”.
Select column “All” -> Edit columns -> Re-order / remove columns….
Drag the “product-discount” under the “product-price” and click “OK”.
In this log you can also remove various columns at once, if needed.
It is as easy as that!
Previous parts of "Data Transfromation with OpenRefine" blog series:
Video tutorial of the "Data Cleaning with OpenRefine".