Data Transformation with OpenRefine

August 27, 2020

Fix misspelled data OpenRefine, Merge CSV files OpenRefine, Web scraped data cleaning, Remove special characters OpenRefine, Data enrichment OpenRefine, Transform numeric columns OpenRefine, Add new column OpenRefine, GREL formulas OpenRefine, OpenRefine data manipulation, Clustering similar entries OpenRefine

Data-Transformation-Open-Refine-Web-Scraper-Blog-Image

With cleaner data, we can begin to transform it. Data transformation can manifest in different forms. It can be clustering, merging, adding information, replacing strings, and so on. OpenRefine covers them all.

To watch the video tutorial, visit our YouTube channel.

OpenRefine supports clustering. A useful feature when it comes to data transformation of misspelt records. It takes data from a column and clusters it, allowing the user to bundle possibly-similar groups together. For example, if there are many entries as “Central Park”, and a bunch of entries as “central park”. Data under each would be bundled as different groups; however, it is clear that the only difference between them is the lower and upper cases of the titles. For easier data analysis, we would need to combine the two groups by clustering.

To check if there are any misspelt words or phrases in your data, you simply select the column -> Facet -> Text facet.

Data-Transformation-With-Open-Refine-Web-Scraper-Blog-Clustering

As in can be observed from the facet log, various “product-color” entries have spelling errors, such as lower case letters, additional spaces before/after/in-between words, misplaced period points, etc., for the groups “Midnight blue”, “ Grass green”, and “Red”.

Data-Transformation-With-Open-Refine-Web-Scraper-Blog-Misspelled-Records-Clustering

To cluster them together, click on the “Cluster” button in the top right corner of the facet log and select which groups should be clustered together, choose the default “New Cell Value” title or customise how you prefer, then apply the changes by clicking “Merge Selected & Re-cluster”.

Data-Transformation-With-Open-Refine-Web-Scraper-Blog-Clustering-Records

The misspelt entries are now corrected to match the main group’s title.

This data transformation can be very useful and change a lot for your data analysis.

Data-Transformation-With-Open-Refine-Clustering-Done-Web-Scraper-Blog

Now, we are going to dive deeper into modifying data more specifically. Frequently, some unnecessary strings or symbols can ruin the work of analysis with the data. For example, the “$” in the “product-price” column is an obstacle, because with the symbol, the cell values will not be recognised as numbers for many spreadsheet software; therefore, we have to eliminate the symbols. The term “time-consuming” is an underestimation of how much time it would take to manually go over 8000 cells to delete a single symbol. In OpenRefine , it can be done in only a minute with the help of GREL (General Refine Expression Language).

GREL obtains such possibilities as:

creating a custom text or numeric facet;
adding a column based on another column;
Create a new column by fetching a URL;
Transforming cells in a column.

To delete the “$” from the “product-price” column, select the column -> Edit cells -> Transform.

Data-Transformation-With-Open-Refine-Transform-Cells-Web-Scraper-Blog

In the “Expression” part, a formula is needed that will only remove the unnecessary symbols.

This can be done by two functions:

replace(value, "$", "");
value.split("$").join("").

Input the function, make sure that the formula works by checking the “Preview” log, then click “OK”.

Data-Transformation-With-Open-Refine-Deleting-Strings-Web-Scraper-Blog

As simple as that! Now the data transformation from a mixed data type column can be selected as solely a number column, making analysis and work with data easier.

Now, we are going to look at how to import/merge data from a different project with OpenRefine. To begin, download the “Discounts-Data.csv” file here.

Import the data into OpenRefine by creating a new project.

Data-Transformation-With-Open-Refine-Import-Second-Data-File-Web-Scraper-Blog

If we take a look at the additional file, it shows that there is a new column of discounts for two specific products.

Data-Transformation-With-Open-Refine-Extra-Column-Data-Web-Scraper-Blog

We are going to merge “product-discount” into our primary data.

For this process, we are going to use another GREL formula.:

cell.cross('arg1','arg2').cells['arg3'].value[arg4]

arg1 = name of the project you are exporting data from = “Discounts Data csv”;
arg2 = name of the key column = “product-title”;
arg3 = name of the column you are importing = “product-discount”;
arg4 = indicate which value to import in the array (if multiple matches for the key) (recommended to use 0).

Since there will be cells that will not obtain an additional value, we add a little bit of logic to clean things up, and for the data not to show null values.

To begin the merging, select the “product-title” column -> Edit column -> Add column based on this column.

Data-Transformation-With-Open-Refine-Add-Column-Based-On-This-Column-Web-Scraper-Blog

Following the formula layout, the formula we are going to input in the “Expression” log is as follows:

if(value!='null',cell.cross("Discounts Data csv", "product-title").cells["product-discount"].value[0],'')

Name the column as “product-discount” or as you wish to name it, and click “OK”.

Data-Transformation-With-Open-Refine-GREL-Merge-Data-Web-Scraper-Blog

Now that the data has been merged together, for a simpler layout, we are going to reorder the columns so that the “product-discount” is placed after the “product price”.

Select column “All” -> Edit columns -> Re-order / remove columns….

Data-Transformation-With-Open-Refine-Re-Order-Remove-Columns-Web-Scraper-Blog

Drag the “product-discount” under the “product-price” and click “OK”.

In this log you can also remove various columns at once, if needed.

Data-Transformation-With-Open-Refine-Reordering-Columns-Web-Scraper-Blog

It is as easy as that!

Previous parts of "Data Transfromation with OpenRefine" blog series:

Part #1 - Introduction to OpenRefine;
Part #2 - Data Cleaning with OpenRefine.

Video tutorial of the "Data Cleaning with OpenRefine".

Go back to blog page