Cleaning up datasets with the Datarefine plugin

Posted by

data-refine
Working with messy data can be fustrating

Cleaning up datasets can be fustrating and time consuming, and mass editing of data in Enonic CMS directly is not supported. This plugin is a working starting point to make it possible to perform basic edit operations on multiple content in a single category folder. You can for example replace sub-strings, change casing to lowercase, uppercase, titlecase - or normalize whitespace. It is also open source and you are free to add your own operations.

Sourcecode for data-refine project is available https://bitbucket.org/runeway/enonic-labs-plugins

OpenRefine - a powerful tool for working with data.

OpenRefine (previously called Google Refine), a great tool for working with messy data. There are many video tutorials - and the entire source code is available on Github, so it is easy to dive into the details.

One implementation example is the  http://en.wikipedia.org/wiki/N-gram fingerprint method which is great for finding duplicates without false positives. It works by generating a key from a string value by following the process in order:

  1. remove leading and trailing whitespace
  2. change all characters to their lowercase representation
  3. remove all punctuation and control characters
  4. split the string into whitespace-separated tokens
  5. sort the tokens and remove duplicates
  6. join the tokens back together
  7. normalize extended western characters to their ASCII representation (for example "gödel" → "godel")

The code for this in Open Refine is here . And here is an article which explains clustering in depth  - clustering being  the operation of "finding groups of different values that might be alternative representations of the same thing".

Working with data in Enonic CMS.

One approach for cleaning data present in Enonic Cms is to write a plugin that imports all content to a csv file, import the file into Goolge Refine, clean up the dataset, and then re-import the data using a appropriate import configuration  - or write an import plugin. This can be a bit tedious process if you only want to do some basic operations, like changing all employee names to Titlecase or have correct spelling on street names. 

One common spelling error in Norway is the street name allé/allè/alle - where allé is the right way of spelling it.  Ref:  http://www.osloby.no/nyheter/Her-er-allallalle-forvirret-7139709.html . So using the Datarefine plugin it is for example easy to search all content which contains an address for misspellings, and correct them.

Introducing the data-refine plugin.

The data-refine plugin is a java plugin, built as a Enonic CMS compatible osgi bundle, it implements its own view controller in freemarker, and uses the http://purecss.io library - a set of small, responsive CSS modules that you can use in every web project - directly from CDN http://yui.yahooapis.com/pure/0.3.0/pure-min.css. From their own blog: "Pure is ridiculously tiny. The entire set of modules clocks in at 4.4KB* minified and gzipped, without forgoing responsive styles, design, or ease of use."

The plugin introduces the possibility of "mass-editing" of content inside a content category folder.

Overview of the available editing options.

The data-refine plugin is per default set up as a controller on the path /admin/site[0-9]/datarefine-plugin. This may be changed in the context.xml file inside the plugin. The first thing you will have to do is provide a category key:

data-refine-select-category-key

Image: Screenshot from category selection in data-refine plugin.

As soon as a valid category key is selected, you will be informed of which contenttype the category contains, and what operations are available. The operation targets are dynamically calculated using the client.getContentTypeConfigXML(GetContentTypeConfigXMLParams params) method in the API. Below is an example with a complex contenttype. 

datarefine-overview

Image: Screenshot from overview of all operations and complex contenttype in data-refine plugin.

The Count input _only_ decides how many instances of the content should be shown in preview, all operations performed will always be executed on all contents in the category folder.

Retrieve data from cache / database.

The first test run, all content is fetched from database, but if you want to do a second test run, Use cached data will be automatically selected to decrease response time.

Commit changes to database.

On every run this will be reset to "No, only preview result in browser". If you are happy with the test run result you may change this setting to "Yes, commit changes to database" to perform changes and commit them to the database. If there are many content, this will take some time, there is no progress bar implemented yet, but you may follow the progress in the Enonic CMS Event Log.

 

event-log 

Logentry

Image: The Event Log inside Enonic CMS will log all changes to content and create detailed change comments with information of what was changed. 

Operations

The currently available operations are:

  • Convert names to Full Title Case
  • Convert names to uppercase
  • Convert names to lowercase
  • Normalize whitespace
  • Replace in string [source|target]

These operations will be performed on all content for all selected Operation targets (contenttype inputs)All fields that are affected by the operation(s) will be marked in red/green colour, where red is the old value and green is the updated value - the updated value will also be suffixed with "(+)". Below is an example where I change the "-" in the norwegian counties with a " " (whitespace character), and at the same time remove extra whitespace characters.

datarefine-previewImage: Screenshot from preview of content after performing string replace operation.

When I am happy with my preview I can select the "Yes, commit changes to database" radiobutton and press the "apply" button, and changes will be saved to database and you should see the changes when the operation has finished.

Example

Lets go through an example where we want to clean up some data inconsistencies in our location contenttype, we want to:

  1. Replace all instances of 'gt. ' with 'gate ' for address field.
  2. Convert all street names to Title Case.
We start by entering the category key for the 'location' contenttype and gets the following message: 
 
"Category 'Legekontorer (18)' selected. Category has 1195 content of contenttype 'location'"

The best way is to perform the operations in two separate commits, if both operations are performed simultaneously you will also change all address fields to title case and replace gt. with gate in all names - which is not the intention. But if you think it is safe, you may do this in one commit and add some elements to the preview to see if you get any unwanted changes. 
 
Let us do the first operation first by entering source and target string and address1 as operation target:
 
datarefine-op-address1
Image: Screenshot from string replace operation of "gt. " to "gate " for address1 field. This operation can have some risk of error, f.x "Vogt. gt." would result in "Vogate gate"- because an unwanted instance of "gt." gets replaced. 
 
Press apply and check out the preview result, increase count to 100 to get some more content in preview, you may also select filter "Display modified rows only", to only see rows that has changes.
 
Below is part of the preview result. It looks ok, so we will select "Yes, commit changes to database." and press "Apply" again to submit changes to database. Go into Enonic CMS and check the Event Log to see if everything proceeds as expected. All updated entries gets a change comment like "Field: 'address1', updated by Datarefine-plugin.".

We continue by selecting the "Convert names to Full Title Case" operation and select "name" as operation target, then press "Apply", and we can see in the preview that all names have been converted to Title Case. Select "Yes, commit changes to database." and press apply again, and we are finished with the planned data cleanup.

Preview:
 
datarefine-gate-replacedatarefine-title-case
Image: Screenshot of red/green diff. screen in preview, before committing changes to database.


Finishing words.

Feel free to download the source from bitbucket. An enonic jar file can be compiled by simply running "mvn clean package" inside the  enonic-labs-plugins / datarefine-plugin folder and copying the packaged .jar file to enonic plugin folder. You should then be able to access the plugin at /admin/site[0-9]/datarefine-plugin on your installation.

Disclaimer: Please do testing in a test environment before performing large operations in a production environment, since this plugin is considered a non-official Enonic plugin in "Beta". It has not been tested with all types of contenttypes / inputs, but should work with at least basic input fields, radiobuttons, checkboxes and perhaps textareas.

Please add comments or feature requests in Disqus and/or the public issue tracker at bitbucket - and create pull requests if you expand on functionality that may be useful for others. 

----------------------

License

The software described in this post is licensed under AGPL 3.0 license. Also the distribution includes 3rd party software components. The vast majority of these libraries are licensed under Apache 2.0. For a complete list please read NOTICE.txt inside the git repository.

You are free to use, change and re-distribute the software according to license The software is provided "AS IS" - Enonic accepts no liability, indemnity or warranties.

Comments