Wednesday, November 6, 2013

Cleansing and Matching using Regular Expressions

Here are three module functions in Javascript that I find handy to cleanse data during load.

function parse(pattern, options, text, returnIndex) {
var re = new RegExp(pattern, options);
var result = re.exec(text);
if (result && result.length > returnIndex) {
return result[returnIndex];
}
return null;
}

function matchx(pattern, options, text) {
var re = new RegExp(pattern, options);
var result = re.exec(text);
if (result) { return true }
return false;
}

function replacex(pattern, options, text, newText) {
var re = new RegExp(pattern, options);
return text.replace(re, newText);
}

Use parse to grab interesting text from a field, usually to create a new field. If a field has job description, you may to create a job title field from the description by looking for words that end with 'er' such as 'developer', 'manager', words ending with 'tect' such as 'architect', and words ending with 'yst' such as 'analyst. This results in the following load script code to be used within a load statement:

parse('\s(\w+er|\w+tect|\w+yst)[\s$]', 'i', [Job Desc], 1) as Title

The "i" option makes matching case in-sensitive. The returnIndex of 1 returns index 1 of the array created by Javascript RegExp.exec function. This will allow you to extract and separate multiple items from one field into multiple fields.

Use matchx to match fields to complex regular regular expression patterns. Matchx returns true or false. Here is an example. If a field has the word 'garbage' in any case, any where, that record will not be loaded.

Reject records where myField  is 'This is garbage', 'Garbage no good', 'No GarBage allowed', etc.

LOAD
     myField
FROM mySource
Where not matchx('garbage', 'i', myField);

Use replacex to replace regular expression pattern matched substring(s) in the field. The following will replace 'nyc, NY', 'manhattan,NY', and 'New York City, NY' in Location field with 'New York, NY'.

replacex('^(nyc|new york city|manhattan).*?,', 'i' , Location, 'New York') as CleanLocation

My favorite site to learn writing regular expressions is http://www.regular-expressions.info/

No comments:

Post a Comment