Pages

Thursday, January 14, 2010

Utilizing Apex Pattern and Matcher Classes

In many projects I am involved with I need to validate a string of data or transform the string into a new one with a specific format. Processing the text by the means of String primitive type methods or your custom handling could a big undertaking and time consuming task.

Sometimes, one needs to write hundreds of lines of code, to process a string and make sure it’s valid (formatted as expected) or transform it into proper format. Some examples of this are validating a string to see if it’s a correct email address, postal code, phone number or URL. Some other examples are grabbing html tags or striping down the XML or HTML tag to get a clear text, trimming the whitespaces, removing duplicate lines or items and many more.

Apex in Force.com platform has just the right set of classes to help you carry out such operations pretty much the same way Java does it.

“A regular expression is a string that is used to match another string, using a specific syntax. Apex supports the use of the regular expression through its Pattern and Matcher classes.” Quoted right from the holly guide. Any regular expression that is written for Java can be used with Apex as well.

In order to utilize these classes we first need to know what each of them does.

Pattern class is designed to contain the regular expression string and you compile the expression into an object of this class. You only need to use this class once. Using this class you will be able to create a Matcher object by passing your string (on which you want to carry out surgery or validation).


pattern myPattern = pattern.compile('(a(b)?)+');




Matcher in turn allows you to do further actions such as checking to see if the string matched the pattern or allows you to manipulate the original string in various ways and produce a new desired one.



matcher myMatcher = myPattern.matcher('aba');



Let’s explore some samples of using regular expressions in Apex and see how we can benefit from them:

My first example will be how to validate an email address. I personally had some struggles with this since the email addresses can get pretty ugly at times. Imagine this email address:

name.lastname_23@ca.gov.on.com



String InputString = 'email@email.com';
String emailRegex = '([a-zA-Z0-9_\\-\\.]+)@((\\[a-z]{1,3}\\.[a-z]{1,3}\\.[a-z]{1,3}\\.)|(([a-zA-Z0-9\\-]+\\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})';
Pattern MyPattern = Pattern.compile(emailRegex);

// Then instantiate a new Matcher object "MyMatcher"
Matcher MyMatcher = MyPattern.matcher(InputString);

if (!MyMatcher.matches()) {
// invalid, do something
}



Some more examples on validations:



// to validate a password
String RegualrExpression_Password = '((?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[@#$%]).{6,20})';

//image file extention
String RegualrExpression_ImgFileExt = '([^\s]+(\.(?i)(jpg|png|gif|bmp))$)';

//to validate an IP Address
String RE_IP = '^([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])$';

//date format (dd/mm/yyyy)
String RE_date = '(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\\d\\d)';

//to match links tag "A" in HTML
String RE_ATags = '(?i)<a([^>]+)>(.+?)</a>';






Another way that you can benefit from the Matcher class to to reformat the string.

Below is an example that shows you, how you can strip the HTML tags from a string and extract the plain text. This is very useful when you want to record email contents into Salesforce or covert the HTML version of an email into it's plain text counterpart.



string html = 'your html code';
//first replace all <BR> tags with \n to support new lines

string result = html.replaceAll('<br/>', '\n');
result = result.replaceAll('<br />', '\n');

//regular expression to match all HTML/XML tags
string HTML_TAG_PATTERN = '<.*?>';

// compile the pattern
pattern myPattern = pattern.compile(HTML_TAG_PATTERN);

// get your matcher instance
matcher myMatcher = myPattern.matcher(result);

//remove the tags
result = myMatcher.replaceAll('');





For complete reference of Java regular expressions please refer to: here

19 comments:

  1. I am really struggling to grasp regular expressions. Is it possible to search a string and find a match for something like this:

    From: John Doe [mailto:

    "From:" and "[mailto:" will always be static but "John Doe" could change.

    ReplyDelete
  2. I think this might work.

    From.*\[mailto:

    ReplyDelete
  3. in your HTML stripping example why do you need these 2 lines? Won't the first line suffice?

    string result = html.replaceAll('
    ', '\n');
    result = result.replaceAll('
    ', '\n');

    ReplyDelete
  4. This way you can convert the HTML break lines into text "newline" character.

    ReplyDelete
  5. You've already replaced all the br tags with "newline" tags in the first statement, right? So why have the second line that makes no change?

    ReplyDelete
  6. The first and second statements have a slight difference to cover the scenario where the HTML developer have a space between the "BR" and the closing tag.

    ReplyDelete
  7. Oh, wait... there's a space between the r and slash so the replacements aren't the same. NeverMind!

    ReplyDelete
  8. Hi Sam this might be what I was looking for...we have an Apex class in salesforce to process email body into fields, but it only processes plain text emails. We have an HTML email which gives an error. If we use your HTML stripping code will this covert the HTML to plain text and then process email correctly?

    ReplyDelete
  9. Sam - thank you so much for posting this! A wealth of information for a novice like me! :)

    ReplyDelete
  10. Two things you may want to consider:
    Prepend your regular expression with (?i) to make the match case-insensitive so to ensure that any html br tags that are in caps are also replaced:
    result = result.replaceAll('(?i)
    ', '\n');

    To match br tags that are and are't self closing, and match any arbitrary number of spaces inside the tag:
    result = result.replaceAll('(?i)', '\n');

    Regular expressions are very... exacting. Also, if you want to do anything more complex than this with an html document, you should probably use a language parser rather than regular expressions because regexes can't readily describe a language grammar well enough to eliminate all edge cases.

    ReplyDelete
  11. Sam,
    I have not been successful in using split function on the matcher object as documented here http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html#split(java.lang.CharSequence, int)
    Do you know if there is a workaround?

    My use case is, I have an email chain and am trying to identify the second occurrence of 'From:'. If I had split function then I can split the email chain into two.

    Thanks in adv
    SK

    ReplyDelete
  12. Have you found a way to use the Pattern flags like you can with Java? Everytime I try to use a flag like Pattern.MULTILINE I get a variable does not exist error.

    ReplyDelete
  13. Hey Sam ,
    Is there anything wrong with the following code segment, i m always getting "invalid" :-(

    String InputString = 'PO';
    String emailRegex = '[A-Z]';
    Pattern MyPattern = Pattern.compile(emailRegex);
    Matcher MyMatcher = MyPattern.matcher(InputString);
    if (!MyMatcher.matches()) {
    system.debug('-------- invalid');
    }
    else{
    system.debug('-------- valid');
    }

    ReplyDelete
  14. Akhilesh Soni, Java documention says about matches():
    "Attempts to match the entire region against the pattern".

    So in you case it's always false, because 'PO' (the whole string) is not a letter between A-Z.

    ReplyDelete
  15. help in validate a number with single "-"

    ReplyDelete
  16. In addition to this, I've read your other article on GoogleCharts. Thank you very much for sharing your knowledge.

    ReplyDelete
  17. how to avoid names like 'a namika' or 'a......nnu' in text field Can u plz tell me the regex of this.

    ReplyDelete
  18. Hi,

    I want to replace all HTML tags except "A" (hyperlinks) tags. Is it possible? Can you please give the regex for that?

    Thanks
    Hari

    ReplyDelete
  19. I have To append a hardcoded text on an RE how can i do this

    Example

    Test Date 2012-2-3
    Test Date 2011-4-5
    ...

    I have to append "Test Date" in my date RE, how can i accomplish this

    ReplyDelete