Wednesday, September 15, 2010

Make it CamelCase: removing sequences of capitals leveraging regex lookahead/lookbehind

I ran into a novel (in that it isn't something I need to do very often) problem today: the need to convert names that might have sequences of capitals to CamelCase with no consecutive capitals in Java. For example:

BEFORE => AFTER
MyGoodName => MyGoodName
MYGoodName => MyGoodName
MyGOODName => MyGoodName
MyGoodNAME => MyGoodName
EndOfIT => EndOfIt
EndOfItALL => EndOfItAll
UUIDIsACOOLType => UuidIsAcoolType

It's easy enough to find such sequences using a regular expression but how to correctly replace them was slightly less clear. So, starting with the easy part, to find such sequences we can use a pattern that says "a capital, followed by some more capitals, ending with either another capital or the end of the string":

//1+ CAPS preceeded by a CAP and ending in either another CAP or end-of-string. 
String expr = "[A-Z][A-Z]+([A-Z]|$)";

The middle set of capitals - the [A-Z]+ - are the ones we'd want to change to lowercase. So ... how to match and replace? Probably we'll want to put that into a group so we can easily use it in a replace:

//1+ CAPS preceeded by a CAP and ending in either another CAP or end-of-string. 
String expr = "[A-Z][A-Z]+([A-Z]|$)";

The normal replaceAll style APIs on Matcher do not appear to easily allow you to swap in a modified version of a group. Luckily the appendReplacement and appendTail APIs do allow this. There is a great example in the javadoc.

So ... we should be able to do something like this:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexTest {
 public static void main(String[] argv) {  //
  //1+ CAPS preceeded by a CAP and ending in either another CAP or end-of-string. 
  String expr = "[A-Z][A-Z]+[A-Z]|$";
  String[] testStrings = { "MyGoodName", "MYGoodName", "MyGOODName", "MyGoodNAME", "EndOfIT", "EndOfItALL", "UUIDIsACOOLType" };
  
  Pattern pattern = Pattern.compile(expr);
  for (String testString : testStrings) {
   Matcher matcher = pattern.matcher(testString);
   
   StringBuffer sb = new StringBuffer();
   while (matcher.find()) {
    matcher.appendReplacement(sb, matcher.group().toLowerCase());    
   }
   matcher.appendTail(sb);
   System.out.printf("%1$s => %2$s\n", testString, sb.toString());
  }
 }    
}

Unfortunately this doesn't work because our match is the entire string so we end up with the following output:

MyGoodName => MyGoodName
MYGoodName => mygoodName
MyGOODName => Mygoodname
MyGoodNAME => MyGoodname
EndOfIT => EndOfIT
EndOfItALL => EndOfItall
UUIDIsACOOLType => uuidisacooltype

We could monkey around with trying to replace only a specific group based on it's start/end indices or some such nonsense but it would really be much nicer to match only the consecutive caps, with the leading cap and trailing cap or end-of-string not actually being considered a part of the match. Luckily java supports lookaround in regular expressions. We can revise our program to use lookahead/behind for the start/end and have only the desired bit be the match:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexTest {
 public static void main(String[] argv) {  //
  //1+ CAPS preceeded by a CAP and ending in either another CAP or end-of-string. 
  String expr = "(?<=[A-Z])[A-Z]+(?=[A-Z]|$)";
  String[] testStrings = { "MyGoodName", "MYGoodName", "MyGOODName", "MyGoodNAME", "EndOfIT", "EndOfItALL", "UUIDIsACOOLType" };
  
  Pattern pattern = Pattern.compile(expr);
  for (String testString : testStrings) {
   Matcher matcher = pattern.matcher(testString);
   
   StringBuffer sb = new StringBuffer();
   while (matcher.find()) {
    matcher.appendReplacement(sb, matcher.group().toLowerCase());    
   }
   matcher.appendTail(sb);
   System.out.printf("%1$s => %2$s\n", testString, sb.toString());
  }
 }    
}

This will finally produce the desired output:
MyGoodName => MyGoodName
MYGoodName => MyGoodName
MyGOODName => MyGoodName
MyGoodNAME => MyGoodName
EndOfIT => EndOfIt
EndOfItALL => EndOfItAll
UUIDIsACOOLType => UuidIsAcoolType

Both the lookahead/behind and the appendReplacement/Tail are very handy but relatively rarely used in my experience.

No comments:

Post a Comment