Wednesday, September 15, 2010

Make it CamelCase: removing sequences of capitals leveraging regex lookahead/lookbehind

I ran into a novel (in that it isn't something I need to do very often) problem today: the need to convert names that might have sequences of capitals to CamelCase with no consecutive capitals in Java. For example:

BEFORE => AFTER
MyGoodName => MyGoodName
MYGoodName => MyGoodName
MyGOODName => MyGoodName
MyGoodNAME => MyGoodName
EndOfIT => EndOfIt
EndOfItALL => EndOfItAll
UUIDIsACOOLType => UuidIsAcoolType

It's easy enough to find such sequences using a regular expression but how to correctly replace them was slightly less clear. So, starting with the easy part, to find such sequences we can use a pattern that says "a capital, followed by some more capitals, ending with either another capital or the end of the string":

//1+ CAPS preceeded by a CAP and ending in either another CAP or end-of-string. 
String expr = "[A-Z][A-Z]+([A-Z]|$)";

The middle set of capitals - the [A-Z]+ - are the ones we'd want to change to lowercase. So ... how to match and replace? Probably we'll want to put that into a group so we can easily use it in a replace:

//1+ CAPS preceeded by a CAP and ending in either another CAP or end-of-string. 
String expr = "[A-Z][A-Z]+([A-Z]|$)";

The normal replaceAll style APIs on Matcher do not appear to easily allow you to swap in a modified version of a group. Luckily the appendReplacement and appendTail APIs do allow this. There is a great example in the javadoc.

So ... we should be able to do something like this:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexTest {
 public static void main(String[] argv) {  //
  //1+ CAPS preceeded by a CAP and ending in either another CAP or end-of-string. 
  String expr = "[A-Z][A-Z]+[A-Z]|$";
  String[] testStrings = { "MyGoodName", "MYGoodName", "MyGOODName", "MyGoodNAME", "EndOfIT", "EndOfItALL", "UUIDIsACOOLType" };
  
  Pattern pattern = Pattern.compile(expr);
  for (String testString : testStrings) {
   Matcher matcher = pattern.matcher(testString);
   
   StringBuffer sb = new StringBuffer();
   while (matcher.find()) {
    matcher.appendReplacement(sb, matcher.group().toLowerCase());    
   }
   matcher.appendTail(sb);
   System.out.printf("%1$s => %2$s\n", testString, sb.toString());
  }
 }    
}

Unfortunately this doesn't work because our match is the entire string so we end up with the following output:

MyGoodName => MyGoodName
MYGoodName => mygoodName
MyGOODName => Mygoodname
MyGoodNAME => MyGoodname
EndOfIT => EndOfIT
EndOfItALL => EndOfItall
UUIDIsACOOLType => uuidisacooltype

We could monkey around with trying to replace only a specific group based on it's start/end indices or some such nonsense but it would really be much nicer to match only the consecutive caps, with the leading cap and trailing cap or end-of-string not actually being considered a part of the match. Luckily java supports lookaround in regular expressions. We can revise our program to use lookahead/behind for the start/end and have only the desired bit be the match:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexTest {
 public static void main(String[] argv) {  //
  //1+ CAPS preceeded by a CAP and ending in either another CAP or end-of-string. 
  String expr = "(?<=[A-Z])[A-Z]+(?=[A-Z]|$)";
  String[] testStrings = { "MyGoodName", "MYGoodName", "MyGOODName", "MyGoodNAME", "EndOfIT", "EndOfItALL", "UUIDIsACOOLType" };
  
  Pattern pattern = Pattern.compile(expr);
  for (String testString : testStrings) {
   Matcher matcher = pattern.matcher(testString);
   
   StringBuffer sb = new StringBuffer();
   while (matcher.find()) {
    matcher.appendReplacement(sb, matcher.group().toLowerCase());    
   }
   matcher.appendTail(sb);
   System.out.printf("%1$s => %2$s\n", testString, sb.toString());
  }
 }    
}

This will finally produce the desired output:
MyGoodName => MyGoodName
MYGoodName => MyGoodName
MyGOODName => MyGoodName
MyGoodNAME => MyGoodName
EndOfIT => EndOfIt
EndOfItALL => EndOfItAll
UUIDIsACOOLType => UuidIsAcoolType

Both the lookahead/behind and the appendReplacement/Tail are very handy but relatively rarely used in my experience.

1 comment:

Gouse said...

"Great article! I love how you explained the use of lookahead and lookbehind in regex for transforming sequences of capitals into CamelCase. It's always fascinating to see how powerful and efficient regex can be when it comes to text manipulation. The example is clear and practical—definitely a time-saver for anyone working with string formatting in Scala. Looking forward to more Scala tips like this!"
Digital Marketing Course In Ameerpet

Post a Comment