Friday, February 17, 2012

Moving from Google Translate API to Microsoft Translate API in Scala

Google Translate used to be gods gift to developers who want to verify their internationalization works. It made localizing to some random locale for test purposes downright trivial. And then the bastards went and deprecated it (http://googlecode.blogspot.com/2011/05/spring-cleaning-for-some-of-our-apis.html), ultimately making it it pay to play.

The rates for Google Translate are so low it is unlikely we can justify switching to Microsoft Translate based on cost but we love "free" so we'll spend a few expensive hours of developer time on it anyway. M$ translate is free for your first 2M characters and you can pay for more.

The first thing that one notices upon trying to call a Microsoft web service is that it isn't as easy as you'd like. From reading the http://api.microsofttranslator.com/V2/Http.svc/Translate API to getting a successful translation call through from code took WAY longer than Google Translate (or other APIs) and had more bumps in the road. I'm sure others have differing experiences but for me I had programmatic access to Google Translate working maybe 30-60 minutes (it was a while ago) after I decided to do it versus 2-3 hours to get Microsoft Translate working.

Trying to get Microsoft Translate working I ran into the following complications:
  1. A multi-step registration process yielding numerous constants you send in to their API’s
    1. https://code.google.com/apis/console beats the hell out of https://datamarket.azure.com/account (the Microsoft equivalent as far as I can tell)
  2. An extra http call to get a token that you have to modify before you send it back to them (addRequestHeader("Authorization", "Bearer " + tok.access_token))
    1. Bonus points for inconsistent description of how this worked, although I believe this is now fixed
  3. Inaccurate documentation of arguments
    1. I believe this is now fixed
  4. Unhelpful error messages
  5. Inconsistent documentation of how to send in authorization data 
    1. I believe this is now fixed
  6. ISO-639 2-char language codes for all languages except Chinese.
    1. Chinese requires use of zh-CHS or zh-CHT to distinguish traditional vs simplified. Apparently having "zh" default to one or the other (probably simplified) is less trouble than having this be the exception case to how everything else works.
In case this is useful to someone else, here is the code (Scala 2.9.1.final, . First up, the API our clients will call into:

class Translator(val client: HttpClient) extends Log {
  //We kept Google around just in case we decide to pay for the service one day
  private var translationServices = List(new Google(client), new Microsoft(client))

  def this() = this(new HttpClient())

  def apply(text: String, fromLang: String, toLang: String): String = {
    if (fromLang != toLang && StringUtils.isNotBlank(text))
      translate(text, fromLang, toLang)
    else
      text
  }

  private def translate(text: String, fromLang: String, toLang: String): String = {  
    for (svc <- translationServices) {
      try {
        val tr = svc(text, fromLang, toLang)
        if (StringUtils.isNotBlank(tr)) {
          return tr
        }
      } catch {
        case e: Exception =>
          logger.warn("Translation failed using " + svc.getClass().getSimpleName() + ": " + e.getMessage() + ", moving on...")
      }
    }
    
    return ""
  }
}

Consumers will call into the Translator using code similar to:

  //translate Hello, World from English to Chinese
  var tr = new Translator()
  tr("Hello, World", "en", "zh") 

The interesting part is of course the actual Microsoft implementation:

/**
 * The parent TranslationService just defines def apply(text: String, fromLang: String, toLang: String): String
 */
class Microsoft(client: HttpClient) extends TranslationService(client) with Log {

  private val tokenUri = "https://datamarket.accesscontrol.windows.net/v2/OAuth2-13"
  private val translateUri = "http://api.microsofttranslator.com/V2/Http.svc/Translate"
  private val encoding = "ASCII"

  private val appKey = Map("client_id" -> "THE NAME OF YOUR APP", "client_secret" -> "YOUR CLIENT SECRET")

  private var token = new MsAccessToken

  def this() = this(new HttpClient())

  /**
   * Ref http://msdn.microsoft.com/en-us/library/ff512421.aspx
   */
  override def apply(text: String, fromLang: String, toLang: String): String = {
    
    /**
     * Always try to re-use an existing token
     */
    val firstTry:Option[String] = try {
      Some(callTranslate(token, text, fromLang, toLang))
    } catch {
      case e: Exception =>
        logger.info("Failed to re-use token, will retry with a new one. " + e.getMessage())
        None
    }
    
    /**
     * If we didn't get it using our old token try try again.
     * 99% of the time we do a bunch in a row and it works first time; occasionally we end up
     * needing a new key.
     * Code in block won't run unless firstTry is None.
     */
    val response = firstTry getOrElse {  
      this.token = requestAccessToken()
      callTranslate(token, text, fromLang, toLang)
    }
    

    //response is similar to: <string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">Hallo Welt</string>
    val translation = StringUtils.substringAfter(StringUtils.substringBeforeLast(response, "</string>"), ">")
    translation
  }

  private def callTranslate(tok: MsAccessToken, text: String, fromLang: String, toLang: String) = {
    val get = new GetMethod(translateUri)

    //Thanks MSFT, it's awesome that the language codes are *almost* ISO 639...
    //We need to specify our type of Chinese, http://www.emreakkas.com/internationalization/microsoft-translator-api-languages-list-language-codes-and-names
    val adjustedToLang = if (toLang.equalsIgnoreCase("zh")) "zh-CHS" else toLang

    val queryPairs = Array(
      new NameValuePair("appId", ""),
      new NameValuePair("text", text),
      new NameValuePair("from", fromLang),
      new NameValuePair("to", adjustedToLang))
    get.setQueryString(queryPairs)

    /**
     * http://msdn.microsoft.com/en-us/library/hh454950.aspx
     */
    get.addRequestHeader("Authorization", "Bearer " + tok.access_token)

    val rawResponse = try {
      val sc = client executeMethod get
      val response = get getResponseBodyAsString ()
      if (sc != HttpStatus.SC_OK) {
        throw new IllegalArgumentException("Error translating; Microsoft translate request '"
          + translateUri + "?" + get.getQueryString()
          + "' failed with unexpected code " + sc + ", response: " + response)
      }
      response
    }
    rawResponse
  }

  /**
   * Ref http://msdn.microsoft.com/en-us/library/hh454950.aspx
   */
  def requestAccessToken(): MsAccessToken = {

    val post = new PostMethod(tokenUri)
    post.setParameter("grant_type", "client_credentials")
    post.setParameter("client_id", appKey("client_id"))
    post.setParameter("client_secret", appKey("client_secret"))
    post.setParameter("scope", "http://api.microsofttranslator.com")

    val rawResponse = try {
      val sc = client executeMethod post
      val response = post getResponseBodyAsString ()
      if (sc != HttpStatus.SC_OK) {
        throw new IllegalArgumentException("Error translating; Microsoft access token request failed with unexpected code " + sc + ", response: " + response)
      }
      response
    } finally {
      post releaseConnection
    }

    val tok = Json.fromJson[MsAccessToken](rawResponse, classOf[MsAccessToken])

    tok
  }
}

The MsAccessToken is a rich and exciting class:
  /**
   * Ref http://msdn.microsoft.com/en-us/library/hh454950.aspx
   */
  class MsAccessToken(var access_token: String, var token_type: String, var expires_in: String, var scope: String) {
    def this() = this(null, null, null, null)
  }


A few third party libraries are in play here. For Http we are using the Apache HttpClient. For JSON we are using Google's excellent Gson library, with a simple implementation of 'using' to make working with java.io cleaner:

object Json {
 def writeJson(something: Any): String = new Gson().toJson(something) 
 
 def writeJson(something: Any, os: OutputStream):Unit 
  = using(new OutputStreamWriter(os)) { osw => new Gson().toJson(something, osw) }
 
 def fromJson[T](json: String, graphRoot: Type):T 
  = new Gson().fromJson(json, graphRoot) 
}

The using() function is in another class; it looks like this (ref http://whileonefork.blogspot.com/2011/03/c-using-is-loan-pattern-in-scala.html):
 def using[T <: {def close(): Unit}, R](c: T)(action: T => R): R = {
  try {
   action(c)
  } finally {
   if (null != c) c.close
  }
 }

And with that our Scala calling of M$ Translate is complete!