Handling Unicode Characters in Android / Kotlin

Posted by

Well well well… I only occasionally dabble in Android but when I do, I blog about it 😝.

Context

I recently came across an issue when loading/parsing HTML into WebView for one of my Android apps, with the issue being certain Unicode characters causing the HTML to break and only partially render.

For further context, the HTML I received back was from an RSS feed (so not always in the best shape to start with). Then, I needed to manipulate the markup in order to inject some additional placeholders after a certain number of paragraphs (I won’t bore you with the details behind this, a). Although the manipulation didn’t actually cause the issue, I just thought it was worth pointing out.

Anyway, by the time the HTML came to be rendered in the WebView, I’d occasionally be met with cut content.

State of that HTML

The problem with content in the RSS feed was that it originated from a WordPress site. Now, nothing against that (this site is WordPress), but it’s a fairly large site with multiple contributors – all who have their specific styles and even tools for writing content.

I’d say about 90% of the time the HTML comes back nice and clean, I can just drop this into a WebView and everything is taken care of.

However on some occasions, when using “double quotes” or more particularly, ‘single quotes‘ rather than a nice escape character that our WebView could understand, I get back the Unicode equivalent, which in turn breaks the rendering 😡.

Quick fix, not really the best option

After I found the offending character, a simple and quick fix was to perform a String.replace() on the Unicode value – yeah felt a little dirty, but one of those “needs must” moments.

All was going well until about 2 days later and my problem arose again… this time, with another variation of the quote. At this point I did a little digging and found out there were 16 variations of quotes and single quotes;

  • left single quotation mark
  • right single quotation mark
  • single low-9 quotation mark
  • single high-reversed-9 quotation mark
  • left double quotation mark
  • right double quotation mark
  • double low-9 quotation mark
  • double high-reversed-9 quotation mark
  • heavy single turned comma quotation mark ornament
  • heavy single comma quotation mark ornament
  • heavy double turned comma quotation mark ornament
  • heavy double comma quotation mark ornament
  • reversed double prime quotation mark
  • double prime quotation mark
  • low double prime quotation mark
  • fullwidth quotation mark

Wow… I’m gonna need more than a String.replace() to handle this…

Options

I was convinced there must have been an API in the Java/Kotlin or even Android Framework that would just handle this for me and after copious amounts of reading through documentation (and surfing Stackoverflow), the only variations of the following were being suggested;

URLDecoder.decode(newContent, "UTF-8")
Html.fromHtml()

Whilst each served a particular purpose, they didn’t decode my Unicode characters in already pre-formatted HTML. With this, the only things left to do was to get creative… up step my good old friend Regex.

The Solution

So when broken down, this was quite an easy function to build… I basically needed a list of all Unicode characters from my HTML, I don’t care what their value or decoded type is – I’m going to tackle all of them at once.

I’ll start by grabbing all instances of &#nnnnn and adding them to a list;

val regex = Regex("&\\B#([a-z0-9]{2,})(?![~!@#\$%^&*()=+_`\\-\\|\\/'\\[\\]\\{\\}]|[?.,]*\\w)")

Once in a list, it’s a simple case of utilising the previously disregarded Html.fromHtml() function and decoding each one back into the HTML.

     private fun parseUnicode(content: String): String {

         var newContent = content

         val regex = Regex("&\\B#([a-z0-9]{2,})(?![~!@#\$%^&*()=+_`\\-\\|\\/'\\[\\]\\{\\}]|[?.,]*\\w)")
         val matches = regex.findAll(content)
         val unicodeList = matches.map { it.groupValues[1] }

         for (uc in unicodeList) {
             val encodedChar: String = "&#$uc;"
             val decodedChar: String = Html.fromHtml(encodedChar).toString()
             newContent = newContent.replace(encodedChar, decodedChar)
         }
         return newContent

     }

Final note

Sometimes code is dirty, and I don’t mean cutting corners to fix that production bug quicker, sometimes limited capabilities or API’s can cause us to be a little ‘old skool’ in the way we need to write our functions.

But don’t mistake missing the bigger picture just to get that quick fix out, sometimes a little forward thinking can result in a more strategic and robust solution.

Enjoy

C. 🥃