How to count characters in a JavaScript string

21/02/2019

By: Andres Castillo Ormaechea

When we want to count characters in a JavaScript string we can use .length property as follows:

>'abc'.length
3
>'four'.length
4
>'hello'.length
5

And voilรก everything seems great no? We can output the number of characters our string has. Nevertheless not 100% true.

What is the problem.

We are counting code points and not characters. Let's illustrate it in an example:

>'๐Ÿ’ฉ'.length
2
> '๐€'.length
2
> '๐ฏ •'.length
2
>'๐Ÿ‘ฉโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ'.length
11

We are counting code points. There are some use cases where characters are composed of two ore more code points like: emojis, bold characters, and some foreign character sets.

What are code points?

To be on the same page let's get a few concepts before we dive into code points:

Graphemes can be composed by multiple symbols. For example the grapheme ๐Ÿ‘ฉโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ is composed of four (4) symbols which are ๐Ÿ‘ฉ + โค + ๐Ÿ’‹ + ๐Ÿ‘ฉ.

Symbols are the single unit of information that humans can understand in readable text. Symbols can be composed of one or multiple code points. For example the symbol pile of poop ๐Ÿ’ฉ is composed of two (2) symbols which are \uD83D + \uDCA9. Or the symbol ๐ฏ • is composed of two (2) code points \uD87E + \uDC15. As a last example the bold capital letter ๐€ is composed by two code points \uD835 + \uDC00

  • '๐Ÿ’ฉ' === \uD83D + \uDCA9
  • '๐ฏ •' === \uD87E + \uDC15
  • '๐€' === \uD835 + \uDC00

Code points are the single unit of information that computers understand. A code point represents a single value in the unicode standard. For example computers understand \uD835 + \uDC00 which is equal to what we humans understand as ๐€.

Remember '๐Ÿ‘ฉโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ'.length === 11? Let's make a diagram that deconstructs the ๐Ÿ‘ฉโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ grapheme into symbols and code points:


        What humans see             | What computers see
------------------------------------------------------------------------------
Graphemes       |Symbols            |  Code points
------------------------------------------------------------------------------
 ๐Ÿ‘ฉโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ             | ๐Ÿ‘ฉ                |  1. \uD83D
                |                   |  2. \uDC69
                |                   |
------------------------------------------------------------------------------
                |                   |  3. \u200d (emoji_zero_width_joiner) 
                |                   |   
                |                   |
------------------------------------------------------------------------------
                | โค                 |  4. \u2764
                |                   |
                |                   |
------------------------------------------------------------------------------
                |                   |  5. \ufe0f (emoji_presentation_selector) 
                |                   |   
                |                   |
------------------------------------------------------------------------------
                |                   |  6. \u200d (emoji_zero_width_joiner) 
                |                   |   
                |                   |
------------------------------------------------------------------------------
                | ๐Ÿ’‹                |  7. \ud83d
                |                   |  8. \udc8b
                |                   |
------------------------------------------------------------------------------
                |                   |  9. \u200d (emoji_zero_width_joiner) 
                |                   |   
                |                   |
------------------------------------------------------------------------------
                | ๐Ÿ‘ฉโ€                |  10. \uD83D
                |                   |  11. \uDC69
                |                   |
------------------------------------------------------------------------------
                

Let's do the same with the ๐Ÿ’ฉ emoji:


        What humans see             | What computers see
------------------------------------------------------------------------------
Graphemes       |Symbols            |  Code points
------------------------------------------------------------------------------
                | ๐Ÿ’ฉ                |  1. \uD83D
                |                   |  2. \uDCA9
                |                   |
------------------------------------------------------------------------------
                

And the ๐ฏ • symbol:


        What humans see             | What computers see
------------------------------------------------------------------------------
Graphemes       |Symbols            |  Code points
------------------------------------------------------------------------------
                | ๐ฏ •                |  1. \uD87E
                |                   |  2. \uDC15
                |                   |
------------------------------------------------------------------------------
                

Last but not least the capital, bold ๐€:


        What humans see             | What computers see
------------------------------------------------------------------------------
Graphemes       |Symbols            |  Code points
------------------------------------------------------------------------------
                | ๐€                 |  1. \uD835
                |                   |  2. \uDC00
                |                   |
------------------------------------------------------------------------------
                

How can we fix this?

The first easy and straightforward approach is to use Array.from that will split symbols in to an array and then count the length of that array.

> Array.from('๐€').length
1
> Array.from('๐Ÿ’ฉ').length
1
> Array.from('๐ฏ •').length
1

This approach will cover most of the readable text use cases (not emojis) and most foreign character sets.

For going further down the rabbit hole for counting emoji sequences, diacritics and 0 width characters there are some npm libraries that can help with that.

Why is this important

Knowing how computers and humans understand a single unit of information and how we are able to go back and forth between these two worlds is important to dive into more complex problems.