How to count characters in a JavaScript string
21/02/2019
By: Andres Castillo Ormaechea
When we want to count characters in a JavaScript string we can use .length
property as follows:
>'abc'.length
3
>'four'.length
4
>'hello'.length
5
And voilรก everything seems great no? We can output the number of characters our string has. Nevertheless not 100% true.
What is the problem.
We are counting code points and not characters. Let's illustrate it in an example:
>'๐ฉ'.length
2
> '๐'.length
2
> '๐ฏ '.length
2
>'๐ฉโโค๏ธโ๐โ๐ฉ'.length
11
We are counting code points. There are some use cases where characters are composed of two ore more code points like: emojis, bold characters, and some foreign character sets.
What are code points?
To be on the same page let's get a few concepts before we dive into code points:
Graphemes can be composed by multiple symbols. For example the grapheme ๐ฉโโค๏ธโ๐โ๐ฉ is composed of four (4) symbols which are ๐ฉ + โค + ๐ + ๐ฉ.
Symbols are the single unit of information that humans can understand in readable text. Symbols can be composed of one or multiple code points. For example the symbol pile of poop ๐ฉ is composed of two (2) symbols which are \uD83D
+ \uDCA9
. Or the symbol ๐ฏ is composed of two (2) code points \uD87E
+ \uDC15
. As a last example the bold capital letter ๐ is composed by two code points \uD835
+ \uDC00
- '๐ฉ' === \uD83D + \uDCA9
- '๐ฏ ' === \uD87E + \uDC15
- '๐' === \uD835 + \uDC00
Code points are the single unit of information that computers understand. A code point represents a single value in the unicode standard. For example computers understand \uD835
+ \uDC00
which is equal to what we humans understand as ๐.
Remember '๐ฉโโค๏ธโ๐โ๐ฉ'.length === 11
? Let's make a diagram that deconstructs the ๐ฉโโค๏ธโ๐โ๐ฉ grapheme into symbols and code points:
What humans see | What computers see
------------------------------------------------------------------------------
Graphemes |Symbols | Code points
------------------------------------------------------------------------------
๐ฉโโค๏ธโ๐โ๐ฉ | ๐ฉ | 1. \uD83D
| | 2. \uDC69
| |
------------------------------------------------------------------------------
| | 3. \u200d (emoji_zero_width_joiner)
| |
| |
------------------------------------------------------------------------------
| โค | 4. \u2764
| |
| |
------------------------------------------------------------------------------
| | 5. \ufe0f (emoji_presentation_selector)
| |
| |
------------------------------------------------------------------------------
| | 6. \u200d (emoji_zero_width_joiner)
| |
| |
------------------------------------------------------------------------------
| ๐ | 7. \ud83d
| | 8. \udc8b
| |
------------------------------------------------------------------------------
| | 9. \u200d (emoji_zero_width_joiner)
| |
| |
------------------------------------------------------------------------------
| ๐ฉโ | 10. \uD83D
| | 11. \uDC69
| |
------------------------------------------------------------------------------
Let's do the same with the ๐ฉ emoji:
What humans see | What computers see
------------------------------------------------------------------------------
Graphemes |Symbols | Code points
------------------------------------------------------------------------------
| ๐ฉ | 1. \uD83D
| | 2. \uDCA9
| |
------------------------------------------------------------------------------
And the ๐ฏ symbol:
What humans see | What computers see
------------------------------------------------------------------------------
Graphemes |Symbols | Code points
------------------------------------------------------------------------------
| ๐ฏ | 1. \uD87E
| | 2. \uDC15
| |
------------------------------------------------------------------------------
Last but not least the capital, bold ๐:
What humans see | What computers see
------------------------------------------------------------------------------
Graphemes |Symbols | Code points
------------------------------------------------------------------------------
| ๐ | 1. \uD835
| | 2. \uDC00
| |
------------------------------------------------------------------------------
How can we fix this?
The first easy and straightforward approach is to use Array.from
that will split symbols in to an array and then count the length
of that array.
> Array.from('๐').length
1
> Array.from('๐ฉ').length
1
> Array.from('๐ฏ ').length
1
This approach will cover most of the readable text use cases (not emojis) and most foreign character sets.
For going further down the rabbit hole for counting emoji sequences, diacritics and 0 width characters there are some npm libraries that can help with that.
Why is this important
Knowing how computers and humans understand a single unit of information and how we are able to go back and forth between these two worlds is important to dive into more complex problems.