Stripping html - dos, don'ts and more

TL;DR: Use template (not div) and textContent (not innerText).

Every so often I run into a situation where I need to strip HTML from a string. For example. this blog's post list includes a short excerpt of the actual post, but I didn't want any of the content's HTML in there, just the plain text.

I recently revisited the stripHtml() function I wrote years ago and I thought it may be worth sharing a few dos and don'ts I learned when it comes to stripping HTML from a string.

I'm just doing a very dumb "strip all HTML from this string" method here, no fancy stuff. For use-cases where you want to keep (some of) the HTML, you should look into DOMPurify and/or the (upcoming) built-in HTML Sanitizer API.

What not to do

Do not div.innerText

In the past, I naively used something that looks like this:

function stripHtml(string) {
  const tmp = document.createElement('div');
  tmp.innerHTML = string;
  return tmp.innerText || '';
}

It does the job, but it contains a major security issue that needs fixing and we can make it a little bit faster with one simple change.

I need to get one other thing out of the way, but the code above is the foundation we'll use to make a safer and faster stripHtml() function.

Do not use regular expressions

I'm not going to spend much time on this, but it needs to be said: No matter how smart your expression is, it's no match for the DOM parser your browser ships with. HTML is complicated and has a lot of historical baggage. Regex is not up to the task.

Use template, not div

Setting the innerHTML of a <div /> like I did in the example is a big security risk, especially if you're working with third-party content you don't control.

Consider the following string:

stripHtml(`
  <p>Hello there!</p>
  <img onerror="alert(\'Gotcha!\')" src="give-me-a-404" />
`);

If you run that through our stripHtml() function, the resulting string will only contain plain text. So looking at just the input and output, it seems to do what you expect. But when we set innerHTML on the tmp element, the browser will parse and - for lack of a better word - execute the input string's HTML. That includes the <img /> tag, which causes the browser to attempt to load the linked image, fail, and execute the inlined onerror callback.

In other words, we've got a Cross Site Scripting (XSS) vulnerability on our hands 😱.

Fortunately, the fix is easy: Use <template /> instead of <div />. It makes sure the HTML is parsed but not 'executed'.

function stripHtml(string) {
  const tmp = document.createElement('template'); // <--- safe!
  tmp.innerHTML = string;
  return tmp.content.innerText || '' ;
}

Note that we need to read textContent from tmp.content instead of from tmp directly.

The <template /> tag is not supported on Internet Explorer.

Use DOMParser for IE

There's an IE-compatible solution, too, but it's a LOT slower than the <template /> method. The good news is that IE usage numbers are dwindling and its users will be on a desktop computer with a fairly powerful processor (compared to mobile, at least).

function stripHtml(string) {
  const doc = new DOMParser().parseFromString(value, 'text/html');
  return doc.body.innerText || '';
}

See the last chapter for a snippet where we use <template /> if it's supported, and fall back to DOMParser on older browsers.

Use textContent, not innerText

If you can, use textContent instead of innerText. It's a lot faster.

function stripHtml(string) {
  const tmp = document.createElement('template'); // <--- safe!
  tmp.innerHTML = string;
  return tmp.content.textConent || '' ; // <--- faster!
}

innerText and textContent are very similar, but there is an important difference: textContent doesn't check if elements and their contents are visible, it pretty much returns anything that is not an HTML tag. In other words, textContent will happily return any text that is inside a hidden element, including elements that hide their contents by default, like <script /> and <style />. This is exactly what makes it faster, but it can also cause output you may not expect.

Example:

stripHtml(`
  <script>alert('Hello world');</script>
  <p>Hello world</p>
`);
// innerText: "Hello world"
// textContent: "alert('Hello world');Hello world"

If you're not in control and/or certain of the exact content you're piping through our stripHtml() function, you should use innerText and take the performance hit.

The reason why you need to be careful with textContent is similar to why you should not use a regular expression to strip HTML. While Regex will do the job most of the time, it's quite dumb and will just strip the tags and nothing more, regardless of the tag's nature. Regex also doesn't account for quite a lot of HTML parsing weirdness that browsers know how to handle.

Putting it all together

Use <template /> if it's available.
Use textContent if you can.
Fall back to DOMParser() for older browsers.

function stripHtml(string, { property: 'textContent' }) {

  // Try to use the fast method on browsers that support <template />
  const tmp = document.createElement('template');
  if ('content' in tmp) {
    tmp.innerHTML = string;
    return tmp.content[ property ] || '' ;
  }
  // Fallback for older browsers
  const doc = new DOMParser().parseFromString(value, 'text/html');
  return doc.body[ property ] || '';

}

Note that I've added the property option so you can easily control whether the function uses textContent (default) or innerText.