Migrating from jsdom to htmlparser2

A practical reference for developers who know jsdom and want to switch to the htmlparser2 ecosystem.

jsdom runs a full virtual browser in Node.js. That makes sense when you need to execute scripts or test UI behaviour, but for scraping, templating, and data extraction it brings in a lot of weight that the job simply does not require.

htmlparser2 parses HTML into a plain JavaScript object tree. Querying is handled by css-select, and traversal utilities live in domutils. This guide shows you the equivalent of every jsdom pattern you already know, section by section.

1 The Mental Model Shift#

jsdom simulates a browser: it gives you a live document object with methods (querySelector, getElementById…) that behave exactly like in a browser. The DOM tree is made of class instances.

htmlparser2 is different. It gives you a plain JavaScript object tree. There is no window, no document.querySelector, no classList. Querying and traversal are done by separate utility libraries.

── jsdom ────────────────────────────────────────────────── const dom = new JSDOM(html) dom.window.document ← live Document object (browser API) .querySelector() .getElementById() .createElement() .innerHTML .textContent .classList .getAttribute() ── htmlparser2 ecosystem ────────────────────────────────── parseDocument(html) ← plain JS object tree (no methods) css-select → CSS selectors (selectOne / selectAll) domutils → everything else (text, attrs, siblings…)
⚠ The single biggest gotcha element.children in jsdom returns only element nodes (HTMLCollection). In htmlparser2, node.children includes all node types — text nodes, comments, everything. Always filter with domutils.isTag(n) when you want elements only.

A node in htmlparser2 is just a plain object. You can console.log it, spread it, JSON-stringify it. No prototype magic.

htmlparser2 — what a node looks like
// <a href="/about" class="nav-link">About</a>
{
  type: 'tag',
  name: 'a',                        // always lowercase
  attribs: { href: '/about', class: 'nav-link' },
  children: [
    { type: 'text', data: 'About', parent: [Circular], … }
  ],
  parent: { … },
  next: { … },   // next sibling (any type)
  prev: { … }    // previous sibling (any type)
}

2 Parsing HTML#

jsdom
import { JSDOM } from 'jsdom';

const dom = new JSDOM(html);
const document = dom.window.document;

// document is a browser Document object
htmlparser2
import { parseDocument } from 'htmlparser2';

const document = parseDocument(html);

// document is a plain JS object:
// { type: 'root', children: [ … ] }
📦 Install npm install htmlparser2 domutils css-select domhandler

All examples use ESM import syntax. For CommonJS, swap to require() — e.g. const { parseDocument } = require('htmlparser2'), const domutils = require('domutils').
💡 parseDocument options Pass { decodeEntities: false } if you want raw HTML entities preserved. By default entities are decoded. Also accepts xmlMode: true for case-sensitive XHTML parsing.

The returned document has type: 'root' and a children array. You pass it directly to css-select and domutils functions — think of it as your entry point, equivalent to jsdom's document.

3 Querying the DOM#

css-select is the query engine. CSS selectors work identically to what you know from the browser.

jsdom
// Single element (or null)
document.querySelector('.nav-link')
document.querySelector('#main')
document.querySelector('a[href^="https"]')

// All matching elements (NodeList)
document.querySelectorAll('ul > li')
document.querySelectorAll('[data-id]')

// Classic helpers
document.getElementById('main')
document.getElementsByClassName('card')
document.getElementsByTagName('p')
htmlparser2 + css-select
import { selectOne, selectAll } from 'css-select';

// Single element (or null)
selectOne('.nav-link', document)
selectOne('#main', document)
selectOne('a[href^="https"]', document)

// All matching elements (plain Array)
selectAll('ul > li', document)
selectAll('[data-id]', document)

// ID / class / tag — just use selectors:
selectOne('#main', document)
selectAll('.card', document)
selectAll('p', document)

Predicate-based queries (domutils)

When a CSS selector isn't enough, use domutils.findOne and domutils.findAll with a custom function:

domutils — predicate queries
import * as domutils from 'domutils';

// Find first tag whose text starts with "Price:"
const node = domutils.findOne(
  n => domutils.isTag(n) && domutils.getText(n).startsWith('Price:'),
  document.children
);

// Find all <li> elements that have a data-id attribute
const items = domutils.findAll(
  n => n.name === 'li' && domutils.hasAttrib(n, 'data-id'),
  document.children
);

Reusable compiled selectors

jsdom
// No native way to compile/reuse a selector.
// Every call re-parses the selector string.
rows.forEach(row => {
  const cell = row.querySelector('td.price');
});
css-select
import { compile, selectOne } from 'css-select';

// Compile once, reuse many times (faster in loops)
const priceCell = compile('td.price');

rows.forEach(row => {
  const cell = selectOne(priceCell, row);
});

Check if a node matches a selector

jsdom
element.matches('.active')
element.matches('a[href]')
css-select
import { is } from 'css-select';

is(node, '.active')
is(node, 'a[href]')
📋 Return types jsdom's querySelectorAll returns a live NodeList. css-select's selectAll returns a plain Array — you can use .map(), .filter(), .find() etc. directly on it, no conversion needed.

4 Attributes#

In htmlparser2, attributes are stored as a plain object on node.attribs. You can access them directly — no method calls required.

jsdom
// Read
el.getAttribute('href')       // '/about' or null
el.id                         // shorthand for id attr
el.className                  // shorthand for class attr

// Check
el.hasAttribute('disabled')   // true/false

// Write
el.setAttribute('href', '/new')
el.removeAttribute('disabled')

// Iterate all attrs
for (const attr of el.attributes) {
  console.log(attr.name, attr.value);
}
htmlparser2 + domutils
// Read — direct object access
el.attribs['href']                    // '/about' or undefined
el.attribs['id']
el.attribs['class']

// or with domutils (returns undefined, not null)
domutils.getAttributeValue(el, 'href')

// Check
domutils.hasAttrib(el, 'disabled')    // true/false
'disabled' in el.attribs              // same thing

// Write — mutate the object directly
el.attribs['href'] = '/new'
delete el.attribs['disabled']

// Iterate all attrs
Object.entries(el.attribs).forEach(([name, value]) => {
  console.log(name, value);
});
⚠ null vs undefined jsdom's getAttribute returns null when an attribute is absent. htmlparser2's el.attribs['missing'] and domutils.getAttributeValue return undefined. If your code does if (attr !== null), update it to if (attr != null) or if (attr !== undefined).
💡 Boolean attributes In HTML, <input disabled> is parsed as attribs: { disabled: '' } — an empty string, not true. Check presence with domutils.hasAttrib(el, 'disabled'), not by checking the value.

5 Text Content#

jsdom
// All text, recursively (equivalent to textContent)
el.textContent

// Rendered text — layout-aware, skips hidden elements
// (browser only, not really useful in jsdom either)
el.innerText

// Text node's raw string value
textNode.nodeValue   // or textNode.data or textNode.textContent
domutils
// All text, recursively — equivalent to textContent
domutils.getText(node)

// There is no innerText equivalent. That's a browser concept.
// domutils.getText is what you want.

// Text node's raw string value
textNode.data

Working with text nodes directly

jsdom
// Get only direct text node children
// NodeList has no .filter() — spread it first
[...el.childNodes]
  .filter(n => n.nodeType === Node.TEXT_NODE)
  .map(n => n.textContent)
  .join('')
htmlparser2
// Get only direct text node children
el.children
  .filter(n => n.type === 'text')
  .map(n => n.data)
  .join('')
📋 Whitespace htmlparser2 preserves whitespace-only text nodes (e.g. newlines between tags). Filter with .filter(n => n.type === 'text' && n.data.trim()) if you only want meaningful text.

7 Type Checking#

jsdom uses numeric nodeType constants. htmlparser2 uses a type string on each node.

jsdom
node.nodeType === Node.ELEMENT_NODE    // 1
node.nodeType === Node.TEXT_NODE       // 3
node.nodeType === Node.COMMENT_NODE    // 8
node.nodeType === Node.DOCUMENT_NODE   // 9

node instanceof Element   // is an element
node instanceof Text      // is a text node
htmlparser2 + domutils
domutils.isTag(node)       // type 'tag' | 'script' | 'style'
domutils.isText(node)      // type === 'text'
domutils.isComment(node)   // type === 'comment'
domutils.isDocument(node)  // type === 'root'

// Or check the string directly:
node.type === 'tag'
node.type === 'text'
node.type === 'comment'
node.type === 'root'
💡 isTag includes <script> and <style> domutils.isTag(node) returns true for type === 'tag', 'script', and 'style'. This mirrors browser behavior where <script> is an Element. If you specifically need only regular tags, check node.type === 'tag'.

Node type strings at a glance

HTMLnode.typedomutils check
<div>, <p>, etc.'tag'isTag(n)
<script>'script'isTag(n)
<style>'style'isTag(n)
text between tags'text'isText(n)
<!-- comment -->'comment'isComment(n)
root document'root'isDocument(n)
<!DOCTYPE html>'directive'

8 Serialization (innerHTML / outerHTML)#

jsdom
el.innerHTML    // content inside the element
el.outerHTML    // element itself + its content
domutils
domutils.getInnerHTML(node)    // content inside
domutils.getOuterHTML(node)    // node + content

// Alternative: dom-serializer (more control)
import render from 'dom-serializer';
render(node.children)          // inner HTML
render(node)                   // outer HTML
📋 dom-serializer domutils.getInnerHTML/getOuterHTML uses dom-serializer internally. Install it directly (npm i dom-serializer) only if you need options like decodeEntities: false or xmlMode: true.

9 Class Manipulation#

There is no classList in htmlparser2. Classes are just a space-separated string in node.attribs.class.

jsdom
el.classList.contains('active')
el.classList.add('active')
el.classList.remove('active')
el.classList.toggle('active')
el.classList.replace('old', 'new')
htmlparser2
const classes = () =>
  (el.attribs.class || '').split(/\s+/).filter(Boolean);

// contains
classes().includes('active')
// or with css-select (no mutation):
is(el, '.active')   // import { is } from 'css-select'

// add
if (!classes().includes('active'))
  el.attribs.class = [...classes(), 'active'].join(' ');

// remove
el.attribs.class = classes().filter(c => c !== 'active').join(' ');

// toggle
el.attribs.class = classes().includes('active')
  ? classes().filter(c => c !== 'active').join(' ')
  : [...classes(), 'active'].join(' ');

// replace
el.attribs.class = classes().map(c => c === 'old' ? 'new' : c).join(' ');
💡 Wrap it once If you use class manipulation a lot, write a tiny helper at the top of the file rather than repeating the split/join pattern. This is also a good indicator of what belongs in an adapter (see section 11).

10 DOM Mutation#

You can mutate the tree by editing the plain objects directly, or use domutils helpers for structural changes.

jsdom
// Remove a node
el.parentNode.removeChild(el)
// or: el.remove()

// Replace a node
el.parentNode.replaceChild(newNode, el)

// Insert before/after
parent.insertBefore(newNode, refNode)

// Create nodes
document.createElement('div')
document.createTextNode('hello')
domutils
// Remove a node (updates parent.children + sibling links)
domutils.removeElement(el)

// Replace a node
domutils.replaceElement(el, newNode)

// Insert as sibling
domutils.prepend(refNode, newNode)   // insert newNode before refNode
domutils.append(refNode, newNode)    // insert newNode after refNode

// Create nodes — use domhandler constructors:
import { Element, Text } from 'domhandler';
const div = new Element('div', { class: 'box' }, []);
const txt = new Text('hello');
⚠ Keep the tree consistent When creating nodes manually and inserting them, domutils helpers (append, prepend) update the parent, next, and prev references for you. If you push directly into node.children without using these helpers, those references go stale and queries/traversal will break.

Append a child

jsdom
parent.appendChild(child);
htmlparser2
// Preferred: pass children at construction time
import { Element } from 'domhandler';
const ul = new Element('ul', {}, [li1, li2, li3]);

// Append after the fact — domutils.append() adds a sibling,
// so target the last child. Fall back for empty parent:
const last = parent.children.at(-1);
if (last) {
  domutils.append(last, child);
} else {
  child.parent = parent;
  parent.children.push(child);
}

11 The Adapter Pattern#

You're considering wrapping htmlparser2 in a jsdom-compatible interface so your business logic doesn't change. Here's a realistic view of what that looks like.

Minimal adapter skeleton
// adapter.js
import { parseDocument } from 'htmlparser2';
import * as domutils from 'domutils';
import { selectOne, selectAll, is } from 'css-select';

class NodeAdapter {
  constructor(node) { this._n = node; }

  // ── Querying ──
  querySelector(sel)    { const n = selectOne(sel, this._n); return n ? new NodeAdapter(n) : null; }
  querySelectorAll(sel) { return selectAll(sel, this._n).map(n => new NodeAdapter(n)); }
  matches(sel)          { return is(this._n, sel); }
  closest(sel)          {
    let n = this._n;
    while (n) { if (domutils.isTag(n) && is(n, sel)) return new NodeAdapter(n); n = n.parent; }
    return null;
  }

  // ── Attributes ──
  getAttribute(name)         { return this._n.attribs?.[name] ?? null; }
  setAttribute(name, value)  { if (this._n.attribs) this._n.attribs[name] = value; }
  hasAttribute(name)         { return domutils.hasAttrib(this._n, name); }
  removeAttribute(name)      { delete this._n.attribs?.[name]; }

  // ── Content ──
  get textContent()  { return domutils.getText(this._n); }
  get innerHTML()    { return domutils.getInnerHTML(this._n); }
  get outerHTML()    { return domutils.getOuterHTML(this._n); }

  // ── Identity ──
  get tagName()      { return this._n.name?.toUpperCase() ?? ''; }
  get id()           { return this._n.attribs?.id ?? ''; }
  get className()    { return this._n.attribs?.class ?? ''; }

  // ── Navigation ──
  get parentElement() {
    const p = this._n.parent;
    return p && domutils.isTag(p) ? new NodeAdapter(p) : null;
  }
  get children() {
    return (this._n.children || []).filter(domutils.isTag).map(n => new NodeAdapter(n));
  }
  get childNodes() {
    return (this._n.children || []).map(n => new NodeAdapter(n));
  }
  get nextElementSibling() {
    let n = this._n.next;
    while (n && !domutils.isTag(n)) n = n.next;
    return n ? new NodeAdapter(n) : null;
  }
  get previousElementSibling() {
    let n = this._n.prev;
    while (n && !domutils.isTag(n)) n = n.prev;
    return n ? new NodeAdapter(n) : null;
  }

  // ── classList shim ──
  get classList() {
    const el = this._n;
    const get = () => (el.attribs?.class || '').split(/\s+/).filter(Boolean);
    return {
      contains: c => get().includes(c),
      add:      c => { if (!get().includes(c)) el.attribs.class = [...get(), c].join(' '); },
      remove:   c => { el.attribs.class = get().filter(x => x !== c).join(' '); },
      toggle:   c => get().includes(c)
                  ? (el.attribs.class = get().filter(x => x !== c).join(' '))
                  : (el.attribs.class = [...get(), c].join(' ')),
    };
  }

  // ── Unwrap to raw node if needed ──
  get raw() { return this._n; }
}

function parseHTML(html) {
  const doc = parseDocument(html);
  return new NodeAdapter(doc);
}

export { parseHTML, NodeAdapter };

When the adapter is the right choice

  • Your business logic is large and heavily uses the DOM API — rewriting everything at once is risky.
  • You want to swap the underlying parser without touching tested business logic.
  • You can ship the adapter, migrate gradually, then decide whether to remove it later.

When to skip the adapter and migrate directly

  • Your DOM usage is limited and already mapped (querying + attribute reads + text extraction).
  • The adapter is papering over gaps — every missing method you add is a maintenance burden.
  • You want the full performance benefit of htmlparser2 (the adapter adds object wrapping overhead).
💡 Recommendation Start with the adapter. It lets you validate that htmlparser2 produces correct results against your existing tests without touching business logic. Once you have confidence, decide per-module whether a direct migration is worth it.

12 Live Playground#

Parse HTML and query it with css-select — running real htmlparser2 + domutils + css-select in the browser.

▶ htmlparser2 + domutils + css-select — live in browser
Real htmlparser2 running in the browser via esm.sh
Output
Click "Run Query" to see results.
📋 About the playground Uses esm.sh to load htmlparser2, domutils, and css-select as ES modules — the exact same code that runs in Node.js. If the playground shows a loading error, check your network connection.

13 XML Mode#

htmlparser2 ships a built-in XML mode that changes how the parser interprets the document. jsdom supports the same via the contentType constructor option — both give you case-sensitive tag names, self-closing tags, and no implicit HTML structure injection.

jsdom — parse as XML
import { JSDOM } from 'jsdom';

const dom = new JSDOM(xmlString, {
  contentType: 'application/xml',
});
const document = dom.window.document;

// Tag names are case-sensitive and preserved as-is.
// Malformed XML produces a parseerror document.
htmlparser2 — xmlMode
import { parseDocument } from 'htmlparser2';

const doc = parseDocument(xmlString, { xmlMode: true });

// xmlMode changes three things:
//   1. Tag names are preserved as-is (no lowercasing)
//   2. Self-closing tags (<br/>, <MyTag/>) are honoured
//   3. No implicit HTML structure (no <html>/<body> injection)

Key behavioural differences

BehaviourHTML mode (default)xmlMode: true
Tag name casinglowercased — el.name === 'div'preserved — el.name === 'MyTag'
Self-closing tagsonly void elements (<br>, <img>…)any tag — <Foo/> has no children
Implicit structure<html>/<head>/<body> injected if missingno injection — document mirrors input exactly
Error recoverylenient HTML5 error recoverybest-effort; parser does not throw on malformed XML
CDATA sectionstreated as commentsparsed as CDATA nodes (type: 'cdata')
Namespacesignoredpreserved in el.name (e.g. 'svg:path')

Querying XML documents

css-select and domutils work identically in XML mode — pass the same parsed root to selectOne / selectAll. The only difference is that selectors are now case-sensitive by default, matching how browsers treat XML.

htmlparser2 — querying XML
import { parseDocument } from 'htmlparser2';
import { selectAll } from 'css-select';

const xml = `<Library>
  <Book genre="fiction"><Title>Dune</Title></Book>
  <Book genre="non-fiction"><Title>Sapiens</Title></Book>
</Library>`;

const doc = parseDocument(xml, { xmlMode: true });

// Tag names are case-sensitive — must match exactly
const books  = selectAll('Book', doc);           // ✓ matches <Book>
const titles = selectAll('Book > Title', doc);   // ✓
const wrong  = selectAll('book', doc);           // ✗ no match (lowercase)

// Attribute selectors work as usual
const fiction = selectAll('Book[genre="fiction"]', doc);
💡 Parsing SVG and RSS/Atom feeds SVG embedded in HTML parses fine in default mode. For standalone SVG files or RSS/Atom XML feeds, pass xmlMode: true so that self-closing tags and namespace prefixes are handled correctly.
⚠ htmlparser2 is not a validating XML parser It does not enforce well-formedness, DTDs, or XML schemas. Malformed XML is handled with best-effort recovery rather than throwing an error. Use a dedicated XML parser (e.g. fast-xml-parser, sax) if strict validation or namespace resolution is required.

14 Cheat Sheet#

jsdomhtmlparser2 equivalent
SETUP
new JSDOM(html).window.documentparseDocument(html)
QUERYING
document.querySelector(sel)selectOne(sel, document)
document.querySelectorAll(sel)selectAll(sel, document) → plain Array
document.getElementById('x')selectOne('#x', document)
document.getElementsByClassName('x')selectAll('.x', document)
document.getElementsByTagName('p')selectAll('p', document)
el.matches(sel)is(el, sel) from css-select
el.closest(sel)manual walk up via el.parent
domutils.findOne(fn, nodes) — predicate query
compile(sel) — reusable selector
ATTRIBUTES
el.getAttribute('x') → null if missingel.attribs['x'] → undefined if missing
el.setAttribute('x', v)el.attribs['x'] = v
el.hasAttribute('x')domutils.hasAttrib(el, 'x')
el.removeAttribute('x')delete el.attribs['x']
el.idel.attribs['id']
el.classNameel.attribs['class']
TEXT CONTENT
el.textContentdomutils.getText(el)
textNode.datatextNode.data ← same!
NAVIGATION
el.tagName (uppercase)el.name (lowercase)
el.parentElementel.parent (check domutils.isTag)
el.children (elements only)el.children.filter(domutils.isTag)
el.childNodes (all nodes)el.children
el.nextElementSiblingwalk el.next until isTag
el.previousElementSiblingwalk el.prev until isTag
el.nextSiblingel.next
el.previousSiblingel.prev
TYPE CHECKING
nodeType === 1 (element)domutils.isTag(n)
nodeType === 3 (text)domutils.isText(n) or n.type === 'text'
nodeType === 8 (comment)domutils.isComment(n)
node instanceof Elementdomutils.isTag(n)
SERIALIZATION
el.innerHTMLdomutils.getInnerHTML(el)
el.outerHTMLdomutils.getOuterHTML(el)
CLASSES
el.classList.contains('x')is(el, '.x') or manual split
el.classList.add('x')manual split/join on el.attribs.class
el.classList.remove('x')manual split/filter/join
MUTATION
parent.removeChild(el)domutils.removeElement(el)
parent.replaceChild(newEl, el)domutils.replaceElement(el, newEl)
parent.insertBefore(newEl, ref)domutils.prepend(ref, newEl)
document.createElement('div')new Element('div', {}, []) from domhandler
document.createTextNode('x')new Text('x') from domhandler