Migrating from jsdom to htmlparser2

A practical reference for developers who know jsdom and want to switch to the htmlparser2 ecosystem.

jsdom runs a full virtual browser in Node.js. That makes sense when you need to execute scripts or test UI behaviour, but for scraping, templating, and data extraction it brings in a lot of weight that the job simply does not require.

htmlparser2 parses HTML into a plain JavaScript object tree. Querying is handled by css-select, and traversal utilities live in domutils. This guide shows you the equivalent of every jsdom pattern you already know, section by section.

1 The Mental Model Shift#

jsdom simulates a browser: it gives you a live document object with methods (querySelector, getElementById…) that behave exactly like in a browser. The DOM tree is made of class instances.

htmlparser2 is different. It gives you a plain JavaScript object tree. There is no window, no document.querySelector, no classList. Querying and traversal are done by separate utility libraries.

── jsdom ────────────────────────────────────────────────── const dom = new JSDOM(html) dom.window.document ← live Document object (browser API) .querySelector() .getElementById() .createElement() .innerHTML .textContent .classList .getAttribute() ── htmlparser2 ecosystem ────────────────────────────────── parseDocument(html) ← plain JS object tree (no methods) css-select → CSS selectors (selectOne / selectAll) domutils → everything else (text, attrs, siblings…)

⚠ The single biggest gotcha element.children in jsdom returns only element nodes (HTMLCollection). In htmlparser2, node.children includes all node types — text nodes, comments, everything. Always filter with domutils.isTag(n) when you want elements only.

A node in htmlparser2 is just a plain object. You can console.log it, spread it, JSON-stringify it. No prototype magic.

htmlparser2 — what a node looks like

// <a href="/about" class="nav-link">About</a>
{
  type: 'tag',
  name: 'a',                        // always lowercase
  attribs: { href: '/about', class: 'nav-link' },
  children: [
    { type: 'text', data: 'About', parent: [Circular], … }
  ],
  parent: { … },
  next: { … },   // next sibling (any type)
  prev: { … }    // previous sibling (any type)
}

2 Parsing HTML#

jsdom

import { JSDOM } from 'jsdom';

const dom = new JSDOM(html);
const document = dom.window.document;

// document is a browser Document object

htmlparser2

import { parseDocument } from 'htmlparser2';

const document = parseDocument(html);

// document is a plain JS object:
// { type: 'root', children: [ … ] }

📦 Install npm install htmlparser2 domutils css-select domhandler

All examples use ESM import syntax. For CommonJS, swap to require() — e.g. const { parseDocument } = require('htmlparser2'), const domutils = require('domutils').

💡 parseDocument options Pass { decodeEntities: false } if you want raw HTML entities preserved. By default entities are decoded. Also accepts xmlMode: true for case-sensitive XHTML parsing.

The returned document has type: 'root' and a children array. You pass it directly to css-select and domutils functions — think of it as your entry point, equivalent to jsdom's document.

3 Querying the DOM#

css-select is the query engine. CSS selectors work identically to what you know from the browser.

jsdom

// Single element (or null)
document.querySelector('.nav-link')
document.querySelector('#main')
document.querySelector('a[href^="https"]')

// All matching elements (NodeList)
document.querySelectorAll('ul > li')
document.querySelectorAll('[data-id]')

// Classic helpers
document.getElementById('main')
document.getElementsByClassName('card')
document.getElementsByTagName('p')

htmlparser2 + css-select

import { selectOne, selectAll } from 'css-select';

// Single element (or null)
selectOne('.nav-link', document)
selectOne('#main', document)
selectOne('a[href^="https"]', document)

// All matching elements (plain Array)
selectAll('ul > li', document)
selectAll('[data-id]', document)

// ID / class / tag — just use selectors:
selectOne('#main', document)
selectAll('.card', document)
selectAll('p', document)

Predicate-based queries (domutils)

When a CSS selector isn't enough, use domutils.findOne and domutils.findAll with a custom function:

domutils — predicate queries

import * as domutils from 'domutils';

// Find first tag whose text starts with "Price:"
const node = domutils.findOne(
  n => domutils.isTag(n) && domutils.getText(n).startsWith('Price:'),
  document.children
);

// Find all <li> elements that have a data-id attribute
const items = domutils.findAll(
  n => n.name === 'li' && domutils.hasAttrib(n, 'data-id'),
  document.children
);

Reusable compiled selectors

jsdom

// No native way to compile/reuse a selector.
// Every call re-parses the selector string.
rows.forEach(row => {
  const cell = row.querySelector('td.price');
});

css-select

import { compile, selectOne } from 'css-select';

// Compile once, reuse many times (faster in loops)
const priceCell = compile('td.price');

rows.forEach(row => {
  const cell = selectOne(priceCell, row);
});

Check if a node matches a selector

jsdom

element.matches('.active')
element.matches('a[href]')

css-select

import { is } from 'css-select';

is(node, '.active')
is(node, 'a[href]')

📋 Return types jsdom's querySelectorAll returns a live NodeList. css-select's selectAll returns a plain Array — you can use .map(), .filter(), .find() etc. directly on it, no conversion needed.

4 Attributes#

In htmlparser2, attributes are stored as a plain object on node.attribs. You can access them directly — no method calls required.

jsdom

// Read
el.getAttribute('href')       // '/about' or null
el.id                         // shorthand for id attr
el.className                  // shorthand for class attr

// Check
el.hasAttribute('disabled')   // true/false

// Write
el.setAttribute('href', '/new')
el.removeAttribute('disabled')

// Iterate all attrs
for (const attr of el.attributes) {
  console.log(attr.name, attr.value);
}

htmlparser2 + domutils

// Read — direct object access
el.attribs['href']                    // '/about' or undefined
el.attribs['id']
el.attribs['class']

// or with domutils (returns undefined, not null)
domutils.getAttributeValue(el, 'href')

// Check
domutils.hasAttrib(el, 'disabled')    // true/false
'disabled' in el.attribs              // same thing

// Write — mutate the object directly
el.attribs['href'] = '/new'
delete el.attribs['disabled']

// Iterate all attrs
Object.entries(el.attribs).forEach(([name, value]) => {
  console.log(name, value);
});

⚠ null vs undefined jsdom's getAttribute returns null when an attribute is absent. htmlparser2's el.attribs['missing'] and domutils.getAttributeValue return undefined. If your code does if (attr !== null), update it to if (attr != null) or if (attr !== undefined).

💡 Boolean attributes In HTML, <input disabled> is parsed as attribs: { disabled: '' } — an empty string, not true. Check presence with domutils.hasAttrib(el, 'disabled'), not by checking the value.

5 Text Content#

jsdom

// All text, recursively (equivalent to textContent)
el.textContent

// Rendered text — layout-aware, skips hidden elements
// (browser only, not really useful in jsdom either)
el.innerText

// Text node's raw string value
textNode.nodeValue   // or textNode.data or textNode.textContent

domutils

// All text, recursively — equivalent to textContent
domutils.getText(node)

// There is no innerText equivalent. That's a browser concept.
// domutils.getText is what you want.

// Text node's raw string value
textNode.data

Working with text nodes directly

jsdom

// Get only direct text node children
// NodeList has no .filter() — spread it first
[...el.childNodes]
  .filter(n => n.nodeType === Node.TEXT_NODE)
  .map(n => n.textContent)
  .join('')

htmlparser2

// Get only direct text node children
el.children
  .filter(n => n.type === 'text')
  .map(n => n.data)
  .join('')

📋 Whitespace htmlparser2 preserves whitespace-only text nodes (e.g. newlines between tags). Filter with .filter(n => n.type === 'text' && n.data.trim()) if you only want meaningful text.

6 Tree Navigation#

jsdom

// Parent
el.parentElement    // element parent or null
el.parentNode       // any node parent (incl. document)

// Children — ELEMENTS only (HTMLCollection)
el.children
el.firstElementChild
el.lastElementChild

// Children — ALL nodes incl. text (NodeList)
el.childNodes
el.firstChild
el.lastChild

// Siblings — elements only
el.nextElementSibling
el.previousElementSibling

// Siblings — any node
el.nextSibling
el.previousSibling

// Tag name (UPPERCASE)
el.tagName    // 'DIV'

htmlparser2 + domutils

// Parent
el.parent     // any node parent (null at root)
domutils.getParent(el)  // same thing

// Children — ALL nodes (text, comments, elements)
el.children
el.children.find(domutils.isTag)           // first element child
[...el.children].reverse().find(domutils.isTag) // last element child

// Children — ELEMENTS only
el.children.filter(domutils.isTag)
domutils.getChildren(el).filter(domutils.isTag)

// Siblings — any node
el.next       // next sibling
el.prev       // previous sibling

// Siblings — element only (walk next/prev, skip non-elements)
let n = el.next;
while (n && !domutils.isTag(n)) n = n.next;
// n is nextElementSibling (or null) — see clean helpers below

// Tag name (lowercase)
el.name       // 'div'

⚠ tagName casing jsdom returns el.tagName === 'DIV' (uppercase). htmlparser2 stores el.name === 'div' (lowercase). If your business logic compares tag names, update the comparison or normalize with el.name.toUpperCase().

Next/previous element sibling — clean pattern

htmlparser2

// Walk forward through siblings until you find an element
function nextElementSibling(node) {
  let n = node.next;
  while (n && !domutils.isTag(n)) n = n.next;
  return n || null;
}

function prevElementSibling(node) {
  let n = node.prev;
  while (n && !domutils.isTag(n)) n = n.prev;
  return n || null;
}

7 Type Checking#

jsdom uses numeric nodeType constants. htmlparser2 uses a type string on each node.

jsdom

node.nodeType === Node.ELEMENT_NODE    // 1
node.nodeType === Node.TEXT_NODE       // 3
node.nodeType === Node.COMMENT_NODE    // 8
node.nodeType === Node.DOCUMENT_NODE   // 9

node instanceof Element   // is an element
node instanceof Text      // is a text node

htmlparser2 + domutils

domutils.isTag(node)       // type 'tag' | 'script' | 'style'
domutils.isText(node)      // type === 'text'
domutils.isComment(node)   // type === 'comment'
domutils.isDocument(node)  // type === 'root'

// Or check the string directly:
node.type === 'tag'
node.type === 'text'
node.type === 'comment'
node.type === 'root'

💡 isTag includes <script> and <style> domutils.isTag(node) returns true for type === 'tag', 'script', and 'style'. This mirrors browser behavior where <script> is an Element. If you specifically need only regular tags, check node.type === 'tag'.

Node type strings at a glance

HTML	node.type	domutils check
`<div>`, `<p>`, etc.	`'tag'`	`isTag(n)`
`<script>`	`'script'`	`isTag(n)`
`<style>`	`'style'`	`isTag(n)`
text between tags	`'text'`	`isText(n)`
`<!-- comment -->`	`'comment'`	`isComment(n)`
root document	`'root'`	`isDocument(n)`
`<!DOCTYPE html>`	`'directive'`	—

8 Serialization (innerHTML / outerHTML)#

jsdom

el.innerHTML    // content inside the element
el.outerHTML    // element itself + its content

domutils

domutils.getInnerHTML(node)    // content inside
domutils.getOuterHTML(node)    // node + content

// Alternative: dom-serializer (more control)
import render from 'dom-serializer';
render(node.children)          // inner HTML
render(node)                   // outer HTML

📋 dom-serializer domutils.getInnerHTML/getOuterHTML uses dom-serializer internally. Install it directly (npm i dom-serializer) only if you need options like decodeEntities: false or xmlMode: true.

9 Class Manipulation#

There is no classList in htmlparser2. Classes are just a space-separated string in node.attribs.class.

jsdom

el.classList.contains('active')
el.classList.add('active')
el.classList.remove('active')
el.classList.toggle('active')
el.classList.replace('old', 'new')

htmlparser2

const classes = () =>
  (el.attribs.class || '').split(/\s+/).filter(Boolean);

// contains
classes().includes('active')
// or with css-select (no mutation):
is(el, '.active')   // import { is } from 'css-select'

// add
if (!classes().includes('active'))
  el.attribs.class = [...classes(), 'active'].join(' ');

// remove
el.attribs.class = classes().filter(c => c !== 'active').join(' ');

// toggle
el.attribs.class = classes().includes('active')
  ? classes().filter(c => c !== 'active').join(' ')
  : [...classes(), 'active'].join(' ');

// replace
el.attribs.class = classes().map(c => c === 'old' ? 'new' : c).join(' ');

💡 Wrap it once If you use class manipulation a lot, write a tiny helper at the top of the file rather than repeating the split/join pattern. This is also a good indicator of what belongs in an adapter (see section 11).

10 DOM Mutation#

You can mutate the tree by editing the plain objects directly, or use domutils helpers for structural changes.

jsdom

// Remove a node
el.parentNode.removeChild(el)
// or: el.remove()

// Replace a node
el.parentNode.replaceChild(newNode, el)

// Insert before/after
parent.insertBefore(newNode, refNode)

// Create nodes
document.createElement('div')
document.createTextNode('hello')

domutils

// Remove a node (updates parent.children + sibling links)
domutils.removeElement(el)

// Replace a node
domutils.replaceElement(el, newNode)

// Insert as sibling
domutils.prepend(refNode, newNode)   // insert newNode before refNode
domutils.append(refNode, newNode)    // insert newNode after refNode

// Create nodes — use domhandler constructors:
import { Element, Text } from 'domhandler';
const div = new Element('div', { class: 'box' }, []);
const txt = new Text('hello');

⚠ Keep the tree consistent When creating nodes manually and inserting them, domutils helpers (append, prepend) update the parent, next, and prev references for you. If you push directly into node.children without using these helpers, those references go stale and queries/traversal will break.

Append a child

jsdom

parent.appendChild(child);

htmlparser2

// Preferred: pass children at construction time
import { Element } from 'domhandler';
const ul = new Element('ul', {}, [li1, li2, li3]);

// Append after the fact — domutils.append() adds a sibling,
// so target the last child. Fall back for empty parent:
const last = parent.children.at(-1);
if (last) {
  domutils.append(last, child);
} else {
  child.parent = parent;
  parent.children.push(child);
}

11 The Adapter Pattern#

You're considering wrapping htmlparser2 in a jsdom-compatible interface so your business logic doesn't change. Here's a realistic view of what that looks like.

Minimal adapter skeleton

// adapter.js
import { parseDocument } from 'htmlparser2';
import * as domutils from 'domutils';
import { selectOne, selectAll, is } from 'css-select';

class NodeAdapter {
  constructor(node) { this._n = node; }

  // ── Querying ──
  querySelector(sel)    { const n = selectOne(sel, this._n); return n ? new NodeAdapter(n) : null; }
  querySelectorAll(sel) { return selectAll(sel, this._n).map(n => new NodeAdapter(n)); }
  matches(sel)          { return is(this._n, sel); }
  closest(sel)          {
    let n = this._n;
    while (n) { if (domutils.isTag(n) && is(n, sel)) return new NodeAdapter(n); n = n.parent; }
    return null;
  }

  // ── Attributes ──
  getAttribute(name)         { return this._n.attribs?.[name] ?? null; }
  setAttribute(name, value)  { if (this._n.attribs) this._n.attribs[name] = value; }
  hasAttribute(name)         { return domutils.hasAttrib(this._n, name); }
  removeAttribute(name)      { delete this._n.attribs?.[name]; }

  // ── Content ──
  get textContent()  { return domutils.getText(this._n); }
  get innerHTML()    { return domutils.getInnerHTML(this._n); }
  get outerHTML()    { return domutils.getOuterHTML(this._n); }

  // ── Identity ──
  get tagName()      { return this._n.name?.toUpperCase() ?? ''; }
  get id()           { return this._n.attribs?.id ?? ''; }
  get className()    { return this._n.attribs?.class ?? ''; }

  // ── Navigation ──
  get parentElement() {
    const p = this._n.parent;
    return p && domutils.isTag(p) ? new NodeAdapter(p) : null;
  }
  get children() {
    return (this._n.children || []).filter(domutils.isTag).map(n => new NodeAdapter(n));
  }
  get childNodes() {
    return (this._n.children || []).map(n => new NodeAdapter(n));
  }
  get nextElementSibling() {
    let n = this._n.next;
    while (n && !domutils.isTag(n)) n = n.next;
    return n ? new NodeAdapter(n) : null;
  }
  get previousElementSibling() {
    let n = this._n.prev;
    while (n && !domutils.isTag(n)) n = n.prev;
    return n ? new NodeAdapter(n) : null;
  }

  // ── classList shim ──
  get classList() {
    const el = this._n;
    const get = () => (el.attribs?.class || '').split(/\s+/).filter(Boolean);
    return {
      contains: c => get().includes(c),
      add:      c => { if (!get().includes(c)) el.attribs.class = [...get(), c].join(' '); },
      remove:   c => { el.attribs.class = get().filter(x => x !== c).join(' '); },
      toggle:   c => get().includes(c)
                  ? (el.attribs.class = get().filter(x => x !== c).join(' '))
                  : (el.attribs.class = [...get(), c].join(' ')),
    };
  }

  // ── Unwrap to raw node if needed ──
  get raw() { return this._n; }
}

function parseHTML(html) {
  const doc = parseDocument(html);
  return new NodeAdapter(doc);
}

export { parseHTML, NodeAdapter };

When the adapter is the right choice

Your business logic is large and heavily uses the DOM API — rewriting everything at once is risky.
You want to swap the underlying parser without touching tested business logic.
You can ship the adapter, migrate gradually, then decide whether to remove it later.

When to skip the adapter and migrate directly

Your DOM usage is limited and already mapped (querying + attribute reads + text extraction).
The adapter is papering over gaps — every missing method you add is a maintenance burden.
You want the full performance benefit of htmlparser2 (the adapter adds object wrapping overhead).

💡 Recommendation Start with the adapter. It lets you validate that htmlparser2 produces correct results against your existing tests without touching business logic. Once you have confidence, decide per-module whether a direct migration is worth it.

12 Live Playground#

Parse HTML and query it with css-select — running real htmlparser2 + domutils + css-select in the browser.

▶ htmlparser2 + domutils + css-select — live in browser

HTML to parse

CSS Selector (css-select)

Attribute to read (optional)

Real htmlparser2 running in the browser via esm.sh

Output

Click "Run Query" to see results.

📋 About the playground Uses esm.sh to load htmlparser2, domutils, and css-select as ES modules — the exact same code that runs in Node.js. If the playground shows a loading error, check your network connection.

13 XML Mode#

htmlparser2 ships a built-in XML mode that changes how the parser interprets the document. jsdom supports the same via the contentType constructor option — both give you case-sensitive tag names, self-closing tags, and no implicit HTML structure injection.

jsdom — parse as XML

import { JSDOM } from 'jsdom';

const dom = new JSDOM(xmlString, {
  contentType: 'application/xml',
});
const document = dom.window.document;

// Tag names are case-sensitive and preserved as-is.
// Malformed XML produces a parseerror document.

htmlparser2 — xmlMode

import { parseDocument } from 'htmlparser2';

const doc = parseDocument(xmlString, { xmlMode: true });

// xmlMode changes three things:
//   1. Tag names are preserved as-is (no lowercasing)
//   2. Self-closing tags (<br/>, <MyTag/>) are honoured
//   3. No implicit HTML structure (no <html>/<body> injection)

Key behavioural differences

Behaviour	HTML mode (default)	xmlMode: true
Tag name casing	lowercased — `el.name === 'div'`	preserved — `el.name === 'MyTag'`
Self-closing tags	only void elements (`<br>`, `<img>`…)	any tag — `<Foo/>` has no children
Implicit structure	`<html>`/`<head>`/`<body>` injected if missing	no injection — document mirrors input exactly
Error recovery	lenient HTML5 error recovery	best-effort; parser does not throw on malformed XML
CDATA sections	treated as comments	parsed as CDATA nodes (`type: 'cdata'`)
Namespaces	ignored	preserved in `el.name` (e.g. `'svg:path'`)

Querying XML documents

css-select and domutils work identically in XML mode — pass the same parsed root to selectOne / selectAll. The only difference is that selectors are now case-sensitive by default, matching how browsers treat XML.

htmlparser2 — querying XML

import { parseDocument } from 'htmlparser2';
import { selectAll } from 'css-select';

const xml = `<Library>
  <Book genre="fiction"><Title>Dune</Title></Book>
  <Book genre="non-fiction"><Title>Sapiens</Title></Book>
</Library>`;

const doc = parseDocument(xml, { xmlMode: true });

// Tag names are case-sensitive — must match exactly
const books  = selectAll('Book', doc);           // ✓ matches <Book>
const titles = selectAll('Book > Title', doc);   // ✓
const wrong  = selectAll('book', doc);           // ✗ no match (lowercase)

// Attribute selectors work as usual
const fiction = selectAll('Book[genre="fiction"]', doc);

💡 Parsing SVG and RSS/Atom feeds SVG embedded in HTML parses fine in default mode. For standalone SVG files or RSS/Atom XML feeds, pass xmlMode: true so that self-closing tags and namespace prefixes are handled correctly.

⚠ htmlparser2 is not a validating XML parser It does not enforce well-formedness, DTDs, or XML schemas. Malformed XML is handled with best-effort recovery rather than throwing an error. Use a dedicated XML parser (e.g. fast-xml-parser, sax) if strict validation or namespace resolution is required.

14 Cheat Sheet#

jsdom	htmlparser2 equivalent
SETUP
`new JSDOM(html).window.document`	`parseDocument(html)`
QUERYING
`document.querySelector(sel)`	`selectOne(sel, document)`
`document.querySelectorAll(sel)`	`selectAll(sel, document)` → plain Array
`document.getElementById('x')`	`selectOne('#x', document)`
`document.getElementsByClassName('x')`	`selectAll('.x', document)`
`document.getElementsByTagName('p')`	`selectAll('p', document)`
`el.matches(sel)`	`is(el, sel)` from css-select
`el.closest(sel)`	manual walk up via `el.parent`
—	`domutils.findOne(fn, nodes)` — predicate query
—	`compile(sel)` — reusable selector
ATTRIBUTES
`el.getAttribute('x')` → null if missing	`el.attribs['x']` → undefined if missing
`el.setAttribute('x', v)`	`el.attribs['x'] = v`
`el.hasAttribute('x')`	`domutils.hasAttrib(el, 'x')`
`el.removeAttribute('x')`	`delete el.attribs['x']`
`el.id`	`el.attribs['id']`
`el.className`	`el.attribs['class']`
TEXT CONTENT
`el.textContent`	`domutils.getText(el)`
`textNode.data`	`textNode.data` ← same!
NAVIGATION
`el.tagName` (uppercase)	`el.name` (lowercase)
`el.parentElement`	`el.parent` (check `domutils.isTag`)
`el.children` (elements only)	`el.children.filter(domutils.isTag)`
`el.childNodes` (all nodes)	`el.children`
`el.nextElementSibling`	walk `el.next` until `isTag`
`el.previousElementSibling`	walk `el.prev` until `isTag`
`el.nextSibling`	`el.next`
`el.previousSibling`	`el.prev`
TYPE CHECKING
`nodeType === 1` (element)	`domutils.isTag(n)`
`nodeType === 3` (text)	`domutils.isText(n)` or `n.type === 'text'`
`nodeType === 8` (comment)	`domutils.isComment(n)`
`node instanceof Element`	`domutils.isTag(n)`
SERIALIZATION
`el.innerHTML`	`domutils.getInnerHTML(el)`
`el.outerHTML`	`domutils.getOuterHTML(el)`
CLASSES
`el.classList.contains('x')`	`is(el, '.x')` or manual split
`el.classList.add('x')`	manual split/join on `el.attribs.class`
`el.classList.remove('x')`	manual split/filter/join
MUTATION
`parent.removeChild(el)`	`domutils.removeElement(el)`
`parent.replaceChild(newEl, el)`	`domutils.replaceElement(el, newEl)`
`parent.insertBefore(newEl, ref)`	`domutils.prepend(ref, newEl)`
`document.createElement('div')`	`new Element('div', {}, [])` from domhandler
`document.createTextNode('x')`	`new Text('x')` from domhandler