Migrating from jsdom to htmlparser2
A practical reference for developers who know jsdom and want to switch to the htmlparser2 ecosystem.
jsdom runs a full virtual browser in Node.js. That makes sense when you need to execute scripts or test UI behaviour, but for scraping, templating, and data extraction it brings in a lot of weight that the job simply does not require.
htmlparser2 parses HTML into a plain JavaScript object tree. Querying is handled by css-select, and traversal utilities live in domutils. This guide shows you the equivalent of every jsdom pattern you already know, section by section.
1 The Mental Model Shift#
jsdom simulates a browser: it gives you a live document object with methods (querySelector, getElementById…) that behave exactly like in a browser. The DOM tree is made of class instances.
htmlparser2 is different. It gives you a plain JavaScript object tree. There is no window, no document.querySelector, no classList. Querying and traversal are done by separate utility libraries.
element.children in jsdom returns only element nodes (HTMLCollection). In htmlparser2, node.children includes all node types — text nodes, comments, everything. Always filter with domutils.isTag(n) when you want elements only.
A node in htmlparser2 is just a plain object. You can console.log it, spread it, JSON-stringify it. No prototype magic.
// <a href="/about" class="nav-link">About</a>
{
type: 'tag',
name: 'a', // always lowercase
attribs: { href: '/about', class: 'nav-link' },
children: [
{ type: 'text', data: 'About', parent: [Circular], … }
],
parent: { … },
next: { … }, // next sibling (any type)
prev: { … } // previous sibling (any type)
}
2 Parsing HTML#
import { JSDOM } from 'jsdom';
const dom = new JSDOM(html);
const document = dom.window.document;
// document is a browser Document object
import { parseDocument } from 'htmlparser2';
const document = parseDocument(html);
// document is a plain JS object:
// { type: 'root', children: [ … ] }
npm install htmlparser2 domutils css-select domhandlerAll examples use ESM
import syntax. For CommonJS, swap to require() — e.g. const { parseDocument } = require('htmlparser2'), const domutils = require('domutils').
{ decodeEntities: false } if you want raw HTML entities preserved. By default entities are decoded. Also accepts xmlMode: true for case-sensitive XHTML parsing.
The returned document has type: 'root' and a children array. You pass it directly to css-select and domutils functions — think of it as your entry point, equivalent to jsdom's document.
3 Querying the DOM#
css-select is the query engine. CSS selectors work identically to what you know from the browser.
// Single element (or null)
document.querySelector('.nav-link')
document.querySelector('#main')
document.querySelector('a[href^="https"]')
// All matching elements (NodeList)
document.querySelectorAll('ul > li')
document.querySelectorAll('[data-id]')
// Classic helpers
document.getElementById('main')
document.getElementsByClassName('card')
document.getElementsByTagName('p')
import { selectOne, selectAll } from 'css-select';
// Single element (or null)
selectOne('.nav-link', document)
selectOne('#main', document)
selectOne('a[href^="https"]', document)
// All matching elements (plain Array)
selectAll('ul > li', document)
selectAll('[data-id]', document)
// ID / class / tag — just use selectors:
selectOne('#main', document)
selectAll('.card', document)
selectAll('p', document)
Predicate-based queries (domutils)
When a CSS selector isn't enough, use domutils.findOne and domutils.findAll with a custom function:
import * as domutils from 'domutils';
// Find first tag whose text starts with "Price:"
const node = domutils.findOne(
n => domutils.isTag(n) && domutils.getText(n).startsWith('Price:'),
document.children
);
// Find all <li> elements that have a data-id attribute
const items = domutils.findAll(
n => n.name === 'li' && domutils.hasAttrib(n, 'data-id'),
document.children
);
Reusable compiled selectors
// No native way to compile/reuse a selector.
// Every call re-parses the selector string.
rows.forEach(row => {
const cell = row.querySelector('td.price');
});
import { compile, selectOne } from 'css-select';
// Compile once, reuse many times (faster in loops)
const priceCell = compile('td.price');
rows.forEach(row => {
const cell = selectOne(priceCell, row);
});
Check if a node matches a selector
element.matches('.active')
element.matches('a[href]')
import { is } from 'css-select';
is(node, '.active')
is(node, 'a[href]')
querySelectorAll returns a live NodeList. css-select's selectAll returns a plain Array — you can use .map(), .filter(), .find() etc. directly on it, no conversion needed.
4 Attributes#
In htmlparser2, attributes are stored as a plain object on node.attribs. You can access them directly — no method calls required.
// Read
el.getAttribute('href') // '/about' or null
el.id // shorthand for id attr
el.className // shorthand for class attr
// Check
el.hasAttribute('disabled') // true/false
// Write
el.setAttribute('href', '/new')
el.removeAttribute('disabled')
// Iterate all attrs
for (const attr of el.attributes) {
console.log(attr.name, attr.value);
}
// Read — direct object access
el.attribs['href'] // '/about' or undefined
el.attribs['id']
el.attribs['class']
// or with domutils (returns undefined, not null)
domutils.getAttributeValue(el, 'href')
// Check
domutils.hasAttrib(el, 'disabled') // true/false
'disabled' in el.attribs // same thing
// Write — mutate the object directly
el.attribs['href'] = '/new'
delete el.attribs['disabled']
// Iterate all attrs
Object.entries(el.attribs).forEach(([name, value]) => {
console.log(name, value);
});
getAttribute returns null when an attribute is absent. htmlparser2's el.attribs['missing'] and domutils.getAttributeValue return undefined. If your code does if (attr !== null), update it to if (attr != null) or if (attr !== undefined).
<input disabled> is parsed as attribs: { disabled: '' } — an empty string, not true. Check presence with domutils.hasAttrib(el, 'disabled'), not by checking the value.
5 Text Content#
// All text, recursively (equivalent to textContent)
el.textContent
// Rendered text — layout-aware, skips hidden elements
// (browser only, not really useful in jsdom either)
el.innerText
// Text node's raw string value
textNode.nodeValue // or textNode.data or textNode.textContent
// All text, recursively — equivalent to textContent
domutils.getText(node)
// There is no innerText equivalent. That's a browser concept.
// domutils.getText is what you want.
// Text node's raw string value
textNode.data
Working with text nodes directly
// Get only direct text node children
// NodeList has no .filter() — spread it first
[...el.childNodes]
.filter(n => n.nodeType === Node.TEXT_NODE)
.map(n => n.textContent)
.join('')
// Get only direct text node children
el.children
.filter(n => n.type === 'text')
.map(n => n.data)
.join('')
.filter(n => n.type === 'text' && n.data.trim()) if you only want meaningful text.
7 Type Checking#
jsdom uses numeric nodeType constants. htmlparser2 uses a type string on each node.
node.nodeType === Node.ELEMENT_NODE // 1
node.nodeType === Node.TEXT_NODE // 3
node.nodeType === Node.COMMENT_NODE // 8
node.nodeType === Node.DOCUMENT_NODE // 9
node instanceof Element // is an element
node instanceof Text // is a text node
domutils.isTag(node) // type 'tag' | 'script' | 'style'
domutils.isText(node) // type === 'text'
domutils.isComment(node) // type === 'comment'
domutils.isDocument(node) // type === 'root'
// Or check the string directly:
node.type === 'tag'
node.type === 'text'
node.type === 'comment'
node.type === 'root'
domutils.isTag(node) returns true for type === 'tag', 'script', and 'style'. This mirrors browser behavior where <script> is an Element. If you specifically need only regular tags, check node.type === 'tag'.
Node type strings at a glance
| HTML | node.type | domutils check |
|---|---|---|
<div>, <p>, etc. | 'tag' | isTag(n) |
<script> | 'script' | isTag(n) |
<style> | 'style' | isTag(n) |
| text between tags | 'text' | isText(n) |
<!-- comment --> | 'comment' | isComment(n) |
| root document | 'root' | isDocument(n) |
<!DOCTYPE html> | 'directive' | — |
8 Serialization (innerHTML / outerHTML)#
el.innerHTML // content inside the element
el.outerHTML // element itself + its content
domutils.getInnerHTML(node) // content inside
domutils.getOuterHTML(node) // node + content
// Alternative: dom-serializer (more control)
import render from 'dom-serializer';
render(node.children) // inner HTML
render(node) // outer HTML
domutils.getInnerHTML/getOuterHTML uses dom-serializer internally. Install it directly (npm i dom-serializer) only if you need options like decodeEntities: false or xmlMode: true.
9 Class Manipulation#
There is no classList in htmlparser2. Classes are just a space-separated string in node.attribs.class.
el.classList.contains('active')
el.classList.add('active')
el.classList.remove('active')
el.classList.toggle('active')
el.classList.replace('old', 'new')
const classes = () =>
(el.attribs.class || '').split(/\s+/).filter(Boolean);
// contains
classes().includes('active')
// or with css-select (no mutation):
is(el, '.active') // import { is } from 'css-select'
// add
if (!classes().includes('active'))
el.attribs.class = [...classes(), 'active'].join(' ');
// remove
el.attribs.class = classes().filter(c => c !== 'active').join(' ');
// toggle
el.attribs.class = classes().includes('active')
? classes().filter(c => c !== 'active').join(' ')
: [...classes(), 'active'].join(' ');
// replace
el.attribs.class = classes().map(c => c === 'old' ? 'new' : c).join(' ');
10 DOM Mutation#
You can mutate the tree by editing the plain objects directly, or use domutils helpers for structural changes.
// Remove a node
el.parentNode.removeChild(el)
// or: el.remove()
// Replace a node
el.parentNode.replaceChild(newNode, el)
// Insert before/after
parent.insertBefore(newNode, refNode)
// Create nodes
document.createElement('div')
document.createTextNode('hello')
// Remove a node (updates parent.children + sibling links)
domutils.removeElement(el)
// Replace a node
domutils.replaceElement(el, newNode)
// Insert as sibling
domutils.prepend(refNode, newNode) // insert newNode before refNode
domutils.append(refNode, newNode) // insert newNode after refNode
// Create nodes — use domhandler constructors:
import { Element, Text } from 'domhandler';
const div = new Element('div', { class: 'box' }, []);
const txt = new Text('hello');
append, prepend) update the parent, next, and prev references for you. If you push directly into node.children without using these helpers, those references go stale and queries/traversal will break.
Append a child
parent.appendChild(child);
// Preferred: pass children at construction time
import { Element } from 'domhandler';
const ul = new Element('ul', {}, [li1, li2, li3]);
// Append after the fact — domutils.append() adds a sibling,
// so target the last child. Fall back for empty parent:
const last = parent.children.at(-1);
if (last) {
domutils.append(last, child);
} else {
child.parent = parent;
parent.children.push(child);
}
11 The Adapter Pattern#
You're considering wrapping htmlparser2 in a jsdom-compatible interface so your business logic doesn't change. Here's a realistic view of what that looks like.
// adapter.js
import { parseDocument } from 'htmlparser2';
import * as domutils from 'domutils';
import { selectOne, selectAll, is } from 'css-select';
class NodeAdapter {
constructor(node) { this._n = node; }
// ── Querying ──
querySelector(sel) { const n = selectOne(sel, this._n); return n ? new NodeAdapter(n) : null; }
querySelectorAll(sel) { return selectAll(sel, this._n).map(n => new NodeAdapter(n)); }
matches(sel) { return is(this._n, sel); }
closest(sel) {
let n = this._n;
while (n) { if (domutils.isTag(n) && is(n, sel)) return new NodeAdapter(n); n = n.parent; }
return null;
}
// ── Attributes ──
getAttribute(name) { return this._n.attribs?.[name] ?? null; }
setAttribute(name, value) { if (this._n.attribs) this._n.attribs[name] = value; }
hasAttribute(name) { return domutils.hasAttrib(this._n, name); }
removeAttribute(name) { delete this._n.attribs?.[name]; }
// ── Content ──
get textContent() { return domutils.getText(this._n); }
get innerHTML() { return domutils.getInnerHTML(this._n); }
get outerHTML() { return domutils.getOuterHTML(this._n); }
// ── Identity ──
get tagName() { return this._n.name?.toUpperCase() ?? ''; }
get id() { return this._n.attribs?.id ?? ''; }
get className() { return this._n.attribs?.class ?? ''; }
// ── Navigation ──
get parentElement() {
const p = this._n.parent;
return p && domutils.isTag(p) ? new NodeAdapter(p) : null;
}
get children() {
return (this._n.children || []).filter(domutils.isTag).map(n => new NodeAdapter(n));
}
get childNodes() {
return (this._n.children || []).map(n => new NodeAdapter(n));
}
get nextElementSibling() {
let n = this._n.next;
while (n && !domutils.isTag(n)) n = n.next;
return n ? new NodeAdapter(n) : null;
}
get previousElementSibling() {
let n = this._n.prev;
while (n && !domutils.isTag(n)) n = n.prev;
return n ? new NodeAdapter(n) : null;
}
// ── classList shim ──
get classList() {
const el = this._n;
const get = () => (el.attribs?.class || '').split(/\s+/).filter(Boolean);
return {
contains: c => get().includes(c),
add: c => { if (!get().includes(c)) el.attribs.class = [...get(), c].join(' '); },
remove: c => { el.attribs.class = get().filter(x => x !== c).join(' '); },
toggle: c => get().includes(c)
? (el.attribs.class = get().filter(x => x !== c).join(' '))
: (el.attribs.class = [...get(), c].join(' ')),
};
}
// ── Unwrap to raw node if needed ──
get raw() { return this._n; }
}
function parseHTML(html) {
const doc = parseDocument(html);
return new NodeAdapter(doc);
}
export { parseHTML, NodeAdapter };
When the adapter is the right choice
- Your business logic is large and heavily uses the DOM API — rewriting everything at once is risky.
- You want to swap the underlying parser without touching tested business logic.
- You can ship the adapter, migrate gradually, then decide whether to remove it later.
When to skip the adapter and migrate directly
- Your DOM usage is limited and already mapped (querying + attribute reads + text extraction).
- The adapter is papering over gaps — every missing method you add is a maintenance burden.
- You want the full performance benefit of htmlparser2 (the adapter adds object wrapping overhead).
12 Live Playground#
Parse HTML and query it with css-select — running real htmlparser2 + domutils + css-select in the browser.
13 XML Mode#
htmlparser2 ships a built-in XML mode that changes how the parser interprets the document. jsdom supports the same via the contentType constructor option — both give you case-sensitive tag names, self-closing tags, and no implicit HTML structure injection.
import { JSDOM } from 'jsdom';
const dom = new JSDOM(xmlString, {
contentType: 'application/xml',
});
const document = dom.window.document;
// Tag names are case-sensitive and preserved as-is.
// Malformed XML produces a parseerror document.
import { parseDocument } from 'htmlparser2';
const doc = parseDocument(xmlString, { xmlMode: true });
// xmlMode changes three things:
// 1. Tag names are preserved as-is (no lowercasing)
// 2. Self-closing tags (<br/>, <MyTag/>) are honoured
// 3. No implicit HTML structure (no <html>/<body> injection)
Key behavioural differences
| Behaviour | HTML mode (default) | xmlMode: true |
|---|---|---|
| Tag name casing | lowercased — el.name === 'div' | preserved — el.name === 'MyTag' |
| Self-closing tags | only void elements (<br>, <img>…) | any tag — <Foo/> has no children |
| Implicit structure | <html>/<head>/<body> injected if missing | no injection — document mirrors input exactly |
| Error recovery | lenient HTML5 error recovery | best-effort; parser does not throw on malformed XML |
| CDATA sections | treated as comments | parsed as CDATA nodes (type: 'cdata') |
| Namespaces | ignored | preserved in el.name (e.g. 'svg:path') |
Querying XML documents
css-select and domutils work identically in XML mode — pass the same parsed root to selectOne / selectAll. The only difference is that selectors are now case-sensitive by default, matching how browsers treat XML.
import { parseDocument } from 'htmlparser2';
import { selectAll } from 'css-select';
const xml = `<Library>
<Book genre="fiction"><Title>Dune</Title></Book>
<Book genre="non-fiction"><Title>Sapiens</Title></Book>
</Library>`;
const doc = parseDocument(xml, { xmlMode: true });
// Tag names are case-sensitive — must match exactly
const books = selectAll('Book', doc); // ✓ matches <Book>
const titles = selectAll('Book > Title', doc); // ✓
const wrong = selectAll('book', doc); // ✗ no match (lowercase)
// Attribute selectors work as usual
const fiction = selectAll('Book[genre="fiction"]', doc);
xmlMode: true so that self-closing tags and namespace prefixes are handled correctly.
fast-xml-parser, sax) if strict validation or namespace resolution is required.
14 Cheat Sheet#
| jsdom | htmlparser2 equivalent |
|---|---|
| SETUP | |
new JSDOM(html).window.document | parseDocument(html) |
| QUERYING | |
document.querySelector(sel) | selectOne(sel, document) |
document.querySelectorAll(sel) | selectAll(sel, document) → plain Array |
document.getElementById('x') | selectOne('#x', document) |
document.getElementsByClassName('x') | selectAll('.x', document) |
document.getElementsByTagName('p') | selectAll('p', document) |
el.matches(sel) | is(el, sel) from css-select |
el.closest(sel) | manual walk up via el.parent |
| — | domutils.findOne(fn, nodes) — predicate query |
| — | compile(sel) — reusable selector |
| ATTRIBUTES | |
el.getAttribute('x') → null if missing | el.attribs['x'] → undefined if missing |
el.setAttribute('x', v) | el.attribs['x'] = v |
el.hasAttribute('x') | domutils.hasAttrib(el, 'x') |
el.removeAttribute('x') | delete el.attribs['x'] |
el.id | el.attribs['id'] |
el.className | el.attribs['class'] |
| TEXT CONTENT | |
el.textContent | domutils.getText(el) |
textNode.data | textNode.data ← same! |
| NAVIGATION | |
el.tagName (uppercase) | el.name (lowercase) |
el.parentElement | el.parent (check domutils.isTag) |
el.children (elements only) | el.children.filter(domutils.isTag) |
el.childNodes (all nodes) | el.children |
el.nextElementSibling | walk el.next until isTag |
el.previousElementSibling | walk el.prev until isTag |
el.nextSibling | el.next |
el.previousSibling | el.prev |
| TYPE CHECKING | |
nodeType === 1 (element) | domutils.isTag(n) |
nodeType === 3 (text) | domutils.isText(n) or n.type === 'text' |
nodeType === 8 (comment) | domutils.isComment(n) |
node instanceof Element | domutils.isTag(n) |
| SERIALIZATION | |
el.innerHTML | domutils.getInnerHTML(el) |
el.outerHTML | domutils.getOuterHTML(el) |
| CLASSES | |
el.classList.contains('x') | is(el, '.x') or manual split |
el.classList.add('x') | manual split/join on el.attribs.class |
el.classList.remove('x') | manual split/filter/join |
| MUTATION | |
parent.removeChild(el) | domutils.removeElement(el) |
parent.replaceChild(newEl, el) | domutils.replaceElement(el, newEl) |
parent.insertBefore(newEl, ref) | domutils.prepend(ref, newEl) |
document.createElement('div') | new Element('div', {}, []) from domhandler |
document.createTextNode('x') | new Text('x') from domhandler |