Itay Moav -Malimovka Itay Moav -Malimovka - 2 months ago 7x
Node.js Question

How do I parse a HTML page with Node.js

I need to parse (server side) big amounts of HTML pages.

We all agree that regexp is not the way to go here.

It seems to me that javascript is the native way of parsing a HTML page, but that assumption relies on the server side code having all the DOM ability javascript has inside a browser.

Does Node.js have that ability built in?

Is there a better approach to this problem, parsing HTML on the server side?

kzh kzh

You can use the npm modules jsdom and htmlparser to create and parse a DOM in Node.JS.

Other options include:

  • BeautifulSoup for python
  • you can convert you html to xhtml and use XSLT
  • HTMLAgilityPack for .NET
  • CsQuery for .NET (my new favorite)
  • The spidermonkey and rhino JS engines have native E4X support. This may be useful, only if you convert your html to xhtml.

Out of all these options, I prefer using the Node.js option, because it uses the standard W3C DOM accessor methods and I can reuse code on both the client and server. I wish BeautifulSoup's methods were more similar to the W3C dom, and I think converting your HTML to XHTML to write XSLT is just plain sadistic.

Quick example for jsdom:

var jsdom = require("jsdom");
    file: 'some file.html',
    done: function (err, window) {
        GLOBAL.window = window;
        GLOBAL.document = window.document;
        // now you can work on parsing HTML as you normally would in a browser
        // e.g. this will work  
function showTables() {
    var tables = document.querySelectorAll('table');
    console.log("there are ", tables.length, " tables").