A brief tour of Nokogiri decorators
Feb 18, 2016
9 minutes read

This is a quick dip into an interesting bit of code in a popular ruby library. I became interested after reading Mike Perham’s excellent Kill Your Dependencies essay. Go ahead and read it if you haven’t—I’ll be here when you get back. Mike described Rails’ dependency on Nokogiri as non-trivial, which got my attention. I was recently complaining that there’s not much “left to do” in ruby, so a clearly defined (albeit daunting) task like this is exciting.

Rails 4.2+ includes rails-html-sanitizer, which uses another library called Loofah to sanitize html fragments. Loofah, in turn, relies on Nokogiri to parse that html/xml. The challenge: To replace Loofah’s dependency on Nokogiri with a smaller, lighter package like Oga.

What’s wrong with Nokogiri? I’ve personally run into challenges installing Nokogiri, and this is fairly common — check out the troubleshooting section on the installing Nokogiri page. Even when it installs cleanly, it takes a long time to build (75-85 seconds on my macbook pro). This is a heavy dependency for rails. For comparison, Oga installs in 7-9 seconds on the same machine.

Usually when I see someone point out such an obvious gain for a high profile open source project, there’s an onslaught of potential solutions. I was surprised that issue flavorjones/loofah#100 was closed so quickly and with so little attention, so I decided to dig in. I don’t know that much about xml, Nokogiri, Oga, or Loofah, but it’s just ruby and seems like a good learning opportunity.

I started by replacing the obvious API entry points as described in Oga’s Migrating From Nokogiri page. I fairly quickly ran into the stumbling block, which is that Loofah uses an extension mechanism Nokogiri calls Nokogiri::XML::Document#decorators which Oga does not provide. In order to figure out how best to translate this to Oga’s API, we are going to need to understand what exactly Nokogiri’s decorators are doing.

Our starting point

At the core of Loofah’s extension to Nokogiri nodes and node sets is this little snippet where we’ll begin our exploration:

module DocumentDecorator # :nodoc:
  def initialize(*args, &block)
    super
    self.decorators(Nokogiri::XML::Node) << ScrubBehavior::Node
    self.decorators(Nokogiri::XML::NodeSet) << ScrubBehavior::NodeSet
  end
end

This gets included into a Loofah subclass of a Nokogiri Document, as follows:

class Loofah::XML::Document < Nokogiri::XML::Document
  include Loofah::ScrubBehavior::Node
  include Loofah::DocumentDecorator
end

So what’s a decorator?

Let me wikipedia that for you. Decorator Pattern:

In object-oriented programming, the decorator pattern (also known as Wrapper, an alternative naming shared with the Adapter pattern) is a design pattern that allows behavior to be added to an individual object, either statically or dynamically, without affecting the behavior of other objects from the same class. The decorator pattern is often useful for adhering to the Single Responsibility Principle, as it allows functionality to be divided between classes with unique areas of concern.

The decorator pattern can be used to extend (decorate) the functionality of a certain object statically, or in some cases at run-time, independently of other instances of the same class, provided some groundwork is done at design time. This is achieved by designing a new decorator class that wraps the original class.

So it seems like this is a pattern where we are “extending” an object by wrapping it through composition and delegation. Let’s take a quick look at how other rubyists have implemented this pattern.

HashRocket uses SimpleDelegator as a decorator. SimpleDelegator composes an instance with a container object that delegates to the instance in question.

Forwardable is another way to approach this, as Avdi Grimm discusses in his discussion of decoration and extension. We’ll return to this discussion later.

The Draper gem also composes wrapper objects and collections for presentational purposes. The wrapper delegates through method_missing to the decorated object, which is still intact and unmodified.

Ok, gotcha, so it’s a composition pattern for adding behavior to instances

How does nokogiri apply this pattern? Loofah’s Document calls self.decorators, which is defined on the superclass at lib/nokogiri/xml/document.rb:

def decorators key
  @decorators ||= Hash.new
  @decorators[key] ||= []
end

After Loofah calls those two lines in the DocumentDecorator initializer, the resulting @decorators object will look like this hash literal:

@decorators = {
  Nokogiri::XML::Node => [ ScrubBehavior::Node ],
  Nokogiri::XML::NodeSet => [ ScrubBehavior::NodeSet ]
}

This is fairly straightforward ruby, but where do these @decorators get used? (lib/nokogiri/xml/document.rb)

def decorate node
  return unless @decorators
  @decorators.each { |klass,list|
    next unless node.is_a?(klass)
    list.each { |moodule| node.extend(moodule) }
  }
end

Let’s take a look at what this does. We’re iterating over each key and value in that @decorators hash, and if the node is_a? Nokogiri::XML::Node (for example, with the @decorators above), we extend the node instance with each module in the list. In this case, an instance of Nokogiri::XML::Node (or a subclass) is extended with ScrubBehavior::Node unless some other library or Nokogiri itself has already appended decorators to that list).

So at this point I got very confused. Why are we using this elaborate inheritance pattern instead of doing this:

Nokogiri::XML::Node.send(:include, ScrubBehavior::Node)
Nokogiri::XML::NodeSet.send(:include, ScrubBehavior::NodeSet)

At this point I decided that I was in over my head and asked for help, both on twitter and on the pdxruby slack.


The next day, I took a fresh look at the usage of Nokogiri decorator API. The first reference that seemed like it might be useful is in lib/nokogiri/xml/node.rb:

#class Nokogiri::Node
def decorate!
  document.decorate(self)
end

That doesn’t seem like it’s called anywhere from ruby, but there are references to both decorate and decorate_bang in c (and also in java for the jruby implementation). Here’s ext/nokogiri/xml_node.c:286:

VALUE Nokogiri_wrap_xml_node(VALUE klass, xmlNodePtr node)
{
//...snipped for brevity
  if (node_has_a_document) {
    document = DOC_RUBY_OBJECT(doc);
    node_cache = DOC_NODE_CACHE(doc);
    rb_ary_push(node_cache, rb_node);
    rb_funcall(document, decorate, 1, rb_node);
  }
  return rb_node ;
}
and later in the file
static VALUE reparent_node_with(VALUE pivot_obj, VALUE reparentee_obj, pivot_reparentee_func prf)
{
//...snipped for brevity
  rb_funcall(reparented_obj, decorate_bang, 0);
  return reparented_obj ;
}

Nokogiri_wrap_xml_node in particular is called all over the place when the c internals emit a node (or node set). Aha! Maybe we’ve found the call point for our decoration. Seeing it in this rb_funcall form helped me realize why Nokogiri wasn’t just include-ing the modules. Remember the definition of decorators — even if Nokogiri isn’t composing their decorators and instead directly mixing in modules, the purpose of decorators is to allow “behavior to be added to an individual object… without affecting the behavior of other objects from the same class.” What is the scope of these added behaviors in this case? The Nokogiri::XML::Document.

Decorators are scoped per document

Because the @decorators instance variable is per Document, the extensions that Loofah adds are not global to all Documents, and in fact Loofah defines several different subclasses of Document, each of which adds different decorators. So although the Nodes and NodeSets that these Documents emit are all of the same classes, the individual instances will be extended with behavior that is dependent on both on the root document and also on the node/nodeset class.

Aside: Nokogiri decorators aren’t really following the decorator pattern

As Avdi discusses, object extension is a different solution to the problem of adding behavior to instances than decoration. The design pattern of decorators is a pattern of composition, not conditional extension (which is a form of inheritance). Coloquially, they’re “decorating” the objects, in that the nodes are bedecked with their new behavior, so it’s close enough.

Why does Nokogiri have this API? (a guess)

A long time ago, the go-to xml parser was something called Hpricot. When Nokogiri first came out, a lot of codebases transitioned from Hpricot to Nokogiri, so Nokogiri included an Hpricot compatibility mode. This mode was implemented as decorators on nokogiri nodes. Nokogiri also currently includes a slop decorator, which implements method_missing on nodes in order to allow search, traversal, and attribute extraction through a chainable/fluent API. The Nokogiri docs explicitly advise against using this API. So we’ve got one historical reason and one included-but-ill-advised decorator, as well as Loofah and potentially other third party libraries.

Why might we want to extend nodes? So that we can support the following variety of API:

MyLibThatDependsOnNokogiri.parse('<pre><code>hello</code></pre>').
  css('pre').children.first.some_special_method

An XML parser emits a tree of objects that represent nodes (which might actually be Elements or DTDs, etc), attached to a root document. At the time of document definition, we might want to add behavior (like #some_special_method above) to some subset of those nodes emitted by the parser, without making a global monkeypatch to all nodes of a particular class. In order to support the above fluent API, we need to append behavior onto each node in the tree. This can be compared to jQuery wrapping every result with $(), which adds any behavior in $.fn to the result.

Maybe we can understand this as a special type of middleware

Sometimes when understanding someone else’s code, it’s good to think through how we might approach the same design challenge that the code we’re reading solves. Often, the reasons behind design decisions become apparent that might not otherwise. My first take on this is that Nokogiri’s decorators, this is similar to a middleware pattern, and in fact decorators could be implemented as a particular middleware. Here’s a sketch of how that might look:

class MyParser::Document
  def initialize(middlewares: nil) #for simplicity of demonstration
    @middlewares = middlewares
  end

  def apply_middlewares_to_node(node)
    if @middlewares
      @middlewares.inject(node) {|n, mw| mw.call(n) }
    else
      node
    end
  end
end

class ConditionalExtensionMiddleware
  def initialize(klass, module)
    @klass, @mod = klass, mod
  end

  def call(node)
    if node.is_a?(@klass)
      node.extend(@mod)
    else
      node
    end
  end
end

MyParser::Document.new(middlewares: [
  ConditionalExtensionMiddleware.new(MyParser::XML::Node, ScrubBehavior::Node),
  ConditionalExtensionMiddleware.new(MyParser::XML::NodeSet, ScrubBehavior::NodeSet)
])

This is a transformation of the extension pattern Nokogiri uses. The middleware solution potentially affords libraries and users more control over how the nodes are extended, which could address some issues with the decorate API, like this one in Loofah:

module ScrubBehavior
  module Node # :nodoc:
    def scrub!(scrubber)
      #
      #  yes. this should be three separate methods. but nokogiri
      #  decorates (or not) based on whether the module name has
      #  already been included. and since documents get decorated
      #  just like their constituent nodes, we need to jam all the
      #  logic into a single module.
      #
      scrubber = ScrubBehavior.resolve_scrubber(scrubber)
      case self
      when Nokogiri::XML::Document
        scrubber.traverse(root) if root
      when Nokogiri::XML::DocumentFragment
        children.scrub! scrubber
      else
        scrubber.traverse(self)
      end
      self
    end
  end

With a middleware pattern, instead of using a Nokogiri-provided “decorator” middleware, Loofah could provide its own middleware which gets call-ed for each emitted node on that document. As long as the middleware returns something that quacks like the node on which it was called, the middleware is free to extend, compose, or in some other way modify or replace the passed-in node.

require 'delegator'
class NodeDecoratorMiddleware < SimpleDelegator
  include ScrubBehavior::Node
  def self.call(node)
    if MyParser::XML::Node === node
      new(node)
    else
      node
    end
  end
end


MyParser::Document.new(middlewares: [ NodeDecoratorMiddleware ])

Wrapping Up

Hopefully you’ve enjoyed this exploration of how Loofah conditionally adds methods to Nokogiri nodes attached to a particular document tree. In the next installment, we’ll take a look at how we can implement this in Oga, which does not currently support a notion of decorators or middlewares.

Please let me know if you have any feedback or advice. I’m not enabling comments, but I’ll happily embed intact any replies on twitter or by email. Thanks for reading!


Many thanks to Jesse Cooke, Bernerd Schaefer, Mat Trudel, and Luc Perkins for kindly reading drafts of this and offering feedback.


Back to posts