Applying FilterIterator to Directory Iteration

I'm currently doing research and prototyping for autoloading alternatives in Zend Framework 2.0. One approach I'm looking at involves creating explicit class/file maps; these tend to be much faster than using the include_path, but do require some additional setup.

My algorithm for generating the maps was absurdly simple:

  • Scan the filesystem for PHP files
  • If the file does not contain an interface, class, or abstract class, skip it.
  • If it does, get its declared namespace and classname

The question was what implementation approach to use.

I'm well aware of RecursiveDirectoryIterator, and planned to use that. However, I also had heard of FilterIterator, and wondered if I could tie that in somehow. In the end, I could, but the solution was non-obvious.

What I Thought I'd Be Able To Do

FilterIterator is an abstract class. When extending it, you must define an accept() method.

class FooFilter extends FilterIterator
{
    public function accept()
    {
    }
}

In that method, you typically will inspect whatever is returned by $this->current(), and then return a boolean true or false, depending on whether you want to keep it or not.

class FooFilter extends FilterIterator
{
    public function accept()
    {
        $item = $this->current();

        if ($someCriteriaIsMet) {
            return true;
        }

        return false;
    }
}

I'll go into the mechanics of my criteria later; what's important now is knowing that a FilterIterator allows you to limit the results returned by your iterator.

I originally thought I'd be able to simply pass a DirectoryIterator or RecursiveDirectoryIterator to my filtering instance. This worked in the former case, as it's only one level deep. However, for the latter, it would only return the first directory level for all classes that matched — i.e., if I ran it over Zend/Controller, I'd get a match for each class under Zend/Controller/Action/Helper/, but it would return simply Zend/Controller/Action as the match. This certainly wasn't useful.

I then discovered RecursiveFilterIterator, which looked like it would solve the recursion problem. However, I found one of two results occurred: either I'd receive an entire subtree if at least one item matched, or it would skip an entire subtree if the first item found failed the criteria. There was no middle ground.

The Solution

The solution was incredibly simple and elegant, once I stumbled upon it: pass my RecursiveIteratorIterator instance to the FilterIterator.

$rdi      = new RecursiveDirectoryIterator($somePath);
$rii      = new RecursiveIteratorIterator($rdi);
$filtered = new FooFilter($rii);

Really. It was that simple — but, as noted, non-obvious. It also required a slight change within my filter — instead of using current(), I'd need to first pull the "inner" iterator instance: $this->getInnerIterator()->current(). I show an example of that below when I go over the filter implementation.

As for my criteria, I had several options. I could require_once the file, and use the Reflection API to inspect the class to determine if it was an interface, abstract class, or class, as well as to determine the namespace. However, I couldn't be 100% sure the file would contain a class, so this seemed like overkill. That, and horribly non-performant, due to using reflection.

The next option was to simply slurp in the file contents into a variable, and use regular expressions. I love regular expressions, but in this case, it felt like I could possibly end up with some false positives. Also, since some of these files could be quite large, I was worried again about performance implications — I don't want to have to wait forever to generate these maps.

The solution I went with was to use the tokenizer to inspect the file. Tokenizing is incredibly fast, and it's also incredibly simple to analyze the tokens.

I decided to store the detected namespace and classnames as public properties of the SplFileInfo objects returned; this makes it simple to iterate over the entire collection and utilize that information. Also, because I have SplFileInfo objects, I already have the paths I need.

My implementation looks like this:

/** @namespace */
namespace Zend\File;

// import SPL classes/interfaces into local scope
use DirectoryIterator,
    FilterIterator,
    RecursiveIterator,
    RecursiveDirectoryIterator,
    RecursiveIteratorIterator;

/**
 * Locate files containing PHP classes, interfaces, or abstracts
 * 
 * @package    Zend_File
 * @license    New BSD {@link http://framework.zend.com/license/new-bsd}
 */
class ClassFileLocater extends FilterIterator
{
    /**
     * Create an instance of the locater iterator
     * 
     * Expects either a directory, or a DirectoryIterator (or its recursive variant) 
     * instance.
     * 
     * @param  string|DirectoryIterator $dirOrIterator 
     * @return void
     */
    public function __construct($dirOrIterator = '.')
    {
        if (is_string($dirOrIterator)) {
            if (!is_dir($dirOrIterator)) {
                throw new InvalidArgumentException('Expected a valid directory name');
            }

            $dirOrIterator = new RecursiveDirectoryIterator($dirOrIterator);
        }
        if (!$dirOrIterator instanceof DirectoryIterator) {
            throw new InvalidArgumentException('Expected a DirectoryIterator');
        }

        if ($dirOrIterator instanceof RecursiveIterator) {
            $iterator = new RecursiveIteratorIterator($dirOrIterator);
        } else {
            $iterator = $dirOrIterator;
        }

        parent::__construct($iterator);
        $this->rewind();
    }

    /**
     * Filter for files containing PHP classes, interfaces, or abstracts
     * 
     * @return bool
     */
    public function accept()
    {
        $file = $this->getInnerIterator()->current();

        // If we somehow have something other than an SplFileInfo object, just 
        // return false
        if (!$file instanceof \SplFileInfo) {
            return false;
        }

        // If we have a directory, it's not a file, so return false
        if (!$file->isFile()) {
            return false;
        }

        // If not a PHP file, skip
        if ($file->getBasename('.php') == $file->getBasename()) {
            return false;
        }

        $contents = file_get_contents($file->getRealPath());
        $tokens   = token_get_all($contents);
        $count    = count($tokens);
        $i        = 0;
        while ($i < $count) {
            $token = $tokens[$i];

            if (!is_array($token)) {
                // single character token found; skip
                $i++;
                continue;
            }

            list($id, $content, $line) = $token;

            switch ($id) {
                case T_NAMESPACE:
                    // Namespace found; grab it for later
                    $namespace = '';
                    $done      = false;
                    do {
                        ++$i;
                        $token = $tokens[$i];
                        if (is_string($token)) {
                            if (';' === $token) {
                                $done = true;
                            }
                            continue;
                        }
                        list($type, $content, $line) = $token;
                        switch ($type) {
                            case T_STRING:
                            case T_NS_SEPARATOR:
                                $namespace .= $content;
                                break;
                        }
                    } while (!$done && $i < $count);

                    // Set the namespace of this file in the object
                    $file->namespace = $namespace;
                    break;
                case T_ABSTRACT:
                case T_CLASS:
                case T_INTERFACE:
                    // Abstract class, class, or interface found

                    // Get the classname
                    $class = '';
                    do {
                        ++$i;
                        $token = $tokens[$i];
                        if (is_string($token)) {
                            continue;
                        }
                        list($type, $content, $line) = $token;
                        switch ($type) {
                            case T_STRING:
                                $class = $content;
                                break;
                        }
                    } while (empty($class) && $i < $count);

                    // If a classname was found, set it in the object, and 
                    // return boolean true (found)
                    if (!empty($class)) {
                        $file->classname = $class;
                        return true;
                    }
                    break;
                default:
                    break;
            }
            ++$i;
        }

        // No class-type tokens found; return false
        return false;
    }
}

Note: the Exceptions thrown in this class are defined in the same namespace; I'll leave how they're implemented to your imagination.

Iterating Faster

The next trick I discovered was in the form of iterator_apply(). Normally when I use iterators, I use foreach, because, well, that's what you do. But in looking through the various iterators for this exercise, I stumbled across this gem.

Basically, you pass the iterator, a callback, and argument(s) you want passed to the callback. Like FilterIterator, you don't get the actual item returned by the iterator, so in most use cases, you pass the iterator itself:

iterator_apply($it, $callback, array($it));

You can then grab the current value and/or key from the iterator itself:

public function process(Iterator $it)
{
    $value = $it->current();
    $key   = $it->key();
    // ...
}

While you can use any valid PHP callback, I found the most interesting solution was to use a closure, as it allows you to define everything up front:

iterator_apply($it, function() use ($it) {
    $value = $it->current();
    $key   = $it->key();
    // ...
});

If you pass in a local value via a use statement, you can do some aggregation:

$map = new \stdClass;
iterator_apply($it, function() use ($it, $map) {
    $file = $it->current();
    $namespace = !empty($file->namespace) ? $file->namespace . '\' : '';
    $classname = $namespace . $file->classname;
    $map->{$classname} = $file->getPathname();
});

Not only is this a nice, concise technique, it's also tremendously fast — I was finding it was 200%–300% faster than using a traditional foreach loop. Clearly it cannot be used in all situations, but if you can use it, you probably should.

So, start playing with FilterIterator and iterator_apply() if you haven't already — the two offer tremendous possibilities and capabilities for your applications.