No real preamble to be made here. Gearman is a distributed job queuing system by the fine folks who brought us memcached. It is nicer than anything else I've looked at. I am attempting to switch one of my projects over to it (replacing a crufty curl + unix sockets + memcached monstrosity that attempted to do the same job).

The documentation is lacking, but if the discussion group is any indication, real docs are a high priority for the project team. Today, I visited the IRC channel to ask for a status update on docs for the PHP extension api (as opposed to the PEAR all-script api, whose auto-generated docs are broken). Turns out my suspicions were right. Documentation is a high priority and none currently exists for the api in question. However... I was informed that the classes support reflection... so :)

A quick grep of the source for the extension tells me that I am looking at four classes: GearmanClient, GearmanWorker, GearmanJob, and GearmanTask. A ridiculously short php script later...

<?
Reflection::export( new ReflectionClass('GearmanWorker') );
Reflection::export( new ReflectionClass('GearmanClient') );
Reflection::export( new ReflectionClass('GearmanJob') );
Reflection::export( new ReflectionClass('GearmanTask') );
?>

And I can at least try to make a human readable list of available methods.

GearmanWorker

  • __construct()
  • clone()
  • error()
  • returnCode()
  • setOptions( $option, $data )
  • addServer( $host, $port ) - both args optional, examples say defaults are localhost on port 4730.
  • addFunction( $function_name, $function, $data, $timeout ) - data and timeout optional
  • work()

GearmanClient

  • __construct()
  • clone()
  • error()
  • setOptions( $option, $data )
  • addServer( $host, $port ) - reflection says REQUIRED, however the provided examples and personal experience says otherwise
  • do( $function_name, $workload, $unique ) - unique is optional
  • doHigh( $function_name, $workload, $unique ) - unique is optional
  • doLow( $function_name, $workload, $unique ) - unique is optional
  • doJobHandle()
  • doStatus()
  • doBackground( $function_name, $workload, $unique ) - unique is optional
  • doHighBackground( $function_name, $workload, $unique ) - unique is optional
  • doLowBackground( $function_name, $workload, $unique ) - unique is optional
  • jobStatus( $job_handle )
  • echo( $workload )
  • addTask( $function_name, $workload, $data, $unique ) - data and unique are optional
  • addTaskHigh( $function_name, $workload, $data, $unique ) - data and unique are optional
  • addTaskLow( $function_name, $workload, $data, $unique ) - data and unique are optional
  • addTaskBackground( $function_name, $workload, $data, $unique ) - data and unique are optional
  • addTaskHighBackground( $function_name, $workload, $data, $unique ) - data and unique are optional
  • addTaskLowBackground( $function_name, $workload, $data, $unique ) - data and unique are optional
  • addTaskStatus( $job_handle, $data ) - data is optional
  • setWorkloadCallback( $callback )
  • setCreatedCallback( $callback)
  • setClientCallback( $callback)
  • setWarningCallback( $callback)
  • setStatusCallback( $callback)
  • setCompleteCallback( $callback)
  • setExceptionCallback( $callback)
  • setFailCallback( $callback)
  • clearCallbacks()
  • data()
  • setData( $data )
  • runTasks()

GearmanJob

  • __construct()
  • returnCode()
  • workload()
  • workloadSize()
  • warning( $warning )
  • status( $numerator, $denominator )
  • handle()
  • unique()
  • data( $data )
  • complete( $result )
  • exception( $exception )
  • fail()
  • functionName()
  • setReturn( $gearman_return_t )

GearmanTask

  • __construct()
  • returnCode()
  • create()
  • free()
  • function()
  • uuid()
  • jobHandle()
  • isKnown()
  • isRunning()
  • taskNumerator()
  • taskDenominator()
  • data()
  • dataSize()
  • takeData( $task_object ) - optional
  • sendData( $data )
  • recvData( $data_len )

The extension also appears to expose all constants defined in the C api.

I have since added this to the official wiki - so there are at least SOME docs on the site now ;)

As of version 5.0, PHP has had the ability to dynamically include required classes as needed - without requiring the developer to manually include all possible dependencies beforehand. This means that in cases where your code execution never touches 39 of the 40 classes in the project, it loads, parses, and runs that much faster.

There is a performance hit for actually having to call the __autoload() method, but if you're in a situation where the hit for executing a few extra comparison calls is unacceptable... you probably aren't developing in PHP in the first place ;)

Almost all of the php I've written in the last 2-3 years uses autoloading, and it has probably saved me hundreds of hours of aggravation.

In most of my projects, the first line of any script or class usually looks something like this:

require_once "/var/www/common/lib.php";

Then lib.php usually reads something like this:

<?
function __autoload( $class ) {
    include_once( "$class.php" );
}
?>

And that is all that is strictly required to make the magic happen. It is fast, it is easy to understand, it is easy to use. You can use require_once() or include_once() and there is very little meaningful difference.

I've looked around the net and found several other attempts at improving on this simple mechanism. But they invariably overcomplicate things. They attempt to recurse source directories, cache filename->class differences to the filesystem, and otherwise turn what should be a simple filesystem operation that the php environment supports natively into a mess of exception handling and wheel reinvention.

There are obviously theoretical instances where you might want to have more than the one require_once/include_once line... but I've honestly never encountered one myself.

I mean, you could try to throw an exception if the file didn't exist or otherwise failed to load... but nothing will happen. Failure to instantiate a nonexistant class is a fatal error in PHP, and will be handled as such with or without you - preempting any attempt at throwing an exception.

The only thing you can add is a bit of extra diagnostics or maybe logging to a separate location.

Assume that we have a file 'test.php':

<?
require_once "autoload.php";
$frog = new Frog();
?>

If autoload.php contains a simple simple autoload function that uses require_once(), and Frog.php doesn't exist anywhere in your include path, the results will look something like this:

ammon@kif:~$ php test.php

Warning: require_once(Frog.php): failed to open stream: No such file or directory in /home/ammon/autoload.php on line 3

Fatal error: require_once(): Failed opening required 'Frog.php' (include_path='.:/usr/share/php:/usr/share/pear') in /home/ammon/autoload.php on line 3

If we had used an include_once() call, the output is similar, but slightly more informative:

ammon@kif:~$ php test.php

Warning: include_once(Frog.php): failed to open stream: No such file or directory in /home/ammon/autoload.php on line 3

Warning: include_once(): Failed opening 'Frog.php' for inclusion (include_path='.:/usr/share/php:/usr/share/pear') in /home/ammon/autoload.php on line 3

Fatal error: Class 'Frog' not found in /home/ammon/test.php on line 4

So that's probably a bit more useful in tracking down the error. Require calls don't return anything - they throw a fatal error on failure. Include calls, however, return FALSE on failure and TRUE if the file is (or, in the case of include_once, has already been) successfully included. So you can include_once() and write to a separate logfile (or to the output stream...) if you need more information than the fatal error already provides you.

<rant>

To those who insist on giving your classes and their containing files different names... umm. Wow.

If I have a class called DatabaseConnection, I'm going to put it in a file called DatabaseConnection.php. If I'm working with strange people who somehow don't think that is explicit enough, I might call it DatabaseConnection.class.php and tweak the autoload method ever so slightly to compensate. There's no good reason to put it in a file called projx-database_connection.incl or something. No. There isn't.

If you want to organize your classes into a meaningful directory structure... good for you. Use PHP's built-in include_path ini option. Don't waste time trying to cascade down a directory structure searching for the classes - just make sure your includes are all in a set of reliable locations. You don't actually have to edit the php.ini file and bounce Apache or your php-cgi processes, just define the additional include paths in the same file where you define your autoloader:

set_include_path(
    get_include_path() . PATH_SEPARATOR .
    "/var/www/includes" . PATH_SEPARATOR .
    "/var/www/includes/apple" . PATH_SEPARATOR .
    "/var/www/includes/banana"
);

Naturally, you could turn that into some function calls to dynamically register and unregister directories, etc... but at that point, you're probably hurting yourself again. If your codebase is being reorganized enough to make maintenance of the list of include dirs onerous without full time intervention, something else has probably already gone very wrong. At best, the code probably doesn't work anyway, so any brief delay in updating the list can't hurt any more than whatever else is happening.

</rant>

But seriously. __autoload() is your friend. It will help clean up your code if you let it. It can help enforce naming conventions. It can even improve performance... so long as you refrain from using it to shoot yourself in the foot. ;)

So a fairly longstanding gripe of mine has been that PHP fails to execute registered signal handlers when it receives a signal in the middle of a blocking select call. Today, I finally bumped into a situation where I couldn't just change the spec to avoid the situation... and I've finally figured out how to make it work.

The bug has been reported here, where it was ignored for a few months before being shot down and ignored some more as per php dev team regulations.

Sample code given by the reporter of the bug is markedly similar to the situations I've encountered the problem:

pcntl_signal(SIGINT, "sig_handler");
$sock = socket_create_listen($port);
$read_socks = array($sock);
$n = NULL;
$foo = socket_select($read_socks, $n, $n, NULL);

By filling in his blanks, my first test case looks something like this:

<?
function sig_handler($signo) {
        echo "received sig #$signo\\\n";
}
pcntl_signal( SIGINT, "sig_handler" );

$socket = socket_create_listen( 1234 );
$r = array( $socket );
$n = NULL;
while( true ) {
        $foo = socket_select( $r, $n, $n, NULL );
        echo "select returned '$foo'\\\n";
}
?>

When executing the script and pressing ^C (which sends SIGINT), the following occurs:

ammon@morbo:~$ php sigtest.php
PHP Warning:  socket_select(): unable to select [4]: Interrupted system call in /home/ammon/sigtest.php on line 13
select returned ''

Ok, so the warning is to be expected, and we can easily squelch that.

The real problem is that the signal handler never runs.

However... for the first time in my life, a response to a php bug report proves enlightening. The dev who answered this ticket provides his sample code and says he can't duplicate the bug. Upon looking at the differences between their code, only one difference stands out:

declare(ticks=1);

The declare(ticks) directive is deprecated as of php 5.3 and will not be with us in php 6.0. Ticks are an unreliable, unpredictable, and generally bad thing in php. I've neither successfully used them nor seen a successful and justified use.

That being said... turning the tick on but not telling it to do anything appears to address the problem of discarded interrupts:

<?
declare(ticks=1);

function sig_handler($signo) {
        echo "received sig #$signo\\\n";
}
pcntl_signal( SIGINT, "sig_handler" );

$socket = socket_create_listen( 1234 );
$r = array( $socket );
$n = NULL;
while( true ) {
        $foo = @socket_select( $r, $n, $n, NULL );
        echo "select returned '$foo'\\\n";
}
?>

And execution:

ammon@morbo:~$ php sigtest.php
received sig #2
select returned ''

Which is precisely the desired behavior.

I don't know what the performance hit for turning ticks on is, I haven't had time to research this. But I can confirm that by declaring ticks globally, it does work in an OO environment as well:

<?
declare(ticks=1);

class signal_tester {
    function __construct() {
        pcntl_signal( SIGINT, array(&$this,"sig_handler") );
        $this->start();
    }

    function sig_handler($signo) {
        echo "received sig #$signo\\\n";
    }

    function start() {
        $socket = socket_create_listen( 1234 );
        $r = array( $socket );
        $n = NULL;
        while( true ) {
            $foo = @socket_select( $r, $n, $n, NULL );
            echo "select returned '$foo'\\\n";
        }
    }
}

$test = new signal_tester();
?>

Executing and hitting ^C:

ammon@morbo:~$ php sigtest.php                                               
received sig #2
select returned ''

After a few minutes of largely unscientific testing, it appears that turning ticks on globally costs a whopping 4 bytes of ram and causes the script to occasionally consume more cpu than the top process I used to monitor it. So... at first glance the cost is pretty negligible and all I can say is that if you ever need to handle signals (SIGTERM, SIGHUP, etc...) from within a blocking select call in php, it looks like declare ticks is the only option for now.

I did the initial tests in 5.1.6, but can confirm the same behavior in 5.2.5. I don't know how the behavior is going to be in 5.3, since I don't run alpha releases on my servers but my gut likes to think that it will continue to work the same for now... and will hopefully not break until 6.0 (when everything else will explode for a few years). Shrug.

I have a php script that frequently needs to email me the last few lines of a log file. I can't afford to exec() a binary tail process, so the solution has to be in pure php.

Originally, the files in question never exceeded more than a few thousand lines. Unfortunately, I am encountering cases now where the files are now occasionally 50,000 lines or longer. This causes PHP's memory consumption to explode.

Note: Code snippets provided here are not fully functional standalone shell scripts. The scripts I ran to benchmark the algorithms contain some rudimentary setup logic that is not important here, so has not been included.

My original method:

// tail-file.php
$arr = @file( $fname, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES );
$arr = array_slice($arr, -$lines);
$buf = implode(&quot;\\\n&quot;,$arr);

This is easy to understand and is pretty fast, all things considered. Unfortunately, the memory footprint for loading a file into an array is obscene. Loading a 4400 line log file with this method could consume more than 17mb of ram. 50,000 line files easily stressed the 256mb limit I am able to provide the process.

So, the obvious solution to the memory consumption is to avoid loading the entire file at once. What if we kept a rotating list of lines in the file?

// tail-array.php
$arr = array_fill( 0, $lines+1, &quot;\\\n&quot; );

$fp = fopen($fname, &quot;r&quot;);
while( !feof($fp) ) {
    $line = fgets($fp, 4096);
    $arr[] = $line; // faster than array_push()
    array_shift($arr);
}
fclose($fp);
$buf = implode(&quot;&quot;,$arr);

This method works by keeping the $lines-many most recent lines of the file in an array. Memory consumption remains sane, but the performance hit for performing so many array pushes and shifts is bad. Really bad. With small files, I can't notice any difference between this method and the file() method... but with longer files, it adds up quickly.

Given a 51 line, 4kb file, an average execution ($lines = 20) might look like this:

ammon@zapp:~$ time ./tail-file.php a.log&gt;/dev/null

real    0m0.015s
user    0m0.009s
sys     0m0.007s

ammon@zapp:~$ time ./tail-array.php a.log&gt;/dev/null

real    0m0.016s
user    0m0.010s
sys     0m0.006s

Comparable enough. But given a 50,004 line (3.3mb) log file:

ammon@zapp:~$ time ./tail-file.php b.log >/dev/null                  

real    0m0.079s
user    0m0.058s
sys     0m0.021s

ammon@zapp:~$ time ./tail-array.php b.log >/dev/null                 

real    0m0.119s
user    0m0.112s
sys     0m0.007s

The difference becomes quite clear. However... what if my log file grows obscenely large? I've got a 9 million line log file (1.6gb) lying around to test with...

ammon@zapp:~$ time ./tail-file.php c.log >/dev/null

real    0m0.015s
user    0m0.008s
sys     0m0.008s

ammon@zapp:~$ time ./tail-array.php c.log >/dev/null                 

real    0m19.351s
user    0m18.545s
sys     0m0.803s

The file() method crashes because it can't allocate enough ram to hold a 9 million element array and the array method takes almost 20 seconds to execute. It's slow... but at least it works.

Of course, there are other methods. The one I finally settled on is this:

// tail-seek.php
$fp = fopen($fname, &quot;r&quot;);
$lines_read = 0;
if( $fp !== FALSE ) {
    fseek( $fp, 0, SEEK_END );
    $pos = $eof = ftell($fp);
    do {
        --$pos;
        fseek($fp, $pos);
        $c = fgetc($fp);
        if( $c == &quot;\\\n&quot; )
            $lines_read++;
    } while( $pos&gt; 0 &amp;&amp; $lines_read &lt;= $lines );
    $buf = fread($fp, $eof-$pos);
}
fclose($fp);

This method doesn't waste time reading the bulk of the file. It jumps to the end and scans backward until enough newlines have been located. The only problem here is that your average filesystem isn't optimized for reading backwards... but since we're not really reading very much data, it doesn't much matter.

ammon@zapp:~$ time ./tail-seek.php a.log >/dev/null

real    0m0.017s
user    0m0.009s
sys     0m0.008s

ammon@zapp:~$ time ./tail-seek.php b.log >/dev/null                  

real    0m0.017s
user    0m0.008s
sys     0m0.010s

ammon@zapp:~$ time ./tail-seek.php c.log >/dev/null                  

real    0m0.023s
user    0m0.015s
sys     0m0.008s

Performance is a trifle slower on small files, but it's astronomically better on long ones. This is similar to the method used by most unix 'tail' commands, and is the clear winner for actual use in my application.

Of course, it needs a bit of cleanup from the state I've provided it in, and isn't appropriate for all environments... but it's a trifle better than requiring 20 seconds and 20gb of ram to execute ;)

Sorry it took me so long to post this, but WordPress 2.5 doesn't seem to like me trying to upload gz/zip files, so I had to upload the source manually.

Well, it's been months since I promised to post some usable socket policy service code, so I will.

The script here is meant to serve as a good starting point for people whose servers need to allow flash clients to make socket connections. I have not actually used this exact code in a production environment, but I have been using code that is 99% identical for a while now. I am confident that any blatant flaws are the result of simple copy-paste errors as I compiled the package. Please let me know if you find any.

I have however, stress tested the heck out of this service. One instance successfully served up over 16000 policy file requests fed into it as rapidly as I could send them. The same networking code has also handled requests from at least 100 different hosts at roughly the same time.

Everything has been combined into a single cli php script that requires no special installation. Just plop it down on the server and run it as root. It will take care of the rest. The config defaults should be safe, but you probably want to specify them more clearly - just to be safe.

The daemon is made of three classes:

  • Logger - A rudimentary log file management class that I copy from project to project in one form or another. The included version is stripped down from some of the other versions I've written, and I'm planning on releasing a more feature-rich version in the future.
  • Daemon - A simple class for daemonizing a process. Adapted and re-adapted countless times from an original php4 class I found on the net a few years ago by some guy named Seth (whose email domain no longer exists).
  • FlashPolicyService - The meat and potatoes, a child of Daemon. Mostly, this is just the requisite networking code and glue to make everything work together.

As with any of my other code, this is licensed under CC Attribution 3.0.

Download:

Source code after the jump.
Read the rest of this entry »

The other day, I heard a few people talking about needing an easy way to browse images on a remote Apache server that has Indexes disabled.

They had a ~20 line php script that they were dropping into each directory in order to generate indices. The problem came when they started organizing the images into subdirectories. Eventually, it became necessary to copy the new script into a mind-bogglingly large number of directories. Inevitably, dirs were missed, etc...

I interjected that I could probably fix their problem in 30 minutes.

So I did.

<?
$base = getcwd();
$subdir = trim($_GET['dir']);

$dir = realpath("$base/$subdir");
$valid = strpos($dir, $base);
if( !$dir || $valid === FALSE || $valid != 0 )
    die();

$imgdir = dirname($_SERVER['SCRIPT_NAME']);

echo "<h3>$subdir</h3>\\\n";
$dirs = "";
$imgs = "<hr/>\\\n";

if( file_exists($dir) && is_dir($dir) ) {
    $dh = opendir($dir);
    while( false !== ($file = readdir($dh)) ) {
        if( $file == "." || $file == ".." || $file == ".svn" || substr($file,-4) == ".php" )
            continue;
        if( is_dir("$base/$subdir/$file") ) {
            $dirs .= "<span>|<a href='?dir=$subdir/$file'>$file</a>|</span>\\\n";
        } else {
            $imgs .= "<div style='float:left; margin:15px;'><a href='$imgdir/$subdir/$file'><img style='border: none;' src='$imgdir/$subdir/$file'/></a></div>\\\n";
        }
    }
}

echo $dirs;
echo $imgs;
?>

It's not elegant. It's not pretty. It has plenty of room for improvement - it'll generate links to Windows explorer thumbnail db's, etc... But it is fast and should be moderately secure. Just drop it in the root directory of your image structure and you're good.