Archive for May, 2009

Phonetic search with DBIC

I recently ran across the probem that there are many people with almost identical last names and it is hard to guess every possible spelling until you finally find the person you were looking for.

In german, there are a lot of possible spellings for the name “Maier” which all sound the same: Meyer, Meier, Mayer, Mayr, Meyr etc.

A phonetic algorithm reduces a given word to a digest which is the same for all names, which sound similar. After asking for a proper name on #dbix-class I came up with the DBIx-Class-PhoneticSearch module.

For now it is only avaiable from github. But sometime soon it will be on the CPAN :-)

The usage is pretty easy:

    package MySchema::User;
 
    use base 'DBIx::Class';
 
    __PACKAGE__->load_components(qw(PhoneticSearch Core));
 
    __PACKAGE__->table('user');
 
    __PACKAGE__->add_columns(
      id       => { data_type => 'integer', auto_increment => 1, },
      surname  => { data_type => 'character varying', 
                    phonetic_search => 1 },
      forename => { data_type => 'character varying', 
                    phonetic_search => { algorithm => 'Koeln', 
                                         no_indices => 1 } },
 
    );
 
    __PACKAGE__->set_primary_key('id');
 
    __PACKAGE__->resultset_class('DBIx::Class::ResultSet::PhoneticSearch');

This defines a result class with a forename and surname column. Both are phonetic-enabled. forename uses the Koeln algorithm, which has been optimized for german names and words. Make sure you deploy() that schema again or add the two columns to your schema.

ALTER TABLE `user` ADD COLUMN `surname_phonetic_phonix` CHARACTER VARYING;
ALTER TABLE `user` ADD COLUMN `forename_phonetic_koeln` CHARACTER VARYING;

Now you can search for any user by a similar sounding name:

  $rs = $schema->resultset('User');
  $rs->create({ forename => 'John', surname => 'Night' });
 
  $rs->search_phonetic({ forename => 'Jon' })->first->forename;  # John
  $rs->search_phonetic({ surname => 'Knight' })->first->surname; # Night

The default algorithm is Phonix which is IMHO far superior to the popular Soundex algorithm. E. g. the last example (Knight -> Night) does not work with Soundex.

Currently I use plain SQL statements to create reports which, for instance, show me the number of logins of all users on a 5 minutes scale.

One of those statements looks similar to this:

INSERT INTO report_item 
    (
        SELECT '2009-05-23 02:30:00+0200'::timestamp WITH time zone + (s.a || ' minute')::interval AS timestamp, 
               COUNT(source."created_on") AS value
        FROM generate_series(0,595,5) AS s(a)
 
        LEFT JOIN 
          (SELECT "created_on" FROM "session") AS source
          ON source."created_on" >= '2009-05-23 02:30:00+0200'::timestamp WITH time zone + (s.a  || ' minute')::interval 
             AND source."created_on" < '2009-05-23 02:30:00+0200'::timestamp WITH time zone + (s.a  + 5 || ' minute')::interval
 
        GROUP BY a
    );

This gives me the number of logins (or new sessions) per 5 minute time slot. Now it’s pretty easy to read the data from report_item and create a graph or do other nice things with it.

I was thinking about a more generic way to do these reports so that the user can choose, what kind of report he wants.

So I started to create a result class Report with the following columns:

  • group_by
    The column which is used to group rows together. This is always a timestamp column, e.g. created_on
  • interval_type
    Group by minutes, hours, days, weeks etc. (defaults to minutes)
  • interval_length
    Group by that amount of interval_type (defaults to 5)
  • aggregate_by
    Which SQL function should be used to aggregate data in a time slot (defaults to COUNT)
  • aggregate
    Which column should be aggregated
  • query
    Which data should be aggregated? This accepts a DBIC result set or a plain SQL statement

After you created a report in that table you call $report->create_report on that row. This will execute the query above with the correct values filled in and the result will be stored in a table result_item, which has a foreign key report_id to the report.

It does a decent job so far and I was wondering if anyone else did something like this before?

If anyone is interested in the code I can make a dist and upload it to github. Since the SQL is Postgres only I don’t want to push it to the CPAN.

I was wondering how you design your base classes for DBIC result(set) classes. I think it’s a good idea both in terms of startup speed and ease of use to not use DBIx::Class as base of your result classes but use a custom one instead. The same applies for the resultset classes. Here is how I designed my DBIC classes:

Make sure you use load_namespacesin MySchema.pm!

Directory structure:

MySchema.pm
MySchema/
         Base/
              Result.pm
              ResultSet.pm
         Result/
                User.pm
         ResultSet/
                   User.pm

This has the advantage that the base classes in base are not being loaded. All classes in Result use MySchema::Base::Result as their base class. Classes in ResultSet use MySchema::Base::ResultSet respectively.

My current result base class looks like this:

package MySchema::Base::Result;
use strict;
use warnings;
use base 'DBIx::Class';
 
__PACKAGE__->load_components(qw(RandomColumns InflateColumn::FS TimeStamp Core));
 
__PACKAGE__->table('dummy');
 
__PACKAGE__->add_columns(
	id => {
      data_type => 'character',
      is_random => {size => 10, set => ['a'..'z','A'..'Z'], check => 1},
      size => 10,
    },
    created_on => {
        data_type => 'timestamp with time zone',
        set_on_create => 1,
        is_nullable => 1
    },
    updated_on => {
        data_type => 'timestamp with time zone',
        set_on_create => 1,
        set_on_update => 1,
        is_nullable => 1
    });
 
__PACKAGE__->set_primary_key('id');
 
 
sub sqlt_deploy_hook {
  my ($self, $sqlt_table) = @_;
  map { $sqlt_table->drop_constraint($_) if $_->type eq "FOREIGN KEY" } 
    $sqlt_table->get_constraints;
}
 
 
1;

So what is going on here?

First I use load_components to load all the components which are valuable to all of my result classes. Since I always use a random combination of characters for the primary key column id I load RandomColumns. InflateColumn::FS makes sure that any BLOB is stored on the file system rather than in the database. TimeStamp can set columns to the current time on either update or create. It also loads InflateColumn::DateTime which is used to inflate any date or time columns to a DateTime object.

You need to define a table name, otherwise DBIC fails to compile. This table will never be created or seen in any of your result classes as long as you overwrite it in each class. So just name it dummy or something and you are good.

Next I create a couple of columns which should be avaiable on all result classes. My primary key is always called id so create that column for all my classes. To use the RandomColumns component you have to add that is_random line to the column definition. This will create a random 10-byte long string. Since I’m a little bit paranoid about collision I set the check parameter so the component checks before inserting a new row if a column with that id is already there. Although this is very unlikely because there are 52^10 combinations, quite a lot…

The created_on and updated_on columns use the features of TimeStamp to set the time when the record was created or updated respectively. timestamp with time zone is a PostgreSQL specific column. You might need to the correct data type of your dbms.

I use SQL::Translator, DBIx::Class::Schema::Versioned and a few lines of code to deploy and update my database schema. Foreign key constraints on the database slow things down and DBIC handles them anyway so I decided to drop them from the SQL::Translator output. this is done in the sqlt_deploy_hook method.

My App GrowlsGrowl is a nice notification utility for Mac OS X. There are many applications or plugins avaiable which interact with Growl (e. g. Skype, Mail, VLC etc.).

I thought it would be nice if I could redirect the catalyst debug output to it as well. Log::Dispatch has a nice appender Log::Dispatch::MacGrowl. Load Log::Dispatch as plugin in MyApp.pm (after ConfigLoader):

use Catalyst qw/-Debug
                ConfigLoader
                Log::Dispatch
                Static::Simple/;

and configure your app accordingly:

<Log::Dispatch>
	class     = MacGrowl
	name      = growl
	app_name  = MyAppGrowls
	title     = MyApp
	sticky    = 0
	min_level = notice
	format    = [%p] %m %n
</Log::Dispatch>

I cannot recommend to set min_level to info or below because this will add all start up log statements to be rendered on your screen. Kind of annoying.

Put a log statement anywhere in your code and start up your dev server. A growl notification should appear when you hit the log statement.

You can get a sample application MyAppGrowls from GitHub.

Make sure you have the following modules installed (or run perl Makefile.PL):

  • Catalyst::Plugin::Log::Dispatch
  • Log::Dispatch::Config
  • Log::Dispatch::MacGrowl

Writing documentation is sometimes hard but inevitable. If you write code which is published on the CPAN you can always access the documentation through search.cpan.org. For those cases where you cannot publish your code on the CPAN it would be great if you had a web site like CPAN where you can access the documentation. For this reason I wrote Pod::Browser a while back. Pod::Browser is a browser for all your local modules and all the modules from the CPAN. The interface is rendered by ExtJS, a powerful JavaScript framework.

Pod::Browser

After installing Pod::Browser and running the server (pod_browser_server.pl) you should be able to access the browser via http://localhost:3000. On the right hand side of the page is a tree which contains all the modules installed on your system. The main page contains a search box where you can search the CPAN via it’s XML interface. If you open a module which is also installed on your system, the local pod will be served. Otherwise the pod is read from the CPAN. Each module has its own ExtJS tab. This allows to have multiple documents open at the same time in a single browser window. The table of contents is shown in the upper left panel.

Since this browser has been implemented as a single Catalyst controller, you can simply create an empty controller in your application and use Catalyst::Controller::POD as base class:

package MyApp::Controller::Docs;
 
  use strict;
  use warnings;
  use base 'Catalyst::Controller::POD';
  __PACKAGE__->config(
    inc        => 1,
    namespaces => [qw(Catalyst::Manual*)],
    self       => 1,
  );
  1;

This will make the documentation to your application avaiable from /docs. The pod tree will also contain all the modules which are listed in the namespaces config attribute (in this case Catalyst::Manual).

For more information on Pod::Browser and Catalyst::Controller::POD please read the documentation on the CPAN.

Warning:

If you set inc to 1 and leave the namespaces parameter blank it can take a while until all modules are indexed and presented in your browser. Be either patient or restrict the namespace to a sane subset of your modules.