Tag: dbic

Phonetic search with DBIC

I recently ran across the probem that there are many people with almost identical last names and it is hard to guess every possible spelling until you finally find the person you were looking for.

In german, there are a lot of possible spellings for the name “Maier” which all sound the same: Meyer, Meier, Mayer, Mayr, Meyr etc.

A phonetic algorithm reduces a given word to a digest which is the same for all names, which sound similar. After asking for a proper name on #dbix-class I came up with the DBIx-Class-PhoneticSearch module.

For now it is only avaiable from github. But sometime soon it will be on the CPAN :-)

The usage is pretty easy:

    package MySchema::User;
 
    use base 'DBIx::Class';
 
    __PACKAGE__->load_components(qw(PhoneticSearch Core));
 
    __PACKAGE__->table('user');
 
    __PACKAGE__->add_columns(
      id       => { data_type => 'integer', auto_increment => 1, },
      surname  => { data_type => 'character varying', 
                    phonetic_search => 1 },
      forename => { data_type => 'character varying', 
                    phonetic_search => { algorithm => 'Koeln', 
                                         no_indices => 1 } },
 
    );
 
    __PACKAGE__->set_primary_key('id');
 
    __PACKAGE__->resultset_class('DBIx::Class::ResultSet::PhoneticSearch');

This defines a result class with a forename and surname column. Both are phonetic-enabled. forename uses the Koeln algorithm, which has been optimized for german names and words. Make sure you deploy() that schema again or add the two columns to your schema.

ALTER TABLE `user` ADD COLUMN `surname_phonetic_phonix` CHARACTER VARYING;
ALTER TABLE `user` ADD COLUMN `forename_phonetic_koeln` CHARACTER VARYING;

Now you can search for any user by a similar sounding name:

  $rs = $schema->resultset('User');
  $rs->create({ forename => 'John', surname => 'Night' });
 
  $rs->search_phonetic({ forename => 'Jon' })->first->forename;  # John
  $rs->search_phonetic({ surname => 'Knight' })->first->surname; # Night

The default algorithm is Phonix which is IMHO far superior to the popular Soundex algorithm. E. g. the last example (Knight -> Night) does not work with Soundex.

Currently I use plain SQL statements to create reports which, for instance, show me the number of logins of all users on a 5 minutes scale.

One of those statements looks similar to this:

INSERT INTO report_item 
    (
        SELECT '2009-05-23 02:30:00+0200'::timestamp WITH time zone + (s.a || ' minute')::interval AS timestamp, 
               COUNT(source."created_on") AS value
        FROM generate_series(0,595,5) AS s(a)
 
        LEFT JOIN 
          (SELECT "created_on" FROM "session") AS source
          ON source."created_on" >= '2009-05-23 02:30:00+0200'::timestamp WITH time zone + (s.a  || ' minute')::interval 
             AND source."created_on" < '2009-05-23 02:30:00+0200'::timestamp WITH time zone + (s.a  + 5 || ' minute')::interval
 
        GROUP BY a
    );

This gives me the number of logins (or new sessions) per 5 minute time slot. Now it’s pretty easy to read the data from report_item and create a graph or do other nice things with it.

I was thinking about a more generic way to do these reports so that the user can choose, what kind of report he wants.

So I started to create a result class Report with the following columns:

  • group_by
    The column which is used to group rows together. This is always a timestamp column, e.g. created_on
  • interval_type
    Group by minutes, hours, days, weeks etc. (defaults to minutes)
  • interval_length
    Group by that amount of interval_type (defaults to 5)
  • aggregate_by
    Which SQL function should be used to aggregate data in a time slot (defaults to COUNT)
  • aggregate
    Which column should be aggregated
  • query
    Which data should be aggregated? This accepts a DBIC result set or a plain SQL statement

After you created a report in that table you call $report->create_report on that row. This will execute the query above with the correct values filled in and the result will be stored in a table result_item, which has a foreign key report_id to the report.

It does a decent job so far and I was wondering if anyone else did something like this before?

If anyone is interested in the code I can make a dist and upload it to github. Since the SQL is Postgres only I don’t want to push it to the CPAN.

I was wondering how you design your base classes for DBIC result(set) classes. I think it’s a good idea both in terms of startup speed and ease of use to not use DBIx::Class as base of your result classes but use a custom one instead. The same applies for the resultset classes. Here is how I designed my DBIC classes:

Make sure you use load_namespacesin MySchema.pm!

Directory structure:

MySchema.pm
MySchema/
         Base/
              Result.pm
              ResultSet.pm
         Result/
                User.pm
         ResultSet/
                   User.pm

This has the advantage that the base classes in base are not being loaded. All classes in Result use MySchema::Base::Result as their base class. Classes in ResultSet use MySchema::Base::ResultSet respectively.

My current result base class looks like this:

package MySchema::Base::Result;
use strict;
use warnings;
use base 'DBIx::Class';
 
__PACKAGE__->load_components(qw(RandomColumns InflateColumn::FS TimeStamp Core));
 
__PACKAGE__->table('dummy');
 
__PACKAGE__->add_columns(
	id => {
      data_type => 'character',
      is_random => {size => 10, set => ['a'..'z','A'..'Z'], check => 1},
      size => 10,
    },
    created_on => {
        data_type => 'timestamp with time zone',
        set_on_create => 1,
        is_nullable => 1
    },
    updated_on => {
        data_type => 'timestamp with time zone',
        set_on_create => 1,
        set_on_update => 1,
        is_nullable => 1
    });
 
__PACKAGE__->set_primary_key('id');
 
 
sub sqlt_deploy_hook {
  my ($self, $sqlt_table) = @_;
  map { $sqlt_table->drop_constraint($_) if $_->type eq "FOREIGN KEY" } 
    $sqlt_table->get_constraints;
}
 
 
1;

So what is going on here?

First I use load_components to load all the components which are valuable to all of my result classes. Since I always use a random combination of characters for the primary key column id I load RandomColumns. InflateColumn::FS makes sure that any BLOB is stored on the file system rather than in the database. TimeStamp can set columns to the current time on either update or create. It also loads InflateColumn::DateTime which is used to inflate any date or time columns to a DateTime object.

You need to define a table name, otherwise DBIC fails to compile. This table will never be created or seen in any of your result classes as long as you overwrite it in each class. So just name it dummy or something and you are good.

Next I create a couple of columns which should be avaiable on all result classes. My primary key is always called id so create that column for all my classes. To use the RandomColumns component you have to add that is_random line to the column definition. This will create a random 10-byte long string. Since I’m a little bit paranoid about collision I set the check parameter so the component checks before inserting a new row if a column with that id is already there. Although this is very unlikely because there are 52^10 combinations, quite a lot…

The created_on and updated_on columns use the features of TimeStamp to set the time when the record was created or updated respectively. timestamp with time zone is a PostgreSQL specific column. You might need to the correct data type of your dbms.

I use SQL::Translator, DBIx::Class::Schema::Versioned and a few lines of code to deploy and update my database schema. Foreign key constraints on the database slow things down and DBIC handles them anyway so I decided to drop them from the SQL::Translator output. this is done in the sqlt_deploy_hook method.