=head1 NAME DBIx::DataModel::Doc::Design - Architecture and design principles =head1 DOCUMENTATION CONTEXT This chapter is part of the C manual. =over =item * L =item * DESIGN =item * L =item * L =item * L =item * L =item * L =back This chapter covers the basic architecture of C, and the main motivating principles for proposing yet another ORM. Read it if you are currently evaluating whether C is suitable for your context, or if you want to globally understand how it works. Skip it and jump to the L chapter if you want to directly start using the framework. B : Since version 2.0, a number of important changes have been made in the API (classes renamed, arguments renamed or reorganized, etc. -- see L). If you need backwards compatibility, make sure to load the L<1.0 compatibility layer|DBIx::DataModel::Compatibility::V1>. =head1 GENERAL ARCHITECTURE =head2 Schema definitions centralized in one single module Table and association definitions for C are centralized in one single file; this is in contrast with most other Perl ORM frameworks, where each table has its own C<.pm> file, and association declarations are scattered among such files. Nevertheless, C does create a Perl package for each table; but such packages are directly added into the interpreter's symbol table, whenever a L declarations is met : hence there is no need for a specific file for each table. However, the client code can add its own methods into these packages, using Perl's L directive to switch into the appropriate namespace. =head2 Classes The following picture shows the main classes in C class hierarchy : FRAMEWORK CLASSES ***************** +=================+ | DBIx::DataModel | +=================+ +=================================+ +============================+ | DBIx::DataModel::ConnectedSource| | DBIx::DataModel::Statement | +=================================+ +============================+ +=========================+ +=========================+ | DBIx::DataModel::Schema | | DBIx::DataModel::Source | +==,======================+ +===,=================.===+ | / \ | / \ | +=====================^==+ +=================^=====+ | | DBIx::DataModel::Table | | DBIx::DataModel::Join | | +=================,======+ +====================,==+ | | / ====|=======================|============================/======= | APPLICATION CLASSES | / | ******************* | / | | / +===^======+ +==========^========+ / | MySchema | | MySchema::Table_n |==+ / +==========+ +==+=========.======+ |==+ / +==+=======\==.=====+ | / +========\==\==.====+ / \ \ \ / +=^==^==^======^========+ | MySchema::AutoJoin::* +==+ +==+====================+ |==+ +==+====================+ | +=======================+ The top half of the picture represents basic classes distributed with C. The bottom half represents application-specific subclasses, built through class definition methods (see below). Most objects created during the lifetime of the application will be instances of those application-specific subclasses. The entry class L is just a façade interface to L. Classes L and L implement short-lived objects which gather information for issuing requests to the database. Subclasses of L are created by methods L or L in C; in most cases only one such subclass will be needed, unless the application talks to several databases. Subclasses of L represent tables in the database and are created by methods L or L. Calls to L or L automatically add new methods into these table classes, so that from a row of one table it is possible to reach related rows in associated tables. Subclasses of L represent database queries that join several tables. They are created indirectly through calls to the L method. C subclasses use multiple inheritance : they inherit first from L, but also from tables involved in the database join. As a result, instances of such joins can invoke all methods of their parent tables. In addition, some other classes hold meta-information about the schema, its tables and associations : these will be discussed later. =head2 Instances A schema instance holds a connexion to a database. Most often there is only one schema instance (I); but if necessary the application can switch to to I with several schema instances. Data rows retrieved from the database are encapsulated as instances of the application-specific C and C subclasses. They possess methods for navigating through the associations in the database and retrieve related rows, and methods to modify the data. A SELECT request to the database is encapsulated as an instance of L. This instance has methods for preparing the SQL query, binding parameters to it, executing the query, and retrieving the resulting data rows. Statement instances are usually short-lived and confined to specific internal parts of the application, while data rows (instances of tables or joins) are usually transmitted to the presentation layers of the application, in order to use the data within reports, forms, etc. Data rows know from which source they were created, because they are blessed into corresponding table or join classses. They also know from which schema they were queried, either implicitly (in single-schema mode), or explicitly with a special C<__schema> field (in multi-schema mode); but they do not know from which statement they were queried. The following picture shows relationships between classes and instances : FRAMEWORK CLASSES +============================+ ***************** | DBIx::DataModel::Statement | +========================,===+ | ============================================================|===== APPLICATION CLASSES | ******************* | | +==========+ +===================+ | | MySchema | | MySchema::Table_n |=+ | +==========+ +==+================+ |=+ | | +==+===============+ | | | +===,=============+ | | | | | | +=====================+ | | | | MySchema::AutoJoin +=+ | | | +==+==================+ | | | | +=,==================+ | | | | | =====|=====================|=========|======================|===== | INSTANCES | | | | ========= | | | +--------+ +-----+ +-----+ next() +-----------+ | schema | | row | | row |<==<==<===| statement | +--------+ +-----+ +-----+ +-----------+ =head2 Meta-classes Each application subclass has a I, i.e. an instance of a class in the C namespace. This meta-object is accessible from the class through the L class method. Conversely, metaclasses have a L method to retrieve the application subclass to which they are bound. Metaclasses hold information about application classes, and implement some of their methods. In most cases, this is totally transparent to end users; however, users can interact with metaclasses to get some information about the available tables, associations, etc., or even to change their behaviour. The picture below shows relationships between application classes and the meta-objects to which they are related (classes start with capital letters, instances start with lowercase letters). FRAMEWORK CLASSES ***************** +==============+ +=============+ +============+ +============+ | Meta::Schema | | Meta::Table | | Meta::Join | | Meta::Path | +====,=========+ +==========,==+ +=====,======+ +,===========+ | | | | +===================+ | | | | | Meta::Association | | | | | +=======,===========+ ====|=======================|===========|==========|==========|== | APPLICATION CLASSES | | | | | AND META-OBJECTS | | | | | ******************* | | | | +----^------+ +========+ | | | +-------^----------+ |meta_schema|---|MySchema| | | | | meta_association | +---------x-+ +========+ | | | +----x-------------+ \_________________|___________|__________|_____ / \ | | | / +=================+ +-x---^------+ | +----^----x-+ | MySchema::Table |---| meta_table |----|-----| meta_path | +=================+ +----------x-+ | +x----------+ \ | / \ | / +=======================+ +-x---^---x-+ | MySchema::AutoJoin::* +---| meta_join | +=======================+ +-----------+ =head2 Polymorphic methods Some methods like C or C are heavily I, in the sense that they can be applied to various kinds of objects, with various kinds of arguments, and can return various kinds of results. Polymorphism in this way is not exactly common object-oriented practice, but it has been intentionally designed as such, in a attempt to "do the right thing" in different situations, while hiding inner details from users. This is similar in sprit to the the "do what I mean" (DWIM) principle of Perl design, where similar constructs may mean different things depending on the context. =head3 Polymorphic C The L method, when applied to a L, calls L to define or retrieve a join subclass; then it creates a fresh I related to that join and to the schema. From that object, data can then be retrieved through the L method. When applied to a L L creates a new statement to query one or several tables associated with the connected source. The new statement contains a condition to restrict the results according to the primary key of the initial source. That statement cannot be executed yet, because the values of the primary key are not known until we have an instance of the initial source; but the statement can already be Ld. Later on, we can L the statement to an instance of the initial source, and then execute it. When applied to a I (to an instance of a Table or Join subclass), L is an instance method that works like the class method above (i.e. it creates a statement), but in addition, values of the current object are immediately bound to the appropriated placeholders within the query. This is illustrated below in the L
. So in all of these cases, C is meant to produce a statement from which one can later C method specified which columns to retrieve. =head3 Stepwise parameter binding through named placeholders C objects have their own mechanism of placeholders and parameter binding. Of course this gets ultimately translated into usual placeholders at the C and database layers; but an additional layer was needed here in order to allow for stepwise building of SQL conditions as just demonstrated above. Stepwise binding of values to placeholders requires I, as opposed to usual positional placeholders. Named placeholders are recognized according to a I, which is C by default. Here is an example : $statement->refine(-where => {col1 => '?:foo', col2 => '?:bar', col3 => '?:foo'}); Values are bound to these named parameters (either before or after the C step) through the L method : $statement->bind(foo => 123, bar => 456); If the default placeholder prefix C is inconvenient, another placeholder prefix may be specified as an option to the L. =head1 COLUMN HANDLERS AND TYPES =head2 Column handlers A I is just a pair C<< $handler_name => $handler_body >>; such pairs are associated to column names in tables, either at table definition time (see L), or later through the L method. Given any C<$row> object, a call to C<< $row->apply_column_handler($handler_name) >> will iterate over all columns present in that row, check if these columns have a handler of the corresponding name, and if so, execute the associated code. This can be useful for all sorts of data manipulation : =over =item * converting dates between internal database format and user presentation format =item * converting empty strings into null values =item * inflating scalar values into objects =item * column data validation =back Handler names B and B have a special meaning : they are called automatically just after reading data from the database, or just before writing into the database. Handler name B is used by the method L. =head2 Types as collections of handlers A I is just a collection of handlers, registered under a type name : it is a convenience for defining the same collection of handlers to various columns in various tables. Here is an example from the L : My::Schema->metadm->define_type( name => 'Percent', handlers => { from_DB => sub {$_[0] *= 100 if $_[0]}, to_DB => sub {$_[0] /= 100 if $_[0]}, validate => sub {$_[0] =~ /1?\d?\d/}), }); Note that this notion of "type" is independent from the actual datatypes defined within the database (integer, varchar, etc.). From the Perl side, these are all seen as scalar values. So a column type as defined here is just a way to specify some operations, programmed in Perl, that can be performed on the scalar values. =head1 GENERAL DESIGN PRINCIPLES Material in the previous sections presented the general architecture of C; this should be enough to easily follow the L chapter, or investigate more details in the L chapter. Now we will discuss the motivation for some design features of C, in order to explain not only I it works, but also I it was designed that way. This section can be safely skipped, unless you are interested in comparing various ORMs. =head2 Collaborate with lower-level layers, do not hide them C provides abstractions that help client applications to automate some common tasks; however, access to lower-level layers remains open, for cases where detailed operations are needed : =over =item * Data retrieval methods can return B. By default, the return value is an object or a list of objects corresponding to data rows; however, these methods can also return a handle to the underlying DBI statement, or even just the generated SQL code. Hence, the client code can take control whenever any fine tuning is needed. =item * Client code can insert B at various stages of the statement lifecycle : see parameters C<-post_SQL>, C<-pre_exec>, etc. to the L method. This provides an opportunity for running any driver-specific or application-specific code, at a particular point in the lifecycle. If the same hook is needed in every statement, another possibility is to subclass L and override the C, C or C methods. =item * Data rows exploit the dual nature of Perl objects : on one hand they can be seen as objects, with methods to walk through the data and access related rows from other tables, but on the other hand they can also be seen as B, with usual Perl idioms for extracting keys, values or slices of data. This dual nature is important for passing data to external helper modules, such as XML generators, Perl dumps, javascript JSON, templates of the Template Toolkit, etc. Such modules need to walk on the data tree, exploring keys, values and subtrees; so they cannot work if data columns are implemented as object-oriented methods. =back =head2 Let the database do the work =head3 Use RDBMS tools to create the schema Besides basic SQL data definition statements, RDBMS often come with their own helper tools for creating or modifying a database schema (interactive editors for tables, columns, datatypes, etc.). Therefore C provides no support in this area, and assumes that the database schema is pre-existent. To talk to the database, the framework only needs to know a bare minimum about the schema, namely the table names, primary keys, and UML associations; but no details are required about column names or their datatypes. =head3 Let the RDBMS check data integrity Most RDBMS have facilities for checking or ensuring integrity rules : foreign key constraints, restricted ranges for values, cascaded deletes, etc. C can also do some validation tasks, by setting up column types with a C handler; however, it is recommended to rather use the RDBMS for performing data integrity checks, whenever possible. =head3 Take advantage of database projections through variable-size objects In many ORMs, columns in the table are in 1-to-1 correspondence with attributes in the class; so any transfer between database and memory systematically includes all the columns, both for selects and for updates. Of course this has the advantage of simplicity for the programmer. However, it may be very inefficient if the client program only wants to read two columns from a very_big_table. Furthermore, unexpected concurrency problems may occur : in a scenario such as client1 client2 ======= ======= my $obj = My::Table->fetch($key); my $obj = My::Table->fetch($key); $obj->set(column1 => $val1); $obj->set(column2 => $val2); $obj->update; $obj->update; the final state of the row should theoretically be consistent for any concurrent execution of C and C. However, if the ORM layer blindly updates I columns, instead of just the changed columns, then the final value of C or C is unpredictable. To diminish the efficiency problem, some ORMs offer the possibility to partition columns into several I. The ORM layer then transparently fetches the appropriate groups in several steps, depending on which columns are requested from the client. However, this might be another source of inefficiency, if the client frequently needs one column from the first group and one from the second group. With C, the client code has precise control over which columns to transfer, because these can be specified separately at each method call. Whenever efficiency is not an issue, one can be lazy and specify nothing, in which case the SELECT columns will default to "*". Actually, the schema I, except for primary and foreign keys, and therefore would be unable to transparently decide which columns to retrieve. Consequently, objects from a given class may be of I : my $objs_A = My::Table->select(-columns => [qw/c1 c2/], -where => {name => {-like => "A%"}}; my $objs_B = My::Table->select(-columns => [qw/c3 c4 c5/], -where => {name => {-like => "B%"}}; my $objs_C = My::Table->select(# nothing specified : defaults to '*' -where => {name => {-like => "C%"}}; Therefore the programmer has much more freedom and control, but of course also more responsability : in this example, attempts to access column C in members of C<@$objs_B> would yield an error. =head3 Exploit database products (joins) through multiple inheritance ORMs often have difficulties to exploit database joins, because joins contain columns from several tables at once. If tables are mapped to classes, and rows are mapped to objects of those classes, then what should be the class of a joined row ? Three approaches can be taken =over =item * ignore database joins altogether : all joins are performed within the ORM, on the client side. This is of course the simplest way, but also the less efficient, because many database queries are needed in order to gather all the data. =item * ask a join from the database, then perform some reverse engineering to split each resulting row into several objects (partitioning the columns). =item * create on the fly a new subclass that inherits from all joined tables : data rows then simply become objects of that new subclass. This is the approach taken by C. =back =head2 High-level declarative specifications =head3 UML compositions for handling data trees Compositions are specific kinds of associations, pictured in UML with a black diamond on the side of the I class; in C, those are expressed by calling the schemas's L method instead of L. As a result, the composite class will be able to perform cascaded insertions and deletions on data trees (for example from structured data received through an external XML or JSON file, and inserted into the database in a single method call). The reverse is also true : the composite class is able to automatically call its own methods to gather data from associated classes and build a complete data tree in memory. This is declared through the L method and is useful for passing structured data to external modules, like for example XML or JSON exports. =head3 Views within the ORM C declarations usually map directly to database tables; but it is also possible to map to an SQL query, possibly with a predefined C clause : $schema->metadm->define_table( class => 'View_example', db_name => 'Foo INNER JOIN Bar ON Foo.fk=Bar.pk', where => {col => $special_filter}, primary_key => [qw/some_foo_col some_bar_col/], ); This is exactly the same idea as database views, except that they are implemented within the ORM, not within the database. Such views can join several tables, or can specify WHERE clauses to filter the data. ORM views are useful to implement application-specific or short-lived requests, that would not be worth registering persistently within the database model. They can also be useful if you have no administration rights in the database. Of course it is also possible to access database views, because the ORM sees them as ordinary tables. =head2 Extended SQL::Abstract API Every method involving a SELECT in the database (either when searching rows from a table or collection of tables, or when following associations from an existing row) accepts an number of optional parameters that closely correspond to SQL clauses. The programming interface reuses main concepts from L module, but with some extensions implemented in L. Therefore it is possible for example to specify =over =item * which columns to retrieve =item * which restriction criteria to apply (WHERE clause) =item * how to order the results =item * whether or not to retrieve distinct rows =item * etc. =back All these parameters are specified at the I, and therefore may vary between subsequent calls to the same class. This is in contrast with some other ORMs where the set of columns or the ordering criteria are specified at schema definition time. As already stated above, C gives more freedom to client programs, but also more responsability. =head2 Efficient interaction with the DBI layer Great care has been taken to interact with the database in the most efficient way, and to leave an open access to L fine-tuning options. In particular : =over =item * At the L level, the fastest way to get a large number of data rows from DBI is to retrieve each row into the same memory location, using the L method. C can exploit this feature through a I : my $statement = My::Table->select(-columns => ..., -where => ..., -result_as => 'fast_statement'); while (my $row = $statement->next) { work_with($row); } This code creates a single memory location for storing a data row; at each call to the C method, memory values are updated from the database, and the same location is returned. While being very fast, this approach also has some limitations : for example it is not possible to put such rows into an array (because the array would merely contain multiple references to the last row). So fast statements are not activated by default; regular statements create a fresh hashref for each row. =item * The client code can have fine control on statement preparation and execution, which is useful for writing efficient loops. For example, instead of writing my $list = My::Table->select(...); foreach my $obj (@$list) { my $related_rows = $obj->join(qw/role1 role2/)->select; ... } we can prepare a statement before the loop, and then just execute that statement at each iteration : my $statement = My::Table->join(qw/role1 role2/)->prepare; my $list = My::Table->select(...); foreach my $obj (@$list) { my $related_rows = $statement->execute($obj)->all; ... } =item * C has a L method, that works like C except that the statement handle returned is stored in a hash associated with the C<$dbh>. This can be exploited from C by stating $schema->dbi_prepare_method('prepare_cached'); =back =head1 DEPENDENCIES C depends on =over =item L =item L =item L =item L =item L =item L =back =head1 DESIGN FAQ Here are answers to some design choices, in the form of a FAQ. =head2 Why no accessor methods for columns ? The philosophy of C is that a data row is very lightweight: nothing more than a blessed hashref, where hash keys are column names and hash values are column values. So data access goes through the hashref API; therefore all common Perl idioms can be used, like # inspect hash keys my @column_names = keys @$row; # remove leading spaces in all columns s/^\s+// foreach values @$row; # print a slice print @{$row}{qw/col1 col2 col3/}; # swap values ($row->{col1}, $row->{col2}) = ($row->{col2}, $row->{col1}); # other way to swap values @{$row}{qw/col1 col2/} = @{$row}{qw/col2 col1/}; =head2 Why this C method ? In versions of C prior to 2.0, the schema was a I, not an instance : therefore the schema state (including the C<$dbh> connection to the database) was a global resource, stored as package variables within the schema class. Version 2.0 introduced the possibility to work in I mode, where each schema is an instance that holds its private data separate from other schemas (see L). However, multi-schema mode is a bit more verbose, a bit more costly in resources, and is only needed in some very special situations, like for example when transferring data between several databases; therefore the preferred mode is still the I mode, where a unique I schema instance is shared throughout the whole application, much like other global resources in Perl (for example C, C<%ENV>, or special variables C<$/>, C<$,>, etc.). When used with care, interaction of several components through a global resource can be quite handy : for example C does not need to be explicitly passed to every component, it is always available; furthermore, C can be redirected at one specific place and all collaborating components will immediately change behaviour accordingly. However, this must be done with care, because there is also a risk of undesired "action at distance" --- maybe the other components wanted to continue reading from the real C, not the redirected one ! To avoid undesired interactions through the global state, Perl offers the C construct, also known as I (see L). Using that construct, a component can temporarily redirect C, for the duration of one specific computation, and then automatically restore it to its previous state. C uses a similar approach. The database handle is stored in the schema instance, and can be changed dynamically, which will immediately affect all classes and objects that use this schema. However, state modifications in schemas can be limited to a specific scope through the L method. Furthermore, the L method takes an optional C<$dbh> argument to localize the transaction within a specific database handle. With these methods, sane management of the global state is quite easy, and since nested transactions are supported, it is perfectly possible to program transactions with cross-database operations (copying objects from one database to another, or simultaneously performing the same insert or delete in several databases). =head2 Serialization C includes support for the standard L serialization / deserialization methods C and C : so records and record trees can be written into files or sent to other processes. Dynamic subclasses for database joins are re-created on the fly during deserialization through C. However, there is no support for serializing database connections (this would be hazardous, and also insecure because serialization data would contain database passwords). Therefore the process performing deserialization is responsible for opening the database connection by its own means, before calling the C method. =head1 TO DO Here is a list of points to improve in future versions C : - 'has_invalid_columns' : should be called automatically before insert/update ? - 'validate' record handler (at record level, not only column handlers) - 'normalize' handler : for ex. transform empty string into null - walk through WHERE queries and apply 'to_DB' handler (not obvious!) - more extensive and more organized testing - add PKEYS keyword in -columns, will be automatically replaced by names of primary key columns of the touched tables - design API for easy dynamic association of row objects without dealing with the keys - remove spouse example from doc (because can't have same table twice in roles) - quoting - pre/post callbacks: support arrays of handlers, refine(..) should add to the array - refine(-order_by => ..) should add to the ordering - update with subtrees (insert/update on dependent records. Quid: delete?) - auto-unjoin (API for partioning columns into subobjects). - support for DBI TraceLevel - support DISTINCT ON ... - support find_or_create, update_or_create - doc : Schema has only tables, pk, fk (no cols, triggers, stored procs, domains) - doc : SQL-oriented (like Fey) - copy idea from DBIC:Storage:DBI:MultiColumnIn - Storable - hooks for YAML ? - think about STORABLE_attach # ADD TESTS for table inheritance - readonly tables - declare RT bug for Params::Validate : doc is confused about validate_with / validation_options - savepoints - various policies for guessing foreign keys : e.g proc.id = attr.proc_id - Statement::refine() should be more intelligent - combine (-where, -orderBy, callbacks) - reject overrides when not possible - should add something like a 'reset' method ?? - explain how to design families of tables with an intermediate superclass - add attribute 'db_schema_name' in Schema->new(). Then one can have my $dev = Schema->new(dbh => ..., db_schema_name => 'DEV'); my $prod = Schema->new(dbh => ..., db_schema_name => 'PROD'); and that name is prepended to table names - doc glossary, introduce the 'path' term - find a way to rename _singleInsert > _single_insert, without breaking compatibility - various policies for guessing foreign keys : e.g proc.id = attr.proc_id - Statement::refine() should be more intelligent - combine (-where, -orderBy, callbacks) - reject overrides when not possible - should add something like a 'reset' method ?? - declare RT bug for Params::Validate : doc is confused about validate_with / validation_options - doc : explain Table inheritance $schema->Table(..., {parent_tables => ...}); # ADD TESTS for table inheritance - Storable - hooks for YAML ? - think about STORABLE_attach - think about freeze/thaw in multi-schema mode