Solumslekt

Forays Into Genealogy Data Base Modelling

Part I: The Master Genealogist | Part II: Into the Blue | Part III: Exodus

Part III - Exodus

«What is clear is that everyone agrees that the current data model for GEDCOM and for most genealogy programs is outdated.»
-- Gary B. Hoffman, August 12, 1998 in GEDCOM: The Next Generation

Introduction

In the Humanities, particularly in the branch of History known as Prosopography, there's a certain kind of dissertations collectively branded as "Me and my Database". Although I'm under the distinct impression that the label doesn't have entirely positive connotations, my own present article series probably should be considered of the same kind. But as I'm an aging man, and write this at my leisure, and don't even seriously expect anyone to read it, you may quite frankly call it whatever you want. I do however hope that someone will find some useful nuggets of thought here.

Compared to the popularity of genealogy itself, there's a remarkable lack of interest in genealogy database issues. A Google search for "genealogy database design" will produce a disappointingly short list of relevant hits, most of them displaying shallow and uninteresting thoughts on the matter. In the Open Source world, there's some interest in Web publishing of data entered in any run-of-the-mill program, and transferred via GEDCOM. But that's hardly to the point. In my opinion, genealogy data modelling grew out of its GEDCOM shoes a long time ago. Clinging to GEDCOM as a "standard" is only inhibiting the natural development of better ways to organize the data.

The persistence of GEDCOM should probably not be attributed to its strength as a data interchange format, but rather to its weakness. Naturally, commercial genealogy program vendors have a vested interest in keeping their users locked into their programs. Paying lip service to the "interchangeability" of data by supporting an ancient and badly designed de facto "standard" that actually was abandoned by its developers before the Information Superhighway came crashing through the living room wall, before Unicode was embraced as a standard, and before XML redefined the entire field of data exchange, makes a lot of sense as a strategy for lock-in. It's sad that young developers endorsing open standards think that genealogy software has to be "GEDCOM-compatible" to be of any use. The GRAMPS developer Don Allingham agrees with me that "GEDCOM is a straitjacket", but the users demand it. What the users should demand, in my opinion, is direct import from other popular programs, like TMG has had for ages. Just Say No to GEDCOM. It sucks.

Considering the real and perceived value of their data, genealogists are among the ones who should be genuinely interested in finding the very best tools for their research. Having been employed at the IT department of Norway's national health insurance since 1992, I have aquired a professional attitude towards robust data entry, storage, and retrieval. Indeed, my entire attitude concerning computer software is very similar to the "Hole Hawg" reference succinctly described by Neal Stephenson in his wonderful essay "In The Beginning Was The Command Line". And it worries me deeply that even the most serious of the commercial genealogy programs don't begin to address quality issues in terms of eg. client-server technology, enforced referential integrity constraints, ACID transactions. During my years as a TMG user, I experienced several spectacular crashes, among them the dreaded "Error 41: Missing/invalid memo file" which is a real show-stopper. I should say that Whollygenes, the company making and selling TMG, is providing excellent help in case of data crashes, but still I think that is addressing the symptoms rather than the problem itself. And when I migrated my data from TMG to Postgres, I found several inconsistencies, including records with duplicate id numbers. That is of course impossible in real databases.

A monolithic database application with embedded storage, like what you typically build in FoxPro or Access, has some serious inherent flaws. If the UI crashes for any reason, such as a memory leakage, you frequently end up with corrupted data. As I will try to show in this article, it is actually both safer, in terms of data integrity, as well as easier, to program an interface to one of the industry-standard relational database management systems (RDBMS) in a client-server context, than to build a monolithic shrink-wrapped application. There's a whole world of difference.

Way back when personal computers packed a 20 MHz 80386 CPU along with maybe a couple of Megabytes of RAM, the use of toy database engines such as FoxPro made some kind of perverse sense. But today, when every computer spec is touting Gigas rather than Megas, the migration to real databases is long overdue. Personally, I'm delighted with the free RDBMS Postgres, which is running smoothly in a very comfortable footprint on my 3.0 GHz Pentium 4, with 1 GB RAM and a 400 GB S-ATA HDD. (Back in the early Nineties, those specs were definitely in the IBM mainframe range.) This database engine has almost all the bells and whistles of Oracle, and for an unbeatable price at that. It is mature, has well-documented reliability as well as scalability, and the mail-lists provide excellent support from both developers and very dedicated users. I have found it an ideal choice as the storage system for my genealogy database.

With the Postgres storage engine in the bottom, I have rolled my own genealogy database application. It is built entirely within an Open Source Web technology context, deploying the Linux/Apache/Postgres/PHP (LAPP) stack in a client-server model. (Apache, Postgres, and PHP are also available as Windows executables.) The user interface, or "client", is just a common browser window, and that part will of course be able to run on any computer -- or even on a hand-held device -- with a network connection and a route to the Web server. I have tested it from IE on my Windows XP laptop in the home LAN, and there's no significant difference between interacting with the database from there, or from Mozilla on my Gentoo Linux/KDE workstation that's hosting the entire application stack. And even if the browser window may crash and burn, the database itself is totally unaffected. Postgres is a real database engine, and supports real, ACID-compliant transactions. To avoid the risk of data corruption, any entry involving interdependent changes, with the potential to leave the database in an inconsistent state, is wrapped up in a transaction. That is how the professionals do it, the guys building data systems for banks and insurance companies. Why should genealogy databases, which by their nature are rather complex and contain valuable information, rely on less? What I'm trying to push through here, is that with a sophisticated data storage engine and a little forethought on behalf of the programmer, data integrity becomes the responsibility of the database itself, and the programmer can simply stop worrying about it. With an ACID-compliant RDBMS, if a series of database interactions is started with a "BEGIN" and ended with a "COMMIT", the RDBMS itself will ensure that everything between is handled as one single action unit. If any of the partial updates fail with invalid data, or even if the plug is pulled in the middle of a transaction, it never happens at all.

Now it's time to explain the name of this article. I was groping for a name to the subdirectory of my computer which would hold the PHP scripts interacting with the Postgres database, and I decided to tentatively call it exodus. The name has stuck, and in my mind it has become associated with the way out of TMG. The biblical connotations are very slight; Bob Velke is no evil Pharao, neither am I a great leader of men. And there's certainly no wandering in the desert for forty years. I am already accommodating myself in The Land of Free Software with the specters of my around 14,000 (mostly long-dead) relatives. It is a remarkably open environment.

The Interface

The user interface, which I will label as a "structured document" (or WYSIWYG) approach, has grown quite organically as a collection of PHP scripts. Originally, I was planning a kind of "TMG look-alike" in a traditional GUI. But then a strange synergy phenomenon just happened. I was tentatively entering data from the psql command prompt in order to work out some basic data-entry routines. At the same time, I was modifying the family sheet of my old PHP Web presentation suite to read data directly from the Postgres production database. As an experiment, I added some "Add / Edit / Delete" links to it, and then wrote corresponding forms and actions to update the database, always returning to the family sheet displaying the current person. All of a sudden, I had a working genealogy application, in the guise of a souped-up interactive "Family Group Sheet" in my Web browser. Here's a screenshot of the Main View, and here's one of the Edit Event form. (The interface has for a large part been translated to English with a separate language file, but my own data are in Norwegian.) Note that Event-type ("Type") and Place, as well as the date qualifier, are selected from list boxes.

The PHP layer is deliberately kept as "thin" as possible. There are no objects, no external framework. There are SQL queries scattered all over the place. I'm well aware that this programming style is considered bad practice by the ORM camp. But I, for one, am quite comfortable with SQL, and am very happy to let the database engine do all the heavy work. Whenever I need to do a JOIN, I define a view in the database, and let the PHP code access that. Thus, I can keep the SQL queries in the PHP code very simple and innocent-looking. Most of the files are small, from 2-3 lines of code in the simplest action scripts that issue just one pg_query("DELETE FROM ...") and a header() pointing back to the main view, and upwards to the main view itself (family.php) which is the most complex, and currently contains about 350 lines of code. There's also one common functions.php, where I place every function that is used by more than one script. As on this Web site, all the generated HTML is presented as (hopefully) validating XHTML 1.0. The code is mostly self-explaining, and every function and variable are given descriptive names. I hate code with function names like loop2() and one-letter variable names. I do tend to use $i for simple local iterators, though.

I've written a few command line utilities in Python combined with plain shell scripts where I consider it impractical to use the browser interface, such as the "upload" routine which makes two dumps of the database; one which is a transformed subset for the online Web presentation, and one complete backup containing everything needed to recreate the database from scratch. Both are packed into archive files, and uploaded to a private area on the Web.

While the data model is still mostly recognizable as "TMG-like", the interface is totally different. After using this interface daily for a few months, I experienced a startling revelation of how much the entire mindset about what should be entered, and in which way, is determined by the interface. The "family sheet" will show everything that is entered into the database about the person in view, along with names and lifespan of parents, and basic BMD info on spouses and children. It doesn't depend on any special report to show events, notes and sources; everything is there, right in front of your eyes. This is of enormous help as it becomes extremely easy to elaborate on the textual contents, and to gather new insights. And as this application doesn't try to be everything to everyone, but is tailored specifically for one single presentation form, you can focus on optimizing the contents for that particular view. It may appear terse, but to me for one, this kind of constraint feels like a liberation in disguise. As in TMG, you can enter any number of "events", and an unlimited amount of textual contents attached to each one. But, unlike TMG, the "sources" are part of the main view. You're working directly in the final report, so What You Get Is What You See. There's no way to hide from sloppy source references, a fact which is actively encouraging the entry of every scrap of documentation.

Regarding the business logic, there are still a few things that I have left undecided. The database will happily accept any number of eg. birth events for a person. When I first started using TMG ten years ago, I thought this was a great way of handling conflicting data, and entered multiple birth events for a number of persons where the sources disagreed. Today, I'll rather discuss the veracity of the sources textually, and be content with an approximate birth date where no exact source can be found. Personally, I prefer to be able to enter multiple birth events, and after some occasional cutting and pasting, delete the ones that may be considered preliminary or secondary. As the interface paradigm regards the entire collection of data pertaining to the person in view as one "structured document", I think that is a reasonable approach. To be able to spot multiple births and deaths, I've written a couple of "sanity checks" as SQL functions to be run from the command-line. They are discussed separately in the blog.

The Data Model

Here is a recapitulation of the data model from Part II of the article series, with a few minor modifications. There are three major entities: Persons, Events, and Sources. The Person/Event part is basically an Entity-Attribute-Value model, where the Entity is the Person, the Attribute is the Event type, and the Value is the contents of the Event instance (date, place, co-participants, additional info, as well as its associated source reference). This is in my opinion the only sensible way to associate an essentially unknown number of "facts" to a person. The Source part is a self-referential tree structure, of which I have more to say below. The design relies heavily on many-to-many relationships, implemented as junction tables. The Participants, Relations, Event-Citations, and Relation-Citations are all junction tables.

Persons

A Person has the major attributes PersonID, Name and Gender. Note that simpler data models usually will lump all kinds of information, such as parent ids, birth and death dates etc. in the person table. In this model, Persons are linked to other Persons through Relations. There's only one kind of Person-Person relation, that of Child-Parent. Persons are linked with Events through a many-to-many relation called Participants. Thus, a "person" can be viewed as the set of "events" in which she is a participant, plus her "relations" to parents and children.

Events

An Event is a structured rendering of any direct or indirect evidence found in one or more "sources", and has the attributes Event-type, Date, Place, and Note. All attributes, with the exception of Event-type, are optional. There's one auxiliary attribute called Sort-date, which will allow the researcher to sort events in a chronological order even if the exact date of an event is unknown. I'm also using a lot of free-text notes, which are regular events of the type Note, but usually it's undesirable to associate explicit dates or places with a note. However, the sort-date will place the notes in the appropriate sequence. Event-types ("tags") and Places are kept in separate tables. Most Events have only one participant. Some, most notably the Marriage event type, usually have two, but there are no inherent restrictions in the data model regarding the number of participants in an event.

The Event-types are mostly identical, in nomenclature as well as in functionality, to GEDCOM "tags". In addition to the common GEDCOM tag bestiary, I have included a couple of other useful two-person events: One called "Probably Identical", which is a kind of "preliminary merge", and one called "Confused (with)". I'm using the latter where I find that another researcher has published conclusions about identity based on insufficient or garbled evidence, and mixed up different persons. It's very convenient to link up the persons at issue with this tag, where I often will include considerable textual matter to show "how things really hang together".

Families

Note that this model, like TMG, has no Family entity in its own right. The Marriage event is the only linkage between spouses. A Marriage is structured like any other event, but has a special significance with respect to the "Family View" of the interface (ie. children of married vs unmarried couples will show up in slightly different manners). If you ever need to express an assertion like eg. that William and Robert Brown were brothers, the way to do it in a "family-less" database is to associate a parent of unknown gender to them, who may be called "Common parent of William and Robert Brown".

I'll cite a passage from a short, but interesting article by Barry Leadbeater, titled "Genealogy database design and its genetic basis", where he discusses the GEDCOM Family entity: "The FAM records contain all marriages. But they also unnecessarily contain references to the children of each marriage. The result is a considerable amount of redundant reference data with the potential for conflicts and inconsistencies. That is, the GEDCOM model is fundamentally flawed and therefore should not be used as the basis for a genealogy database design. However, it will still be necessary in the foreseeable future to provide a means of converting data to and from the GEDCOM format to enable export to, and import from other GEDCOM enabled databases."

Sources

Sources are linked with themselves in a tree-structure. In the words of the Gentech GDM document (5.3 Evidence Submodel): "The only logical way to model SOURCE data is as a self-referential hierarchy with an unknown number of levels." I have not implemented the actual structure of the Gentech evidence model, which I find overly complex. Instead, I'm working from this simple postulate: Any source reference can be expressed as a recursive concatenation of strings. Those strings may of course contain HTML, thus hyperlinks to online resources are an integral part of the reference system. The generic source-types are close to the root, and the particularia are at the "leaves". Any "node" or "leaf" of the structure can be associated with Events and Relations through the many-to-many relations Event-Citations and Relation-Citations. Consequently, you can associate any number of source references to any event or relation. The actual citation is generated on the fly by a recursive function that concatenates text from the point of entry towards the root, or "Mother of all sources", which has source_id = 0:

CREATE OR REPLACE FUNCTION get_source_text(INTEGER) RETURNS TEXT AS $$
-- return full source string
DECLARE
    src sources%ROWTYPE;
    mystring TEXT;
BEGIN
    SELECT * FROM sources INTO src WHERE source_id = $1;
    mystring := src.large_text;
    IF src.parent_id <> 0 THEN
        mystring := get_source_text(src.parent_id) || ' ' || mystring;
    END IF;
    RETURN COALESCE(mystring, '[undefined]');
END;
$$ LANGUAGE plpgsql STABLE;

Thus, the structure of the "source tree" is given by how the text will display itself in the citation. The process is entirely data-driven.

This source model has proven itself to be an incredibly powerful asset. I now routinely add every "source fragment" to the note apparatus, and can assert at a glance every known scrap of information. Even if the interface is still clunky, it's a sheer delight to work with it. Formerly, in TMG, I usually only entered the page number of a source in the Citation Detail field. Now, those page numbers have become "parent sources" for the actual source transcripts. Here is a screenshot of my newest addition to the application, the Source Manager. It will show every subsource and citation of any source id, and provides clickable links to the main edit screen for every person referenced by the source. The number triplet in parentheses is my own shorthand to keep count of the number of event-citations, relation-citations, and subsources respectively. It is a great tool for cleaning up sources and citations, as I can keep the source view open in one window, and open the family view of any person in another window. When I'm done editing the person, I can just close that window and return to the source view.

That's one of the great things about the Web paradigm: You're not locked into one fixed view of anything. You may have a virtually unlimited number of views of the database open at the same time. Compare that to your traditional monolithic genealogy app, where you are usually limited to one view at a time in a rigid layout.

The "source entry" is standardized throughout the application, and is calling one common function. There are two fields in the source input. One is a text input field, where you enter an integer for the node you wish to cite, and the other is a "text area" where you can enter supplemental text. If the text area is empty, a citation to the referred node is inserted in the event_citations or relation_citations, depending on where the routine was called from. If the text area is not empty, a new node is inserted in the sources table with the referred node as the parent, and a citation as above. In practice, this is an easy and flexible system. You can see how it works in the Edit Event screen, where the entered source number is identical to the earlier citation displayed below, and the actual source extract has been entered into the text area. This will add a new node in the source tree.

It should be noted that the Family Sheet will display each source text only once. Thus, if I eg. have entered a child living with her parents from the 1801 census, there should be only one citation to the census. This covers both parents, an approximate birth year based on the recorded age, and the census event itself. In the Add Event form, I have a field for the person's age. If a numeric age is entered, the program will automatically add a Birth event with an approximate birth year (year of event minus age), and a citation to the same source as the entered event. This is what currently is implemented of my idea of Birth as a "Meta-event" as outlined in Part 2.

This calls for a major restructuring of the entire event "Memo" field versus "Citations". In my 18th century research, I have mostly recorded Birth, Baptism, Death, and Burial as separate events. But in the new context, as births and deaths mainly are documented from baptism and burial entries in the church records, I find it much tidier to move the source transcripts from the Memo fields of those events into the source section. The separate Baptism and Burial events thus become superfluous, and I'm in the process of phasing them out of the database.

I have discovered that the "inherited" citations tables of Part II weren't such a great idea after all. It turned out that the constraints applied to the parent table weren't inherited, practically rendering the inheritance useless. Here are the current data definitions of the citations tables:

CREATE TABLE relation_citations (
    relation_fk INTEGER REFERENCES relations (relation_id) ON DELETE CASCADE,
    source_fk INTEGER REFERENCES sources (source_id),
    PRIMARY KEY (relation_fk, source_fk)
);

CREATE TABLE event_citations (
    event_fk INTEGER REFERENCES events (event_id) ON DELETE CASCADE,
    source_fk INTEGER REFERENCES sources (source_id),
    PRIMARY KEY (event_fk, source_fk)
);

Note that the surrogate keys have been replaced with a composite primary key.

Participants, Principals and Witnesses

The participants entity has been radically restructured compared to the description in Part II. While I have entered or edited several thousand events during the few months that I've been using this program, I haven't felt any desire to fill in those fields that don't exist in TMG. Moreover, the "Participant Note" carried over from TMG is all but unused in the current implementation.

In the "family sheet" I display each event syntactically described as

EVENT-TYPE[ DATE][ PLACE][ with <a href="family.php?person=xxx">Name of other principal</a>][: NOTE]

The square brackets indicate optional contents. It should be noted that every item apart from the EVENT-TYPE is optional.

There's simply no place to display the "vector in genealogy space" that I discussed in Part 2. And, as everything from the source documents will be literally transcribed into either the event notes or the source citations, I can see no need to store eg. name variants separately. Neither are the "roles" of any significant use in my opinion. During the eight years that I was using TMG, I found that entering eg. witnesses to a baptism as participants in the event mostly was more of a bother than of any real help. It's vastly more useful to see the actual source text immediately in the person display. The main thing that I actually used the "witness" feature for, was the widow / widower in the death event of a spouse. That made it easy to note at a glance at the Person View if I had recorded the death of a spouse, particularly useful in case of a new marriage. But that is simply a consequence of the way events are displayed in TMG, and to no help in my "family sheet" approach. Another frequent usage of non-principals was for the members of a household in Census events, but because I used a lot of sentence variables specific to TMG, it turned out to be a major hurdle to convert them to the new data structure. I have now deleted every non-principal participant carrying over from the "Witness" file of my old TMG database, and have just started entering non-principals in an entirely different context which I will return to below. This may sound like a drastic measure, but I have been putting it off for several months, and have also been browsing the data to see if there was anything worth salvaging. There wasn't. So, now I'm in the process of reviewing all the census data for 3000-4000 persons previously entered into TMG. As I have the transcribed census lists in printed form, tagged with my own ID numbers of every person, this is a great opportunity for revision. I'm entering a lot of new source data, and am finding new relations and reinterpretations all the time.

I have trimmed down the participants table to only:

 CREATE TABLE participants (
    person_fk     INTEGER REFERENCES persons (person_id),
    event_fk      INTEGER REFERENCES events (event_id) ON DELETE CASCADE,
    is_principal  BOOLEAN NOT NULL DEFAULT TRUE,
    PRIMARY KEY (person_fk, event_fk)
); 

These are the parts that I actually use. There's no need for a surrogate key here, as the person / event pair should be unique (in plain English: any person can participate only once in a specific event).

Regarding "principals": Most events have only one participant, consider eg. a birth or a death. That person will of course always be regarded as the principal, or the primary person of that event. Some event types will always have two principals, as a marriage. But is there any need for more principals? The reason I ask, is because of the program logic behind the display of an event. In eg. a marriage, we will always want to display and link to the spouse in the events entence, so that the marriage between John Smith and Jane Doe will show up in his family sheet as something like "Married 12 OCT 1755 in Tusquewaga, Wisconsin with Jane Doe". (I know that "married with" is poor English, but to keep things simple, whenever there is a second principal, the preposition is "with" in my implementation. There are of course ways around this. You could for instance store a preposition in the tags table to be used with a second principal.) You may of course add any number of participants to any event, the program itself won't complain about it. But then it will become impossible to just display the "other" principal in a coherent manner. Which one would that be? I'm also experimenting with events without principals, such as a real estate transaction that will often involve several persons (and the real principal probably is the estate in question). I simply input these as free-form notes, with hard-coded links to each person. Concerning this "hard-coding": It isn't really as "hard" as the name implies. The person links are input on the form [p=xxx|yyy], where the 'xxx' part is the person ID, and the 'yyy' part is the name of that person. This is parsed and expanded to a full link whenever the text is displayed. Because of how my merge routine works (see the section on "Merging Persons"), there should be no risk of these links going stale either.

I have now resolved the entire principal / non-principal mess with the new entity tag_types. Because this is an important change to the data model, I'll include the text from my blog entry describing it:

I've never really come to terms with the TMG notion of "principals" and "witnesses". The idea is that eg. for a christening you can enter the child as "principal", and the godparents as "witnesses". Through the years, I've never done much more than playing around with this concept, as it's both relatively time-consuming to enter witnesses, (IMO) of little use, and potentially cluttering up the person view with irrelevant details.

However, my data model still carries along this notion as a legacy from TMG. The participants table still has the column is_principal, but I've moved further and further away from actually using the value of this column. Instead, I've come to realize that there are really three kinds of events:

  1. The "one-person" kind, like birth and death.
  2. The "two-persons" kind, like marriage and divorce. This is also the category for my own "probably identical", and "confused" events.
  3. The "multiple-persons" kind, where you can have any number of participants. In other programs, this concept may be known as "shared notes" or "history tags".

As a consequence of this realization, I've added another table:

CREATE TABLE tag_types (
    tag_type_id     INTEGER PRIMARY KEY,
    description     VARCHAR(20)
);

INSERT INTO tag_types VALUES (1, 'single');
INSERT INTO tag_types VALUES (2, 'double');
INSERT INTO tag_types VALUES (3, 'multiple');

And a foreign key in the tags table:

ALTER TABLE tags
    ADD COLUMN tag_type_fk INTEGER
        REFERENCES tag_types (tag_type_id);

With this approach, you can forget all about "principals" and "witnesses". A birth will have exactly one participant. A marriage will have two participants -- that is, provided that it's correctly entered in the first place. A "note" may have any number of participants, and it will show up on the person view of any of them.

One benefit of this is that it simplifies the layout of the person view. But even more, it enables me to differentiate the handling of events according to their nature. I've just started to explore the interface issues, but I expect that some aspects of the program will change a lot in the next few months.

Merging persons

Here is another blog entry that deserves being incorporated into the article.

Merging persons is an essential feature of any genealogy database application. For a variety of reasons, it is also one of the most complex operations. However, this is one of the places where an adequate data model really is paying back both to the developer and the user, in terms of ease of programming as well as minimal loss of data.

In the merge script, I have defined the two persons to be merged as the 'source' and the 'target'. All events and relations will be transferred from the 'source' person to the 'target' person. Rather than deleting the 'source' person, I'll keep him/her in the database, with a single link to the 'target' person. Every merge is also written to a table called 'merged', with the id's of the two persons, and the date of the merge. This table does provide a rudimentary audit trail, but it has another important function. In my presentation application, there's a check in the main view: If the person_id is found in the 'old_person_fk' of the 'merged' table, a header is sent that will redirect the browser to the 'new_person_fk' of the same record. This satisfies at least two important objectives:

  1. External or static links to persons will never cease to function.
  2. A 'skeleton' person is preserved, which will make it easier to revert a merge that later may be proven fallacious. (Those things do happen occasionally.)

This approach is a nod to the GDM 'tree-structured' approach to the 'persona', and goes a long way to meet my own objections to the 'single person ID' in my previous articles in this series.

It might be prudent to preserve all events of the 'source' person, and merely copy them to the 'target' person. That, however, is a different can of worms, and is currently not implemented.

The Database

Referential Integrity

You get an astonishing wealth of database features for free in Postgres, of which the most basic is probably the referential integrity. For instance, the participants table has a person_fk foreign key that REFERENCES the persons (person_id). There's no way that you can add a participant that doesn't exist in the persons table to an event. (From the application's point of view it's actually the other way around: The program will prompt you to add an 'event' to a 'person', which is a much more intuitive concept.)

On the other hand, you can't delete a person that is referenced from the participants, and I think that is good design: You should never delete a person that still has any event or relatives associated with it. However, I think it's okay to delete an event without bothering if it still is referenced from participants. This is where the "cascade" feature comes in handy. In the participants table, the event_fk is defined as INTEGER REFERENCES events (event_id) ON DELETE CASCADE. This means that if I delete an event, the associated participants are automatically hosed. (The same goes for source citations.) This design eliminates the classic nuisance of "dangling cross-references" commonly cropping up in databases lacking referential integrity.

Views & Functions

I'm using Postgres views and functions liberally. A database "view" is a kind of pseudo-table; you can write a query that eg. does a join between two or more tables, and then presents the result as just another table. It makes the interface coding a lot more easy, as you can do a simple "SELECT * FROM myview" from any external script, instead of rewriting a complex -- and error-prone -- query of the actual tables. This abstracts the internal design of the database from the interface, which generally is considered a Good Thing[TM]. Here is a short example that recreates the "denormalized" persons table for the Web presentation. For performance and backwards compatibility reasons, it's almost identical to the corresponding TMG entity, hence the name:

CREATE OR REPLACE VIEW tmg_persons AS
SELECT
    person_id,
    get_parent(person_id,1) AS father_id,
    get_parent(person_id,2) AS mother_id,
    last_edit,
    get_pbdate(person_id) AS pb_date,
    get_pddate(person_id) AS pd_date,
    gender AS s,
    living AS l,
    rolle AS r,
    is_public AS p
FROM persons;

A Postgres function works like a function in any old programming language, as the programmer's regular building-block. It will take parameters, and return values. It does assignments, formulas, loops, and conditionals. Besides, it can do almost any internal operation on the database, including full-blown transactions updating any number of tables. The limitations are mostly in the mind of the programmer. In the view above, there are several function calls. Without them, the query would have been vastly more complex, requiring some nasty outer joins. Here's the get_parent() function:

CREATE OR REPLACE FUNCTION get_parent(INTEGER,INTEGER) RETURNS INTEGER AS $$
DECLARE
    person ALIAS FOR $1;    -- person ID
    rel_type ALIAS FOR $2;  -- gender code (1=male, 2=female)
    par INTEGER;            -- person ID of parent, returned by func
BEGIN
    SELECT parent_fk INTO par FROM relations
        WHERE child_fk = person AND relation_type = rel_type;
    RETURN COALESCE(par,0); -- will return parent ID if it exists, 0 otherwise
END;
$$ LANGUAGE plpgsql STABLE;

The PL/pgSQL language is just one of several that you can use for writing functions or any stored procedures in Postgres. I've found it to be well documented, and the syntax comes naturally to an old Pascal hacker. By design, it's very similar to Oracle's PL/SQL.

The benefits of employing the "intelligence" of the database engine to add an abstraction layer should be obvious. It simplifies the actual interface programming, and enables the designer to put the business logic as close as possible to the actual storage. In enterprise system architecture, the database server and the Web server are usually residing on separate machines. When you know what the RDBMS can do for you, you don't want to create an endless chatter between the two systems by sending intermediate data back and forth. By thinking strictly in terms of "database" and "web" as distinct spheres, you're laying the foundation of a solid and scalable design.

Because of this three-tier design implemented through views, functions, and triggers, the PHP scripts that interface with the Postgres database are much shorter and simpler than the ones that I wrote for the MySQL database. On the other hand, as mentioned above, I'm using the same features to "flatten" the table structure of the Web database, which still is running on MySQL with MyISAM tables. That combo will serve up simple SELECTs very fast, which is exactly what a Web application should do. The MySQL Web app will typically generate a family sheet in around 0.1 sec. Initially, the same family sheet in my Postgres production database took about 0.7 sec. But after a restructuring of some key queries to better take advantage of the new data model, along with a couple of indexes, mainly on the foreign key columns of the Events table, both applications now run equally fast. A generation time of about 0.1 seconds is probably faster than the HTML can be loaded into the browser anyway, even if everything is running on the localhost.

But please note that here I'm really comparing apples and potatoes. The underlying data structures are quite different. I don't think that MySQL 4.x would cope very well with the data structure and queries of the production database. Besides, thanks to Postgres' views and functions, it was easier to push out a backport of the database than doing a rewrite of the Web scripts. The thought of reimplementing my internal Postgres code in PHP-MySQL makes me shudder. In the end, it's a question of flexibility versus speed. For instance, in my Postgres production database, the source text is generated on the fly with the recursive function get_source_text() (see above). The sources table of the Web database contains the full concatenated string, output at dump time by the same function, because I never need to edit the source text in the Web app. This is of course one of the reasons why the MySQL Web app will keep up with the Postgres production database. The duplication of parent IDs and BD dates into the persons table as mentioned above is another obvious one. The "tmg_persons" view is almost identical to the TMG persons table, and is what I read from when generating the Web database. It's of course far easier to read parent IDs, birth and death dates directly from one table, than to hunt around for them via separate queries. And this is the core of the difference between the 'logical' and the 'physical' data model. The logical model is the same for the production and the presentation databases, while the physical models are slightly different because they serve different purposes.

Conclusion

In its present state, I don't consider the application fit for the general public, and I will enumerate some of the reasons:

The Entity-Attribute-Value model (se above) is extremely flexible, in many people's view too much so. As David M. points out in his blog:

"The usual advantage noted, is that it allows end-users to modify (Add) attributes to the business model without professional personnel (DBAs & programmers). If we use the analogy of building a house, essentially it is designed to allow the home owner the ability to add more rooms without any architectural or building training. Can you image the house after a couple of changes?. Something Escher would be proud of ..."

Finally, there's what I'll describe as the 'Geek factor'. You've got to be a little weird to dissect the database from a commercial program, dump and convert it to an entirely different format, and then write your own program logic to interact with it from scratch. The application is the digital equivalent of a monster truck: With its heavy-duty engine and wheels, and its somewhat arcane controls, it obeys no one but its creator. And of course it stomps all over the commercial "family" programs. But it will just as easily break through any fence, and wreak havoc and destruction in the hands of a casual user. As Spider-Man observes: "With great power comes great responsibility." But if you like the concept and really want to collaborate on a version at least usable by other "monster truck" aficionados, you are hereby invited to contact me at the email address shown at the bottom of the page.

You can download my Bleeding-edge tarball anytime. The code is released under the GPL license.

The bulk of this article was published in the autumn of 2006. As I'm reviewing it one year later, I notice that just in the past few months, there has been a surge of interest in commercial online genealogy databases with a browser interface, where you enter and store your own data for a fee, and the service provider is responsible for the interface as well as the data storage. Eventually, I'm pretty certain that this approach will transform the entire genealogy software business, and the personal genealogy application where the data is stored on your own HDD (and at your own risk!) will probably become extinct in a few years. The Web browser is the perfect interface to genealogy data, because it's all about links: Internal linkage between persons in the database, as well as links to external resources. Genealogy and the Web is a marriage made in heaven.

Kirkerud, 2007-10-03, Leif Biberg Kristensen

Valid XHTML 1.0!

Back to Contents