Versioning MySQL data

by Arnold Daniels on 11/12/2009

As a developer you’re probably using a versioning control system, like subversion or git, to safeguard your data. Advantages of using a VCS are that you can walk to the individual changes for a document, see who made each change and revert back to specific revision if needed. These are features which would also be nice for data stored in a database. With the use of triggers we can implement versioning for data stored in a MySQL db.

The revisioning table
We will not store the different versions of the records in the original table. We want this solution to be in the database layer instead of putting all the logic in the application layer. Instead we’ll create a new table, which stores all the different versions and lives next to the original table, which only contains the current version of each record. This revisioning table is copy of the original table, with a couple of additional fields.

CREATE TABLE `_revision_mytable` LIKE `mytable`;

ALTER TABLE `_revision_mytable`
  CHANGE `id` `id` int(10) unsigned,
  DROP PRIMARY KEY,
  ADD `_revision` bigint unsigned AUTO_INCREMENT,
  ADD `_revision_previous` bigint unsigned NULL,
  ADD `_revision_action` enum('INSERT','UPDATE') default NULL,
  ADD `_revision_user_id` int(10) unsigned NULL,
  ADD `_revision_timestamp` datetime NULL default NULL,
  ADD `_revision_comment` text NULL,
  ADD PRIMARY KEY (`_revision`),
  ADD INDEX (`_revision_previous`),
  ADD INDEX `org_primary` (`id`);

The most important field is `_revision`. This field contains a unique identifier for a version of a record from the table. Since this is the unique identifier in the revisioning table, the original id field becomes a normal (indexed) field.

We’ll also store some additional information in the revisioning table. The `_revision_previous` field hold the revision nr of the version that was updated to create this revision. Field `_revision_action` holds the action that was executed to create this revision. This field has an extra function that will discussed later. The user id and timestamp are useful for blaming changes on someone. We can add some comment per revision.

The database user is probably always the same. Storing this in the user id field is not useful. Instead, we can set variable @auth_id after logging in and on connecting to the database to the session user.

Altering the original table
The original table needs 2 additional fields: `_revision` and `_revision_comment`. The `_revision` field holds the current active version. The field can also be used to revert to a different revision. The value of `_revision_comment` set on an update or insert will end up in the revisioning table. The field in the original table will always be empty.

ALTER TABLE `mytable`
  ADD `_revision` bigint unsigned NULL,
  ADD `_revision_comment` text NULL,
  ADD UNIQUE INDEX (`_revision`);

The history table
Saving each version is not enough. Since we can revert back to older revisions and of course delete the record altogether, we want to store which version of the record was enabled at what time. The history table only needs to hold the revision number and a timestamp. We’ll add the primary key fields, so it’s easier to query. A user id field is included again to blame.

CREATE TABLE `_revhistory_mytable` (
  `id` int(10) unsigned,
  `_revision` bigint unsigned NULL,
  `_revhistory_user_id` int(10) unsigned NULL,
  `_revhistory_timestamp` timestamp NOT NULL default CURRENT_TIMESTAMP,
  INDEX (`id`),
  INDEX (_revision),
  INDEX (_revhistory_user_id),
  INDEX (_revhistory_timestamp)
) ENGINE=InnoDB;

How to use
Inserting, updating and deleting data should work as normal, including the INSERT … ON DUPLICATE KEY UPDATE syntax. When updating the _revision field shouldn’t be changed.

To switch to a different version, we would do something like

UPDATE mytable SET _revision=$rev WHERE id=$id;

However if the record has been deleted, there will be no record in the original table, therefore the update won’t do anything. Instead we could insert a record, specifying the revision.

INSERT INTO mytable SET _revision=$rev;

We can combine these two into a statement that works either way.

INSERT INTO mytable SET id=$id, _revision=$rev ON DUPLICATE KEY UPDATE _revision=VALUES(_revision);

The above query shows that there an additional constraint. The only thing that indicates that different versions is of the same record, is the primary key. Therefore value of the primary key can’t change on update. This might mean that some tables need to start using surrogate keys if they are not.

On Insert
Let’s dive into the triggers. We’ll start with before insert. This trigger should get the values of a revision when the _revision field is set, or otherwise add a new row to the revision table.

CREATE TRIGGER `mytable-beforeinsert` BEFORE INSERT ON `mytable`
  FOR EACH ROW BEGIN
    DECLARE `var-id` int(10) unsigned;
    DECLARE `var-title` varchar(45);
    DECLARE `var-body` text;
    DECLARE `var-_revision` BIGINT UNSIGNED;
    DECLARE revisionCursor CURSOR FOR SELECT `id`, `title`, `body` FROM `_revision_mytable` WHERE `_revision`=`var-_revision` LIMIT 1;
  
    IF NEW.`_revision` IS NULL THEN
      INSERT INTO `_revision_mytable` (`_revision_comment`, `_revision_user_id`, `_revision_timestamp`) VALUES (NEW.`_revision_comment`, @auth_uid, NOW());
      SET NEW.`_revision` = LAST_INSERT_ID();
    ELSE
      SET `var-_revision`=NEW.`_revision`;
      OPEN revisionCursor;
      FETCH revisionCursor INTO `var-id`, `var-title`, `var-body`;
      CLOSE revisionCursor;
      
      SET NEW.`id` = `var-id`, NEW.`title` = `var-title`, NEW.`body` = `var-body`;
    END IF;
    
    SET NEW.`_revision_comment` = NULL;
  END

CREATE TRIGGER `mytable-afterinsert` AFTER INSERT ON `mytable`
  FOR EACH ROW BEGIN
    UPDATE `_revision_mytable` SET `id` = NEW.`id`, `title` = NEW.`title`, `body` = NEW.`body`, `_revision_action`='INSERT' WHERE `_revision`=NEW.`_revision` AND `_revision_action` IS NULL;
    INSERT INTO `_revhistory_mytable` VALUES (NEW.`id`, NEW.`_revision`, @auth_uid, NOW());
  END

If the `_revision` field is NULL, we insert a new row into the revision table. This action is primarily to get a revision number. We set the comment, user id and timestamp. We won’t set the values, action and previous id yet. The insert might fail or be converted into an update action by insert on duplicate key update. If the insert action fails, we’ll have an unused row in the revisioning table. This is a problem, since the primary key has not been set, so it won’t show up anywhere. We can clean up these phantom records once in a while to keep the table clean.

When `_revision` is set, we use a cursor to get the values from the revision table. We can’t fetch to values directly into NEW, therefore we first fetch them into variables and than copy that into NEW.

After insert, we’ll update the revision, setting the values and the action. However, the insert might have been an undelete action. In that case `_revision_action` is already set and we don’t need to update the revision. We also add an entry in the history table.

On Update
The before and after update trigger do more or less the same as the before and after insert trigger.

CREATE TRIGGER `mytable-beforeupdate` BEFORE UPDATE ON `mytable`
  FOR EACH ROW BEGIN
    DECLARE `var-id` int(10) unsigned;
    DECLARE `var-title` varchar(45);
    DECLARE `var-body` text;
    DECLARE `var-_revision` BIGINT UNSIGNED;
    DECLARE `var-_revision_action` enum('INSERT','UPDATE','DELETE');
    DECLARE revisionCursor CURSOR FOR SELECT `id`, `title`, `body`, `_revision_action` FROM `_revision_mytable` WHERE `_revision`=`var-_revision` LIMIT 1;
    
    IF NEW.`_revision` = OLD.`_revision` THEN
      SET NEW.`_revision` = NULL;
      
    ELSEIF NEW.`_revision` IS NOT NULL THEN 
      SET `var-_revision` = NEW.`_revision`;
      
      OPEN revisionCursor;
      FETCH revisionCursor INTO `var-id`, `var-title`, `var-body`, `var-_revision_action`;
      CLOSE revisionCursor;
      
      IF `var-_revision_action` IS NOT NULL THEN
        SET NEW.`id` = `var-id`, NEW.`title` = `var-title`, NEW.`body` = `var-body`;
      END IF;
    END IF;

    IF (NEW.`id` != OLD.`id` OR NEW.`id` IS NULL != OLD.`id` IS NULL) THEN
-- Workaround for missing SIGNAL command
      DO `Can't change the value of the primary key of table 'mytable' because of revisioning`;
    END IF;

    IF NEW.`_revision` IS NULL THEN
      INSERT INTO `_revision_mytable` (`_revision_previous`, `_revision_comment`, `_revision_user_id`, `_revision_timestamp`) VALUES (OLD.`_revision`, NEW.`_revision_comment`, @auth_uid, NOW());
      SET NEW.`_revision` = LAST_INSERT_ID();
    END IF;
    
    SET NEW.`_revision_comment` = NULL;
  END

CREATE TRIGGER `mytable-afterupdate` AFTER UPDATE ON `mytable`
  FOR EACH ROW BEGIN
    UPDATE `_revision_mytable` SET `id` = NEW.`id`, `title` = NEW.`title`, `body` = NEW.`body`, `_revision_action`='UPDATE' WHERE `_revision`=NEW.`_revision` AND `_revision_action` IS NULL;
    INSERT INTO `_revhistory_mytable` VALUES (NEW.`id`, NEW.`_revision`, @auth_uid, NOW());
  END

If `_revision` is not set, it has the old value. In that case a new revision should be created. Setting `_revision` to NULL will have the same behaviour of not setting `_revision`. Next to the comment, user id and timestamp, we add also set the previous revision.

As said before, it’s very important that the value of primary key doesn’t change. We need to check this and trigger an error, if it would be changed.

On Delete
Deleting won’t create a new revisiong. However we do want to log that the record has been deleted. Therefore we add an entry to the history table with `_revision` set to NULL.

CREATE TRIGGER `mytable-afterdelete` AFTER DELETE ON `mytable`
  FOR EACH ROW BEGIN
    INSERT INTO `_revhistory_mytable` VALUES (OLD.`id`, NULL, @auth_uid, NOW());
  END

To conclude
Using triggers we can implement the basic versioning functionality to MySQL. Since this is completely done the by database, it can be added to an existing application, without having to change to application code (or with very little changes). Using the history table, we can get the data of the database on any moment in time.

There are some situations where this solution as a bit to basic. A record might span across multiple table, like an invoice with invoice lines. In that case, we don’t want to revision each individual invoice line, but the invoice as a whole. I’ll come around in a follow up with a solution for this. I can tell up front that this solution is unfortunately not as clean as these basics.

Continue reading
Please continue reading the follow up article ‘Versioning MySQL data: Multi-table records‘. At the bottom of that article you’ll find a download link for a script that adds revisioning to existing MySQL tables.

Arnold Daniels

I've spend a big part of my life behind a computer, learning about databases (MySQL), programming (PHP) and system administration (Linux). Currently I playing with HTML5, jquery and node.js.

More Posts

Follow Me:
TwitterLinkedIn

There are 15 comments in this article:

  1. 12 November 2009Maxim says:

    Hi! nice decision.
    Are know instrument, that can versioning DB struture and creating it automaticly?

    ReplyReply
  2. 12 November 2009Arnold Daniels says:

    Maxim: Did you download the PHP script (available at the bottom of this article)?

    ReplyReply
  3. 12 November 2009Ed says:

    The script is empty :(

    ReplyReply
  4. 12 November 2009Baron says:

    It sounds nice, but in practice, it is hell. I can’t tell you how many people’s data I have fixed after replication and other issues cause problems. In fact, three of my colleagues are just discussing a mess on a customer’s systems right now that’s caused by triggers, stored procedures, and replication.

    ReplyReply
  5. 12 November 2009Arnold Daniels says:

    Ed: Sorry for that, please try again.

    Baron: I agree that you have to be careful with using triggers to alter data. Keep it simple and don’t try to develop half your application in triggers and stored procedures.

    However, with this method it is very unlikely to mess up your data (in the original table), since:
    1. A trigger on the original table, can’t update that same table. You can therefore only mess up the data you’re inserting/updating.
    2. The data in the revisioning table is a copy. If an error would mess up that table, you can remove it and only use the history, not the current data.
    3. You should still run a nightly backup and store that off-disk. The data would still be lost I the disk crash or something else terrible would happen. There are no alternatives to making backups, not this, not replication, not RAID, not anything.
    4. Programming error can happen anywhere, here as well as in the application code.

    If you see something potentially unsafe about the above code, please let me know what that is.

    ReplyReply
  6. 12 November 2009Arnold Daniels says:

    I should also mention, that there trigger will slow down insert and update queries. So I’ve you’re doing a lot of writes, don’t use this. Since the original table still contains the same data, this won’t affect the speed of read (select) queries.

    ReplyReply
  7. 13 November 2009Artur Ejsmont says:

    Hi there,
    Very nice article with a lot of details.

    I used to work with similar solution on Postgres and it was really working well.

    The coolest thing we had in our system was that versioning and history were completly separated from application and kept in db layer. The only thing application could use is to manipulate the CURRENT_VIEW_TIME value. Then views on all the tables would pick rows from real table or from history table so that you were transparently accessing data as it was exactly on CURRENT_VIEW_TIME.

    We had begin and end time in every history row to make this calculation easier.

    Suprisyngly it was working well and not really so slow either. It was the collest thing ever! :- )

    But i agree once you add triggers and lots of them forget about replication … well maybe row-level-replication would work for you.

    Very nice article! thanks

    ReplyReply
  8. 15 November 2009Jens Schauder says:

    When collecting history of data one should carefully think about the expected use cases. For example, the presented approach of a version per table becomes a nightmare when you try to find a consistent set of data spanning multple tables.

    It is also less then optimal when trying to analyze what kind of changes happen on the database.

    Another question often asked is: what happend at a certain time in the system, which probably would be better answered by some kind of audit trail.

    So before you implement an approach like this, I’d recommend thinking about some different options.

    I wrote a german article about this some time ago: http://blog.schauderhaft.de/2008/09/14/versionierte-vs-historisierte-objekte/

    ReplyReply
  9. 16 November 2009Arnold Daniels says:

    Jens: This article shows the basics of this method. I do have a solution for revisioning data across multiple tables. I’ll will discuss that in a follow article which I’ll write this week. That solution also has its limits, so I agree that this solution is not universally applicable. Again, more on that later.

    This method is not mutually exclusive with an audit trail. I don’t think it is usually worth to save a diff of the change. The alternative is save a copy of each record, which is what this revisioning system does. The audit trail table can than simply look like

    CREATE TABLE `audit_trail` (
      `table` varchar(255) NOT NULL,
      `id` int(10) unsigned NOT NULL,
      `revision` bigint(20) unsigned DEFAULT NULL,
      `user_id` int(10) unsigned DEFAULT NULL,
      `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
      KEY `id` (`table`,`id`,`revision`),
      KEY `revision` (`table`,`revision`),
      KEY `user_id` (`user_id`)
      KEY `timestamp` (`timestamp`),
    )
    

    This could replace the history tables. Do note that your forced in using surrogate keys for all revisioned tables.

    You should still make a periodic backups, even when using this method. Those backups are basically snapshots.

    As you already state in your article, versioned objects highly complicate the logic of the application layer. I don’t like that, since that is usually the place that is already complex, shouldn’t be over abstracted and where most bugs appear.

    ReplyReply
  10. 20 November 2009Log Buffer says:

    “From Arnold Daniels comes a version of versioning MySQL data, which Arnold introduces thus: [...]”

    Log Buffer #170

    ReplyReply
  11. 21 November 2009Baron says:

    Arnold, the problem is not conceptual. The problem is specifically with MySQL’s triggers and binary logging (replication). It is not occasionally going to fail, it’s basically just going to fail and your slave will end up with different data, or duplicate key errors, or you won’t be able to do point-in-time recovery with binary logs. See http://www.mysqlperformanceblog.com/2008/09/29/why-audit-logging-with-triggers-in-mysql-is-bad-for-replication/

    ReplyReply
  12. 26 November 2009Arnold Daniels says:

    I understand, it’s a race condition. Also actions in the trigger are not send to the slaves, instead they invoke their own trigger. I’ll address in my next post.

    ReplyReply
  13. 9 October 2010Hari K T says:

    Nice . Thanks for sharing . It also helped me to know more about trigger.

    ReplyReply
  14. 13 October 2011Brian says:

    Right now its not really possible to version your data in a RDBMS. You are kind of shoe horning a versioning system on your data model here, because in an RDMS you can only have one version of your data at a time. The best alternative you can have is an LVM snapshot, or some other mountable copy of your data at some point int time. the larger issue you have is that you really cannot separate data from how you interpret it. What every field means and how it used is important as it forms intent and information at that point in time, which is not getting covered in your system. Perhaps if you tied your revision number to the repository tag?

    ReplyReply
  15. 13 October 2011Arnold Daniels says:

    Hi Brian,

    Thanks for commenting :)

    MySQL (and other RDBMS) don’t support versioning natively, which is a pain. This means that you have to do it yourself.

    If we look at the most basic versioning system, it saves a copy of a file (or in this case record) whenever the file is changed. Since there can only be one version of a file on a (standard) file system, a copy needs to be saved elsewhere. We’re doing the same with MySQL (so no shoe horning here).

    Using LVM is a great way to get a snapshot of the complete database at a certain time. However, this is very much a sysadmin tool. You can’t (easily) use it to show a user (the manager of a company for instance) the history of a record and allow him to roll back changes.

    I strongly consider reusing a field for a different purpose to be a bad practise. Away from versioning, let’s say you have an old piece of code that you’ve forgotten about (or a college has written way back), reverencing this field, than it will reek havoc when updating the data incorrectly.

    Instead you should always create a new column (or rename the old one), when giving it a new purpose.

    The revisioning table should have all columns ever used, even if they no longer exist in the main table. At times that the field did not exists value will simply be NULL.

    Note that this is only a solution for versioning the data and not for versioning the db structure. You should do that separately and treat the revisioning tables like any other table.

    ReplyReply

Write a comment: