An alternative way of EAV modelling

by Arnold Daniels on 07/29/2008

I was reading this month’s php|architect. It has a nice article about EAV modeling. I had seen this db structure in other project, but didn’t know that it was called EAV. For those who don’t read php|architect, EAV describes a method of saving a large set of attributes, only some of which apply to an individual entity. Normally you would create a table, with a row for each entity and save each attribute in a column. With EAV you save each attribute as a row.

This makes selecting the data quite tedious. If you can life with some constraints, there is an easier way to do this.

The common way

How php|architect describes it (and how it’s normally described) is to split out the values over different tables for the different data types. The DB structure proposed in php|architect look like:

CREATE TABLE field_names (
    fid INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    field_name VARCHAR(50) NOT NULL DEFAULT '',
    field_type ENUM('VARCHAR', 'INTEGER', 'DOUBLE',
        'DATE', 'TEXT') NOT NULL DEFAULT 'VARCHAR',
    UNIQUE KEY (field_name)
);
CREATE TABLE varchar_values (
    vid INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    value VARCHAR(255) NOT NULL DEFAULT '',
    UNIQUE KEY (value)
);
CREATE TABLE integer_values (
    vid INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    value INT(11) NOT NULL DEFAULT 0,
    UNIQUE KEY (value)
);
CREATE TABLE double_values (
    vid INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    value DOUBLE NOT NULL DEFAULT 0,
    UNIQUE KEY (value)
);
CREATE TABLE date_values (
    vid INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    value DATE NOT NULL DEFAULT '0000-00-00',
    UNIQUE KEY (value)
);
CREATE TABLE text_values (
    vid INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    value TEXT NOT NULL DEFAULT '',
    UNIQUE KEY (value(100))
);

A small notice might be that using a unique key on the first 100 chars of a text field isn’t wise. You might have 2 texts where the first 100 chars are the same causing trouble.

The alternative way

The suggested previous makes selecting the data quite tedious. If you can life with some constraints, there is an easier way to do this. Data types can be split in 2 categories: scalars and arrays. The difference is very clear when you look at a language like C. A scaler is x bytes of memory containing 1 value. An array is a x*y bytes of data, where x is the number of bytes needed for 1 value and y is the number of values. In this definition a string is an array of characters.

Normally you define the data type in MySQL. When adding data the right number the right number of bytes is allocated by the storage engine. When the field is used in ‘WHERE field=value’, the value in the SQL string so it cast to the byte value. It’s fantastic that this is done outside the user’s scope, but here it’s part of the reason that the data can’t be saved in a single field. We want the way that the data is seen to be based on per-row meta-data. That’s simply not how SQL works.

However, if we don’t mind some pre- and post-processing, we can find a way around this. Let’s see if we can find a data type, to which all the different scalar types can be transformed. It’s important to remember that the data type isn’t only used to display the data, but it also determined the way it’s ordered, indexed and reacts to operator like ‘>’. Since we don’t want to get into issues with this, we’ll choose a signed type. The representation of integers is fairly strait forward, so it seems like a good candidate.

Dates
Dates are often saved as integers using unix time. It represents the number of seconds since 1970. With an unsigned integer this can run up to 2038.
This can work fine for appointments, but will not work for something like the date of birth. However for birthdays, the time part isn’t important. We can repeat this trick, but instead of counting seconds we count the days since ’0000-00-00′.
For time we also count seconds running up to 86399 for 23:59:59.

Numbers
For floating points the point is decided for each value. However this is probably not necessary, since the scope of the values for a field should be pretty much the same. Since we’re deciding the data type per attribute type (the field table), we can make it a fixed point where the position of the point is decided by in the field table. Both fixed point and integer are numbers, with the only difference that an integer has got precision 0.

Text
The php|architect article distinguishes between text and varchar fields. Free text fields for names, etc are saved in varchar fields. Data that would be represented as selection box is saved in the same table. We will make a different division. In a normal db setup, values that aren’t free text would be saved in a second table and the ids would be stored in the original table. This will work fine with the integer set-up, so let’s stick with that.
All free text values will be saved in a TEXT field, which has a variable size. Searching on a text field is extremely slow. However by placing an index (of 255 chars) on the field, the index will always be used unless wildcards are used. This makes searching on this field as fast as it would be with a varchar field. Instead of using wildcards we could use fulltext indexing. More on this choice later.

The DB structure looks like:

CREATE TABLE `field` (
  `fid` int(10) unsigned NOT NULL auto_increment,
  `field_name` varchar(50) NOT NULL default '',
  `field_type` enum('NUMBER','DATE','TIME','ENUM','TEXT') NOT NULL default 'NUMBER',
  `precision` tinyint(4) NOT NULL default '0' COMMENT 'To make fixed point. 10^x',
  `date_format` VARCHAR (50) NULL, 
  PRIMARY KEY  (`fid`),
  UNIQUE KEY `field_name` (`field_name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

CREATE TABLE `field_option` (
  `fid` int(10) unsigned NOT NULL,
  `value` int(11) NOT NULL,
  `description` varchar(255) NOT NULL,
  PRIMARY KEY  (`fid`,`value`),
  KEY `description` (`description`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

CREATE TABLE `value` (
  `item_id` int(10) unsigned NOT NULL default '0',
  `fid` int(10) unsigned NOT NULL default '0',
  `value` int(11) NOT NULL default '0',
  `text` text,
  PRIMARY KEY  (`item_id`,`fid`),
  KEY `value` (`fid`,`value`),
  KEY `text` (`text`(255))
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Selecting data

This setup makes it easier to look through the data, though you need to convert filter values to integers in your script and place those integers in the query. The integer values aren’t good for displaying though. You can do this in the script, however it’s nicer to get the correct data from the DB.

By using a stored procedure we can cast the value into a string which can be displayed. This isn’t terribly fast, but for a single item, it’s surely fast enough.

delimiter |
CREATE FUNCTION `value_display` (`type` enum('NUMBER', 'ENUM', 'DATE', 'TIME', 'TEXT'), `value` INT, `option` VARCHAR(255), `text` TEXT, `precision` INT, `date_format` VARCHAR(50)) RETURNS VARCHAR(255) CHARACTER SET latin1 NO SQL
BEGIN
  CASE type
    WHEN 'NUMBER' THEN RETURN `value` / POW(10, `precision`);
    WHEN 'ENUM' THEN RETURN `option`;
    WHEN 'DATE' THEN RETURN DATE_FORMAT(FROM_DAYS(`value`), `date_format`);
    WHEN 'TIME' THEN RETURN FROM_UNIXTIME(`value`, `date_format`);
    WHEN 'TEXT' THEN RETURN `text`;
    ELSE RETURN NULL;
  END CASE;

  RETURN NULL;
END;
|

Other thoughts

You might have noticed that I’ve choosen InnoDB instead of my regular choice of MyISAM. I expect that the `value` table will be updated a lot. MyISAM only has table locking, which can causes delays. InnoDB doesn’t support fulltext indexes, but I use sphinx anyway, which is faster and more flexible.

Multi value fields, grouping and ranges are discussing in the follow-up article ‘EAV multi-value fields’.

Arnold Daniels

I've spend a big part of my life behind a computer, learning about databases (MySQL), programming (PHP) and system administration (Linux). Currently I playing with HTML5, jquery and node.js.

E-mailTwitterLinkedInGithubGittip

There are 41 comments in this article:

  1. 29 July 2008Thomas Koch says:

    Hi Arnold,

    we’re using eZPublish in our company, which also uses a kind of EAV to store content. This has advantages in flexibility, but also disadvantages in speed and maintainability.
    Therefor I developed a concept for an alternative content store: The main idea is to have one table for each class of content, but to allow the administrator to automatically create, modify and delete those tables when the content classes change.

    I’d be glad, if you’d like to comment on this idea:
    http://www.koch.ro/blog/index.php?/archives/81-My-content-storage-component-project.html
    http://www.koch.ro/blog/index.php?/archives/75-Changing-Database-Schemas.html

    ReplyReply
  2. 29 July 2008Roland Bouman says:

    Hi!

    Just wondering, for what kind of problem are you considering to use EAV? I have used it in the past, but was thinking that it would perhaps make more sense to allow users to extend the database schema in a controlled manner.

    ReplyReply
  3. 29 July 2008Antonin Faltynek says:

    [1] That’s not so easy. When your table of attribute values has millions of rows, every table alter will take some time for which are any updates and inserts are stalled.

    It is better to solve some situations with EAV. It is pretty easy to define new attribute or some existing attribute remove.

    ReplyReply
  4. 29 July 2008p says:

    I don’t think there’s a need to point you to one of the many articles describing the drawbacks of EAV …
    Would you care to share the query for a report on that table structure? Say it contains sales orders, a report giving the number and avg value of sales order per month, customer, district and order status.

    ReplyReply
  5. 29 July 2008rudy says:

    yes, please, let’s see a real-world query against this modified EAV scheme

    storing data isn’t the problem is it — it’s getting it in and getting it back out that’s the challenge (especially doing it without table scans)

    please do not take offence at this, but…

    … “you can put lipstick on a pig, but it’s still a pig”

    ReplyReply
  6. 29 July 2008Arnold Daniels says:

    Hi All,

    I agree with all of you, don’t use EAV unless you really really really need it. I’ve used it for a project that has a db with 100.000 construction products divided in about 500 different categories. Besides the properties the products have in common, a product has a set of properties which are depended on the category. There are about 4000 different properties.

    We’ve looked at the possibility of creating one table for each categories and saving the properties in a serialised string. This solution however turned out to very difficult to implement especially concerning searching.

    Unlike what the php|architect says, I would not store common fields like the name in the values table. I create a table with the common fields for all items (persons, products, etc) in 1 table. This tables is joined to `values` with `item_id`. However I think the example of patients is well chosen. There are hundreds of thousands of illnesses, all with different properties. I would not be a good idea to make a table with millions of fields or make a db with a table for each illness.

    A comment to Thomas. I completely agree with your vision. Often I ask why have you put that in a separate table instead of using an enum, adding a column, etc. You’ll get the answer, because we want to be able to change that. The DB structure is not set in stone. You can do an ‘ALTER TABLE” query, just as easy as doing a “INSERT INTO’.

    ReplyReply
  7. 29 July 2008Mark says:

    The idea of EAV is good, but it shouldn’t be used for ALL modeled attributes of a domain. For certain things, like last_modified_on, and user_id, and group_id, title, etc. these should be modeled traditionally in one database row.

    EAV is really good for letting the end-user customize things in unexpected ways. I feel that the best way to use them is as “extra” attributes that you can’t know about at runtime. Too much EAV in the core of a data domain makes it flabby and undefined.

    ReplyReply
  8. 29 July 2008Antonin Faltynek says:

    Let say we have some kind of evidence of customers. Each customer can be of different type. According to its type it can have different attributes.

    With EAV you can easily create application that lets users do define its own types of customers and all attributes of that customer type without changing DB schema.

    You have table of customers, of their types, table of attribute definitions and table of all values of all attributes for all customers and sure some association table between customer type and attribute definition.

    I know the customer for which I want to retrieve its data (and this is indexed foreign key -> no table scan), I probably know what attributes I want (again foreign key -> no table scans). I mention data retrieval lower ;) .

    Scheme mentioned here is not ideal for sorting, but when you use common scheme (separate tables/columns for each data type) it is not problem too.

    On application level you can create object lazy init in nicer way.

    Another point to discuss is amount of data, but with, let say, adult databases (PostgreSQL, Oracle,…) you can use horizontal partitioning based on indexes. MySQL could not do this in some easy way, you have to use partitioning on application level (it doesn’t matter if it is some code in DB and views or code on client side).

    Again, I’m do not say that EAV should or could be used everywhere. It is good solution when your database is ONLY storage for your application, which covers EAV lacks.

    Searching itself isn’t so big issue because you know what you are looking for (attribute type and its value).

    The really big problem comes when you need retrieve several attributes per row, then you have to use joins. Again you are using indexed keys for joins -> no table scans, but there are many joins -> large temporary space for data that can be saved in normal (read not EAV) schema. This is the real issue of EAV and you have to know how big issue it is for your application.

    I think we all know that EAV is something special with its limits, that can kill your application if you don’t know what you need and what EAV means, but, still, EAV has its place.

    ReplyReply
  9. 29 July 2008Antonin Faltynek says:

    There are faster commenters :-D

    ReplyReply
  10. 29 July 2008Mark says:

    After reading the alternative method more thouroughly, I don’t really understand why the values table is so limiting (also, “values” is a keyword, can we avoid using it?)

    Why should we try to throw floats into an int, and then concatenate and divide by 10 or `precision` whenever we need to access the field as a float? This would make things like MAX(), SUM() AVG() either impossible or piss-poor slow.

    What’s so wrong with just making the values table have
    1 column of each type, and make it take NULLs. Proper supporting of NULLs would keep the table small and not produce any wasted overhead.

    CREATE TABLE `value` (
    `item_id` int(10) UNSIGNED NOT NULL DEFAULT ’0′,
    `fid` int(10) UNSIGNED NOT NULL DEFAULT ’0′,
    `value_int` int(11) SIGNED NULL,
    `value_float` int(13,4) SIGNED NULL,
    `value_char` varchar(255) NULL,
    `value_text` NULL,
    PRIMARY KEY (`item_id`,`fid`),
    KEY `value` (`fid`,`value`),
    KEY `text` (`text`(100))m
    KEY `value_int`… etc
    ) ENGINE=MyISAM DEFAULT CHARSET=ustf8;

    BTW, I don’t see why you’re knocking the 100 char index on the TEXT field, then you go and add a 255 sized index. How is that any different, other than reducing the chance of two records’ indexes having the same value? I’m also not sure why you think that would be a problem if two records shared the same first 100 or 255 characters…

    ReplyReply
  11. 29 July 2008Antonin Faltynek says:

    [6] Arnold, doing ALTER TABLE can be much more expensive then INSERT INTO ;)

    ReplyReply
  12. 29 July 2008Antonin Faltynek says:

    [10] Mark, I agree with you. The alternative solution is not good also for sorting.

    ReplyReply
  13. 29 July 2008Arnold Daniels says:

    [11] Antonin,

    Sure ALTER TABLE is slower, but in most of these cases a query of 50 to 100ms is just fine.

    —–

    [10] Mark,

    About the index on the text field: I’m not knocking the 100 char index, but I’m knocking the fact that it’s a UNIQUE index. It shouldn’t be unique.
    I don’t understand the reason of having both a value_char and a value_text field. With an index on the text field it will behave the same.

    I forgot to take out the concat. It would slow it down and should not be part of this example. I’ve removed that part.
    Doing `value` / pow(10, `field`.`precision`) isn’t that slow. Using the function is fairly slow, but the number of items I’m displaying is never really big.

    My performance concern is on the where part of selection. With the setup you’ve suggested the indexes won’t work to well.

    Let’s say we to get all items made of wood (5) and with a width > 10.50:

    SELECT `item`.*, value_display(`field_type`, `field_option`.`description`, `value_int`, `value_float`, `value_char`, `value_text`, `field`.`date_display`) FROM `item` INNER JOIN `value` ON `item`.`id` = `value`.`item_id` INNER JON `field` ON `value`.`fid` = `field`.`fid` LEFT JOIN `field_option` ON `value`.`fid` = `field_option`.`fid` AND `value`.`value_int` = `field_option`.`value` WHERE (`value`.`fid` = 5 AND `value`.`value_int` = 1) OR (`value`.`fid`=2012 AND `value_float` > 10.50) GROUP BY `item`.`id` HAVING COUNT(*)=2

    I would have to test this to be sure, but I think this would only use the ‘value` index and would use a table scan to filter the widths. This is might be acceptable if the number of items which are made of wood are small, but it isn’t ideal.
    You could get around this by making the value index contain all value fields and changing the where statement to something like:

    WHERE (`value`.`fid` = 5 AND `value`.`value_int` = 1) OR (`value`.`fid`=2012 AND `value_int` IS NULL AND `value_float` > 10.50)

    In all of the 3 cases you need to know the type and other data of the specific field (as in record of the `field` table) to either choose the right field or to cast the value. I don’t see any real disadvantages of casting it and using an int.

    Sorting isn’t a problem it all. The position of the decimal point is not necessary for the sorting. If you know the field type you’re sorting on the implementation isn’t that different between how Mark suggested and how I suggested.

    SELECT `item`.* FROM `item` LEFT JOIN `value` ON `item`.`id` = `value`.`item_id` AND `value`.`fid`=21 ORDER BY `value`.`value`

    I think that casting your number to an int will give you better performance and easier queries. I might be wrong though and Mark’s concept might be better, only benchmarking can tell. Both solutions however seem to be better than the one suggested in php|architect.

    ReplyReply
  14. 30 July 2008Antonin Faltynek says:

    [13] Sorting of numbers that are saved as varchar IS problem.

    create table numbers (number int, string varchar(32));

    Then perform following insert:

    insert into numbers values (20, "20"), (100, "100");

    But what abut sorting:

    select * from numbers order by number asc;
    +--------+--------+
    | number | string |
    +--------+--------+
    | 20 | 20 |
    | 100 | 100 |
    +--------+--------+
    2 rows in set (0.00 sec)

    mysql> select * from numbers order by string asc;
    +--------+--------+
    | number | string |
    +--------+--------+
    | 100 | 100 |
    | 20 | 20 |
    +--------+--------+
    2 rows in set (0.00 sec)

    I think that it is obvious where the problem is. Sorting on data return by some function WILL be slower.

    ReplyReply
  15. 30 July 2008Arnold Daniels says:

    [14] Antonin,

    Please re-read the article. No-one is suggesting saving numbers as varchars. Everything except text is saved as an INTEGER.

    Data does not have to be sorted using a function. Just sorting it on the `value` column, which is an INTEGER, will give you the correct order.

    ReplyReply
  16. 1 August 2008Quinton Parker says:

    Hi all,

    I’ve been in a situation whereby we considered EAV. One obvious problem mentioned by all is what happens when u have plenty of data in ur table. And u need to perform SELECTs with some serious clauses that would require casting fields on-the-fly. Regardless of having indexes any db server will struggle. EXPLAIN confirms it

    Back to the point. If u must go EAV i endorse using Mark’s approach above. A bit ugly and imperfect but its sane and will be lightning fast

    I recommend a sub-table approach. Whereby tables are split into one-to-one relationships yet still maintaining 3NF. Its the fastest most humanly readable method and the database server will love u for it

    See
    http://www.scribd.com/doc/2670985/SQL-Antipatterns somewhere around pg 36/37

    ReplyReply
  17. 1 August 2008Arnold Daniels says:

    Hi Quinton,

    First of all thanks for commenting on my blog and participating in what has become a bit of a discussion.

    Using the result of a function in a WHERE condition with a substantial amount of records, will surely be slow. I don’t think anyone will dispute that. However, I have not found any situation where I need to cast a field on-the-fly. If you do have a situation like that, this method will not work well.

    I only need to do select based on integers, which is quite fast. Converting the filter values into integers before putting them into an SQL statement is is not dependant of the amount of records and is unlikely to make any performance hit.

    Could you please give an example where you need to cast on-the-fly in a WHERE statement?

    I try to free up some time to benchmark the 3 approaches in order to give a conclusion.

    ReplyReply
  18. 1 August 2008Adam Machanic says:

    You’ve reinvented what is referred to as OTLT, or the One True Lookup Table.

    http://www.google.com/search?q=otlt

    ReplyReply
  19. 4 August 2008Arnold Daniels says:

    Hi Adam,

    It’s true it looks somewhat like the OTLT pattern, however it does not have most of the described drawbacks.

    http://decipherinfosys.wordpress.com/2007/02/01/otlt-one-true-lookup-table

    b.) Numeric values are saved as numbers and not as strings.
    c.) Strings are saved in a variable sized column. The index makes sure it’s fairly searchable.
    a.) Values are only joined to one table `field_option`.
    d.) Values are only joined using INTEGERS.

    It’s true though that you have to write complex and relatively slow queries to filter using the EAV table.

    The issue I have with the articles condemning these structures, is that they are comparing very simple DB schemas with EAV. In that case there is no reason to use EAV. This structure is useful when the alternative is using either hundreds of tables or thousands of fields. Wanting to make the DB more flexible is not a valid reason to choose EAV.

    ReplyReply
  20. 26 August 2008MT says:

    This is very much a topical debate for me, as I am facing a similar conundrum with our own database structure.

    Our application offers users the ability to create user defined fields. We have one client who has created 150+!!

    The way I originally tackled this was via a simple EAV mechanism like this:

    table UDF_DEFINITION
    id (integer) PK
    caption (varchar)
    datatype (varchar) constrained by domain to presets such as ‘Text’, ‘Date’, ‘Number’ etc.

    table UDF_VALUE
    id (integer) PK
    id_udfdefinition (integer) FK to UDF_DEFINITION table
    id_client (integer) FK to CLIENT table
    data (varchar) i.e. value

    So I could do something like this:

    SELECT uv.* FROM udf_value uv JOIN client c on c.id = uv.id_client

    This model was OK when we were returning ALL the data into a grid. However, it started to fall apart at the seams when we were asked to return paged search results based on search criteria.

    The problem here as Antonin says is with sorting and paging. The problems were 2 fold:

    1) Having my udf data all as VARCHAR meant that it is a pain in the *ass to order the data without having the correct data type.

    2) Because you cannot apply a sort until you’ve returned the whole results set due to the pivoted nature of the data. So this becomes a cross tab like performance hog on searches. A no no.

    I have rejected altering the schema on the fly because it seems inherently dangerous when you have multiple users connecting concurrently. Also if I have to create a table for each UDF then I’ll end up with 150+ joins in my SQL! Also, for me it’s risky. In my experience, if it can go wrong, it WILL go wrong. :)

    I have rejected using a non RDBMS storage type as all the rest of the app works off flat tables. So I refuse to go to that level of complexity for one feature.

    i am now leaning towards distributing the data into separate columns. The possibilities I see are:

    1) I then i can return the whole row and pick off my values in code using a switch by parsing the type for that definition id.
    2) I can keep a read only data varchar value updated via triggers so that i can still do a search on a plain text field. (I havent thought this one through fully so it may not work)

    Any comments or assistance welcome.

    ReplyReply
  21. 30 September 2008Travis Paxton says:

    I am interested in how you approach the storage of multiple value options and how to retrieve them using your database schema. I’m also interested in how you group your data into “rows” of information. Thanks!

    ReplyReply
  22. 7 February 2009Grant Czerepak says:

    SQL Databases are not the environment for creating an EAV database.

    A correctly functioning EAV database management system has to be built from the ground up.

    Check out http://www.lazysoft.com and the Sentences database

    Sentences provides the architecture, the schema, data, form and query tools you need to create an EAV database that is powerful, scalable, distributable.

    Relational databases are based on lattices not the scalable networks of EAV.

    ReplyReply
  23. 23 October 2009Ivan says:

    Hello, very interesting way of doing EAV modeling. I would like to know how you implemented ranges and multi value fields, maybe it’s time for another article?

    Thanks
    Ivan

    ReplyReply
  24. 23 October 2009Ivan says:

    Also, how do you create SELECT statement to bring back several fields at once?

    ReplyReply
  25. 29 October 2009Arnold Daniels says:

    Ivan: See EAV multi-value fields

    ReplyReply
  26. 29 October 2009Bill Karwin says:

    I commented on “EAV multi-value fields” but Arnold asked me to join the discussion here:

    EAV is a non-relational design and trying to make it fit in an SQL database is a fool’s errand.

    By using EAV, you’ve already sacrificed any benefit using an RDBMS might have given you, so why are you still using an RDBMS?

    If EAV really seems like what you need, you should use something like CouchDB, Apache Cassandra , Project Voldemort, or one of the other emerging key/value storage technology.

    I don’t mean to be cranky, but this is really one of those square-peg-in-a-round-hole situations.

    Arnold replied: “The issue with document oriented databases is that it is very difficult to get results based on referential info.”

    Yes, but neither can you support referential integrity constraints in EAV. A constraint on a column applies to all rows, not just some rows, where the fid is a certain value.

    Arnold: “I would not be a good idea to make a table with millions of fields or make a db with a table for each illness.”

    Yes, the seminal use of the EAV data model is in medical databases in the 1970′s, where a patient entity has thousands of boolean attributes, forming a sparse matrix of the illnesses and conditions that apply. So EAV was used so that the attributes could be stored as rows, instead of columns, because a table with thousands of columns was impractical.

    But the relational model for patients-to-conditions is simply a many-to-many relationship. Yes, we can store rows to represent the boolean association from a patient to a given condition. And we can store the possible conditions in rows too–in a lookup table also referenced by the many-to-many intersection table.

    CREATE TABLE HasCondition (
    patient_id INT REFERENCES Patients,
    condition_id INT REFERENCES Conditions,
    PRIMARY KEY (patient_id, condition_id)
    );

    Arnold: “Wanting to make the DB more flexible is not a valid reason to choose EAV.”

    This I agree with. There are several ways to solve that problem, while keeping the separation between data and metadata. These solutions therefore support other core relational features such as typing and referential integrity, without employing the “Inner Platform Effect” antipattern.

    If you really, *REALLY* need to support custom user-defined attributes, I’d recommend the “Serialized LOB” approach, encoding the custom attributes in XML or JSON or whatever, and stuffing them into a blob. You aren’t sacrificing any query power, because you’ve already given that up as soon as you consider EAV.

    ReplyReply
  27. 26 September 2010Jaimie Sirovich says:

    I know I’m late to the party here, but I found some of this discussion useful.

    I mostly agree with #26/Bill. Most of the time EAV is nonsense. Now I’m not saying we should all bow to the DBA. As programmers we know they’re all about making their lives easy and our lives hard with annoyingly rigid schemas, but —

    The Serialized LOB approach works rather well. MySQL has some XML primitives, and there’s always JSON, CSV, PHP SERIALIZED, etc. If your database has a way to map JSON or XML, it can search parts of a LOB in linear time without doing anything. That’s probably better than a bunch of joins if at least 1 other field is a native column. Evaluating time complexity on an EAV query is a crap shoot at best.

    From there, you can either set up some mechanism whereby columns are extracted to the same table, or take the Friend Feed approach of building external “index tables” if linear time isn’t making you happy.

    Doing everything in EAV, even obvious parts of the table that have no business being abstracted like name or birthday in a users table is dumb.

    EAV is the reason Magento is slow. Any improvement on EAV will still be slow. It doesn’t matter. EAV isn’t bright at all except in certain cases —

    In these cases, you can pivot it into a data warehouse, don’t need to do so quickly in realtime, or have materialized views in your DBMS.

    It’s really only good for stuff that can be queued for rebuild periodically. At that point you can also use something like Sphinx. Common cases where this works is faceted search, as is used in medical diagnosis databases.

    Even hybrid document stores like Mongo don’t do it this way. Mongo is more like the indexed Serialized LOB approach I noted above with Friendfeed-type indexing.

    Anyone considering EAV should seriously consider a standard schema + LOB.

    ReplyReply
  28. 26 September 2010Jaimie Sirovich says:

    Oh, and let’s not forget that most databases either have:

    1. A hard limit on the number of joins. MySQL has a limit of 51.
    2. A soft limit on the number of joins — a point at which the joins become terribly slow.

    This makes EAV limited to scenarios where the EAV data can be periodically pivoted out to a table for fast querying, and even then it’s frustrating, as you’ll need to break, for example, a 102 join query into 2 EAV queries then piece those together.

    At least it’s fast when you’re done, though.

    ReplyReply
  29. 15 November 2010Oleg says:

    I wonder how to overcome problems with using BLOB pattern as an alternative of EAV.
    Let’s assume that we could store all custom fields of the entity in one field as a string forexample in JSON something like tihis:
    {customField1: value1, customField2: value2, …,
    customFieldN: valueN}

    How to overcome the following problems:
    1. How to seach by seperate custom fields, for example, to find entities with the conditions custField1 = value1 AND customField2 = value2?
    2. How to mantain data integrity, for example, if we delete a custom field for the entity how to delete all values oif these custom fields in the entity.

    ReplyReply
  30. 15 November 2010Arnold Daniels says:

    Oleg Never do that! What you are proposing is a well know anti-pattern. You can only use this EAV solution for values that can be converted into integers.

    If you want to store data in a non-relational way (like JSON), look at NO-SQL solutions like Redis or MongoDB.

    ReplyReply
  31. 15 November 2010Oleg says:

    Yes, serialized LOB seems violate even 1NF.
    I don’t know why it’s recommended as an alternative to EAV.

    Most data is store in a usual relational scheme, EAV is going to be used only for custom defined attributes of the subscribers.

    Some suguest using fixed table pattern.
    http://blog.springsource.com/arjen/archives/2008/01/24/storing-custom-fields-in-the-database/
    It looks like we’ll need up to 30 custom fields attributes and using fixed table pattern will require to many indexes per table I suppose.

    ReplyReply
  32. 17 November 2010Oleg says:

    Hi, Arnold.
    Did you look at document-oriented databases like MongDb and CouchDb as an alternative of EAV?
    These products allow storing any number of attributes in the documents.
    Do you think such database could be useful in many cases or they’re still spefic products?
    Best regards, Oleg.

    ReplyReply
  33. 17 November 2010Arnold Daniels says:

    Oleg Yes, see my previous comment.

    I don’t really like CouchDB. It’s to abstract, so you end up writing a lot of code just getting the data from the DB. MongoDB is good for logging type of databases. In general I like Redis best for general purpose solutions http://code.google.com/p/redis/. The project doesn’t have as nice a site as Mongo or Couch, but the software is great ;) .

    ReplyReply
  34. 21 November 2010Oleg says:

    But it looks like Redis can be used only when all data can be in RAM, right?

    ReplyReply
  35. 21 November 2010Arnold Daniels says:

    Yes, this is because the whole of Redis works like an index. My DBs are a couple hundred MBs max, so this isn’t a problem for me. For logging solutions I never use redis.

    Even if you have a lot of data and little memory, loading everything isn’t necessarily a bad thing. Your server will simply store the least used data in swap space. Which is okay if that data is hardly used.

    Make sure you read the FAQ. It has a lot of info about this: http://code.google.com/p/redis/wiki/FAQ

    ReplyReply
  36. 9 December 2011top mistakes says:

    Your site is really cool to me and your topics are very relevant. I was browsing around and came across something you might find interesting. I was guilty of 3 of them with my sites. “99% of site owners are guilty of these five mistakes”. http://tinyurl.com/cwa3tj7 You will be suprised how simple they are to fix.

    ReplyReply
  37. 10 January 2012Jaimie Sirovich says:

    @Arnold, Oleg

    I don’t get your logic. Don’t use serialized fields because they’re an anti-pattern, but EAV is OK because it’s quasi-normalized … but STILL an anti-pattern.

    That logic is flawed. If you don’t need an index, and you’re not terribly interested in atomicity and/or can update the data in place using functions within a lock, it’s just dandy. If you embed JSON manipulation functions inside a RDBMS you’re fine. Same for CSV. I do it all the time. It’s fine.

    In fact, MariaDB has VIRTUAL COLUMNS that are based in a serialized BLOB approach with options to pull it out to a column if you want it indexed. If not, it’s left as a blob.

    Why does the concept of normalization/atomicity change when you go to MongoDB. MongoDB just supports only one table and uses an imperative language instead of a declarative one. Nothing changes. Using MongoDB doesn’t suddenly make not making values atomic correct. The question is whether you care. There are no Gods who will smite you. Just make sure you know what you’re doing.

    If you think otherwise, you’re just plain wrong.

    EAV is an awful idea invented by people that have no idea what Codd meant. It’s stupid slow and won’t scale.

    ReplyReply
  38. 11 January 2012Arnold Daniels says:

    @Jamie

    First of all, I don’t understand why you’re all worked up and flaming on a year old article, but I’ll respond anyway.

    The problem space is the following. Let’s say we’re a hardware warehouse. We have a lot of different products. They have some common properties like a price, but most properties that are specific to that type of hardware. For example a drill has different properties as a door, which is again different from a can of paint.

    Let’s say there are 1000 different kind of products, each product has got about 20 properties, which are not all unique. Still there might be something like 10.000 different properties. Creating a table with 10.000 columns is not feasible, not even it they’re virtual.

    If we just needed to store and display those properties, just serializing them would be a good solution. But in our case we want to filter on these properties.

    If this is the main part of your project and you have the opportunity to move to a document-oriented DB, I would strongly suggest you do so. However, if this is only a small part of your project, moving everything into a doc DB, might just not be worth it. RDBMSs do have some signification advantages.

    So, please take this as a given fact. We need to use an RDBMS and we need to store and filter those 10.000 different kind of properties.

    A lot of people will just create a table with a varchar property field and a varchar value field. While it might work, that will indeed be very slow. Another common way is to create a table or field per type, but this also has mayor drawbacks as discussed above.

    I’m giving a solution where we cast any worth filtering down to an integer. So the idea is basically Java vs C. In Java a variable knows it’s type, while in C a byte is just 8 bits and the meaning of those 8 bits can only be seen in context of the code. The same thing here, whether the integer represents a date, a decimal or a number, can only be seen in context of the code.

    So I agree that we’re abusing instead of using MySQL here. This is a very usable and relatively fast solution though. Much more that the other RDBMS alternatives.

    I hope this answers your question and leaves you satisfied :)

    ReplyReply
  39. 11 January 2012Jaimie Sirovich says:

    @Arnold,

    I don’t believe I was flaming (or even worked up), but you’re still pretty far gone and incorrect. You would probably do better by deconstructing those values into integers and throwing them into a fulltext index. Just use boolean mode and do +100 +51 +600, etc.

    The only thing that bothered me is that you’re recommending people don’t serialize things in a database like it was the word of God and he thinks EAV is better. It’s really not, and EAV indexes are barely indexes in the traditional sense. You’re wasting your time esp. if you don’t need an index.

    You could use Sphinx if you consider that a storage engine, or you could take the FF approach http://backchannel.org/blog/friendfeed-schemaless-mysql

    What you’re doing will not scale for faceted search. Stop recommending it to people :)

    ReplyReply
  40. 13 January 2012Arnold Daniels says:

    Using a full-text index for is another option. You should write an article explaining how you would do that.

    Have a look at the results below.

    mysql> show create table product_property_test;
    +-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | Table                 | Create Table                                                                                                                                                                                                                                                                                                                                                                                                            |
    +-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | product_property_test | CREATE TABLE `product_property_test` (
      `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
      `product_id` int(10) unsigned NOT NULL,
      `property_id` int(10) unsigned NOT NULL,
      `value` int(11) NOT NULL,
      `text` text,
      PRIMARY KEY (`id`),
      KEY `product_id` (`product_id`,`property_id`),
      KEY `property_id` (`property_id`,`value`)
    ) ENGINE=InnoDB AUTO_INCREMENT=79187 DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC |
    +-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    1 row in set (0.00 sec)
    
    mysql> SELECT COUNT(*) FROM `product_property_test`;
    +----------+
    | COUNT(*) |
    +----------+
    |    47340 |
    +----------+
    1 row in set (0.04 sec)
    
    mysql> SELECT product_id FROM `product_property_test` WHERE ( property_id =65854 AND value =5493 ) OR ( property_id =65854 AND value =5491 ) OR ( property_id =65863 AND value =5498 ) GROUP BY product_id HAVING COUNT(*) = 3;
    +------------+
    | product_id |
    +------------+
    |       9541 |
    |      46210 |
    |      74013 |
    |      88603 |
    |      88717 |
    |     209541 |
    |     246210 |
    |     274013 |
    |     288603 |
    |     288717 |
    |     409541 |
    |     446210 |
    |     474013 |
    |     488603 |
    |     488717 |
    |     609541 |
    |     646210 |
    |     674013 |
    |     688603 |
    |     688717 |
    |     809541 |
    |     846210 |
    |     874013 |
    |     888603 |
    |    1046210 |
    |    1246210 |
    |    1446210 |
    |    1646210 |
    +------------+
    28 rows in set (0.00 sec)
    
    mysql> EXPLAIN SELECT product_id FROM `product_property_test` WHERE ( property_id =65854 AND value =5493 ) OR ( property_id =65854 AND value =5491 ) OR ( property_id =65863 AND value =5498 ) GROUP BY product_id HAVING COUNT(*) = 3;
    +----+-------------+-----------------------+-------+---------------+-------------+---------+------+------+----------------------------------------------+
    | id | select_type | table                 | type  | possible_keys | key         | key_len | ref  | rows | Extra                                        |
    +----+-------------+-----------------------+-------+---------------+-------------+---------+------+------+----------------------------------------------+
    |  1 | SIMPLE      | product_property_test | range | property_id   | property_id | 8       | NULL |  124 | Using where; Using temporary; Using filesort |
    +----+-------------+-----------------------+-------+---------------+-------------+---------+------+------+----------------------------------------------+
    1 row in set (0.00 sec)
    

    So it’s finding 124 instances of one of the property/value combinations. It needs to create a temp table and do a filesort for the group by and having. Now if that value was a lot bigger, like > 10000, things would start to slow down. Also if you would filter on a large number of properties, the temp table will grow.

    Remember that there are a lot of different properties, so one of these properties is only used in a small subset of the product. If let’s say > 5% of the products has a specific property, we should just add a column to the product table.

    Now in reality you’d probably not just search using the properties, but filter on fields from the product table as well. This should also limit the number of matching property/value pairs found.

    SELECT product.* FROM product INNER JOIN product_property_test q ON product.id = q.product_id WHERE product.category_id = 200 AND ( ( property_id =65854 AND value =5493 ) OR ( property_id =65854 AND value =5491 ) OR ( property_id =65863 AND value =5498 ) ) GROUP BY product_id HAVING COUNT(*) = 3
    

    So in practise this will scale well up to a couple million products. If you have more than that you should look at a different solution, like combining Redis and MySQL.

    @Jaimie, I hope your questions are hereby answered. I agree that this isn’t the right way to go for all occasions, but it’s definitely useful for the situation as described above. If you want to discuss this further, please come with practical examples of where, why and how problems will arise.

    ReplyReply
  41. 30 December 2012Asad Kamran says:

    @Arnold Daniels:
    I came to this page from google, after a lot of search.
    I am in same situation you already handled, but not discussed in this article ie. multi value fields, grouping or ranges.
    In past i developed a classified site having different categories(almost 700, 11 top level, almost 150 2ndlevel, and then 3rd, 4th level.

    I analyzed and collected all the attributes needed by the products of all categories in 1 comment table CLASSIFIED(almost 100 attributes, 10-15 common , other are collection of not common).

    Now is demand to control the form attributes for each type of category separately and data can be single, or multivalued( select drop down, radio, checkbox).
    As you promise to help, so i am asking you to share your knowledge, so can be able to design a good database.
    I already posted a question here http://stackoverflow.com/questions/13913642/php-mysql-controlling-post-forms-fields with 1 answer, that i accepted, but not fully agreed.

    ReplyReply

Write a comment: