Should We Do Something With Core Field SQL Storage?

Drupal field was part of the Drupal core since version 7. The Field extends her ability to build different kinds of systems. Since it is basic units of each entity, it is one of the most important parts of the open source software. But, when it comes to the efficiency of using SQL storage engine, the field can still do better with efficiency. I sincerely believe that we may not afford to ignore it. Let put it under a microscope had a close look at field SQL storage.

Case study:

I had built a patient scheduling system for a couple clinic offices. The project itself is not complicated. I have attached the patient profile picture on this article. We built a patient profile node type on the form. It is not a complicated form, but there are over 40 fields. It is not difficult to set up a nice patient profile node form. I also created appointment node type that connected patient profile and doctor profile with entity reference fields. Using views with exposed filter for the various reports.

It was the project where I find the issue. I am a little bit uncomfortable after I take a close look at the database. Each field has two almost identical tables. I think fields took too much unnecessary database space. I have dumped one of the fields database information to explain my concern.

1) Base table: field_data_field_initial

+----------------------+------------------+------+-----+---------+-------+
| Field                | Type             | Null | Key | Default | Extra |
+----------------------+------------------+------+-----+---------+-------+
| entity_type          | varchar(128)     | NO   | PRI |         |       |
| bundle               | varchar(128)     | NO   | MUL |         |       |
| deleted              | tinyint(4)       | NO   | PRI | 0       |       |
| entity_id            | int(10) unsigned | NO   | PRI | NULL    |       |
| revision_id          | int(10) unsigned | YES  | MUL | NULL    |       |
| language             | varchar(32)      | NO   | PRI |         |       |
| delta                | int(10) unsigned | NO   | PRI | NULL    |       |
| field_initial_value  | varchar(255)     | YES  |     | NULL    |       |
| field_initial_format | varchar(255)     | YES  | MUL | NULL    |       |
+----------------------+------------------+------+-----+---------+-------+

Base table SQL script:

CREATE TABLE `field_data_field_initial` (
`entity_type` varchar(128) NOT NULL DEFAULT '',
`bundle` varchar(128) NOT NULL DEFAULT '',
`deleted` tinyint(4) NOT NULL DEFAULT '0',
`entity_id` int(10) unsigned NOT NULL,
`revision_id` int(10) unsigned DEFAULT NULL,
`language` varchar(32) NOT NULL DEFAULT '',
`delta` int(10) unsigned NOT NULL,
`field_initial_value` varchar(255) DEFAULT NULL,
`field_initial_format` varchar(255) DEFAULT NULL,
PRIMARY KEY (`entity_type`,`entity_id`,`deleted`,`delta`,`language`),
KEY `entity_type` (`entity_type`),
KEY `bundle` (`bundle`),
KEY `deleted` (`deleted`),
KEY `entity_id` (`entity_id`),
KEY `revision_id` (`revision_id`),
KEY `language` (`language`),
KEY `field_initial_format` (`field_initial_format`)

2) Revision table: field_revision_field_initial

+----------------------+------------------+------+-----+---------+-------+
| Field                | Type             | Null | Key | Default | Extra |
+----------------------+------------------+------+-----+---------+-------+
| entity_type          | varchar(128)     | NO   | PRI |         |       |
| bundle               | varchar(128)     | NO   | MUL |         |       |
| deleted              | tinyint(4)       | NO   | PRI | 0       |       |
| entity_id            | int(10) unsigned | NO   | PRI | NULL    |       |
| revision_id          | int(10) unsigned | NO   | PRI | NULL    |       |
| language             | varchar(32)      | NO   | PRI |         |       |
| delta                | int(10) unsigned | NO   | PRI | NULL    |       |
| field_initial_value  | varchar(255)     | YES  |     | NULL    |       |
| field_initial_format | varchar(255)     | YES  | MUL | NULL    |       |
+----------------------+------------------+------+-----+---------+-------+

Revision table SQL script:

CREATE TABLE `field_revision_field_initial` (
  `entity_type` varchar(128) NOT NULL DEFAULT '',
  `bundle` varchar(128) NOT NULL DEFAULT '',
  `deleted` tinyint(4) NOT NULL DEFAULT '0',
  `entity_id` int(10) unsigned NOT NULL,
  `revision_id` int(10) unsigned NOT NULL,
  `language` varchar(32) NOT NULL DEFAULT '',
  `delta` int(10) unsigned NOT NULL,
  `field_initial_value` varchar(255) DEFAULT NULL,
  `field_initial_format` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`entity_type`,`entity_id`,`revision_id`,`deleted`,`delta`,`language`),
  KEY `entity_type` (`entity_type`),
  KEY `bundle` (`bundle`),
  KEY `deleted` (`deleted`),
  KEY `entity_id` (`entity_id`),
  KEY `revision_id` (`revision_id`),
  KEY `language` (`language`),
  KEY `field_initial_format` (`field_initial_format`)

Here are my concerns.

1) Normalization.

Here is one of the fields' data record.

+-------------+--------+---------+-----------+-------------+----------+-------+---------------------+----------------------+
| entity_type | bundle | deleted | entity_id | revision_id | language | delta | field_initial_value | field_initial_format |
+-------------+--------+---------+-----------+-------------+----------+-------+---------------------+----------------------+
| node        | patient_profile      |       0 |      1497 |        1497 | und      |     0 | w                   | plain_text        |
+-------------+--------+---------+-----------+-------------+----------+-------+---------------------+----------------------+

We have value "W" in the Initial field. One character took 51 bytes for storage that had not included index yet. It took another 51 byte in the revision table and more for index. In this case here, only less than two percents of space are used for real data the initial 'W', and over 98% of space is for other purposes.

For the sake of space, I think we should not use varchar for entity_type, bundle, language, field_format column. Use small int, tiny int or intÎÎ that will only take one to four bytes. The field is a basic unit of a Drupal website. A medium website can hold millions of fields. Saved one byte is equal to multiple megabytes in precious MySQL database.

2) Too complicated primary key

Each field table has a complicated primary key. Base table use `entity_type`, `entity_id`, `deleted`, `delta`, `language` as primary key. Revision table use `entity_type`, `entity_id`, `revision_id`, `deleted`, `delta`, `language` as primary key. "In InnoDB, having a long PRIMARY KEY wastes a lot of disk space because its value must be stored with every secondary index record."ÎÎÎ. It may be worthy to add an auto increasing int as a primary key.

3) Not needed field column

I found bundle type column is not necessary. We can have the system running well without bundle type column. In my clinic project, I named the node type "patient profile". The machine name patient_profile appears in each field record's bundle type column. As varchar (255), it uses 16 bytes for each table record. Let do a quick calculation. if there are 100, 000 nodes and each node have 40 fields, 100,000 x 40 x 2 x 16 = 122MB are taken for this column. Or at least, we use 2 bytes small int that will take only one-eighth of the spaces.

4) Just use revision table.

Remove one of the field's data tables. It may need a little bit more query power to get field data, but it save time when we insert, update and delete field's data. By doing so, we maintain one less table per field, edit content faster. It helps to bring better editor experience and to save on database storage space.

A contributed module field_sql_leanÎÎ addressed some of the concerns here. It still needs a lot of work on itself and if we want other contributed modules compatible with it. After all, it changed the field table structure.

Reference:

1: http://dev.mysql.com/doc/refman/5.1/en/integer-types.html
2: http://dev.mysql.com/doc/refman/5.0/en/innodb-tuning.html
3: Field SQL storage lean solution
4: Patient profile form:medical form

Comments

Great you pointed this out. Current SQL field storage is so over engineered that is not suitable for big/complex systems. We realized this a long time ago and have resorted to storing data in entity properties or attached document storage (XML/JSON). The problem of this approach is that you don't get much of the out of the box field management functionality, but who cares, on a complex application you are not probably relying on much of the site builder oriented automation that comes with Drupal. BTW: You didn't mention that there's a module out there "Field SQL no revisions", that does a good job at keeping Field SQL storage compatible with contrib and core, and reduces insert/update queries by half by not storing revision data.

Thank David pointing it out. Field SQL no revisions is a great module. The field SQL lean module that I had built borrowed some ideas from it.

#3 - yes, I agree. Especially when you think about it form D8 POV where you can no longer share fields between bundles. In D7 you rarely shared them anyway. #4 - I've been saying this for years and still, D8 ships with 4 tables just for the entity!!!!!!

I for one tend to share fields between bundles all the time... It makes working with views (showing more than one bundle type) and view modes loads easier. I cringe a little whenever I see a site that has fields called page_image, article_image, etc.

D8 can share fields between bundles. Just not between entity types.

Yeah, you are right. My bad.

Doing away with revision tables is already being considered here https://www.drupal.org/node/2083451

"Remove one of the field's data tables. It may need a little bit more query power to get field data, but it save time when we insert, update and delete field's data." In all sites I have seen so far it was never the insert performance that was a concern, but the query performance. Also e.g. with a use case like CPS module and with field data for fields being cached, but not for revisions (to save cache space), the partition makes sense to split revision and data tables. "A contributed module field_sql_leanÎÎ addressed some of the concerns here. It still needs a lot of work on itself and if we want other contributed modules compatible with it. After all, it changed the field table structure." Contributed modules should never ever rely on the database structure itself, but use EntityFieldQuery instead. File a bug in modules that make field_sql_lean break. So field_sql_lean should just work, as well as pushing entities to XML storage or MongoDB or anything else. That is why the field storage is flexible and pluggable so that you can use field_sql_lean or whatever if core's structure does not suit your needs. However Core's field_storage works well for many applications and is optimized for huge build outs and usually you have other problems than database size in high performance sites and are rather adding more indexes for different use cases than removing indexes due to space concerns.

Revision tables shouldn't exist unless you explicitly want to use revisioning systems to manage data. If you have a site with a hundred fields or more, which is easy to attain on larger sites, you have a TON of space being used by revision tables that in most cases are rarely used.

The problem is that the whole thing is focused on decoupling fields from entities in order to provide revisionability + multilanguage + multi value. But the truth is that in 99% of use cases revisionability is not used, in 75% multilanguage is not used, and in 50% multivalue is not used. So performance and simplicity are being compromised for the sake of serving those use cases. What can we do about it? Maybe the storage backend could be smart enough to adapt to those 3 things, but that would mean that the storage representation will vary depending on the field settings, and having non homogeneous storage will be a mess. And even with the current design, it could be done better, I mean, just take a look at a field table and the Primary Key size. nvarchar(128) + nvarchar(128) + nvarchar(32) + 3 integers.

Add new comment

Target Image