The Age Of Smart Data

I have read that the modern day dish washer was invented in 1886 by a woman who wanted to provide a way to prevent her servants from chipping and breaking her china. Today, the dishwasher is a labor saving device. It is common for us not to see the implications of an invention or of a new technique.

For example, the Java programming language was not invented with the Internet in mind. Even when it did come to the web, it was presented as a language for creating applets that run in the browser. In the early days, nobody imagined J2EE etc.

The same is true about object databases. It is common to find posts that confuse some of the new product offerings with each other. An object persistence framework is NOT a database. Objects are more than data. Objects contain data, but they also contain methods. An object that has been persisted into an object store can tell you about itself in a multiplicity of ways. Data that has been retrieved from a relational database is static – it is just information. An object that has been retrieved is smart data. Or, it should be – that is my way of thinking.

I have been thinking about this for a while. I love relational databases, but the model breaks down in places. For example, breaking an entity into mutually exclusive parts called fields that have a fixed data type is not accurate. A song has lyrics, and it has music. A song has a chorus, and a bridge. The music is composed of notes, which are arranged in bars which depend upon time signatures. Can I use a relational database to store a song? I can, but the result is unsatisfactory.

However, if I store my song as an object, it can contain a method to retrieve the lyrics. Perhaps my song object can contain lyrics in different languages. My object could contain a method to transpose the music into a different key, or to retrieve just the chorus of the song. The song object could also contain links to versions and interpretations of itself stored as ogg files elsewhere on the Net.

Storing a song in multiple languages presents interesting problems. The lyrics would have to be broken down into syllables, and the syllables would have to be associated with notes so that the user of this information could sing each word at the appropriate pitch and for the correct duration. There is a reason we do not use relational databases to do these things. It is too hard – we do it in code. However, we do not store the code in the database. We use the code to access the database, but it has no intrinsic relationship to the data. Data SHOULD be associated with and be retrieved with tools that make the information useful.

If you are using db4o, or another object store, just to store and retrieve data, that is boring. Consider the following few paragraphs carefully. I have a dream!

I remember attending a product demo of the Informix database product during which datablade technology was discussed. Basically, the Informix database product provides the means to build datablades, special extensions that allowed you to retrieve, store and use data in special ways. For example, the is a Geodetic datablade that can be used to retrieve and use geospatial data. Basically, data is stored in a relational database, and a special extension can be used to interact with it.

If we go back to my music example, it would be possible to store a song in a relational database using a music datablade that performs all of the functions I described. I find that idea compelling.

However, if we use object stores to store those same songs as objects, then they become “smart data”. Being able to search for and retrieve an object is just the beginning. Then we can use introspection to ask the object to tell us about itself. In fact, we could even design standard interfaces to let a data object tell us about itself. (I have a title, in am composed of chapters that also have titles, but they can also be identified by numbers.)

I can even imagine a new class of products: special data storage classes that provide special methods for adding data to an object, for retrieving data front an object, for specifying business rules and for adding data to an object. I could be wrong, but this is the future I see. I do not see the mere substitution of the object persistence stores for the relational database.

As long as I am imagining things. Imagine a data object that had a method the exposed meta data to help relational databases plug themselves into relational databases – basically, an object could make itself searchable by MySQL, or by Postgres. In this case, the RDBMS would be used to create the relationships between objects and to store meta data to make them searchable. However, the point of the exercise would not be the data in the RDDMS. The point of the exercise would be the object the data helps us retrieve – the object would come with functionality. Basically, the data would be smart.

I have another hobby-horse that I hope to write about in the future, but it relates to “smart data”. We store information in word files, spreadsheets and databases, but this hardly makes them useful. I need Microsoft Word, or Open Office, to access the word file. I need special software to access a spreadsheet. It seems that format comes first, and the information comes second. Let’s face this: an essay is an essay no matter how it is stored. Docbook is one of many ways we can view and use an essay. As long as we put the format first and the entity itself last, we are misusing the technologies we have built. We fail to achieve the vision.

Imagine an essay object that knows how to expose itself as an Open Office document, or as xhtml, or as a pdf file. At all times, it remains an essay independant of its format. That is the future. That is the vision of the semantic web. What the semantic web has been missing is an underlying technology – a concrete implementation. I have a special fondness for db4o because I am expecting great things from this type of technology – there are other products out there, but use this one because you can play for free and there is a community.

Why are we still writing code to retrieve a recordset, and then writing more code to turn the recordset into a table? Store the data such that it knows how to serve itself up as a table and be done with it! If a book were stored by means of an object store, we should have the tools to do the following:

  1. Tell the object store to serve the book as a collection of standard length pages marked up as xhmtl.
  2. Expose a set of rest style urls to access the resource by chapter and page, and expose a table of contents.
  3. Use the rest style urls to access the resource.
  4. Expose an interface to the resource that can be used by web developers to present the resource to people and services on the web.
  5. The developer should not care that the book is stored in a Oracle database, or as an Open Office document – it’s a book for crying out loud! My four year old knows what a book is! Why doesn’t my computer?
  6. Apply an external stylesheet, or retrieve one from the object store itself.

Let’s think outside of the box, people! There is so much technology, but there is so little vision. (If you have vision, and you are reading this, please know that I do not mean you.) Discussions about the speed of db4o are interesting. LINQ is interesting. But, this represents nothing more than an improved means to an unimproved end. The age of smart data has arrived. You have nothing to lose but your chains.



Leave a Reply