It’s Halloween season and visions of tombstones and cobwebs are all around us. It’s a season for costumes, pumpkins, haunted houses and hay-rides, as well as a time for seasonally-tinged and pun-inducing blog posts.
In other words, reader beware…
I have been haunted of late by troubling and spooky thoughts. A fiendish beast haunts me in my sleep. It’s not Dracula, The Werewolf, Frankenstein’s Monster or The Mummy. The answer, dear reader, is something more sinister, insidious and seemingly omnipresent.
I am speaking of course about The Blob…
First brought to life in 1958, The Blob, according to Wikipedia was “an independently made 1958 American horror/science-fiction film that depicts a growing amoeba-like alien that came from outer space and terrorizes the small community of Downingtown, Pennsylvania.”
However, what Wikipedia fails to mention is something far more frightening and difficult to eradicate … the blob data type.
I’ve come across this on more than one occasion, and each time I’m reminded of its dangers. Nicely formed and organized pieces of data, ready to be used in all kinds of creative ways, suddenly getting sucked into a blob data type, never to be seen again.
In a recent interview, I was asked a question about key-value stores and having had an inexplicable brain-freeze, I missed an opportunity (which I will rectify here) to share my very strong opinions on some of the technical implications of viewing data exclusively through this lens. This was a bit surprising, given how many times I’m asked similar questions, each time getting on my soapbox about opaque data storage patterns and/or the use of blob data types on data.
In fact calling a blob a data type is something of an oxymoron in a sense, in that it’s really the absence of a data type. In fact what you’re really telling the database when you blob something is that outside of whatever key you might have used to identify the blob of data, there is no other way to find anything out about what’s in the blob, at least using the database itself. Putting data into a blob is telling the database that “anything goes,” essentially giving it complete license to punt on visibility. In these cases, all of the grunt work of figuring out what is inside of the blob gets pushed to whatever application you write on top of the database, while at the same time devolving the database into a dumbed-down key-value map, resulting in an enormous step backward in productivity and data quality.
Now as bad as that is, there is even something worse. While it’s one thing to stuff data “eyes open” into a blob inside of an otherwise strongly typed database such as an RDBMS (i.e. knowingly declaring blobdom on a chunk of data), it’s quite another however to think that you have some visibility into your data, only to find out that the database hasn’t done much with the data-type and has effectively blobbed your data anyway.
I’ll give you an example (from when I didn’t miss the opportunity to soapbox).
A few weeks ago, a developer at a customer (which at the time was still a prospect) talked to me about a “performance test” that he had done with “another document oriented” database and how blazingly fast it was. When he ran the “same test” with MarkLogic, he found the results to be slower and was disappointed. Now to simplify the discussion, I momentary put aside the following facts: a) the other database wasn’t actually guaranteeing that it was saving/committing all of the data anyway, b) the other database didn’t provide any transactional consistency for silly things like point-in-time recovery, deterministic commit and rollback, etc. and c) his test didn’t represent anything that could apply to a realistic scaling model to the tune of hundreds of nodes with transactional consistency, and able to provide high-performance simultaneous read and write.
I left all of that out for simplicity sake. However, what I didn’t put aside was the discussion around visibility and the dangerous blob lurking in his test.
“So how did you index the data?”
“I just created a single key for lookup.”
“OK, so just a single unique ID to look up a blob of data?”
“Yes but it’s not a blob, it’s a JSON object.”
“OK is it set it up so that you can search on any of the fields in the JSON object, with as many different combinations of search criteria as you would like in a single query?”
(Note: I didn’t even bother to ask about going deeper like being able to index all of the words and phrases inside of long text fields for free-text search. I kept it simple.)
“No but this was just a simple test for a simple application.”
“So there’s no other way to look up the data?”
“Not for this example, no. But I can add an additional key or two later on if I need. I’d have to write multiple queries inside my code to leverage the other keys but it’s possible to do as long as I write code to combine the results. Now the additional keys might slow things down a bit…”
We then had a little bit of a discussion about CRUD-based business applications in general and I mentioned how in my experience I had never built or worked on an application that relied on the user to memorize a huge collection of unique object identifiers in order to find their data. I was a little bit more diplomatic (this was a blob-intervention after all) but I did continue to make the point about the importance of proactively indexing things and the foundational value-add to end users. Sure there’s overhead to doing that, but the payoff is huge in terms of application productivity, data quality, and user-enablement. It’s kind of how databases have been operating for 4 decades or so. You know, being designed for ad-hoc search and retrieval of data, not simply storing data for no good reason. Call me old-fashioned I guess…
Like the victims of the name-sake horror-movie, developers can easily get sucked in by the blob. It’s as if they’re tricked into having some false sense of control over their data by not having the database worry about silly things like high-performance search. “Look how fast the data loads” they think to themselves, as the data – un-indexed – gets sucked into the blob, never to be seen again. And the developers, flailing away in vain inside of their application code, watch as things like productivity, visibility, governance, and data quality get swallowed up right behind.
A tragic sight indeed.
In the blob movie, the blob is (seemingly) defeated by being put into deep freeze by what can only be described as a CO2-fire-extinguisher-wielding flash mob (hey, they were before their time I guess…). Of course that was only after the protagonist (teenager Steve Andrews played by Steve McQueen) finally convinced the towns-people of the real dangers of the blob. For much of the time no-one believed him.
What’s really troubling about today’s tech blob however is that many newer database entrants have been embracing this notion of having the database do less with data – particularly on ingestion – under the guise of performance and scalability. It’s as if the blob has become almost the new gold standard. It’s fool’s gold of course, given the price paid in overall system performance, including simultaneous search + ingestion/update, not to mention the extra code to manage. And you’ll notice that I didn’t say read-write either. A simple lookup by a key is not a search and almost never translates to a search operation from a business perspective.
So what should developers and application owners do?
Well if you like to solve technical problems by writing as many lines of code as possible, then by all means blob to your heart’s content. However, if you’re interested in productivity and expecting more from your database than simple key-value lookup, then take a long hard look at either your model and indexing schemes (e.g. if you’re in RDBMS world), or if you want blob-like flexibility without the horror show, take a look at the Enterprise NoSQL product known as MarkLogic.
No fire extinguishers necessary.