Maybe you’ve heard people talking about ditching their SQL Servers and other RDBMS entirely. There is a movement out in the software development world called
“Insanity!” you may cry, “for where will people put their data if not in a database? Flat files? Tell me we aren’t going back to flat files.”
No, but in the relational model, something does has to give. The NoSQL movement is about re-evaluating the constraints and scalability of data storage systems in the light of the way modern web applications generate and consume data.
The outcry about flat files above is meant to highlight an assumption developers often have about building data-driven applications: Data goes in the database (SQL Server, Oracle, or MySql). Just maybe, if we are really cutting-edge, we might consider storing our data in the cloud, but the choices generally stop there.
The NoSQL movement asks the question:
“Is the relational database (RDBMS) always the right tool for data storage and data access?”
Starting from an RDBMS is virtually an axiom
of software development. However, those of us who are excited about NoSQL believe that relational databases are not always the answer. I think this highlights one of the reasons this NoSQL thing is called a movement. People are realizing they have a choice where they thought they had none.
The converse is, of course, also true. The NoSQL databases are also not always the right choice either. If you look carefully however, you will find that they are a good choice much of the time. Don’t take my word on it. Ask Facebook, Twitter, Digg, SourceForge, WebEx, Reddit and a bunch of other companies here
that are using NoSQL databases.
This move towards NoSQL is driven by pressure from two angles in the web application world:
- Ease-of-use and deployment
- Performance - especially when there are many writers as compared to the number of readers (think Twitter or Facebook).
Choosing NoSQL for Ease-of-Use and Deployment
I cover the programming model in detail as well as introduce the actual database server below. For some vague motivation, let me just give you a quick look at how you define the data model and maintain it.
- Define your classes in C# (largely) without regard to putting them in a database. Related classes? Easy - one has a collection of the others.
- Create a simple DataContext-like class which exposes each top-level type that is to be stored in the database. This is only a few lines of code per collection (think of this as a table).
- Interact with the database using LINQ. This creates the collections (think tables), sets the schema, etc.
- Maintain the database and evolve it by maintaining your classes from step 1. *
Why, in the name of all that is right, do we have to model our system twice? Once in the database and once, in parallel, in code? With NoSQL, you have one place to do that - in your C# classes.
*You may have to run a transformation tool if you’re making radical data changes, but that’s true in SQL systems as well.
Choosing NoSQL for Performance
When the number of concurrent clients using your application - and thus your database - is reasonably small (let’s say 500 users as a baseline) RDBMS can work great. But what if that number grows? And if you are writing a web app, you definitely want that number to grow. At 50,000 users, can you still run on a single instance of SQL Server or MySql? How powerful does your hardware have to be to handle that? What about at 500,000 or 5,000,000 users, still good?
I’m sure there are some of you out there thinking, “What a minute now! There are plenty of systems with tons of users built upon relational databases.”
It’s true, there are. But how much expensive hardware and software do these require? How easy is it to leverage *commodity* hardware and free software? A basic SQL Server cluster might run you $100,000 just to get it up and running on decent hardware. Rather than leveraging crazy scaling-up options, the NoSQL databases let you scale-out. They make this possible (dare I say easy?) by dropping the relational aspects of a database. Some NoSQL systems such as MongoDB get even better scalability by loosening some of the durability guarantees – which they backfill somewhat with redundancy (more on MongoDB shortly).
“Ok, ok. So it’s cheaper and simpler,” you say. “How much faster than the finely tune system that is SQL Server 2008 can these open source NoSQL systems be?”
The answer is: MUCH MUCH FASTER. Here’s a simple comparison of running a bunch of concurrent inserts into SQL Server 2008 and MongoDB on the same computer.
Looks like under heavy load, I’d say it’s about 100 times faster. I’m sure there going to be tons of second guessing this graph and so on. Hold your comments please! I’ll be posting a full performance comparison with source code soon. Let me just say that I think the comparison was fair - I’ll back that up in a later post.
NoSQL and a New Programming Model
If we do not have joins and primary / foreign key relationships, how do we associate related data? In NoSQL, there is a way to mimic foreign keys for certain relationships. However the main answer is that you do not disassociate your data in the first place.
I’m sure that you’ve all heard of the object-relational impedance mismatch
. A large part of that mismatch comes from the fact that we normalize the data in our database to the extreme and then use joins to reassemble that data. Not only does that cause this so-called impedance mismatch, but those joins can be really slow and they can be the death of any scale-out solution. The key to many of the NoSQL databases’ scalability is that they do not use joins. You simply save large swaths of your data as a single blob (which in MongoDB’s case, is still deeply queriable).
Shortly we’ll look at an example where we build out a disconnected, offline RSS reader that uses MongoDB and LINQ to store its data. But just think about how you might structure your data storage if you could save entire object graphs and still query them? Your "row" might be a Blog object which has an array of BlogEntries which contain the entry text, link, date, etc. Then your *entire* query to pull all the details of a single blog would hit a single “table” in the database.
That might look like this query which has one result:
There are no joins or anything like that because you’re saving objects not columns and those objects contain their collections already (e.g. RssEntries). There is an important distinction to make here. These NoSQL databases generally are *not* the same as object databases. They are what are known as document databases. There’s actually a big difference between the two
The NoSQL database we are using in this example is MongoDB
. This is free, open-source database which runs on Windows, Linux, and Mac OS X systems. You can access it from many platforms including .NET, Ruby, Java, PHP, and so on.
We’ll be using .NET and C# of course. You have several options when choosing how to access MongoDB from .NET
but generally that means using LINQ and a light-weight object-mapper on top of MongoDB itself. Note that common terminology might categorize the object mapper that moves objects into and out of the database as an ORM. While that’s OK, there is technically no "R" in this ORM because MongoDB is not relational. Hence I’m calling simply an Object-Mapper (OM).
If you want to learn more about MongoDB you should listen to these Podcast interviews:
Michael Dirolf also has a great book in the works. You can catch a preview of it on Safari Books Online
. Here’s the amazon page:
NoSQL in Action
Let’s write some code. The first step typically in a data-driven application is to spec out the database. Then we’d use LINQ to SQL or Entity Framework to generate the ORM classes. MongoDB is different. MongoDB has no schema or rather its schema is flexible and defined via usage rather than being predefined in the database. So our first step is to define the classes we’d be storing in the DB via NoRM.
We’re going to define 3 classes: Blog, RssEntry, and RssDetail. The Blog object will contain a collection of RssEntry objects. In practice you might just go with the Blog and RssEntry classes. But I wanted to model both the embedded case (Blog + RssEntry) and the loosely defined foreign key style relationship that mimic joins (RssEntry + RssDetail). That way we can demonstrate both use-cases.
Here’s a taste of the Blog class:
Notice that it contains a collection (List<T> really) of RssEntry objects. That’s the relationship supported by nesting. The Blog class just has this collection as part of its data model.
The RssEntry class has the summary info for a blog entry:
And the larger data is stored in the RssDetails class (for example the text of the post):
Let’s see how we insert an entire set of Blog data into the database. We begin by generating the objects (Blog, RssEntry, etc) in memory and then serializing them via NoRM to MongoDB much as you would in LINQ to SQL. The difference is this will actually generate the collections (analogous to tables) if they don’t already exist and it will define the implicit schema to match our objects:
Here we are using a class called RssDataContext which we wrote manually. It is very similar to what LINQ to SQL and Entity Framework use to do the object-relational mapping. Want to do a query? Do you know LINQ? Well then you’re all set:
How do you add a new entry to an existing blog and update it in the database?
We leverage the fact that the blog.Entries collection is a List and just add to it. Then save will update the record in the DB.
All this works great and is highly performant. But do be careful as not all the LINQ operations are fully implemented yet in NoRM and some (like join) may never be added because MongoDB doesn’t support it.
To get started, download MongoDB the tools and server here:
You unzip the zip file and run the mongod.exe program. Be sure that you have created the C:\data\db folder. It appears at first that you have to run MongoDB in a console window. But you can register it as a Windows Service:
Here’s some helpful advice on installing MongoDB as a Windows Service (there is a small bug you have to work around):
There’s also a management console (and I mean "console"):
For a project I’m working on I’ve built a Windows Forms UI that lets me manage the database easily by just adding an object data source and doing some drag-drop magic in Visual Studio. Generally I look down upon that sort of development, but for an admin tool it’s just fine.
Now It’s Your Turn!
Try it out for yourself. Download MongoDB and the NoRM driver and build some apps. You may also want to check out the source code for my demo app:
Here are some other blogs on this subject.
Michael Kennedy is an instructor for DevelopMentor where he specializes in core .NET technologies as well as agile and TDD development methodologies. Keep up with Michael via his Web site and blog at