DbReaderGenerator Library

Paulo Zemek

Rate me:

4.93/5 (31 votes)

23 Feb 2014CPOL17 min read

55K

1.9K

Generate DataReaders at run-time and get the best performance by avoiding repetitive code and by using the database's most appropriate Get methods.

Introduction

Last year, I published the article ADO+.NET. In it, I presented a library that in my opinion can "replace" ADO.NET.

In fact, such library uses ADO.NET but it doesn't expose it to the users. We can say that this is very similar to what WinForms do compared to the Windows API (the user32 library). Yet, Nicolas Dorier told me that ADO+.NET shouldn't try to replace ADO.NET, that it should add to it.

Well, part of me still disagrees because my purpose was to hide those methods and properties that cause confusion but I do understand that there are actually too many projects that use ADO.NET directly, so making something that simply makes it easier to access data the right way without replacing the ADO.NET components is a good thing.

So this time, I am presenting a very small library meant to make it easier to actually read values with a data reader and generate a filled record in a very fast and yet configurable manner.

What Does This Library Really Do?

This library starts with a very simple purpose: Filling a .NET object by using an IDataReader row easily, without manual code to do the right GetSomething call, without manual conversions for datatypes mismatches and with a great speed.

And on top of this, it gives syntactic sugar methods that allows users to easily enumerate a database command receiving instances of a given type or to create a list with those results.

And to make things even better, the entire code has expansion points that allow you to tell how to process a given database column (so you can decide to call some specific method to read that column) or if you don't want to go that far, you have other extension points that allow you to tell how to find a field or property by the database column or how to read a database column considering its actual database type. Or, maybe simpler, to tell how a needed conversion should be done.

Using the Code

Before talking about performance, explaining how it works internally or explaining how to completely reconfigure this solution, let's understand how to use it for the basic situations.

To start (probably during the application initialization), you need to configure the library with its default values (or maybe with different values, but we will see that later) so, to do that, you must call:

DbReaderGeneratorLibrary.Configure();

Then, at any moment, when you have an IDbCommand already filled and being at the point of calling ExecuteReader(), you can use one of those methods:

Enumerate<T>() - This is an extension method that will call the ExecuteReader() and will create a fast delegate to do the reads of each line.

As an Enumerable, you are free to use constructs like foreach and even LINQ methods (methods like FirstOrDefault(), First(), Single() and SingleOrDefault() are good methods as they don't require to read all the records). But if you require methods that load the entire contents in memory, it is preferable to use the ToList() method.
ToList<T>() - Well, this is very similar to the Enumerate() method, but instead of only reading records when requested (enumerated), it will load all the records at once. If you actually want to have all the records in memory, this method is preferred over the previous one, but it is not indicated for large batches or if you plan to call methods like First(), Single(), etc, as it is useless to read all the records when you only want the first one.
EnumerateColumn<T>(columnIndex) - It is not uncommon to create a query that will read a single column from the database. In this case, why should we enumerate records that have a single property? Why not enumerate the columns directly?

This is what the EnumerateColumn() does. I was unsure if I should call it EnumerateField or EnumerateColumn, but even if the IDataReader has a GetFieldType, I decided it was better to use the "columns" name.

When enumerating a command to read a column, we generally read the single column the query is actually selecting, so the default value for the columnIndex is zero, yet we are free to read another column index if we really want to select more than one column and decide to ignore all the others.
ColumnToList<T>(columnIndex) - As happens with the Enumerate() and ToList() methods, sometimes we may prefer to put all data into memory instead of iterating it. So, the ColumnToList() method will read all the lines resulting from a command and will put those results into a single list.

The Library Organisation

Personally, I like to have one type per file. I am even extremist as I usually put a delegate declaration, which is a single line of code, into its own file.

But I am lately "playing" a little with different ways of presenting codes and libraries and one of the things I know is that many users prefer to have a single file they can add in their projects instead of having many files to add or, even worse, having to add a new project to their solutions and/or a DLL reference. So I decided to divide things into only two files:

DbReaderGeneratorDefinition.cs: This file contains all the interfaces, delegates and the "Default" definitions without actually having any implementation. The purpose of this design is to allow you to replace the default implementation if you decide to write your own implementation without having to rewrite code that depends on this library. This is very similar to the design I presented in the article Architecture of a Remoting Framework.
DbReaderGeneratorDefaultImplementation.cs: This file is the actual implementation of the DbReaderGenerator. Well, at least the default one for all the interfaces and extension points. I really believe that in your own projects, you will try to replace at least one of the classes presented in this file.

Well... actually there's a third file. The DbReaderExtensions.cs is where the methods that I presented in the Using the Code topic are found. I was really unsure where I should put the class with those methods. Considering it is a default implementation, I thought about putting it in the default implementation file. But considering it is able to work with any configuration, I thought I should put it in the definition file... in the end, I put it into its own file, but I still hope there aren't too many files in the solution, and you can always merge the contents into a single file if you really want to.

Performance

I always hear (and read) people saying that performance is not important. Yet in almost every place I work, the main issue is performance. And the common points where performance is a problem are:

Communication
Database access

And considering we are not dealing with communication in this article (and databases have their own optimized way of communication), there's only one thing to try to improve: The database access.

Actually database accesses using the ADO.NET directly suffer from these things (among others):

If using DataTables/DataSets, all the value-type columns (int, booleans, tiny ints, chars, etc.) are boxed, which occupies much more memory and also always requires a cast when reading the data
If using data readers, considering that some databases change the actual data-type, it is not an uncommon practice to read data using the GetValue() method (which also does boxing) and then use a method from the Convert class to actually receive the data with the right type
Before the existence of yield return, it was hard to write enumerators that read one record at a time and so people used to read all the data and put it into lists. Now, even with the existence of yield return, it is a common practice to continue to read all data and put it into a list (maybe because people are simply following the old "standard") and this can be very problematic for large batches

So my purpose was to help users to avoid all those problems. Actually, the code generates a single delegate at run-time to read the entire row filling an already existing object, so we can say that reading an entire line only adds a single virtual method call compared to the code done by hand.

But, differently from what happens when users do the code by hand, it is actually capable of analysing the types of the database and using the right GetSomething method. Note that for SQL Server to read a char, you actually can't use the GetChar() method, you must read it as a string and then get the first character, but that's not the case for other databases.

So, with the code generated at run-time, you can benefit from the versatility of using the right access method for different databases without a performance hit and without having to write a different code for every database. And as an extra benefit, you can avoid writing lots of repetitive code even if you don't use different databases, avoiding bugs caused by copy/paste and guaranteeing that the good pattern is always used.

My only performance comparison was with Dapper reading thousands of records, many, many times. When Dapper was taking 1.9 seconds, this solution was taking 1.2 seconds. Yet I did it only to be sure that it is fast, as Dapper is more complete as it helps you fill parameters while this solution is more complete in respect to user data-types and configuration (in fact, I plan to present an article on the solution to fill the query parameters as another independent solution that can be combined with this one).

EnumerateSingleInstance<T>

I already presented a list of extensions methods that you can use on with your IDbCommands. But if you really care about performance, there's one extra method, called EnumerateSingleInstance.

This method always returns the same instance for all the database rows (and it can even receive such instance as parameter). Its purpose is to be used in foreach blocks considering the record is not going to be used outside such blocks, so it avoids the cost of creating a new instance at every database row, effectively becoming faster and also alleviating the pressure over the garbage collector.

But, as it always return the same instance, you can't use methods that expect different instances to be returned so, for example, it is useless to use LINQ methods like ToArray(), as the returned array will have the right length but will be filled with only one instance (which will contain the values of the last record read).

So, use this method if you want the maximum performance, but use it with caution.

The IDbReaderGenerator

In this library, everything starts with the IDbReaderGenerator. When we do an Enumerate() call or a ToList() call, what actually happens is that those methods will call an ExecuteReader(), will ask to generate a record filler using the IDbReaderGenerator and, well, they will iterate through all lines, calling the generated delegate and either yield returning the records or adding them to a list.

So we can say that everything starts from the IDbReaderGenerator.

But the default implementation will, in fact, do 2 things:

Will decorate an inner solution with a cache of the results, so new executions of the same query will not lose time generating the delegate again;
Will compile a delegate from an expression generated by the IDbReaderExpressionGenerator.

So, our user start point is not the architecture start point. The architecture start point is the IDbReaderExpressionGenerator.

The IDbReaderExpressionGenerator

This is the real heart of this library. The IDbReaderExpressionGenerator is responsible for generating a single expression that represents the appropriate call. In fact, there are two kinds of expressions that can be generated:

Filler: A "filler" expression has the purpose of filling an already existing instance, so it receives such instance as input instead of generating a new result at each call. This has the advantage that allows users to avoid creating new records if they only want to keep a single record at a time in memory;
Column Reader: Expressions of this type are optimized to read a single column, so they return that column value directly instead of filling a record instance.

The Extension Points

Actually, the first extension points to the DbReaderGenerator and the DbReaderExpressionGenerator are the interfaces. As they start as interfaces, their implementation can be completely replaced or decorated.

As already explained, the default DbReaderGenerator is in fact a decorator that caches the results generated by an implementation that redirects to the DbReaderExpressionGenerator. So, let's see the extension points that are available for the DbReaderExpressionGenerator.

The default implementation of the DbReaderExpressionGenerator has 2 constructors. One of them receives a delegate to generate expression to:

Access the database column
Access the .NET member (field or property)
Make a conversion from the database type to the .NET member type (if needed)

Note that generating the member access expression is the easiest one and it is actually implemented with this code:

(command, reader, columnIndex, instanceVariable) =>
{
  string memberName = reader.GetName(columnIndex);
  var member = instanceVariable.Type.GetMember(memberName);
  if (member == null || member.Length != 1)
    throw new InvalidOperationException("Can't find a single member named: " + memberName);

  return Expression.MakeMemberAccess(instanceVariable, member[0]);
};

And this is probably the code that you may want to replace if you have a different naming rule.

If you see this code, the name of the database column is used to find a .NET member with the same name (so, a field or property with the same name).

But I know many situations where users want to put different names in their properties, be it by using attributes or another application specific rule. So, it is enough to replace this delegate with one that finds the member with a different rule and everything will be done.

Also note that if the query has a column name that's not found, an exception is thrown. Actually, if a null expression is returned, it will simply ignore such database column but it will still be able to fill the object with the other columns, so those are possible extensions that you may want to try.

The Other Two Parameters (readColumnGenerator and conversionGenerator)

The default implementation of the readColumnGenerator will try to discover if the column type, as seen in the database, has an equivalent GetColumnType method on the reader.

That is, an Int32 will use the GetInt32() method, a String will use the GetString() method, etc. If there is, it will use such a method to do the read. It doesn't care if this is not the type of the destination field or property as that's the responsibility of the conversion generator.

And so, if those types don't match, it is the responsibility of the conversion generator to generate the appropriate conversion call. And well, the default implementation uses the Convert class, trying to find an appropriate ToSomeType or, if one is not available, using the ChangeType() method. I plan to present another article showing how you can use a really expandable solution for the data type conversions to support all kinds of conversions (similar to the Expandable IoC Container, but for the conversion expressions).

What if the Database Column is Not Expected to Set a Field or Property?

In some situations, it is possible that a column in the database is not directly reflected as a field or property set. If this is the case, you can use the alternative constructor for the DbReaderExpressionGenerator. Such alternative constructor still needs the ReadColumnGenerator and a ConversionGenerator (which are used by the GenerateColumnReader method) but instead of trying to build the path (reading the database column and setting the field/property), it calls a delegate to do that. This means that you can actually read a column and call a method, if that's appropriate for your case.

I can say that I can see this happening if the last column is a column like IsReadOnly which is responsible for calling a MakeReadOnly() method.

Cache

Actually, the code generated to read the IDataReaders is pretty fast. The problem is that generating such reader takes time (not that much, yet we can say that it may be a problem). So the best thing to do is to cache the generated readers. The problem is: How do we cache them?

It is possible that you always use an object (like Person) when reading a table Person, which is always using the same list of columns and in the same order, independently on the ORDER BY used or the WHERE used. If this is the case, a generator for the destination type could be reused in different select clauses, as different ORDER BY or WHERE clauses don't affect the returned columns.

But if you use the same target object with different tables or with different column orders on the SELECT clause, you can't reuse the IDataReader reader, so a new one must be used. In fact, if you use different databases, the same select may actually return columns of different types, so the generated reader can't be used in those cases either.

So, to try to solve those problems, the default cache will only reuse a cached generator for an identical SQL clause, considering it comes from a connection that has exactly the same connection string and, of course, for the exactly same destination type. The class that does such cache actually allows you to say that you don't want to consider the SQL clause or the connection string, effectively reusing the cached generators more frequently, but that's the user responsibility to ask for that and to guarantee that he will not use the same object to read different tables, databases or simply selects with different column orders.

Yet, even if you don't care about reusing the cache so often, there's another problem: Items cached will never be collected, so if you are building different SQL clauses (maybe because the where clauses are using string concatenations instead of using parameters, which is another problem on its own) the cache can become too big. It is not hard to make a better cache by using a weak dictionary, but creating a good weak dictionary is not that easy and surely it is not that small, so to avoid giving a big solution I simply wrote a simple cache that's not weak. But remember such trait when using it, so you may prefer to reuse the queries for different SQLs or you may prefer to write your own cache with different rules, as this is something important.

Sample

The sample application is a speed and conversion comparison using this solution and Dapper with fake commands and data readers. I made it use fake commands and data readers because I don't want to force users to create a real database to do the tests, yet I tried to use all the important methods so it is possible to see how the library can be used.

The fake reader actually treats the "int" columns as decimal columns, so it is necessary to do some conversions to make the values work and, with enums, Dapper simply fails while this solution works.

It is important to note that this application will run really fast to process millions of records as it doesn't actually lose time querying a real database, so we can see that both mappers are extremely fast. In real situations, most of the time is spent doing the real query and receiving data through TCP/IP, yet I put the speed comparison to prove this is not going to make things slower while it will be useful by using the right get methods and doing the conversions easily, if they are necessary.

The Future

I will not promise anything, but in the future I plan to present an implementation for the ConversionGenerator that will allow users to easily register new conversions without having to provide an entire new implementation and I also plan to present a solution to fill the database parameters (which actually is something that Dapper has but this solution doesn't).

My purpose is to let each one of those solutions to live isolated so you can use anyone of them without using the others while still providing syntactic sugar methods more similar to how Dapper works, so you can easily execute a query, giving typed parameters and receiving typed results with ease, which of course will use this fast solution for reading and, if needed, will also use the fast solutions to fill parameters and to do configurable datatype conversions.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Paulo Zemek

Software Developer (Senior) Microsoft

United States

I started to program computers when I was 11 years old, as a hobbyist, programming in AMOS Basic and Blitz Basic for Amiga.
At 12 I had my first try with assembler, but it was too difficult at the time. Then, in the same year, I learned C and, after learning C, I was finally able to learn assembler (for Motorola 680x0).
Not sure, but probably between 12 and 13, I started to learn C++. I always programmed "in an object oriented way", but using function pointers instead of virtual methods.

At 15 I started to learn Pascal at school and to use Delphi. At 16 I started my first internship (using Delphi). At 18 I started to work professionally using C++ and since then I've developed my programming skills as a professional developer in C++ and C#, generally creating libraries that help other developers do their work easier, faster and with less errors.

Want more info or simply want to contact me?
Take a look at: http://paulozemek.azurewebsites.net/
Or e-mail me at: paulozemek@outlook.com

Codeproject MVP 2012, 2015 & 2016
Microsoft MVP 2013-2014 (in October 2014 I started working at Microsoft, so I can't be a Microsoft MVP anymore).

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.