Monday, July 21, 2008

Performance Tip for .Net Developers

I don't actually remember how I landed on this one, but today morning, I saw this article about "Tips for performance improvements for .Net Developers and some Do's and Dont's" and thought it would be something really nice to share on the blog for the community.

Mind well, this article is old and like it says in the article, a lot of stuff in the form of optimization suggested in it might have already been pushed into later versions of the Framework (the Framework was v1 at the time of the writing of this article).

The article's 18 pages long, but its every bit worth the time to go through it.
So, here the article.

.NET Development (General) Technical Articles
Performance Tips and Tricks in .NET Applications

Emmanuel Schanzer

Microsoft Corporation

August 2001

Summary: This article is for developers who want to tweak their applications for optimal performance in the managed world. Sample code, explanations and design guidelines are addressed for Database, Windows Forms and ASP applications, as well as language-specific tips for Microsoft Visual Basic and Managed C++. (25 printed pages)

Contents

Overview
Performance Tips for All Applications
Tips for Database Access
Performance Tips for ASP.NET Applications
Tips for Porting and Developing in Visual Basic
Tips for Porting and Developing in Managed C++
Additional Resources
Appendix: Cost of Virtual Calls and Allocations

Overview

This white paper is designed as a reference for developers writing applications for .NET and looking for various ways to improve performance. If you are a developer who is new to .NET, you should be familiar with both the platform and your language of choice. This paper strictly builds on that knowledge, and assumes that the programmer already knows enough to get the program running. If you are porting an existing application to .NET, it's worth reading this document before you begin the port. Some of the tips here are helpful in the design phase, and provide information you should be aware of before you begin the port.

This paper is divided into segments, with tips organized by project and developer type. The first set of tips is a must-read for writing in any language, and contains advice that will help you with any target language on the Common Language Runtime (CLR). A related section follows with ASP-specific tips. The second set of tips is organized by language, dealing with specific tips about using Managed C++ and Microsoft® Visual Basic®.

Due to schedule limitations, the version 1 (v1) run time had to target the broadest functionality first, and then deal with special-case optimizations later. This results in a few pigeonhole cases where performance becomes an issue. As such, this paper covers several tips that are designed to avoid this case. These tips will not be relevant in the next version (vNext), as these cases are systematically identified and optimized. I'll point them out as we go, and it is up to you to decide whether it is worth the effort.

Performance Tips for All Applications

There are a few tips to remember when working on the CLR in any language. These are relevant to everyone, and should be the first line of defense when dealing with performance issues.

Throw Fewer Exceptions

Throwing exceptions can be very expensive, so make sure that you don't throw a lot of them. Use Perfmon to see how many exceptions your application is throwing. It may surprise you to find that certain areas of your application throw more exceptions than you expected. For better granularity, you can also check the exception number programmatically by using Performance Counters.

Finding and designing away exception-heavy code can result in a decent perf win. Bear in mind that this has nothing to do with try/catch blocks: you only incur the cost when the actual exception is thrown. You can use as many try/catch blocks as you want. Using exceptions gratuitously is where you lose performance. For example, you should stay away from things like using exceptions for control flow.

Here's a simple example of how expensive exceptions can be: we'll simply run through a For loop, generating thousands or exceptions and then terminating. Try commenting out the throw statement to see the difference in speed: those exceptions result in tremendous overhead.

public static void Main(string[] args){
int j = 0;
for(int i = 0; i < 10000; i++){
try{
j = i;
throw new System.Exception();
} catch {}
}
System.Console.Write(j);
return;
}
  • Beware! The run time can throw exceptions on its own! For example, Response.Redirect() throws a ThreadAbort exception. Even if you don't explicitly throw exceptions, you may use functions that do. Make sure you check Perfmon to get the real story, and the debugger to check the source.
  • To Visual Basic developers: Visual Basic turns on int checking by default, to make sure that things like overflow and divide-by-zero throw exceptions. You may want to turn this off to gain performance.
  • If you use COM, you should keep in mind that HRESULTS can return as exceptions. Make sure you keep track of these carefully.

Make Chunky Calls

A chunky call is a function call that performs several tasks, such as a method that initializes several fields of an object. This is to be viewed against chatty calls, that do very simple tasks and require multiple calls to get things done (such as setting every field of an object with a different call). It's important to make chunky, rather than chatty calls across methods where the overhead is higher than for simple, intra-AppDomain method calls. P/Invoke, interop and remoting calls all carry overhead, and you want to use them sparingly. In each of these cases, you should try to design your application so that it doesn't rely on small, frequent calls that carry so much overhead.

A transition occurs whenever managed code is called from unmanaged code, and vice versa. The run time makes it extremely easy for the programmer to do interop, but this comes at a performance price. When a transition happens, the following steps needs to be taken:

  • Perform data marshalling
  • Fix Calling Convention
  • Protect callee-saved registers
  • Switch thread mode so that GC won't block unmanaged threads
  • Erect an Exception Handling frame on calls into managed code
  • Take control of thread (optional)

To speed up transition time, try to make use of P/Invoke when you can. The overhead is as little as 31 instructions plus the cost of marshalling if data marshalling is required, and only 8 otherwise. COM interop is much more expensive, taking upwards of 65 instructions.

Data marshalling isn't always expensive. Primitive types require almost no marshalling at all, and classes with explicit layout are also cheap. The real slowdown occurs during data translation, such as text conversion from ASCI to Unicode. Make sure that data that gets passed across the managed boundary is only converted if it needs to be: it may turn out that simply by agreeing on a certain datatype or format across your program you can cut out a lot of marshalling overhead.

The following types are called blittable, meaning they can be copied directly across the managed/unmanaged boundary with no marshalling whatsoever: sbyte, byte, short, ushort, int, uint, long, ulong, float and double. You can pass these for free, as well as ValueTypes and single-dimensional arrays containing blittable types. The gritty details of marshalling [ http://msdn.microsoft.com/en-us/library/aa720205(printer).aspx ] can be explored further on the MSDN Library. I recommend reading it carefully if you spend a lot of your time marshalling.

Design with ValueTypes

Use simple structs when you can, and when you don't do a lot of boxing and unboxing. Here's a simple example to demonstrate the speed difference:

using System;

namespace ConsoleApplication{

  public struct foo{
public foo(double arg){ this.y = arg; }
public double y;
}
public class bar{
public bar(double arg){ this.y = arg; }
public double y;
}
class Class1{
static void Main(string[] args){
System.Console.WriteLine("starting struct loop...");
for(int i = 0; i < 50000000; i++)
{foo test = new foo(3.14);}
System.Console.WriteLine("struct loop complete.
starting object loop...");
for(int i = 0; i < 50000000; i++)
{bar test2 = new bar(3.14); }
System.Console.WriteLine("All done");
}
}
}

When you run this example, you'll see that the struct loop is orders of magnitude faster. However, it is important to beware of using ValueTypes when you treat them like objects. This adds extra boxing and unboxing overhead to your program, and can end up costing you more than it would if you had stuck with objects! To see this in action, modify the code above to use an array of foos and bars. You'll find that the performance is more or less equal.

Tradeoffs ValueTypes are far less flexible than Objects, and end up hurting performance if used incorrectly. You need to be very careful about when and how you use them.

Try modifying the sample above, and storing the foos and bars inside arrays or hashtables. You'll see the speed gain disappear, just with one boxing and unboxing operation.

You can keep track of how heavily you box and unbox by looking at GC allocations and collections. This can be done using either Perfmon externally or Performance Counters in your code.

See the in-depth discussion of ValueTypes in Performance Considerations of Run-Time Technologies in the .NET Framework [ http://msdn.microsoft.com/en-us/library/ms973838(printer).aspx ] .

Use AddRange to Add Groups

Use AddRange to add a whole collection, rather than adding each item in the collection iteratively. Nearly all windows controls and collections have both Add and AddRange methods, and each is optimized for a different purpose. Add is useful for adding a single item, whereas AddRange has some extra overhead but wins out when adding multiple items. Here are just a few of the classes that support Add and AddRange:

  • StringCollection, TraceCollection, etc.
  • HttpWebRequest
  • UserControl
  • ColumnHeader

Trim Your Working Set

Minimize the number of assemblies you use to keep your working set small. If you load an entire assembly just to use one method, you're paying a tremendous cost for very little benefit. See if you can duplicate that method's functionality using code that you already have loaded.

Keeping track of your working set is difficult, and could probably be the subject of an entire paper. Here are some tips to help you out:

  • Use vadump.exe to track your working set. This is discussed in another white paper covering various tools for the managed environment.
  • Look at Perfmon or Performance Counters. They can give you detail feedback about the number of classes that you load, or the number of methods that get JITed. You can get readouts for how much time you spend in the loader, or what percent of your execution time is spent paging.

Use For Loops for String Iteration—version 1

In C#, the foreach keyword allows you to walk across items in a list, string, etc. and perform operations on each item. This is a very powerful tool, since it acts as a general-purpose enumerator over many types. The tradeoff for this generalization is speed, and if you rely heavily on string iteration you should use a For loop instead. Since strings are simple character arrays, they can be walked using much less overhead than other structures. The JIT is smart enough (in many cases) to optimize away bounds-checking and other things inside a For loop, but is prohibited from doing this on foreach walks. The end result is that in version 1, a For loop on strings is up to five times faster than using foreach. This will change in future versions, but for version 1 this is a definite way to increase performance.

Here's a simple test method to demonstrate the difference in speed. Try running it, then removing the For loop and uncommenting the foreach statement. On my machine, the For loop took about a second, with about 3 seconds for the foreach statement.

public static void Main(string[] args) {
string s = "monkeys!";
int dummy = 0;

System.Text.StringBuilder sb = new System.Text.StringBuilder(s);
for(int i = 0; i < 1000000; i++)
sb.Append(s);
s = sb.ToString();
//foreach (char c in s) dummy++;
for (int i = 0; i < 1000000; i++)
dummy++;
return;
}
}
Tradeoffs Foreach is far more readable, and in the future it will become as fast as a For loop for special cases like strings. Unless string manipulation is a real performance hog for you, the slightly messier code may not be worth it.

Use StringBuilder for Complex String Manipulation

When a string is modified, the run time will create a new string and return it, leaving the original to be garbage collected. Most of the time this is a fast and simple way to do it, but when a string is being modified repeatedly it begins to be a burden on performance: all of those allocations eventually get expensive. Here's a simple example of a program that appends to a string 50,000 times, followed by one that uses a StringBuilder object to modify the string in place. The StringBuilder code is much faster, and if you run them it becomes immediately obvious.

namespace ConsoleApplication1.Feedback
  using System;

public class Feedback{
public Feedback(){
text = "You have ordered: \n";
}

public string text;

public static int Main(string[] args) {
Feedback test = new Feedback();
String str = test.text;
for(int i=0;i<50000;i++){
str = str + "blue_toothbrush";
}
System.Console.Out.WriteLine("done");
return 0;
}
}
}

namespace ConsoleApplication1.Feedback
  using System;

public class Feedback{
public Feedback(){
text = "You have ordered: \n";
}

public string text;

public static int Main(string[] args) {
Feedback test = new Feedback();
System.Text.StringBuilder SB =
new System.Text.StringBuilder(test.text);
for(int i=0;i<50000;i++){
SB.Append("blue_toothbrush");
}
System.Console.Out.WriteLine("done");
return 0;
}
}
}

Try looking at Perfmon to see how much time is saved without allocating thousands of strings. Look at the "% time in GC" counter under the .NET CLR Memory list. You can also track the number of allocations you save, as well as collection statistics.

Tradeoffs There is some overhead associated with creating a StringBuilder object, both in time and memory. On a machine with fast memory, a StringBuilder becomes worthwhile if you're doing about five operations. As a rule of thumb, I would say 10 or more string operations is a justification for the overhead on any machine, even a slower one.

Precompile Windows Forms Applications

Methods are JITed when they are first used, which means that you pay a larger startup penalty if your application does a lot of method calling during startup. Windows Forms use a lot of shared libraries in the OS, and the overhead in starting them can be much higher than other kinds of applications. While not always the case, precompiling Windows Forms applications usually results in a performance win. In other scenarios it's usually best to let the JIT take care of it, but if you are a Windows Forms developer you might want to take a look.

Microsoft allows you to precompile an application by calling ngen.exe. You can choose to run ngen.exe during install time or before you distribute you application. It definitely makes the most sense to run ngen.exe during install time, since you can make sure that the application is optimized for the machine on which it is being installed. If you run ngen.exe before you ship the program, you limit the optimizations to the ones available on your machine. To give you an idea of how much precompiling can help, I've run an informal test on my machine. Below are the cold startup times for ShowFormComplex, a winforms application with roughly a hundred controls.

Code StateTime
Framework JITed

ShowFormComplex JITed

3.4 sec
Framework Precompiled, ShowFormComplex JITed2.5 sec
Framework Precompiled, ShowFormComplex Precompiled2.1sec

Each test was performed after a reboot. As you can see, Windows Forms applications use a lot of methods up front, making it a substantial performance win to precompile.

Use Jagged Arrays—Version 1

The v1 JIT optimizes jagged arrays (simply 'arrays-of-arrays') more efficiently than rectangular arrays, and the difference is quite noticeable. Here is a table demonstrating the performance gain resulting from using jagged arrays in place of rectangular ones in both C# and Visual Basic (higher numbers are better):

C#Visual Basic 7
Assignment (jagged)

Assignment (rectangular)

14.16

8.37

12.24

8.62

Neural Net (jagged)

Neural net (rectangular)

4.48

3.00

4.58

3.13

Numeric Sort (jagged)

Numeric Sort (rectangular)

4.88

2.05

5.07

2.06

The assignment benchmark is a simple assignment algorithm, adapted from the step-by-step guide found in Quantitative Decision Making for Business (Gordon, Pressman, and Cohn; Prentice-Hall; out of print). The neural net test runs a series of patterns over a small neural network, and the numeric sort is self-explanatory. Taken together, these benchmarks represent a good indication of real-world performance.

As you can see, using jagged arrays can result in fairly dramatic performance increases. The optimizations made to jagged arrays will be added to future versions of the JIT, but for v1 you can save yourself a lot of time by using jagged arrays.

Keep IO Buffer Size Between 4KB and 8KB

For nearly every application, a buffer between 4KB and 8KB will give you the maximum performance. For very specific instances, you may be able to get an improvement from a larger buffer (loading large images of a predictable size, for example), but in 99.99% of cases it will only waste memory. All buffers derived from BufferedStream allow you to set the size to anything you want, but in most cases 4 and 8 will give you the best performance.

Be on the Lookout for Asynchronous IO Opportunities

In rare cases, you may be able to benefit from Asynchronous IO. One example might be downloading and decompressing a series of files: you can read the bits in from one stream, decode them on the CPU and write them out to another. It takes a lot of effort to use Asynchronous IO effectively, and it can result in a performance loss if it's not done right. The advantage is that when applied correctly, Asynchronous IO can give you as much as ten times the performance.

An excellent example of a program using Asynchronous IO [ http://msdn.microsoft.com/en-us/library/aa719596(printer).aspx ] is available on the MSDN Library.

  • One thing to note is that there is a small security overhead for asynchronous calls: Upon invoking an async call, the security state of the caller's stack is captured and transferred to the thread that'll actually execute the request. This may not be a concern if the callback executes lot of code, or if async calls aren't used excessively

Tips for Database Access

The philosophy of tuning for database access is to use only the functionality that you need, and to design around a 'disconnected' approach: make several connections in sequence, rather than holding a single connection open for a long time. You should take this change into account and design around it.

Microsoft recommends an N-Tier strategy for maximum performance, as opposed to a direct client-to-database connection. Consider this as part of your design philosophy, as many of the technologies in place are optimized to take advantage of a multi-tired scenario.

Use the Optimal Managed Provider

Make the correct choice of managed provider, rather than relying on a generic accessor. There are managed providers written specifically for many different databases, such as SQL (System.Data.SqlClient). If you use a more generic interface such as System.Data.Odbc when you could be using a specialized component, you will lose performance dealing with the added level of indirection. Using the optimal provider can also have you speaking a different language: the Managed SQL Client speaks TDS to a SQL database, providing a dramatic improvement over the generic OleDbprotocol.

Pick Data Reader Over Data Set When You Can

Use a data reader whenever when you don't need to keep the data lying around. This allows a fast read of the data, which can be cached if the user desires. A reader is simply a stateless stream that allows you to read data as it arrives, and then drop it without storing it to a dataset for more navigation. The stream approach is faster and has less overhead, since you can start using data immediately. You should evaluate how often you need the same data to decide if the caching for navigation makes sense for you. Here's a small table demonstrating the difference between DataReader and DataSet on both ODBC and SQL providers when pulling data from a server (higher numbers are better):

ADOSQL
DataSet8012507
DataReader10834585

As you can see, the highest performance is achieved when using the optimal managed provider along with a data reader. When you don't need to cache your data, using a data reader can provide you with an enormous performance boost.

Use Mscorsvr.dll for MP Machines

For stand-alone middle-tier and server applications, make sure mscorsvr is being used for multiprocessor machines. Mscorwks is not optimized for scaling or throughput, while the server version has several optimizations that allow it to scale well when more than one processor is available.

Use Stored Procedures Whenever Possible

Stored procedures are highly optimized tools that result in excellent performance when used effectively. Set up stored procedures to handle inserts, updates, and deletes with the data adapter. Stored procedures do not have to be interpreted, compiled or even transmitted from the client, and cut down on both network traffic and server overhead. Be sure to use CommandType.StoredProcedure instead of CommandType.Text

Be Careful About Dynamic Connection Strings

Connection pooling is a useful way to reuse connections for multiple requests, rather than paying the overhead of opening and closing a connection for each request. It's done implicitly, but you get one pool per unique connection string. If you're generating connection strings dynamically, make sure the strings are identical each time so pooling occurs. Also be aware that if delegation is occurring, you'll get one pool per user. There are a lot of options that you can set for the connection pool, and you can track the performance of the pool by using the Perfmon to keep track of things like response time, transactions/sec, etc.

Turn Off Features You Don't Use

Turn off automatic transaction enlistment if it's not needed. For the SQL Managed Provider, it's done via the connection string:

SqlConnection conn = new SqlConnection(
"Server=mysrv01;
Integrated Security=true;
Enlist=false");

When filling a dataset with the data adapter, don't get primary key information if you don't have to (e.g. don't set MissingSchemaAction.Add with key):

public DataSet SelectSqlSrvRows(DataSet dataset,string connection,string query){
SqlConnection conn = new SqlConnection(connection);
SqlDataAdapter adapter = new SqlDataAdapter();
adapter.SelectCommand = new SqlCommand(query, conn);
adapter.MissingSchemaAction = MissingSchemaAction.AddWithKey;
adapter.Fill(dataset);
return dataset;
}

Avoid Auto-Generated Commands

When using a data adapter, avoid auto-generated commands. These require additional trips to the server to retrieve meta data, and give you a lower level of interaction control. While using auto-generated commands is convenient, it's worth the effort to do it yourself in performance-critical applications.

Beware ADO Legacy Design

Be aware that when you execute a command or call fill on the adapter, every record specified by your query is returned.

If server cursors are absolutely required, they can be implemented through a stored procedure in t-sql. Avoid where possible because server cursor-based implementations don't scale very well.

If needed, implement paging in a stateless and connectionless manner. You can add additional records to the dataset by:

  • Making sure PK info is present
  • Changing the data adapter's select command as appropriate, and
  • Calling Fill

Keep Your Datasets Lean

Only put the records you need into the dataset. Remember that the dataset stores all of its data in memory, and that the more data you request, the longer it will take to transmit across the wire.

Use Sequential Access as Often as Possible

With a data reader, use CommandBehavior.SequentialAccess. This is essential for dealing with blob data types since it allows data to be read off of the wire in small chunks. While you can only work with one piece of the data at a time, the latency for loading a large data type disappears. If you don't need to work the whole object at once, using Sequential Access will give you much better performance.

Performance Tips for ASP.NET Applications

Cache Aggressively

When designing an app using ASP.NET, make sure you design with an eye on caching. On server versions of the OS, you have a lot of options for tweaking the use of caches on the server and client side. There are several features and tools in ASP that you can make use of to gain performance.

Output Caching—Stores the static result of an ASP request. Specified using the <@% OutputCache %> directive:

  • Duration—Time item exists in the cache
  • VaryByParam—Varies cache entries by Get/Post params
  • VaryByHeader—Varies cache entries by Http header
  • VaryByCustom—Varies cache entries by browser
  • Override to vary by whatever you want:
    • Fragment Caching—When it is not possible to store an entire page (privacy, personalization, dynamic content), you can use fragment caching to store parts of it for quicker retrieval later.

      a) VaryByControl—Varies the cached items by values of a control

    • Cache API—Provides extremely fine granularity for caching by keeping a hashtable of cached objects in memory (System.web.UI.caching). It also:

      a) Includes Dependencies (key, file, time)

      b) Automatically expires unused items

      c) Supports Callbacks

Caching intelligently can give you excellent performance, and it's important to think about what kind of caching you need. Imagine a complex e-commerce site with several static pages for login, and then a slew of dynamically-generated pages containing images and text. You might want to use Output Caching for those login pages, and then Fragment Caching for the dynamic pages. A toolbar, for example, could be cached as a fragment. For even better performance, you could cache commonly used images and boilerplate text that appear frequently on the site using the Cache API. For detailed information on caching (with sample code), check out the ASP. NET [ http://www.gotdotnet.com/quickstart/aspplus/ ] Web site.

Use Session State Only If You Need To

One extremely powerful feature of ASP.NET is its ability to store session state for users, such as a shopping cart on an e-commerce site or a browser history. Since this is on by default, you pay the cost in memory even if you don't use it. If you're not using Session State, turn it off and save yourself the overhead by adding <@% EnabledSessionState = false %> to your asp. This comes with several other options, which are explained at the ASP. NET [ http://www.gotdotnet.com/quickstart/aspplus/ ] Web site.

For pages that only read session state, you can choose EnabledSessionState=readonly. This carries less overhead than full read/write session state, and is useful when you need only part of the functionality and don't want to pay for the write capabilities.

Use View State Only If You Need To

An example of View State might be a long form that users must fill out: if they click Back in their browser and then return, the form will remain filled. When this functionality isn't used, this state eats up memory and performance. Perhaps the largest performance drain here is that a round-trip signal must be sent across the network each time the page is loaded to update and verify the cache. Since it is on by default, you will need to specify that you do not want to use View State with <@% EnabledViewState = false %>. You should read more about View State on the the ASP. NET [ http://www.gotdotnet.com/quickstart/aspplus/ ] Web site to learn about some of the other options and settings to which you have access.

Avoid STA COM

Apartment COM is designed to deal with threading in unmanaged environments. There are two kinds of Apartment COM: single-threaded and multithreaded. MTA COM is designed to handle multithreading, whereas STA COM relies on the messaging system to serialize thread requests. The managed world is free-threaded, and using Single Threaded Apartment COM requires that all unmanaged threads essentially share a single thread for interop. This results in a massive performance hit, and should be avoided whenever possible. If you can't port the Apartment COM object to the managed world, use <@%AspCompat = "true" %> for pages that use them. For a more detailed explanation of STA COM [ http://msdn.microsoft.com/en-us/library/ms809311(printer).aspx ] , see the MSDN Library.

Batch Compile

Always batch compile before deploying a large page into the Web. This can be initiated by doing one request to a page per directory and waiting until the CPU idles again. This prevents the Web server from being bogged down with compilations while also trying to serve out pages.

Remove Unnecessary Http Modules

Depending on the features used, remove unused or unnecessary http modules from the pipeline. Reclaiming the added memory and wasted cycles can provide you with a small speed boost.

Avoid the Autoeventwireup Feature

Instead of relying on autoeventwireup, override the events from Page. For example, instead of writing a Page_Load() method, try overloading the public void OnLoad() method. This allows the run time from having to do a CreateDelegate() for every page.

Encode Using ASCII When You Don't Need UTF

By default, ASP.NET comes configured to encode requests and responses as UTF-8. If ASCII is all your application needs, eliminated the UTF overhead can give you back a few cycles. Note that this can only be done on a per-application basis.

Use the Optimal Authentication Procedure

There are several different ways to authenticate a user and some of more expensive than others (in order of increasing cost: None, Windows, Forms, Passport). Make sure you use the cheapest one that best fits your needs.

Tips for Porting and Developing in Visual Basic

A lot has changed under the hood from Microsoft® Visual Basic® 6 to Microsoft® Visual Basic® 7, and the performance map has changed with it. Due to the added functionality and security restrictions of the CLR, some functions are simply unable to run as quickly as they did in Visual Basic 6. In fact, there are several areas where Visual Basic 7 gets trounced by its predecessor. Fortunately, there are two pieces of good news:

  • Most of the worst slowdowns occur during one-time functions, such as loading a control for the first time. The cost is there, but you only pay it once.
  • There are a lot of areas where Visual Basic 7 is faster, and these areas tend to lie in functions that are repeated during run time. This means that the benefit grows over time, and in several cases will outweigh the one-time costs.

The majority of the performance issues come from areas where the run time does not support a feature of Visual Basic 6, and it has to be added to preserve the feature in Visual Basic 7. Working outside of the run time is slower, making some features far more expensive to use. The bright side is that you can avoid these problems with a little effort. There are two main areas that require work to optimize for performance, and few simple tweaks you can do here and there. Taken together, these can help you step around performance drains, and take advantage of the functions that are much faster in Visual Basic 7.

Error Handling

The first concern is error handling. This has changed a lot in Visual Basic 7, and there are performance issues related to the change. Essentially, the logic required to implement OnErrorGoto and Resume is extremely expensive. I suggest taking a quick look at your code, and highlighting all the areas where you use the Err object, or any error-handling mechanism. Now look at each of these instances, and see if you can rewrite them to use try/catch. A lot of developers will find that they can convert to try/catch easily for most of these cases, and they should see a good performance improvement in their program. The rule of thumb is "if you can easily see the translation, do it."

Here's an example of a simple Visual Basic program that uses On Error Goto compared with the try/catch version.

Sub SubWithError(
  On Error Goto SWETrap
Dim x As Integer
Dim y As Integer
x = x / y
SWETrap:
Exit Sub
End Sub


Sub SubWithErrorResumeLabel()
On Error Goto SWERLTrap
Dim x As Integer
Dim y As Integer
x = x / y
SWERLTrap:
Resume SWERLExit
End Sub

SWERLExit:
Exit Sub

Sub SubWithError(
  Dim x As Integer
Dim y As Integer
Try
x = x / y
Catch
Return
End Try
End Sub

Sub SubWithErrorResumeLabel()
Dim x As Integer
Dim y As Integer
Try
x = x / y
Catch
Goto SWERLExit
End Try

SWERLExit:
Return
End Sub

The speed increase is noticeable. SubWithError() takes 244 milliseconds using OnErrorGoto, and only 169 milliseconds using try/catch. The second function takes 179 milliseconds compared to 164 milliseconds for the optimized version.

Use Early Binding

The second concern deals with objects and typecasting. Visual Basic 6 does a lot of work under the hood to support casting of objects, and many programmers aren't even aware of it. In Visual Basic 7, this is an area that out of which you can squeeze a lot of performance. When you compile, use early binding. This tells the compiler to insert a Type Coercion is only done when explicitly mentioned. This has two major effects:

  • Strange errors become easier to track down.
  • Unneeded coercions are eliminated, leading to substantial performance improvements.

When you use an object as if it were of a different type, Visual Basic will coerce the object for you if you don't specify. This is handy, since the programmer has to worry about less code. The downside is that these coercions can do unexpected things, and the programmer has no control over them.

There are instances when you have to use late binding, but most of the time if you're not sure then you can get away with early binding. For Visual Basic 6 programmers, this can be a bit awkward at first, since you have to worry about types more than in the past. This should be easy for new programmers, and people familiar with Visual Basic 6 will pick it up in no time.

Turn On Option Strict and Explicit

With Option Strict on, you protect yourself from inadvertent late binding and enforce a higher level of coding discipline. For a list of the restrictions present with Option Strict, see the MSDN Library. The caveat to this is that all narrowing type coercions must be explicitly specified. However, this in itself may uncover other sections of your code that are doing more work than you had previously thought, and it may help you stomp some bugs in the process.

Option Explicit is less restrictive than Option Strict, but it still forces programmers to provide more information in their code. Specifically, you must declare a variable before using it. This moves the type-inference from the run time into compile time. This eliminated check translates into added performance for you.

I recommend that you start with Option Explicit, and then turn on Option Strict. This will protect you from a deluge of compiler errors, and allow you to gradually start working in the stricter environment. When both of these options are used, you ensure maximum performance for your application.

Use Binary Compare for Text

When comparing text, use binary compare instead of text compare. At run time, the overhead is much lighter for binary.

Minimize the Use of Format()

When you can, use toString() instead of format(). In most cases, it will provide you with the functionality you need, with much less overhead.

Use Charw

Use charw instead of char. The CLR uses Unicode internally, and char must be translated at run time if it is used. This can result in a substantial performance loss, and specifying that your characters are a full word long (using charw) eliminates this conversion.

Optimize Assignments

Use exp += val instead of exp = exp + val. Since exp can be arbitrarily complex, this can result in lots of unnecessary work. This forces the JIT to evaluate both copies of exp, and many times this is not needed. The first statement can be optimized far better than the second, since the JIT can avoid evaluating the exp twice.

Avoid Unnecessary Indirection

When you use byRef, you pass pointers instead of the actual object. Many times this makes sense (side-effecting functions, for example), but you don't always need it. Passing pointers results in more indirection, which is slower than accessing a value that is on the stack. When you don't need to go through the heap, it is best to avoid it.

Put Concatenations in One Expression

If you have multiple concatenations on multiple lines, try to stick them all on one expression. The compiler can optimize by modifying the string in place, providing a speed and memory boost. If the statements are split into multiple lines, the Visual Basic compiler will not generate the Microsoft Intermediate Language (MSIL) to allow in-place concatenation. See the StringBuilder example discussed earlier.

Include Return Statements

Visual Basic allows a function to return a value without using the return statement. While Visual Basic 7 supports this, explicitly using return allows the JIT to perform slightly more optimizations. Without a return statement, each function is given several local variables on stack to transparently support returning values without the keyword. Keeping these around makes it harder for the JIT to optimize, and can impact the performance of your code. Look through your functions and insert return as needed. It doesn't change the semantics of the code at all, and it can help you get more speed from your application.

Tips for Porting and Developing in Managed C++

Microsoft is targeting Managed C++ (MC++) at a specific set of developers. MC++ is not the best tool for every job. After reading this document, you may decide that C++ is not the best tool, and that the tradeoff costs are not worth the benefits. If you aren't sure about MC++, there are many good resources [ http://msdn.microsoft.com/vstudio/techinfo/articles/upgrade/managedext.asp ] to help you make your decision This section is targeted at developers who have already decided that they want to use MC++ in some way, and want to know about the performance aspects of it.

For C++ developers, working Managed C++ requires that several decisions be made. Are you porting some old code? If so, do you want to move the entire thing to managed space or are you instead planning to implement a wrapper? I'm going to focus on the 'port-everything' option or deal with writing MC++ from scratch for the purposes of this discussion, since those are the scenarios where the programmer will notice a performance difference.

Benefits of the Managed World

The most powerful feature of Managed C++ is the ability to mix and match managed and unmanaged code at the expression level. No other language allows you to do this, and there are some powerful benefits you can get from it if used properly. I'll walk through some examples of this later on.

The managed world also gives you huge design wins, in that a lot of common problems are taken care of for you. Memory management, thread scheduling and type coercions can be left to the run time if you desire, allowing you to focus your energies on the parts of the program that need it. With MC++, you can choose exactly how much control you want to keep.

MC++ programmers have the luxury of being able to use the Microsoft Visual C® 7 (VC7) backend when compiling to IL, and then using the JIT on top of that. Programmers that are used to working with the Microsoft C++ compiler are used to things being lightning-fast. The JIT was designed with different goals, and has a different set of strengths and weaknesses. The VC7 compiler, not bound by the time restrictions of the JIT, can perform certain optimizations that the JIT cannot, such as whole-program analysis, more aggressive inlining and enregistration. There are also some optimizations that can be performed only in typesafe environments, leaving more room for speed than C++ allows.

Because of the different priorities in the JIT, some operations are faster than before while others are slower. There are tradeoffs you make for safety and language flexibility, and some of them aren't cheap. Fortunately, there are things a programmer can do to minimize the costs.

Porting: All C++ Code Can Compile to MSIL

Before we go any further, it's important to note that you can compile any C++ code into MSIL. Everything will work, but there's no guarantee of type-safety and you pay the marshalling penalty if you do a lot of interop. Why is it helpful to compile to MSIL if you don't get any of the benefits? In situations where you are porting a large code base, this allows you to gradually port your code in pieces. You can spend your time porting more code, rather than writing special wrappers to glue the ported and not-yet-ported code together if you use MC++, and that can result in a big win. It makes porting applications a very clean process. To learn more about compiling C++ to MSIL, take a look at the /clr compiler option [ http://msdn.microsoft.com/en-us/library/system.xml.xmlvalidatingreader.readtypedvalue(printer).aspx ] .

However, simply compiling your C++ code to MSIL doesn't give you the security or flexibility of the managed world. You need to write in MC++, and in v1 that means giving up a few features. The list below is not supported in the current version of the CLR, but may be in the future. Microsoft chose to support the most common features first, and had to cut some others in order to ship. There is nothing that prevents them from being added later, but in the meantime you will need to do without them:

  • Multiple Inheritance
  • Templates
  • Deterministic Finalization

You can always interoperate with unsafe code if you need those features, but you will pay the performance penalty of marshalling data back and forth. And bear in mind that those features can only be used inside the unmanaged code. The managed space has no knowledge of their existence. If you are deciding to port your code, think about how much you rely on those features in your design. In a few cases, the redesign is too expensive and you will want to stick with unmanaged code. This is the first decision you should make, before you start hacking.

Advantages of MC++ Over C# or Visual Basic

Coming from an unmanaged background, MC++ preserves a lot of the ability to handle unsafe code. MC++'s ability to mix managed and unmanaged code smoothly provides the developer with a lot of power, and you can choose where on the gradient you want to sit when writing your code. On one extreme, you can write everything in straight, unadulterated C++ and just compile with /clr. On the other, you can write everything as managed objects and deal with the language limitations and performance problems mentioned above.

But the real power of MC++ comes when you choose somewhere in between. MC++ allows you to tweak some of the performance hits inherent in managed code, by giving you precise control over when to use unsafe features. C# has some of this functionality in the unsafe keyword, but it's not an integral part of the language and it is far less useful than MC++. Let's step through some examples showing the finer granularity available in MC++, and we'll talk about the situations where it comes in handy.

Generalized "byref" pointers

In C# you can only take the address of some member of a class by passing it to a ref parameter. In MC++, a byref pointer is a first-class construct. You can take the address of an item in the middle of an array and return that address from a function:

Byte* AddrInArray( Byte b[] ) {
return &b[5];
}

We exploit this feature for returning a pointer to the "characters" in a System.String via our helper routine, and we can even loop through arrays using these pointers:

System::Char* PtrToStringChars(System::String*);  
for( Char*pC = PtrToStringChars(S"boo");
pC != NULL;
pC++ )
{
... *pC ...
}

You can also do a linked-list traversal with injection in MC++ by taking the address of the "next" field (which you cannot do in C#):

Node **w = &Head;
while(true) {
if( *w == 0 val < (*w)->val ) {
Node *t = new Node(val,*w);
*w = t;
break;
}
w = &(*w)->next;
}

In C#, you can't point to "Head", or take the address of the "next" field, so you have make a special-case where you're inserting at the first location, or if "Head" is null. Moreover, you have to look one node ahead all the time in the code. Compare this to what a good C# would produce:

if( Head==null  val < Head.val ) {
Node t = new Node(val,Head);
Head = t;
}else{
// we know at least one node exists,
// so we can look 1 node ahead
Node w=Head;
while(true) {
if( w.next == null val < w.next.val ){
Node t = new Node(val,w.next.next);
w.next = t;
break;
}
w = w.next;
}
}

User Access to Boxed Types

A performance problem common with OO languages is the time spent boxing and unboxing values. MC++ gives you a lot more control over this behavior, so you won't have to dynamically (or statically) unbox to access values. This is another performance enhancement. Just place __box keyword before any type to represent its boxed form:

__value struct V {
int i;
};
int main() {
V v = {10};
__box V *pbV = __box(v);
pbV->i += 10; // update without casting
}

In C# you have to unbox to a "v", then update the value and re-box back to an Object:

struct B { public int i; }
static void Main() {
B b = new B();
b.i = 5;
object o = b; // implicit box
B b2 = (B)o; // explicit unbox
b2.i++; // update
o = b2; // implicit re-box
}

STL Collections vs. Managed Collections—v1

The bad news: In C++, using the STL Collections was often just as fast as writing that functionality by hand. The CLR frameworks are very fast, but they suffer from boxing and unboxing issues: everything is an object, and without template or generic support, all actions have to be checked at run time.

The good news: In the long term, you can bet that this problem will go away as generics are added to the run time. Code you deploy today will experience the speed boost without any changes. In the short term, you can use static casting to prevent the check, but this is no longer safe. I recommend using this method in tight code where performance is absolutely critical, and you've identified two or three hot spots.

Use Stack Managed Objects

In C++, you specify that an object should be managed by the stack or the heap. You can still do this in MC++, but there are restrictions you should be aware of. The CLR uses ValueTypes for all stack-managed objects, and there are limitations to what ValueTypes can do (no inheritance, for example). More information [ http://msdn.microsoft.com/en-us/library/34yytbws(printer).aspx ] is available on the MSDN Library.

Corner Case: Beware Indirect Calls Within Managed Code—v1

In the v1 run time, all indirect function calls are made natively, and therefore require a transition into unmanaged space. Any indirect function call can only be made from native mode, which means that all indirect calls from managed code need a managed-to-unmanaged transition. This is a serious problem when the table returns a managed function, since a second transition must then be made to execute the function. When compared to the cost of executing a single Call instruction, the cost is fifty- to one hundred times slower than in C++!

Fortunately, when you are calling a method that resides within a garbage-collected class, optimization removes this. However, in the specific case of a regular C++ file that has been compiled using /clr, the method return will be considered managed. Since this cannot be removed by optimization, you are hit with the full double-transition cost. Below is an example of such a case.

//////////////////////// a.h:    //////////////////////////
class X {
public:
void mf1();
void mf2();
};

typedef void (X::*pMFunc_t)();


////////////// a.cpp: compiled with /clr /////////////////
#include "a.h"

int main(){
pMFunc_t pmf1 = &X::mf1;
pMFunc_t pmf2 = &X::mf2;

X *pX = new X();
(pX->*pmf1)();
(pX->*pmf2)();

return 0;
}


////////////// b.cpp: compiled without /clr /////////////////
#include "a.h"

void X::mf1(){}


////////////// c.cpp: compiled with /clr ////////////////////
#include "a.h"
void X::mf2(){}

There are several ways to avoid this:

  • Make the class into a managed class ("__gc")
  • Remove the indirect call, if possible
  • Leave the class compiled as unmanaged code (e.g. do not use /clr)

Minimize Performance Hits—version 1

There are several operations or features that are simply more expensive in MC++ under version 1 JIT. I'll list them and give some explanation, and then we'll talk about what you can do about them.

  • Abstractions—This is an area where the beefy, slow C++ backend compiler wins heavily over the JIT. If you wrap an int inside a class for abstraction purposes, and you access it strictly as an int, the C++ compiler can reduce the overhead of the wrapper to practically nothing. You can add many levels of abstraction to the wrapper, without increasing the cost. The JIT is unable to take the time necessary to eliminate this cost, making deep abstractions more expensive in MC++.
  • Floating Point—The v1 JIT does not currently perform all the FP-specific optimizations that the VC++ backend does, making floating point operations more expensive for now.
  • Multidimensional Arrays—The JIT is better at handling jagged arrays than multidimensional ones, so use jagged arrays instead.
  • 64 bit Arithmetic—In future versions, 64-bit optimizations will be added to the JIT.

What You Can Do

At every phase of development, there are several things you can do. With MC++, the design phase is perhaps the most important area, as it will determine how much work you end up doing and how much performance you get in return. When you sit down to write or port an application, you should consider the following things:

  • Identify areas where you use multiple inheritance, templates, or deterministic finalization. You will have to get rid of these, or else leave that part of your code in unmanaged space. Think about the cost of redesigning, and identify areas that can be ported.
  • Locate performance hot spots, such as deep abstractions or virtual function calls across managed space. These will also require a design decision.
  • Look for objects that have been specified as stack-managed. Make sure they can be converted into ValueTypes. Mark the others for conversion to heap-managed objects.

During the coding stage, you should be aware of the operations that are more expensive and the options you have in dealing with them. One of the nicest things about MC++ is that you come to grips with all the performance issues up front, before you start coding: this is helpful in paring down work later on. However, there are still some tweaks you can perform while you code and debug.

Determine which areas make heavy use of floating point arithmetic, multidimensional arrays or library functions. Which of these areas are performance critical? Use profilers to pick the fragments where the overhead is costing you most, and pick which option seems best:

  • Keep the whole fragment in unmanaged space.
  • Use static casts on the library accesses.
  • Try tweaking boxing/unboxing behavior (explained later).
  • Code your own structure.

Finally, work to minimize the number of transitions you make. If you have some unmanaged code or an interop call sitting in a loop, make the entire loop unmanaged. That way you'll only pay the transition cost twice, rather than for each iteration of the loop.

Additional Resources

Related topics on performance in the .NET Framework include:

Watch for future articles currently under development, including an overview of design, architectural and coding philosophies, a walkthrough of performance analysis tools in the managed world, and a performance comparison of .NET to other enterprise applications available today.

Appendix: Cost of Virtual Calls and Allocations

Call Type# Calls/sec
ValueType Non-Virtual Call809971805.600
Class Non-Virtual Call268478412.546
Class Virtual Call109117738.369
ValueType Virtual (Obj Method) Call3004286.205
ValueType Virtual (Overridden Obj Method) Call2917140.844
Load Type by Newing (Non-Static)1434.720
Load Type by Newing (Virtual Methods)1369.863

Note The test machine is a PIII 733Mhz, running Windows 2000 Professional with Service Pack 2.

This chart compares the cost associated with different types of method calls, as well as the cost of instantiating a type that contains virtual methods. The higher the number, the more calls/instantiations-per-second can be performed. While these numbers will certainly vary on different machines and configurations, the relative cost of performing one call over another remains significant.

  • ValueType Non-Virtual Call: This test calls an empty non-virtual method contained within a ValueType.
  • Class Non-Virtual Call: This test calls an empty non-virtual method contained within a class.
  • Class Virtual Call: This test calls an empty virtual method contained within a class.
  • ValueType Virtual (Obj Method) Call: This test calls ToString() (a virtual method) on a ValueType, which resorts to the default object method.
  • ValueType Virtual (Overridden Obj Method) Call: This test calls ToString() (a virtual method) on a ValueType that has overridden the default.
  • Load Type by Newing (Static): This test allocates space for a class with only static methods.
  • Load Type by Newing (Virtual Methods): This test allocates space for a class with virtual methods.

One conclusion you can draw is that Virtual Function calls are about two times as expensive as regular calls when you're calling a method in a class. Bear in mind that calls are cheap to begin with, so I wouldn't remove all virtual calls. You should always use virtual methods when it makes sense to do so.

  • The JIT cannot inline virtual methods, so you lose a potential optimization if you get rid of non-virtual methods.
  • Allocating space for an object that has virtual methods is slightly slower than the allocation for an object without them, since extra work must be done to find space for the virtual tables.

Notice that calling a non-virtual method within a ValueType is more than three times as fast as in a class, but once you treat it as a class you lose terribly. This is characteristic of ValueTypes: treat them like structs and they're lighting fast. Treat them like classes and they're painfully slow. ToString() is a virtual method, so before it can be called, the struct must be converted to an object on the heap. Instead of being twice as slow, calling a virtual method on a ValueType is now eighteen times as slow! The moral of the story? Don't treat ValueTypes as classes.

No comments: