Chris Brumme's Weblog

Saturday, February 21, 2004 9:28 AM

Hosting
My prior three blogs were supposed to be on Hosting. Each time I got side tracked, first on Exceptions, then on Application Compatibility and finally on Finalization. I refuse to be side tracked this time… much.

Also, I need to explain why it’s taken so long to get this blog out. Part of the reason is vacation. I spent Thanksgiving skiing in Whistler. Then I took a quick side trip to Scottsdale for a friend’s surprise birthday party and to visit my parents. Finally, I spent over three weeks on Maui getting a break from the Seattle winter.

Another reason for the delay is writer’s block. This topic is so huge. The internal specification for the Whidbey Hosting Interfaces is over 100 pages. And that spec only covers the hosting interfaces themselves. There are many other aspects of hosting, like how to configure different security policy in different AppDomains, or how to use COM or managed C++ to stitch together the unmanaged host with the managed applications. There’s no way I can cover the entire landscape.

Anyway, here goes.

Mostly I was tourist overhead at the PDC. But one of the places I tried to pay for my ticket was a panel on Hosting. The other panelists included a couple of Program Managers from the CLR, another CLR architect, representatives from Avalon / Internet Explorer, SQL Server, Visual Studio / Office, and – to my great pleasure – a representative from IBM for DB2.

One thing that was very clear at that panel is that the CLR team has done a poor job of defining what hosting is and how it is done. Depending on your definition, hosting could be:
- Mixing unmanaged and managed code in the same process.
- Running multiple applications, each in its own specially configured AppDomain.
- Using the unmanaged hosting interfaces described in mscoree.idl.
- Configuring how the CLR runs in the process, like disabling the concurrent GC through an application config file.
Even though the hosting interfaces described in mscoree.idl are a small part of what could be hosting, I’m going to concentrate on those interfaces.

In V1 and V1.1 of the CLR, we provided some APIs that allowed an unmanaged process host to exercise some limited control over the CLR. This limited control included the ability to select the version of the CLR to load, the ability to create and configure AppDomains from unmanaged code, access to the ThreadPool, and a few other fundamental operations.

Also, we knew we eventually needed to support hosts which manage all the memory in the process and which use non-preemptive scheduling of tasks and perhaps even light-weight fibers rather than OS threads. So we added some rudimentary (and alas inadequate) APIs for fibers and memory control. This invariably happens when you add features that you think you will eventually need, rather than features that someone is actually using and giving feedback on.

If you look closely at the V1 and V1.1 hosting APIs, you really see what we needed to support ASP.NET and a few other scenarios, like ones involving EnterpriseServices, Internet Explorer or VSA, plus some rudimentary guesses at what we might need to coexist properly inside SQL Server.

Obviously in Whidbey we have refined those guesses about SQL Server into hard requirements. And we tried very hard to generalize each extension that we added for SQL Server, so that it would be applicable to many other hosting scenarios. In fact, it’s amazing that the SQL Server team still talks to us – whenever they ask for anything, we always say No and give them something that works a lot better for other hosts and not nearly so well for SQL Server.

In our next release (Whidbey), we’ve made a real effort to clean up the existing hosting support and to dramatically extend it for a number of new scenarios. Therefore I’m not going to spend any more time discussing those original V1 & V1.1 hosting APIs, except to the extent that they are still relevant to the following Whidbey hosting discussion.

Also I’m going to skip over all the general introductory topics like “When to host” since they were the source of my writer’s block. Instead, I’m going to leap into some of the more technically interesting topics. Maybe after we’ve studied various details we can step back and see some general guidelines.

Threading and Synchronization

One of the most interesting challenges we struggled with during Whidbey was the need to cooperate with SQL Server’s task scheduling. SQL Server can operate in either thread mode or fiber mode. Most customers run in thread mode, but SQL Server can deliver its best numbers on machines with lots of CPUs when it’s running in fiber mode. That gap between thread and fiber mode has been closing as the OS addresses issues with its own preemptive scheduler.

A few years ago, I ran some experiments to see how many threads I could create in a single process. Not surprisingly, after almost 2000 threads I ran out of address space in the process. That’s because the default stack size on NT is 1 MB and the default user address space is 2 GB. (Starting with V1.1, the CLR can load into LARGEADDRESSAWARE processes and use up to 3 GB of address space). If you shrink the default stack size, you can create more than 2000 threads before hitting the address space limit. I see stack sizes of 256 KB in the SQL Server process on my machine, clearly to reduce this impact on process address space.

Of course, address space isn’t the only limit you can hit. Even on the 4 CPU server box I was experimenting with, the real memory on the system was inadequate for the working set being used. With enough threads, I exceeded real memory and experienced paging. (Okay, it was actually thrashing). But nowadays there are plenty of servers with several GB of real – and real cheap – memory, so this doesn’t have to be an issue.

In my experiments, I simulated server request processing using an artificial work load that combined blocking, allocation, CPU-intensive computation, and a reasonable memory reference set using a mixture of both shared and per-request allocations. In the first experiments, all the threads were ready to run and all of them had equal priority. The result of this was that all threads were scheduled in a round-robin fashion on those 4 CPUs. Since the Windows OS schedules threads preemptively, each thread would execute until it either needed to block or it exceeded its quantum. With hundreds or even thousands of threads, each context switch was extremely painful. That’s because most of the memory used by that thread was so cold in the cache, having been fully displaced by the hundreds of threads that ran before it.

As we all know, modern CPUs are getting faster and faster at raw computation. And they have more and more memory available to them. But access to that memory is getting relatively slower each year. By that, I mean that a single memory access costs the equivalent of an increasing number of instructions. One of the ways the industry tries to mitigate that relative slowdown is through a cache hierarchy. Modern X86 machines have L1, L2 and L3 levels of cache, ordered from fastest and smallest to slowest and largest.

(Other ways we try to mitigate the slowdown is by increasing the locality of our data structures and by pre-fetching. If you are a developer, hopefully you already know about locality. In the unmanaged world, locality is entirely your responsibility. In the managed world, you get some locality benefits from our environment – notably the garbage collector, but also the auto-layout of the class loader. Yet even in managed code, locality remains a major responsibility of each developer).

Unfortunately, context switching between such a high number of threads will largely invalidate all those caches. So I changed my simulated server to be smarter about dispatching requests. Instead of allowing 1000 requests to execute concurrently, I would block 996 of those requests and allow 4 of them to run. This makes life pretty easy for the OS scheduler! There are four CPUs and four runnable threads. It’s pretty obvious which threads should run.

Not only will the OS keep those same four threads executing, it will likely keep them affinitized to the same CPUs. When a thread moves from one CPU to another, the new CPU needs to fill all the levels of cache with data appropriate to the new thread. However, if we can remain affinitized, we can enjoy all the benefits of a warm cache. The OS scheduler attempts to run threads on the CPU that last ran them (soft affinity). But in practice this soft affinity is too soft. Threads tend to migrate between CPUs far more than we would like. When the OS only has 4 runnable threads for its 4 CPUs, the amount of migration seemed to drop dramatically.

Incidentally, Windows also supports hard affinity. If a thread is hard affinitized to a CPU, it either runs on that CPU or it doesn’t run. The CLR can take advantage of this when the GC is executing in its server mode. But you have to be careful not to abuse hard affinity. You certainly don’t want to end up in a situation where all the “ready to run” threads are affinitized to one CPU and all the other CPUs are necessarily stalled.

Also, it’s worth mentioning the impact of hyper-threading or NUMA on affinity. On traditional SMP, our choices were pretty simple. Either our thread ran on its ideal processor, where we are most likely to see all the benefits of a warm cache, or it ran on some other processor. All those other processor choices can be treated as equally bad for performance. But with hyper-threading or NUMA, some of those other CPUs might be better choices than others. In the case of hyper-threading, some logical CPUs are combined into a single physical CPU and so they share access to the same cache memory at some level in the cache hierarchy. For NUMA, the CPUs may be arranged in partitions (e.g. hemispheres on some machines), where each partition has faster access to some memory addresses and slower access to other addresses. In all these cases, there’s some kind of gradient from the very best CPU(s) for a thread to execute on, down to the very worst CPU(s) for that particular thread. The world just keeps getting more interesting.

Anyway, remember that my simulated server combined blocking with other operations. In a real server, that blocking could be due to a web page making a remote call to get rows from a database, or perhaps it could be blocking due to a web service request. If my server request dispatcher only allows 4 requests to be in flight at any time, such blocking will be a scalability killer. I would stall a CPU until my blocked thread is signaled. This would be intolerable.

Many servers address this issue by releasing some multiple of the ideal number of requests simultaneously. If I have 4 CPUs dedicated to my server process, then 4 requests is the ideal number of concurrent requests. If there’s “moderate” blocking during the processing of a typical request, I might find that 8 concurrent requests and 8 threads is a good tradeoff between more context switching and not stalling any CPUs. If I pick too high of a multiple over the number of CPUs, then context switching and cache effects will hurt my performance. If I pick too low a multiple, then blocking will stall a CPU and hurt my performance.

If you look at the heuristics inside the managed ThreadPool, you’ll find that we are constantly monitoring the CPU utilization. If we notice that some CPU resources are being wasted, we may be starving the system by not doing enough work concurrently. When this is detected, we are likely to release more threads from the ThreadPool in order to increase concurrency and make better use of the CPUs. This is a decent heuristic, but it isn’t perfect. For instance, CPU utilization is “backwards looking.” You actually have to stall a CPU before we will notice that more work should be executed concurrently. And by the time we’ve injected extra threads, the stalling situation may already have passed.

The OS has a better solution to this problem. IO Completion Ports have a direct link to the blocking primitives in Win32. When a thread is processing a work item from a completion port, if that thread blocks efficiently through the OS, then the blocking primitive will notify the completion port that it should release another thread. (Busy waiting instead of efficient blocking can therefore have a substantial impact on the amount of concurrency in the process). This feedback mechanism with IO Completion Ports is far more immediate and effective than the CLR’s heuristic based on CPU utilization. But in fairness I should point out that if a managed thread performs managed blocking via any of the managed blocking primitives (contentious Monitor.Enter, WaitHandle.WaitOne/Any/All, Thread.Join, GC.WaitForPendingFinalizers, etc.), then we have a similar feedback mechanism. We just don’t have hooks into the OS, so we cannot track all the blocking operations that occur in unmanaged code.

Of course, in my simulated server I didn’t have to worry about “details” like how to track all OS blocking primitives. Instead, I postulated a closed world where all blocking had to go through APIs exposed by my server. This gave me accurate and immediate information about threads either beginning to block or waking up from a blocking operation. Given this information, I was able to tweak my request dispatcher so it avoided any stalling by injecting new requests as necessary.

Although it’s possible to completely prevent stalling in this manner, it’s not possible to prevent context switches. Consider what happens on a 1 CPU machine. We release exactly one request which executes on one thread. When that thread is about to block, we release a second thread. So far, it’s perfect. But when the first thread resumes from its blocking operation, we now have two threads executing concurrently. Our request dispatcher can “retire” one of those threads as soon as it’s finished its work. But until then we have two threads executing on a single CPU and this will impact performance.

I suppose we could try to get ruthless in this situation, perhaps by suspending one of the threads or reducing its priority. In practice, it’s never a good idea to suspend an executing thread. If that thread holds any locks that are required by other concurrent execution, we may have triggered a deadlock. Reducing the priority might help and I suspect I played around with that technique. To be honest, I can’t remember that far back.

We’ll see that SQL Server can even solve this context switching problem.

Oh yeah, SQL Server

So what does any of this have to do with SQL Server?

Not surprisingly, the folks who built SQL Server know infinitely more than me about how to get the best performance out of a server. And when the CLR is inside SQL Server, it must conform to their efficient design. Let’s look at their thread mode, first. Fiber mode is really just a refinement over this.

Incoming requests are carried on threads. SQL Server handles a lot of simultaneous requests, so there are a lot of threads in the process. With normal OS “free for all” scheduling, this would result in way too many context switches, as we have seen. So instead those threads are affinitized to a host scheduler / CPU combination. The scheduler tries to ensure that there is one unblocked thread available at any time. All the other threads are ideally blocked. This gives us the nirvana of 100% busy CPUs and minimal context switches. To achieve this nirvana, all the blocking primitives need to cooperate with the schedulers. Even if an event has been signaled and a thread is considered by the application to be “ready to run”, the scheduler may not choose to release it, if the scheduler’s corresponding CPU is already executing another thread. In this manner, the blocking primitive and the scheduler are tightly integrated.

When I built my simulated server, I was able to achieve an ideal “closed world” where all the synchronization primitives were controlled by me. SQL Server attempts the same thing. If a thread needs to block waiting for a data page to be read, or for a page or row latch to be released, that blocking occurs through the SQL Server scheduler. This guarantees that exactly one thread is available to run on each CPU, as we’ve seen.

Of course, execution of managed code also hits various blocking points. Monitor.Enter (‘lock’ in C# and ‘SyncLock’ in VB.NET) is a typical case. Other cases include waiting for a GC to complete, waiting for class construction or assembly loading or type loading to occur, waiting for a method to be JITted, or waiting for a remote call or web service to return. For SQL Server to hit their performance goals and to avoid deadlocks, the CLR must route all of these blocking primitives to SQL Server (or any other similar host) through the new Whidbey hosting APIs.

Leaving the Closed World

But what about synchronization primitives that are used for coordination with unmanaged code and which have precise semantics that SQL Server cannot hope to duplicate? For example, WaitHandle and its subtypes (like Mutex, AutoResetEvent and ManualResetEvent) are thin wrappers over the various OS waitable handles. These primitives provide atomicity guarantees when you perform a WaitAll operation on them. They have special behavior related to message pumping. And they can be used to coordinate activity across multiple processes, in the case of named primitives. It’s unrealistic to route operations on WaitHandle through the hosting APIs to some equivalent host-provided replacements.

This issue with WaitHandle is part of a more general problem. What happens if I PInvoke from managed code to an OS service like CoInitialize or LoadLibrary or CryptEncrypt? Do those OS services block? Well, I know that LoadLibrary will have to take the OS loader lock somewhere. I could imagine that CoInitialize might need to synchronize something, but I have no real idea. One thing I am sure of: if any blocking happens, it isn’t going to go through SQL Server’s blocking primitives and coordinate with their host scheduler. The idealized closed world that SQL Server needs has just been lost.

The solution here is to alert the host whenever a thread “leaves the runtime”. In other words, if we are PInvoking out, or making a COM call, or the thread is otherwise transitioning out to some unknown unmanaged execution, we tell the host that this is happening. If the host is tracking threads as closely as SQL Server does, it can use this event to disassociate the thread from the host scheduler and release a new thread. This ensures that the CPU stays busy. That’s because even if the disassociated thread blocks, we’ve released another thread. This newly released thread is still inside our closed world, so it will notify before it blocks so we can guarantee that the CPU won’t stall.

Wait a second. The CLR did a ton of work to re-route most of its blocking operations through the host. But we could have saved almost that entire ton of engineering effort if we had just detached the thread from the host whenever SQL Server called into managed code. That way, we could freely block and we wouldn’t disrupt the host’s scheduling decisions.

This is true, but it won’t perform as well as the alternative. Whenever a thread disassociates from a host scheduler, another thread must be released. This guarantees that the CPU is busy, but it has sacrificed our nirvana of only having a single runnable thread per CPU. Now we’ve got two runnable threads for this CPU and the OS will be preemptively context-switching between them as they run out of quantum.

If a significant amount of the processing inside a host is performed through managed code, this would have a serious impact on performance.

Indeed, if a significant amount of the processing inside a host is performed in unmanaged code, called via PInvokes or COM calls or other mechanisms that “leave the runtime”, this too can have a serious impact on performance. But, for practical purposes, we expect most execution to remain inside the host or inside managed code. The amount of processing that happens in arbitrary unmanaged code should be low, especially over time as our managed platform grows to fill in some of the current gaps.

Of course, some PInvokes or COM calls might be to services that were exported from the host. We certainly don’t want to disassociate from the host scheduler every time the in-process ADO provider performs a PInvoke back to SQL Server to get some data. This would be unnecessary and expensive. So there’s a way for the host to control which PInvoke targets perform a “leave runtime” / “return to runtime” pair and which ones are considered to remain within the closed world of our integrated host + runtime.

Even if we were willing to tolerate the substantial performance impact of considering all of the CLR to be outside the host’s closed world (i.e. we disassociated from the host’s scheduler whenever we ran managed code), this approach would be inadequate when running in fiber mode. That’s because of the nasty effects which thread affinity can have on a fiber-based system.

Fiber Mode

As we’ve seen, SQL Server and other “extreme” hosts can ensure that at any time each CPU has only a single thread within the closed world that is ready to run. But when SQL Server is in thread mode, there are still a large number of threads that aren’t ready to run. It turns out that all those blocked threads impose a modest cost upon the OS preemptive scheduler. And that cost becomes an increasing consideration as the number of CPUs increases. For 1, 2, 4 and probably 8 CPU machines, fiber mode isn’t worth the headaches we’re about to discuss. But by the time you get to a larger machine, you might achieve something like a 20% throughput boost by switching to fiber mode. (I haven’t seen real numbers in a year or two, so please take that 20% as a vague ballpark).

Fiber mode simply eliminates all those extra threads from any consideration by the OS. If you stay within the idealized nirvana (i.e. you don’t perform a “leave runtime” operation), there is only one thread for each host scheduler / CPU. Of course, there are many stacks / register contexts and each such stack / register context corresponds to an in-flight request. When a stack is ready to run, the single thread switches away from whatever stack it was running and switches to the new stack. But from the perspective of the OS scheduler, it just keeps running the only thread it knows about.

So in both thread mode and fiber mode, SQL Server uses non-preemptive host scheduling of these tasks. This scheduling happens in user mode, which is a distinct advantage over the OS preemptive scheduling which happens in kernel mode. The only difference is whether the OS scheduler is aware of all the tasks on the host scheduler, or whether they all look like a single combined thread – albeit with different stacks and register contexts.

But the impact of this difference is significant. First, it means that there is an M:N relationship between stacks (logical CLR threads) and OS threads. This is M:N because multiple stacks will execute on a single thread, and because the specially nominated thread that carries those stacks can change over time. This change in the nominated thread occurs as a consequence of those “leave runtime” calls. Remember that when a thread leaves the runtime, we inform the host which disassociates the thread from the host scheduler. A new thread is then created or obtained from a short list of already-created threads. This new thread then picks up the next stack that is ready to run. The effect is that this stack has migrated from the original disassociated thread to the newly nominated thread.

This M:N relationship between stacks and OS threads causes problems everywhere that thread affinity would normally occur. I’ve already mentioned CPU affinity when discussing how threads are associated with CPUs. But now I’m talking about a different kind of affinity. Thread affinity is the association between various programmatic operations and the thread that these operations must run on. For example, if you take an OS critical section by calling EnterCriticalSection, the resulting ownership is tied to your thread. Sometimes developers say that the OS critical section is scoped to your thread. You must call LeaveCriticalSection from that same thread.

None of this is going to work properly if your logical thread is asynchronously and apparently randomly migrating between different physical threads. You’ll successfully take the critical section on one logical thread. If you attempt to recursively acquire this critical section, you will deadlock if a migration has intervened. That’s because it will look like a different physical thread is actually the owner.

Imagine writing some hypothetical code inside the CLR:
```
EnterCriticalSection(pCS);

If (pGlobalBlock == NULL)

   pGlobalBlock = Alloc(count);

LeaveCriticalSection(pCS);
```
Obviously any real CLR code would be full of error handling, including a ‘finally’ clause to release the lock. And we don’t use OS critical sections directly since we typically reflect them to an interested host as we’ve discussed. And we instrument a lot of this stuff, including spinning during lock acquisition. And we wrap the locks with lots of logic to avoid deadlocks, including GC-induced deadlocks. But let’s ignore all of the goop that would be necessary for real CLR code.

It turns out that the above code has a thread affinity problem. Even though SQL Server’s fiber scheduling is non-preemptive, scheduling decisions can still occur whenever we call into the host. For reasons that I’ll explain later, all memory allocations in the CLR have the potential to call into the host and result in scheduling. Obviously most allocations will be satisfied locally in the CLR without escalation to the host. And most escalations to the host still won’t cause a scheduling decision to occur. But from a correctness perspective, all allocations have the potential to cause scheduling.

Other places where thread affinity can bite us include:
- The OS Mutex and the managed System.Threading.Mutex wrapper.
- LoadLibrary and DllMain interactions. As I’ve explained in my blog entry on Shutdown, DllMain notifications occur on a thread which holds the OS loader lock.
- TLS (thread local storage). It’s worth mentioning that, starting with Windows Server 2003, there are new FLS (fiber local storage) APIs. These APIs allow you to associate state with the logical rather than the physical thread. When a fiber is associated with a thread for execution (SwitchToFiber), the FLS is automatically moved from the fiber onto the thread. For managed TLS, we now move this automatically. But we cannot do this unconditionally for all the unmanaged TLS.
- Thread culture or locale, the impersonation context or user identity, the COM+ transaction context, etc. In some sense, these are just special cases of thread local storage. However, for historical reasons it isn’t possible to solve these problems by moving them to FLS.
- Taking control of a thread for GC, Abort, etc. via the OS SuspendThread() service.
- Any use of ThreadId or Thread Handle. This includes all debugging.
- “Hand-rolled” locks that we cannot discover or reason about, and which you have inadvertently based on the physical OS thread rather than the logical thread or fiber.
- Various PInvokes or COM calls that might end up in unmanaged code with affinity requirements. For instance, MSHTML can only be called on STA threads which are necessarily affinitized. Of course, there is no list of all the APIs that have odd threading behavior. It’s a minefield out there.
Solving affinity issues is relatively simple. The hard part is identifying all the places. Note that the last two bullet items are actually the application’s responsibility to identify. Some application code might appear to execute correctly when logical threads and OS threads are 1:1. But when a host creates an M:N relationship, any latent application bugs will be exposed.

In many cases, the easiest solution to a thread affinity issue is to disassociate the thread from the host’s scheduler until the affinity is no longer required. The hosting APIs provide for this, and we’ve taken care of it for you in many places – like System.Threading.Mutex.

Before we finish our discussion of locking, there’s one more aspect worth mentioning. In an earlier blog, I have mentioned the limited deadlock detection and deadlock breaking which the CLR performs when executing class constructors or JITting.

Except for this limited case, the CLR doesn’t concern itself with application-level deadlocks. If you write some managed code that takes a set of locks in random order, resulting in a potential deadlock, we consider that to be your application bug. But some hosts may be more helpful. Indeed, SQL Server has traditionally detected deadlocks in all data accesses. When a deadlock occurs, SQL Server selects a victim and aborts the corresponding transaction. This allows the other requests implicated in the deadlock to proceed.

With the new Whidbey hosting APIs, it’s possible for the host to walk all contentious managed locks and obtain a graph of the participants. This support extends to locking through our Monitor and our ReaderWriterLock. Clearly, an application could perform locking through other means. For example, an AutoResetEvent can be used to simulate mutual exclusion. But it’s not possible for such locks to be included in the deadlock algorithms, since there isn’t a strong notion of lock ownership that we can use.

Once the host has selected a deadlock victim, it must cause that victim to abort its forward progress somehow. If the victim is executing managed code, some obvious ways to do this include failing the lock attempt (since the thread is necessarily blocking), aborting the thread, or even unloading the AppDomain. We’ll return to the implications of this choice in the section on Reliability below.

Finally, it’s interesting to consider how one might get even better performance than what SQL Server has achieved. We’ve seen how fiber mode eliminates all the extra threads, by multiplexing a number of stacks / register contexts onto a single thread. What happens if we then eliminate all those fibers? For a dedicated server, we can achieve even better performance by forcing all application code to maintain its state outside of a thread’s stack. This allows us to use a single thread per CPU which executes user requests by processing them on its single dedicated stack. All synchronous blocking is eliminated by relying on asynchronous operations. The thread never yields while holding its stack pinned. The amount of memory required to hold an in-flight request will be far less than a 256 KB stack reservation. And the cost of processing an asynchronous completion through polling can presumably be less than the cost of a fiber context switch.

If all you care about is performance, this is an excellent way to build a server. But if you need to accommodate 3rd party applications inside the server, this approach is questionable. Most developers have a difficult time breaking their logic into segments which can be separately scheduled with no stack dependencies. It’s a tedious programming model. Also, the underlying Windows platform still contains a lot of blocking operations that don’t have asynchronous variants available. WMI is one example.

Memory Management

Servers must not page.

Like all rules, this one isn’t strictly true. It is actually okay to page briefly now and then, when the work load transitions from one steady state to another. But if you have a server that is routinely paging, then you have driven that server beyond its capacity. You need to reduce the load on the server or increase the server’s memory capacity.

At the same time, it’s important to make effective use of the memory capacity of a server. Ideally, a database would store the entire database contents in memory. This would allow it to avoid touching the disk, except to write the durable log that protects it from data loss and inconsistency in the face of catastrophic failure. Of course, the 2 or 3 GB limit of Win32 is far too restrictive for most interesting databases. (SQL Server can use AWE to escape this limit, at some cost). And even the address limits of Win64 are likely to be exceeded by databases presently. That’s because Win64 does not give you a full 64 bits of addressing and databases are already heading into the petabytes.

So a database needs to consider all the competing demands for memory and make wise decisions about which ones to satisfy. Historically, those demands have included the buffer cache which contains data pages, compiled query plans, and all those thread stacks. When the CLR is loaded into the process, significant additional memory is required for the GC heap, application code, and the CLR itself. I’m not sure what techniques SQL Server uses to trade off the competing demands for memory. Some servers carve memory up based on fixed ratios for the different broad uses, and then rely on LRU within each memory chunk. Other servers assign a cost to each memory type, which indicates how expensive it would be to regenerate that memory. For example, in the case of a data page, that cost is an IO.

Some servers use elaborate throttling of inbound requests, to keep the memory load reasonable. This is relatively easy to do when all requests are comparable in terms of their memory and CPU requirements. But if some queries access a single database page and other queries touch millions of rows, it would be hard to factor this into a throttling decision that is so far upstream from the query processor. Instead, SQL Server tends to accept a large number of incoming requests and process them “concurrently.” We’ve already seen in great detail why this concurrent execution doesn’t actually result in preemptive context switching between all the corresponding tasks. But it is still the case that each request will hold onto some reference set of memory, even when the host’s non-preemptive scheduler has that request blocked.

If enough requests are blocked while holding onto significant unshared memory, then the server process may find itself over-committed on memory. At this point, it could page – which hurts performance. Or it could kill some of the requests and free up the resources they are holding onto. This is an unfortunate situation, because we’ve presumably already devoted resources like the CPU to get the request to its current state of partial completion. If we throw away the request, all that work was wasted. And the client is likely to resubmit the request, so we will have to repeat all that work soon.

Nevertheless, if the server is over-committed and it’s not practical to recover more memory by e.g. shrinking the number of pages devoted to the buffer cache, then killing in-flight requests is a sound strategy. This is particularly reasonable in database scenarios, since the transactional nature of database operations means that we can kill requests at any time and with impunity.

Unfortunately, the world of arbitrary managed execution has no transactional foundation we can rely on. We’ll pick up this issue again below, in the section on Reliability.

It should be obvious that, if SQL Server or any other host is going to make wise decisions about memory consumption on a “whole process” basis, that host needs to know exactly how much memory is being used and for what purposes. For example, before the host unloads an AppDomain as a way of backing out of an over-committed situation, the host needs some idea of how many megabytes this unload operation is likely to deliver.

In the reverse direction, the host needs to be able to masquerade as the operating system. For instance, the CLR’s GC monitors system memory load and uses this information in its heuristics for deciding when to schedule a collection. The host needs a way to influence these collection decisions.

SQL Server and ASP.NET

Clearly a lot of work went into threading, synchronization and memory management in SQL Server. One obvious question to ask is how ASP.NET compares. They are both server products from Microsoft and they both execute managed code. Why didn’t we need to add all this support to the hosting interfaces in V1 of the CLR, so we could support ASP.NET?

I think it’s fair to say that ASP.NET took a much simpler approach to the problem of building a scalable server. To achieve efficient threading, they rely on the managed ThreadPool’s heuristics to keep the CPUs busy without driving up too many context switches. And since the bulk of memory allocations are due to the application, rather than the ASP.NET infrastructure (in other words, they aren’t managing large shared buffer pools for data pages), it’s not really possible for ASP.NET to act as a broker for all the different memory consumers. Instead, they just monitor the total memory load, and recycle the worker process if a threshold is exceeded.

(Incidentally, V1 of ASP.NET and the CLR had an unfortunate bug with the selection of this threshold. The default point at which ASP.NET would recycle the process was actually a lower memory load than the point at which the CLR’s GC would switch to a more aggressive schedule of collections. So we were actually killing the worker process before the CLR had a chance to deliver more memory back to the application. Presumably in Whidbey this selection of default thresholds is now coordinated between the two systems.)

How can ASP.NET get away with this simpler approach?

It really comes down to their fundamental goals. ASP.NET can scale out, rather than having to scale up. If you have more incoming web traffic, you can generally throw more web servers at the problem and load balance between them. Whereas SQL Server can only scale out if the data supports this. In some cases, it does. There may be a natural partitioning of the data, like access to the HotMail mailbox for a particular incoming user. But in too many other cases, the data cannot be sufficiently partitioned and the server must be scaled up. On X86 Windows, the practical limit is a 32-way CPU with a hard limit of 3 GB of user address space. If you want to keep increasing your work load on a single box, you need to use every imaginative trick – like fibers or AWE – to eke out all possible performance.

There’s also an availability issue. ASP.NET can recycle worker processes quite quickly. And if they have scaled out, recycling a worker process on one of the computers in the set will have no visible effect on the availability of the set of servers. But SQL Server may be limited to a single precious process. If that process must be recycled, the server is unavailable. And recycling a database is more expensive than recycling a stateless ASP.NET worker process, because transaction logs must be replayed to move the database forwards or backwards to a consistent state.

The short answer is, ASP.NET didn’t have to do all the high tech fancy performance work. Whereas SQL Server was forced down this path by the nature of the product they must build.

Reliability

Well, if you haven’t read my earlier blogs on asynchronous exceptions, or if – like me – you read the Reliability blog back in June and don’t remember what it said – you might want to review it quickly at http://cbrumme.dev/reliability.

The good news is that we’ve revisited the rules for ThreadAbortException in Whidbey, so that there is now a way to abort a thread without disturbing any backout code that it is currently running. But it’s still the case that asynchronous exceptions can intrude at fairly arbitrary spots in the execution.

Anyway, the availability goals of SQL Server place some rather difficult requirements on the CLR. Sure, we were pretty solid in V1 and V1.1. We ran a ton of stress and – if you avoided stack overflow, running out of memory, and any asynchronous exceptions like Thread.Abort – we could run applications indefinitely. We really were very clean.

One problem with this is that “indefinitely” isn’t long enough for SQL Server. They have a noble goal of chasing 5 9’s and you can’t get there with loose statements like “indefinitely”. Another problem is that we can no longer exclude OutOfMemoryException and ThreadAbortException from our reliability profile. We’ve already seen that SQL Server tries to use 100% of memory, without quite triggering paging. The effect is that SQL Server is always on the brink of being out of memory, so allocation requests are frequently being denied. Along the same lines, if the server is loaded it will allow itself to become over-committed on all resources. One strategy for backing out of an over-commitment is to abort a thread (i.e. kill a transaction) or possibly unload one or more AppDomains.

Despite this stressful abuse, at no time can the process terminate.

The first step to achieve this was to harden the CLR so that it was resilient to any resource failures. Fortunately we have some extremely strong testers. One tester built a system to inject a resource failure in every allocator, for every unique logical call stack. This tests every distinct backout path in the product. This technique can be used for unmanaged and managed (FX) code. That same tester is also chasing any unmanaged leaks by applying the principles of a tracing garbage collector to our unmanaged CLR data structures. This technique has already exposed a small memory leak that we shipped in V1 of the CLR – for the “Hello World” application!

With testers like that, you better have a strong development team too. At this point, I think we’ve annotated the vast majority of our unmanaged CLR methods with reliability contracts. These are a bit like Eiffel pre- and post-conditions and they provide machine-verifiable statements about each method’s behavior with respect to GC, exceptions, and other fundamental operations. These contracts can be used during test coverage (and, in some cases, during static scans of the binary images) to test for conformance.

The bottom line is that the next release of CLR should be substantially more robust in the face of resource errors. Leaving aside stack overflows and focusing entirely on the unmanaged runtime, we are shooting for perfection. Even for stack overflow, we expect to get very, very close. And we have the mechanisms in place that allow us to be rigorous in chasing after these goals.

But what about all of the managed code?

Will FX be as robust as the unmanaged CLR? And how can we possibly hold 3rd party authors of stored procedures or user defined functions to that same high bar? We want to enable a broad class of developers to write this sort of code, and we cannot expect them to perform many hundreds of hours of stress testing and fault injection on each new stored procedure. If we’re chasing 5 9’s by requiring every external developer to write perfect code, we should just give up now.

Instead, SQL Server relies on something other than perfect code. Consider how SQL Server worked before it started hosting the CLR:

The vast majority of execution inside SQL Server was via Transact SQL or TSQL. Any application written in TSQL is inherently scalable, fiber-aware, and robust in the face of resource errors. Any computation in TSQL can be terminated with a clean transaction abort.

Unfortunately, TSQL isn’t expressive enough to satisfy all application needs. So the remaining applications were written in extended stored procedures or xprocs. These are typically unmanaged C++. Their authors must be extremely sophisticated, because they are responsible for integrating their execution with the unusual threading environment and resource rules that exist inside SQL Server. Throw in the rules for data access and security (which I won’t be discussing in this blog) and it takes superhuman knowledge and skill to develop a bug-free xproc.

In other words, you had a choice of well-behaved execution and limited expression (TSQL), or the choice of arbitrary execution coupled with a very low likelihood that you would get it right (xprocs).

One of the shared goals of the SQL Server and CLR teams in Whidbey was to eliminate the need for xprocs. We wanted to provide a spectrum of choices to managed applications. In Whidbey, that spectrum consists of three buckets for managed code:
- Safe
  
  Code in this bucket is the most constrained. In fact, the host constrains it beyond what the CLR would normally allow to code that’s only granted SecurityPermissionFlag.Execution. So this code must be verifiably typesafe and has a reduced grant set. But it is further constrained from defining mutable static fields, from creating or controlling threads, from using the threadpool, etc. The goal here is to guide the code to best practices for scalability and robustness within the SQL Server or similar hosted environments. In the case of SQL Server, this means that all state should be stored in the database and that concurrency is controlled through transactions against the data. However, it’s important to realize that these additional constraints are not part of the Security system and they may well be subvertible. The constraints are simply speedbumps (not roadblocks) which guide the application code away from potentially non-scalable coding techniques and which encourage best practices.
- External Access
  
  Code in this bucket should be sufficient for replacing most xprocs. Such code must also be verifiably typesafe, but it is granted some additional permissions. The exact set of permissions is presumably subject to change until Yukon ships, but it’s likely to allow access to the registry, the file system, and the network.
- Unsafe
  
  This is the final managed escape hatch for writing code inside SQL Server. This code does not have to be verifiable. It has FullTrust (with the possible exception of UIPermission, which makes no sense within the database). This means that it can do anything the most arbitrary xproc can do. However, it is much more likely to work properly, compared to that xproc. First, it sits on top of a framework that has been designed to work inside the database. Second, the code has all the usual benefits of managed code, like a memory manager that’s based on accurate reachability rather than on programmer correctness. Finally, it is executing on a runtime that understands the host’s special rules for resource management, synchronization, threading, security, etc.
For code in the Safe bucket, you may be wondering how a host could constrain code beyond SecurityPermissionFlag.Execution. There are two techniques available for this:
1. Any assembly in the ‘Safe’ subset could be scanned by a host-provided pre-verifier, to check for any questionable programming constructs like the definition of mutable static fields, or the use of reflection. This raises the obvious question of how the host can interject itself into the binding process and guarantee that only pre-verified assemblies are loaded. The new Whidbey hosting APIs contain a Fusion loader hook mechanism, which allows the host to abstract the notion of an assembly store, without disturbing all our normal loader policy. You can think of this as the natural evolution of the AppDomain.AssemblyResolve event. SQL Server can use this mechanism to place all application assemblies into the database and then deliver them to the loader on demand. In addition to enabling pre-verification, the loader hooks can also be used to ensure that applications inside the database are not inadvertently broken or influenced by changes outside the database (e.g. changes to the GAC). In fact, you could even copy a database from one machine to another and theoretically this could automatically transfer all the assemblies required by that database.
2. The Whidbey hosting APIs provide controls over a new Host Protection Attribute (HPA) feature. Throughout our frameworks, we’ve decorated various unprotected APIs with an appropriate HPA. These HPAs indicate that the decorated API performs a sensitive operation like Synchronization or Thread Control. For instance, use of the ThreadPool isn’t considered a secure operation. (At some level, it is a risk for Denial of Service attacks, but DOS remains an open design topic for our managed platform). If code is running outside of a host that enables these HPAs, they have no effect. Partially trusted code, including code that only has Execution permission, can still call all these APIs. But if a host does enable these attributes, then code with insufficient trust can no longer call these APIs directly. Indirect calls are still permitted, and in this sense the HPA mechanism is similar to the mechanism for LinkDemands.
Although HPAs use a mechanism that is similar to LinkDemands, it’s very important to distinguish the HPA feature – which is all about programming model guidance – from any Security feature. A great way to illustrate this distinction is Monitor.Enter.

Ignoring HPAs, any code can call Monitor.Enter and use this API to synchronize with other threads. Naturally, SQL Server would prefer that most developers targeting their environment (including all the naïve ones) should rely on database locks under transaction control for this sort of thing. Therefore they activate the HPA on this class:
```
    [HostProtection(Synchronization=true, ExternalThreading=true)]
    public sealed class Monitor
    {
        ...
        [MethodImplAttribute(MethodImplOptions.InternalCall)]
        public static extern void Enter(Object obj);
```
However, devious code in the ‘Safe’ bucket could use a HashTable as an alternate technique for locking. If you create a synchronized HashTable and then perform inserts or lookups, your Object.Equals and GetHashCode methods will be called within the lock that synchronizes the HashTable. The BCL developers were smart enough to realize this, and they added another HPA:
```
    public class Hashtable : IDictionary, ISerializable,
                             IDeserializationCallback, ICloneable
    {
        ...
        [HostProtection(Synchronization=true)]
        public static Hashtable Synchronized(Hashtable table) {
            if (table==null)
                throw new ArgumentNullException("table");
            return new SyncHashtable(table);
        }
```
Are there other places inside the frameworks where it’s possible to trick an API into providing synchronization for its caller? Undoubtedly there are, but we aren’t going to perform exhaustive audits of our entire codebase to discover them all. As we find additional APIs, we will decorate them with HPAs, but we make no guarantees here.

This would be an intolerable situation for a Security feature, but it’s perfectly acceptable when we’re just trying to increase the scalability and reliability of naively written database applications.

Escalation Policy

I chose the HPA on System.Threading.Monitor for a reason, in the above example. If you’ve read my earlier blogs on Thread.Abort, you know that it’s dangerous to asynchronously abort another thread. That thread could be executing a class constructor, in which case that class is now unavailable throughout the AppDomain. That thread could be in the middle of an update to some shared application state, which would leave the application in an inconsistent state.

In V1 & V1.1, it was not really possible to write code that is robust in the face of asynchronous exceptions like Abort. In Whidbey, we’re now introducing some constructs (Constrained Execution Regions and Critical Finalization) which make it possible to do this. I’m not going to discuss those constructs in this blog. But suffice it to say that, although it makes it possible to write entirely robust code, it doesn’t make it easy. Without a higher level programmatic construct, like transactions, it’s very difficult to write entirely robust code. You must acquire all the resources required for forward progress, tolerating exceptions during this acquisition phase. Then you enter a forward progress phase, which either cannot fail or which unconditionally triggers some compensating backout code upon failure. If compensation is triggered, it must guarantee that the system is returned to a consistent state before it completes.

If you’ve successfully written that sort of code, you know that it’s an onerous discipline. There’s no way that we can expect the greater population of developers to write large bodies of bug-free code based on this plan.

That’s why, in V1 & V1.1, we recommend either using Abort on the current thread (in which case it is not asynchronous) or we recommend using it in conjunction with an AppDomain.Unload (in which case any inconsistent application state is likely to be discarded).

In Whidbey, it is possible to avoid inducing asynchronous Aborts onto threads that are performing backout (i.e. filter, finally, catch or fault blocks) or that hold locks. Our definition of a lock is pretty broad. It includes execution of a class constructor, since all .cctor execution is synchronized according to elaborate rules by the CLR. It also includes Monitor.Enter, Mutex, ReaderWriterLock, etc. Finally, it includes any “hand-rolled” locks that you build, so long as you properly identify them to us.

Our rationale here is that any thread holding a lock may be updating shared state. If a thread isn’t holding a lock, then any update it performs against shared state must be atomic or at least it never leaves that shared state in an inconsistent state. This is strictly a heuristic, but it’s a pretty good one.

If we believe this heuristic, it means that we can use Abort without consequently unloading an AppDomain, if that thread doesn’t hold any locks and isn’t performing any backout. And it just so happens that the bulk of all managed code executing inside SQL Server is in the ‘Safe’ subset – which coincidentally is highly discouraged via HPAs from taking or holding locks.

In other words, code in the ‘Safe’ subset can almost always take an asynchronous exception without affecting any of the execution on other threads in the same AppDomain. This is the case, even though that code was written by developers who don’t understand the deep issues involved with asynchronous exceptions. It further means that if we should catch such a thread at a point where it isn’t safe to inject an asynchronous exception without also unloading the AppDomain, we can identify this window. Once this window is identified, we can either hold off from injecting the exception until this unsafe window has closed, or we can unload the entire AppDomain to eliminate the application inconsistency. The host can decide whether to hold off on the injection or alternatively to proceed with an AppDomain unload, based on criteria like how resource-constrained the host is.

The hosting APIs for making these decisions imperatively would be rather complicated. So the Whidbey hosting APIs provide a declarative mechanism called an escalation policy. This allows the host to express transitions and timeouts that take effect during error conditions. For instance, SQL Server might state that any attempt to Abort a thread should delay if the victim thread holds a lock. But if that delay exceeds 30 seconds, the Abort attempt should be escalated to an AppDomain.Unload. Of course, the feature is more general than SQL Server’s needs. Indeed, the V1 ASP.NET process recycling feature should now be expressible as a particular Whidbey escalation policy.

Winding down

As usual, I didn’t get around to many of the interesting topics. For instance, those guidelines on when and how to host are noticeably absent. And I didn’t explain how to do any simple stuff, like picking concurrent vs. non-concurrent vs. server GC. The above text is completely free of any specific details of what our hosting APIs look like (partly because they are subject to change until Whidbey ships). And I didn’t touch on any hosting topics outside of the hosting APIs, like all of the AppDomain considerations. As you can imagine, there’s also plenty I could have said about Security. For instance, the hosting APIs allow the host to participate in role-based security and impersonation of Windows identities… Oh well.

Fortunately, one of the PMs involved in the Whidbey hosting effort is apparently writing a book on the general topic of hosting. Presumably all these missing topics will be covered there. And hopefully he won’t run into the same issues with writer’s block that I experienced on this topic.

(Indeed, the event that ultimately resolved my writer’s block was that my wife got the flu. When she’s not around, my weekends are boring enough for me to think about work. The reason I’m posting two blogs this weekend is that Kathryn has gone to Maui for the week and has left me behind.)

Finally, the above blog talks about SQL Server a lot.

Hopefully it’s obvious that the CLR wants to be a great execution environment for a broad set of servers. In V1, we focused on ASP.NET. Based on that effort, we automatically worked well in many other servers with no additional work. For example, EnterpriseServices dropped us into their server processes simply by selecting the server mode of our GC. Nothing else was required to get us running efficiently. (Well, we did a ton of other work in the CLR to support EnterpriseServices. But that work was related to the COM+ programming model and infrastructure, rather than their server architecture. We had to do that work whether we ran in their server process or were instead loading EnterpriseServices into the ASP.NET worker process or some other server).

In Whidbey we focused on extending the CLR to meet SQL Server’s needs. But at every opportunity we generalized SQL Server’s requirements and tried to build something that would be more broadly useful. Just as our ASP.NET work enabled a large number of base server hosting scenarios, we hope that our SQL Server work will enable a large number of advanced server hosting scenarios.

If you have a “commercially significant” hosting problem, whether on the server or the client, and you’re struggling with how to incorporate managed code, I would be interested in hearing from you directly. Feel free to drop me an email with the broad outline of what you are trying to achieve, and I’ll try to get you supported. That support might be something as lame as some suggestions from me on how I would tackle the problem. Or at the other extreme, I could imagine more formal support and conceivably some limited feature work. That other extreme really depends on how commercially significant your product is and on how well our business interests align. Obviously decisions like that are far outside my control, but I can at least hook you up with the right people if this seems like a sensible approach.

Okay, one more ‘Finally’. From time to time readers of my blog send me emails asking if there are jobs available on the CLR team. At this moment, we do. Drop me an email if you are interested. It’s an extremely challenging team to work on, but the problems are truly fascinating.
Tweet