Contents

1 Introduction
2 Benchmark presentation
3 Benchmark implementation
- 3.1 The benchmark component’s code
- 3.2 The benchmark application’s code
4 Results
5 Conclusion

Introduction

I’ve recently wanted to demonstrate the relevance of the .Net ReaderWriterLock[Slim] synchronization primitives.
It’s good to hear from the vendor that it’s better, faster, stronger, but when you can it’s always good to evaluate it yourself; not that I don’t trust vendors, but because I like to have hard numbers, particularly when I assert something that can be critical for my participants’ developments.

So I’ve built a small, simple and I hope relevant benchmark to measure the performance impact of ReaderWriterLock[Slim] compared to the naive and uniform use of a Monitor using the C# lock construct.

I wanted to check these two things:

that the RW locks behave as advertised,
what is the profile of the gain function.

In this article I’ll explain the rationales behind the benchmark, how I’ve implemented it and finally present the results.

Benchmark presentation

Workload

The work of each thread is simulated by interrupting them using Thread.Sleep during 1ms, a mid value between heavy operations that would last seconds and light ones that would last 1µs.

The proportion/probability of writes operations, p, varies linearly in a given range, typically from 0% (only reads) to 100% (only writes).

If you have specific workload profiles you can adapt this part:

by using a sub-range, e.g. if you know the reads and writes operations are balanced,
by overweighting either the read or write part duration if you know one is longer,
or even by randomizing them according to the probability distribution of your workload.

General workflow

The benchmark scans the range of levels of concurrency.
For each level of concurrency, c, it scans the range of writes proportions, p.
For each pair (c, p) it runs c threads that will each execute a loop of n operations.
For each iteration, the operation to simulate (read or write) is chosen randomly according to the proportion of writes p.

To randomly generate the operation type a random integer is generated with Random.Next(), then normalized with int.MaxValue to obtain a number linearly distributed in the range [0, 1].
If this number falls into the [0, p] range then we simulate a write, otherwise we simulate a read.

Threads synchronization

To ensure that for a given combination (c, p) all the threads start at the same time, and that none can starts, runs or even ends before others have been given a chance to run, they are synchronized using a Barrier, another synchronization primitive which ensures that a set of threads will reach the same point of execution before continuing.

When all the threads are ready, the main thread gives the go-ahead and starts to measure the total time it takes for all the threads to execute all their operations using a Stopwatch instance.

Benchmark implementation

The benchmark component’s code

The benchmark itself has been reified with a class to allow for an easy configuration: the range of p, the range of c and n are passed to the single benchmark’s constructor.
The benchmark class produces the results but does not display them, which is the responsibility of the caller.

Here is the benchmark class itself:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading;

namespace ReaderWriterLockBenchmark
{
    public class ReaderWriterLockBenchmark
    {
        /// <summary>
        /// The number of operations to execute for each thread.
        /// </summary>
        private readonly int n;

        /// <summary>
        /// The range for the values of p.
        /// </summary>
        private readonly decimal pMin, pMax, pStep;

        /// <summary>
        /// The range for the levels of concurrency.
        /// </summary>
        private readonly int threadCountMin, threadCountMax, threadCountStep;

        /// <summary>
        /// The current proportion of writes.
        /// </summary>
        private decimal p;

        /// <summary>
        /// The primitive used to ensure that all threads start to work at the same time.
        /// </summary>
        private Barrier barrier;

        /// <summary>
        /// Object used for the C# lock statement underlying Monitor object.
        /// </summary>
        private readonly object @lock = new object();

        /// <summary>
        /// RW lock used for a finer grained locking policy.
        /// </summary>
        ReaderWriterLockSlim rwLock = new ReaderWriterLockSlim();

        /// <summary>
        /// Take a full lock for both read and write operations.
        /// </summary>
        void SimpleLocking()
        {
            // Wait the other threads.
            barrier.SignalAndWait();

            for (int i = 1; i <= n; ++i)
            {
                lock (@lock)
                {
                    Thread.Sleep(1);
                }
            }
        }

        /// <summary>
        /// Take a read lock when reading and a write lock when writing.
        /// </summary>
        void RWLocking()
        {
            // Wait the other threads.
            barrier.SignalAndWait();

            Random rand = new Random();

            for (int i = 1; i <= n; ++i)
            {
                // If we are in the write range then simulate a write and take a full lock...
                if (1.0m * rand.Next() / int.MaxValue <= p)
                {
                    try
                    {
                        rwLock.EnterWriteLock();

                        Thread.Sleep(1);
                    }
                    finally
                    {
                        rwLock.ExitWriteLock();
                    }
                }
                // ... otherwise simulate a read and only take a read-lock
                else
                {
                    try
                    {
                        rwLock.EnterReadLock();

                        Thread.Sleep(1);
                    }
                    finally
                    {
                        rwLock.ExitReadLock();
                    }
                }
            }
        }

        /// <summary>
        /// Initialize a new benchmark.
        /// </summary>
        /// <param name="n">The number of operations done by each thread.</param>
        /// <param name="pMin">The initial proportion of writes.</param>
        /// <param name="pMax">The final proportion of writes.</param>
        /// <param name="pStep">The increase of the proportion of writes at each step.</param>
        /// <param name="threadCountMin">The minimum level of concurrency.</param>
        /// <param name="threadCountMax">The maximum level of concurrency.</param>
        /// <param name="threadCountStep">The increase of the level of concurrency at each step.</param>
        public ReaderWriterLockBenchmark(int n, decimal pMin, decimal pMax, decimal pStep, int threadCountMin, int threadCountMax, int threadCountStep)
        {
            this.n = n;
            this.pMin = pMin;
            this.pMax = pMax;
            this.pStep = pStep;
            this.threadCountMin = threadCountMin;
            this.threadCountMax = threadCountMax;
            this.threadCountStep = threadCountStep;
        }

        /// <summary>
        /// Measure the times of the naive uniform locking.
        /// </summary>
        /// <returns></returns>
        private IDictionary<int, TimeSpan> MeasureReferenceTimes()
        {
            IDictionary<int, TimeSpan> simpleLockingTimes = new Dictionary<int, TimeSpan>();

            for (int threadCount = threadCountMin; threadCount <= threadCountMax; threadCount += threadCountStep)
            {
                Console.WriteLine("Now with {0} threads.", threadCount);

                // Track the running threads for joining them.
                Thread[] threads = new Thread[threadCount];

                barrier = new Barrier(threadCount + 1);

                for (int i = 0; i < threadCount; ++i)
                {
                    Thread thread = new Thread(SimpleLocking) { IsBackground = true };
                    threads[i] = thread;
                    thread.Start();
                }

                Stopwatch stopwatch = Stopwatch.StartNew();

                barrier.SignalAndWait();

                for (int i = 0; i < threadCount; ++i)
                {
                    threads[i].Join();
                }
                stopwatch.Stop();

                simpleLockingTimes[threadCount] = stopwatch.Elapsed;
            }

            return simpleLockingTimes;
        }

        /// <summary>
        /// Perform the benchmark, comparing the RW finer grained policy times against the uniform locking policy times.
        /// </summary>
        /// <param name="simpleLockingTimes"></param>
        /// <returns></returns>
        private IDictionary<int, IDictionary<decimal, double>> Benchmark(IDictionary<int, TimeSpan> simpleLockingTimes)
        {
            IDictionary<int, IDictionary<decimal, double>> results = new Dictionary<int, IDictionary<decimal, double>>();

            for (int threadCount = threadCountMin; threadCount <= threadCountMax; threadCount += threadCountStep)
            {
                Console.WriteLine("Now with {0} threads.", threadCount);

                results[threadCount] = new Dictionary<decimal, double>();

                Thread[] threads = new Thread[threadCount];

                Stopwatch stopwatch = new Stopwatch();
                for (p = pMin; p <= pMax; p += pStep)
                {
                    barrier = new Barrier(threadCount + 1);

                    for (int i = 0; i < threadCount; ++i)
                    {
                        Thread thread = new Thread(RWLocking) { IsBackground = true };
                        threads[i] = thread;
                        thread.Start();
                    }

                    stopwatch.Restart();
                    barrier.SignalAndWait();

                    for (int i = 0; i < threadCount; ++i)
                    {
                        threads[i].Join();
                    }
                    stopwatch.Stop();

                    TimeSpan rwLockingTime = stopwatch.Elapsed;

                    double ratio = 1.0 * rwLockingTime.TotalMilliseconds / simpleLockingTimes[threadCount].TotalMilliseconds;

                    results[threadCount][p] = ratio;
                }
            }

            return results;
        }

        public IDictionary<int, IDictionary<decimal, double>> Run()
        {
            IDictionary<int, TimeSpan> simpleLockingTimes = MeasureReferenceTimes();

            IDictionary<int, IDictionary<decimal, double>> results = Benchmark(simpleLockingTimes);

            return results;
        }
    }
}

The benchmark application’s code

And here is the code that uses the benchmark and renders the results as a CSV stream in the console:

using System;
using System.Collections.Generic;

namespace ReaderWriterLockBenchmark
{
    class Program
    {
        static void Main(string[] args)
        {
            int n = 200;
            decimal pMin = 0m;
            decimal pMax = 1m;
            decimal pStep = 0.1m;
            int threadCountMin = 1;
            int threadCountMax = 8;
            int threadCountStep = 1;

            ReaderWriterLockBenchmark benchmark = new ReaderWriterLockBenchmark(n, pMin, pMax, pStep, threadCountMin, threadCountMax, threadCountStep);

            IDictionary<int, IDictionary<decimal, double>> results = benchmark.Run();

            for (int threadCount = threadCountMin; threadCount <= threadCountMax; threadCount += threadCountStep)
            {
                Console.Write("\t{0}", threadCount);
            }

            Console.WriteLine();

            for (decimal p = pMin; p <= pMax; p += pStep)
            {
                Console.Write(p);

                for (int threadCount = threadCountMin; threadCount <= threadCountMax; threadCount += threadCountStep)
                {
                    Console.Write("\t{0:N3}", results[threadCount][p]);
                }

                Console.WriteLine();
            }
        }
    }
}

From there the output can be copy-pasted in your favorite tools for analysis, typically Excel.

Note that with this design you could go a step further and instead of updating the parameters in the Main method you would pass them as parameters of the executable, parsing them in the Main and forwarding them to the benchmark instance.

Results

I’ve run the benchmark with 1 thread to 8 threads and with 50 threads, and with p varying from 0% to 100% increasing by step of 10%.

Here are the raw results:

p\n	1	2	3	4	5	6	7	8	50
0.0	1.003	0.502	0.335	0.249	0.200	0.167	0.145	0.128	0.020
0.1	0.997	0.658	0.500	0.404	0.347	0.299	0.287	0.197	0.111
0.2	0.995	0.741	0.606	0.540	0.424	0.410	0.400	0.360	0.185
0.3	0.999	0.838	0.684	0.550	0.573	0.504	0.522	0.462	0.320
0.4	0.998	0.840	0.755	0.698	0.602	0.597	0.581	0.539	0.379
0.5	0.999	0.898	0.811	0.761	0.750	0.688	0.650	0.655	0.500
0.6	0.997	0.923	0.880	0.823	0.775	0.699	0.765	0.669	0.573
0.7	0.998	0.973	0.910	0.860	0.804	0.788	0.801	0.787	0.723
0.8	0.995	0.960	0.928	0.894	0.890	0.882	0.853	0.854	0.779
0.9	0.995	0.981	0.965	0.935	0.948	0.938	0.929	0.912	0.938
1.0	0.999	1.001	1.000	0.996	1.001	1.001	1.002	1.003	1.000

And plotted:
[visualizer id=”2972″]

A basic analysis shows that the main characteristics were expected:

when we only perform reads we’re as fast as we can: with c threads it’s c times faster
when we only perform writes there is no gain

What is more interesting and that I personally didn’t expect is that this function is concave.
It’s linear only asymptotically when n tends towards infinity (and even with 50 threads this is already quite linear).

I honestly have not the skills in queuing theory to fully explain the results but I guess this can be modelized this way:
$\frac {t + X(n, p)} {n t}$
And with normalized time:
$\frac {1 + X(n, p)} {n}$
With $X(n, p)$ a random variable representing the time a thread will wait in the ReaderWriterLockSlim waiting queue.

If p = 0% then $X(n, p)$ will be 0 and the ratio will always be $\frac{1}{n}$ .
If p = 100% then $X(n, p)$ will be n – 1 because each time a thread enters the queue it will have to wait for all the other threads to finish, and the ratio will always be $\frac{1 + (n - 1)}{n} = \frac{n}{n} = 1$ .

Hopefully this is consistent with our empirical results.

And finally according to our results with increasing n $X(n, p)$ should tend to $(n - 1)p$ , or something similar so that the final ratio tends to p.

Conclusion

Concerning the efficiency, we’ve confirmed that using a finer locking policy, distinguishing between reads and writes operations, allows for a significant gain in performance.
Again this is not an absolute result as it could greatly vary depending on your use case but it’s consistent with what we expect for such a synchronization primitive.

As for the gain function its profile was less expected: it’s not linear but tends to be with increasing concurrency.
Then the theoretical gain if you have infinite concurrency is $\frac{1}{p}$ , whatever p.

If you have an idea of what the $X(n, p)$ random variable looks like please share by letting a comment, it would be interesting to be able to compute the theoretical results and to match them against the empirical results presented above.
I hope there is no flaw in the benchmark that would explain the strange profile, but I’d like to be sure. 🙂

If you catch any typo or mistake, have additional questions, some remarks, feel free to let a comment.

Pragmateek

A geek on his way to IT pragmatism

Relative performance of ReaderWriterLockSlim