Everyone excited to see the new features and enhancements of SQL Server 2016, beside all those improvements, there are some improvements in the architectural design of the database engine to provide better performance, scalability and use of hardware resources. I would like to pin point to some of the major internal improvements that have been done on SQL Server 2016.
[plain]Special thanks to Joe Sack, Bob Dorr, Ajay Jagannathan, Greg Lapinski, Nitin Verma, Steve Lindell, Zhengguo Sun, and other PSSSQL team members to provide such informative blog posts.[/plain]
Extended Events Reader in Management Studio
SQL Server 2016 improves the
XEvent Linq reader scalability and performance. The XEvent UI in SQL Server Management Studio uses the XEvent Linq reader to process the events for display. Careful study of the XEvent Linq reader revealed opportunities for scalability and performance improvements.
Note: The XEvent Linq reader is .NET based and can be used by custom applications.
Various changes were made to the XEvent Linq provider logic to reduce context switching, memory allocations and other aspects.
The following are testing results of the XEvent reader.
Prior 2016: 8000 Events/Sec
2016 and later: 80,000 Events/Sec
Leverages On Demand MSDTC Startup
SQL Server leverages MSDTC for distributed transactions (begin distributed transaction, remote proc trans, etc.) Prior to SQL Server 2016 the MSDTC service must be running (started) prior to any SQL Server, DTC based transaction activity.
SQL Server 2016 enhances SQL distributed transaction capabilities leveraging MSDTC, On Demand startup. The On Demand startup of MSDTC does not start the MSDTC service and consume resources until SQL Server requires MSDTC.
SQL Server 2016 dynamically starts MSDTC as needed allowing resources to be used for other activities until required.
In-Memory Optimized Database Worker Pool
SQL Server 2016 removed the trace flag and updated the design to dynamically adjust the size of the In-Memory Optimized Database worker pool as needed. SQL Server 2016 dynamically adjusts the In-Memory Optimized worker pool to maximize scalability and performance.
AlwaysOn AES-NI Encryption
The SQL Server 2016, AlwaysOn, log transport takes advantage of hardware based encryption to significantly improve scale and performance. SQL Server 2016 development efforts continued focusing on the AlwaysOn log shipping transport. Testing revealed that software based encryption, while fundamentally sound, could be improved using hardware based capabilities.
The exchange of information between AlwaysOn instances takes place over a message protocol. When encryption is configured the messages are encrypted and decrypted at the designated endpoints. Because the messages are delivered across a remote environment the overhead of encryption activities can increase the latency of the communications.
SQL Server 2016 defaults endpoint creation to AES based encryption allowing hardware based AES-NI encryption. The hardware based capabilities improve Always On log shipping scalability and performance by a significant factor.
AlwaysOn Parallel Compression / Improved Algorithms
SQL Server 2016 introduces two distinct changes in the AlwaysOn transport, compression design.
- Improved compression algorithms
- Parallel compression of log block data
Compression can be performed faster, using less resource overhead and maintains compression ratios.
SQL Server 2016 also introduces parallel compression operations. The following chart outlines the performance and resource gains as tested using a TPCC stress load. SQL Server 2016 combines new compression algorithms and parallelization to significantly improve AlwaysOn log transport capabilities.
AlwaysOn Log Transport Reduced Context Switches
The AlwaysOn log transport uses a SQL Broker based design to send and receive messages between the primary and secondary replicas. Studying the transport behavior revealed various improvement opportunities.
Many of you have read the various materials indicating that log transportation between the primary and secondary is commonly 1/3 that of file copy speed. That is no longer the case. A SQL Server 2016 AlwaysOn database can commonly maintain 90% or better transitional throughput rate directly comparing the same workload against a standalone database.
For example, SQL Server 2014 requires ~16, thread context switches to transport a log block. SQL Server 2016 can accomplish the same transport task in 8 context switches. This can lead to a 4x to 5x improvement in log block transport throughput.
BULK INSERT Uses CPU Vector Instructions
SQL Server 2016 enhanced the bulk insert activities (BULK INSERT), leveraging the CPUs vector instructions. The vector based approach allows the bulk data to be processed faster.
For example, when inserting data from a text file the integer conversion leverages SSE instructions when the character length is more than 4 characters and trace flag 4809 is disabled.
Take the value of ‘12345678’ needing to be converted to the integer value 1,2345,678. A common algorithm used to determine the integer value of each position one might loop over the character array subtracting the value of character ‘0’ from each entry so it can be casted to a BYTE value and shifted to accommodate base 10 mathematics.
SQL Server 2016 takes advantage of CPU vector instructions to improve bulk insert performance.
Column Store Uses CPU Vector Instructions
SQL Server’s Column Store feature is designed to accommodate large amounts of data and associated compression capabilities. Microsoft studied numerous data layouts associated with column store and the CPU instructions required to process that data. Various algorithms have been optimized throughout the column store code base.
A specific, SQL Server 2016 enhancement is the use of vector based hardware capabilities. When compressing, building dictionaries and processing data a vectored design is leveraged. SQL Server 2016 detects the level of CPU support for the following extensions and when supported SQL Server 2016 takes full advantage of the hardware scalability and performance capabilities.
SQL Server 2016 detects the CPU capabilities for AVX or SSE and leverages the hardware, based vector capabilities to improve scalability and performance.
Multiple Log Writer Workers
SQL Server 2016 introduces multiple log writer workers to the engine. For many years the log device was limited by the capabilities of spinning media and hardware caches. The advancement of hardware caches and fast storage types (SSD, flash, …) on large installations can saturate a single log write worker.
SQL Server 2016 extended the log writer by allowing up to 4 workers for the instance. You can find out the number of log writers present in your SQL Server using the following command:
select session_id, status, command, scheduler_id, task_address, wait_type
where command = ‘LOG WRITER’
The number of log writers created during the SQL Server instance startup depends on the number of hardware NUMA nodes present on the system. If there is only one NUMA node, then there will be only one log writer created. For systems with multiple NUMA nodes, there can be a maximum of 4 log writer threads.
If you have configured the affinity I/O mask (not recommended) the log writer threads are bound to the CPU mask specified in the affinity I/O mask configuration. Otherwise, the log writer threads will be created on separate hidden schedulers. These hidden schedulers are always bound to NUMA node 0. These decision to cap the maximum number of log writers and their placement is based on extensive performance tests conducted with heavy concurrent activities of multiple log writers accessing the log flush queue.
Larger Data File Writes
SQL Server uses WriteFileGather for the vast majority of data file write requests. The logic is to consolidate dirty pages into a single I/O request. For over a decade hardware performed best with at 64 or 128K requests. The newer SSD and flash implementations often have 1MB to 4MB internal blocking sizes. Increasing the overall size of data file writes slim lines the (R)ead (M)odify (W)rite behavior, increasing scalability and performance.
SQL Server 2016 (X64) increase the number of contiguous, 8K pages from 32 to 128 (1MB) when performing (Lazy, checkpoint, select into, create index and bulk insert write operations.) These write operations encompass 95%+ of the write operations for data file.
Indirect Checkpoint Default
Indirect checkpoint is the recommended configuration, especially on systems with large memory footprints and default for databases created in SQL Server 2016. New databases in SQL Server 2016 use indirect checkpoint, improving performance of checkpoint activities.
There has always been a need to track which pages are dirty. The determination of dirty pages has been extended over the years and SQL Server releases to include a ‘PrepareToDirty’ routine inside SQL Server. This facilitates capabilities such as snapshot databases and page latch enforcement. For example, before any code in the SQL Server is allowed to make changes it must invoke ‘PrepareToDirty.’ If a database snapshot is present for the database and the page has not been copied to the snapshot the image of the page is secured before allowing the change.
Indirect checkpoint leverages ‘PrepareToDirty’ logic. For indirect checkpoint, the SQL Server creates a dirty page manager for the database. Within the dirty page manager exists a set of lists to track dirty pages. As pages become dirty they are added to the highly efficient, partitioned lists which are sorted by LSN and tend to just be append list operations. Adding the pages to partitioned, sorted lists when the page is first dirtied is very low overhead.
The SOS_RWLock is a synchronization primitive used in various places throughout the SQL Server code base. As the name implies the code can have multiple shared (readers) or single (writer) ownership. SQL Server 2016 contains the design changes for the SOS_RWLock using similar design patterns and concepts leveraged by in-memory optimized objects.
As part of a change to improve the execution of sp_reset_connection (logical connection pooling) the SOS_RWLock redesign reduced the duration on a highly contended database by 5%.
Apply SQL Server 2016 and code paths leveraging reader/writer locks use less resources and scale better.
Dynamic Memory Object (CMemThread) Partitioning
Apply SQL Server 2016 and SQL Server dynamically partitions a CMemThread encountering contention, increasing scalability of the instance.
The CMemThread waits (PWAIT_MEMTHREAD) can be a point of contention as machine sizes advance. The CMemThread object type is utilized in 100s of objects throughout the SQL Server code base and can be partitioned globally, by node or by CPU.
The vast majority of CMemThread objects leverage global partitioning. Trace flag -T8048 only forces node based partitioning to cpu based partitioning. Each time a highly contended, global partitioned object was encountered a hotfix was required to partition the object.
SQL Server 2005, 2012 and 2014 contains dynamic latch promotion and demotion logic (Super/Sub-Latch). The concept is to watch the number of acquires on a latch along comparing sampling rates as to how long it should take to acquire a latch without contention. When a shared latch can be promoted (partitioned) in order to improve performance and scalability SQL Server will do so.
Taking a page from the latch design concepts, CMemThread is designed in SQL Server 2016 to dynamically promote a contended CMemThread object. SQL Server 2016 detects contention points on a specific CMemThread object and promotes the object to a per node or per cpu based implementation. Once promoted the object remains promoted until the SQL Server instance is restarted.
The dynamic CMemThread design is a superset of -T8048 behavior, allowing you to deploy SQL Server 2016 on a wide variety of hardware environments and SQL Server 2016 automatically adjusts CMemThread behavior to the applicable load. By partitioning contended CMemThread objects it removes the waits behaviors and allows SQL Server 2016 to scale to your applications needs.
Updated Scheduling Algorithms
SQL Server 2016 scheduling algorithms balance the work load better leading to enhanced scalability. SQL Server 2016 gets a scalability boost from scheduling updates. Testing uncovered issues with the percentile scheduling based algorithms in SQL Server 2012 and 2014. A large, CPU quantum worker and a short, CPU quantum worker can receive unbalanced access to the scheduling resources.
Take the following example. Worker 1 is a large, read query using read ahead and in-memory database pages and Worker 2 is doing shorter activities. Worker 1 finds information already in buffer pool and does not have to yield for I/O operations. Worker 1 can consume its full CPU quantum.
On the other hand, Worker 2 is performing operations that require it to yield. For discussion let’s say Worker 2 yields at 1/20th of its CPU, quantum target. Taking resource governance and other activities out of the picture the scheduling pattern looks like the following.
Worker 1 is getting ~5 times more CPU cycles than Worker 2. In testing we found issues with various workloads and system tasks. If Worker 2 is the log writer it takes longer to harden log records, which holds locks, which can lead to blocking and throughput issues.
SQL Server 2016 and Windows Azure SQL Database (WASD) monitors the quantum usage patterns allowing all workers to get fair treatment. The same pattern described above looks like the following on SQL Server 2016. In this simplistic example Worker 2 is allowed to consume repeated quantum’s preventing Worker 1 from monopolizing the scheduler in an unfriendly pattern.
Note: The scheduler changes were deployed to Windows Azure SQL Server Database in March of 2014.
Automatic Soft NUMA
Apply SQL Server 2016 and SQL Server internally leverages SOFT NUMA partitioning to achieve double digit performance gains
During startup, SQL Server 2016 interrogates the hardware layout and automatically configures Soft NUMA on systems reporting 8 or more CPUs per NUMA node. The partitioning triggers various adjustments throughout the database engine improving scalability and performance. The Automatic Soft NUMA logic considers logical CPU ratios, total CPU counts and other factors, attempting to create soft, logical nodes containing 8 or fewer CPUs each.
When creating or growing the database log file (LDF) a byte pattern is stamped. The pattern establishes the proper log block layout with end of log marker(s.)
SQL Server 7.0 changed the LDF format from the original, 2K, Sybase database page design to sector aligned log blocks stamped with all zeros (0x00). Creation or expansion of the LDF file involves writing the entire series of 0x00’s to the new bytes for the log.
SQL Server 16 changes the stamp to 0xC0’s instead of 0x00s.
Note: The log file stamping pattern change has been leveraged by Window Azure SQL Database since calendar year 2014.[/info]
Question: If the log is stamped with 0xC0’s instead of 0x00’s how is it a performance gain?
Many of the new hardware implementations detect patterns of 0x00’s. The space is acquired and zero’s written to stable media, then a background, hardware based garbage collector reclaims the blocks.
SQL Server needs to expand or create a new segment in the LDF to support active, transaction activity.
- The region is allocated
- Stamped with the 0x00’s pattern
- Hardware starts to reclaim the newly allocated regions
- SQL Server needs to write new transaction log records
As an example, SQL Server grew and stamped the log but when SQL Server attempts to write the log record the hardware may no longer have space. When SQL Server can no longer write log records all transaction activity for the database is halted.
Spatial Index Builds Faster
Index creation and tessellation are often intensive, spatial activities. Along with the native and TVP, spatial enhancements additional work to optimize index creation and tessellation was completed. Testing reveals that building a spatial index on SQL Server 2016, with the improved design, can be more than 2 times faster than SQL Server 2012 or 2014 on the same data and hardware.
It has become common place for spatial tables to be 300 million or more rows and reduction of the index build time by a factor of 2x or more reduces maintenance window(s.)
TVPs with Spatial Column
Table Valued Parameters (TVPs) containing spatial columns can be used as input parameter(s) to stored procedures.
SQL Server 2016 improves the scalability, using native spatial validation(s), increasing performance by 15 times or more.
- Before the Fix: 8000 Rows/Sec
- After the Fix: 120,000 Rows/Sec
Native Spatial Implementation(s)
As the SQL Server Spatial data types have matured, development team uncovered the Unmanaged (SQL Server) to Managed (Microsoft.SqlServer.Types) to SqlServerSpatial###.dll (unmanaged) transitions (PInvoke and PUnInvoke) may become a scalability bottleneck. The truth is that the PInvoke and PUnInvoke is extremely performant. The reality of a database system is that the CPU overhead for such transitions, at 3,000,000+ rows could be used to perform the work of the spatial method instead of the transition. SQL Server 2016 invokes the native implementation of the methods, avoiding the Unmanaged to Managed to Unmanaged transitions, improving performance.
There are no application or database changes just the SQL Server binary updates showing dramatic improvement. Running the same query on the same hardware, the following chart highlights the SQL 2016 implementation compared to SQL Server 2014.
DBCC Extended Checks
Internally DBCC CHECK* uses a page scanning coordinator design (MultiObjectScanner.) SQL Server 2016 changes the internal design to (CheckScanner), applying no lock semantics and a design similar to those used with In-Memory Optimized (Hekaton) objects, allowing DBCC operations to scale far better than previous releases.
Starting with SQL Server 2016, additional checks on filtered indexes, persisted computed columns, and UDT columns will not be run by default to avoid the expensive expression evaluation(s.) This change greatly reduces the duration of CHECKDB against databases containing these objects. However, the physical consistency checks of these objects is always completed. Only when EXTENDED_LOGICAL_CHECKS option is specified will the expression evaluations be performed in addition to already present, logical checks (indexed view, XML indexes, and spatial indexes) as part of the EXTENDED_LOGICAL_CHECKS option. For filtered indexes, CHECKDB has also been improved to skip records that do not qualify as being indexed by target NC index.