• Slang User's Guide
    • Introduction
      • Why use Slang?
      • Who is Slang for?
      • Who is this guide for?
      • Goals and Non-Goals
    • Getting Started with Slang
      • Installation
      • Your first Slang shader
      • The full example
    • Conventional Language Features
      • Types
      • Expressions
      • Statements
      • Functions
      • Preprocessor
      • Attributes
      • Global Variables and Shader Parameters
      • Shader Entry Points
      • Mixed Shader Entry Points
      • Auto-Generated Constructors
      • Initializer Lists
    • Basic Convenience Features
      • Type Inference in Variable Definitions
      • Immutable Values
      • Namespaces
      • Member functions
      • Properties
      • Initializers
      • Operator Overloading
      • Subscript Operator
      • Tuple Types
      • `Optional<T>` type
      • `if_let` syntax
      • `reinterpret<T>` operation
      • Pointers (limited)
      • Extensions
      • Multi-level break
      • Force inlining
      • Special Scoping Syntax
      • User Defined Attributes (Experimental)
    • Modules and Access Control
      • Defining a Module
      • Importing a Module
      • Access Control
      • Legacy Modules
    • Capabilities
      • Capability Atoms and Capability Requirements
      • Conflicting Capabilities
      • Requirements in Parent Scope
      • Inference of Capability Requirements
      • Inference on target_switch
      • Capability Aliases
      • Validation of Capability Requirements
    • Interfaces and Generics
      • Interfaces
      • Generics
      • Supported Constructs in Interface Definitions
      • Associated Types
      • Generic Value Parameters
      • Type Equality Constraints
      • Interface-typed Values
      • Extending a Type with Additional Interface Conformances
      • `is` and `as` Operator
      • Generic Interfaces
      • Generic Extensions
      • Extensions to Interfaces
      • Variadic Generics
      • Builtin Interfaces
    • Automatic Differentiation
      • Using Automatic Differentiation in Slang
      • Mathematic Concepts and Terminologies
      • Differentiable Value Types
      • Forward Derivative Propagation Function
      • Backward Derivative Propagation Function
      • Builtin Differentiable Functions
      • Primal Substitute Functions
      • Working with Mixed Differentiable and Non-Differentiable Code
      • Higher Order Differentiation
      • Interactions with Generics and Interfaces
      • Restrictions of Automatic Differentiation
    • Compiling Code with Slang
      • Concepts
      • Command-Line Compilation with `slangc`
      • Using the Compilation API
      • Multithreading
      • Compiler Options
      • Debugging
    • Using the Reflection API
      • Compiling a Program
      • Types and Variables
      • Layout for Types and Variables
      • Programs and Scopes
      • Calculating Cumulative Offsets
      • Determining Whether Parameters Are Used
      • Conclusion
    • Supported Compilation Targets
      • Background and Terminology
      • Direct3D 11
      • Direct3D 12
      • Vulkan
      • OpenGL
      • Metal
      • CUDA and OptiX
      • CPU Compute
      • WebGPU
      • Summary
    • Link-time Specialization and Module Precompilation
      • Link-time Constants
      • Link-time Types
      • Providing Default Settings
      • Restrictions
      • Using Precompiling Modules with the API
      • Additional Remarks
    • Special Topics
      • Handling Matrix Layout Differences on Different Platforms
        • Two conventions of matrix transform math
        • Discussion
        • Matrix Layout
        • Overriding default matrix layout
      • Using Slang to Write PyTorch Kernels
        • Getting Started with SlangTorch
        • Specializing shaders using slangtorch
        • Back-propagating Derivatives through Complex Access Patterns
        • Manually binding kernels
        • Builtin Library Support for PyTorch Interop
        • Type Marshalling Between Slang and Python
      • Obfuscation
        • Obfuscation in Slang
        • Using An Obfuscated Module
        • Accessing Source Maps
        • Accessing Source Maps without Files
        • Emit Source Maps
        • Issues/Future Work
      • Interoperation with Target-Specific Code
        • Defining Intrinsic Functions for Textual Targets
        • Defining Intrinsic Types
        • Injecting Preludes
        • Managing Cross-Platform Code
        • Inline SPIRV Assembly
      • Uniformity Analysis
        • Treat Values as Uniform
        • Treat Function Return Values as Non-uniform
    • Target-specific features
      • SPIR-V specific functionalities
        • Experimental support for the older versions of SPIR-V
        • Combined texture sampler
        • System-Value semantics
        • Behavior of `discard` after SPIR-V 1.6
        • Supported HLSL features when targeting SPIR-V
        • Unsupported GLSL keywords when targeting SPIR-V
        • Supported atomic types for each target
        • ConstantBuffer, StructuredBuffer and ByteAddressBuffer
        • ParameterBlock for SPIR-V target
        • Push Constants
        • Specialization Constants
        • SPIR-V specific Attributes
        • Multiple entry points support
        • Global memory pointers
        • Matrix type translation
        • Legalization
        • Tessellation
        • SPIR-V specific Compiler options
      • Metal-specific functionalities
        • Entry Point Parameter Handling
        • System-Value semantics
        • Interpolation Modifiers
        • Resource Types
        • Header Inclusions and Namespace
        • Parameter blocks and Argument Buffers
        • Struct Parameter Flattening
        • Return Value Handling
        • Value Type Conversion
        • Conservative Rasterization
        • Address Space Assignment
        • Explicit Parameter Binding
        • Specialization Constants
      • WGSL specific functionalities
        • System-Value semantics
        • Supported HLSL features when targeting WGSL
        • Supported atomic types
        • ConstantBuffer, (RW/RasterizerOrdered)StructuredBuffer, (RW/RasterizerOrdered)ByteAddressBuffer
        • Specialization Constants
        • Interlocked operations
        • Entry Point Parameter Handling
        • Parameter blocks
        • Write-only Textures
        • Pointers
        • Address Space Assignment
        • Matrix type translation
        • Explicit Parameter Binding
        • Specialization Constants
    • Reference
      • Capability Profiles
      • Capability Atoms
        • Targets
        • Stages
        • Versions
        • Extensions
        • Compound Capabilities
        • Other

Handling Matrix Layout Differences on Different Platforms

The differences between default matrix layout or storage conventions between GLSL (OpenGL/Vulkan) and HLSL has been an issue that frequently causes confusion among developers. When writing applications that work on different targets, one important goal that developers frequently seek is to make it possible to pass the same matrix generated by host code to the same shader code, regardless of what graphics API is being used (e.g. Vulkan, OpenGL or Direct3D). As a solution to shader cross-compilation, Slang provides necessary tools for developers navigate around the differences between GLSL and HLSL targets.

A high level summary:

  • Default matrix layout in memory for Slang is row-major.
    • Except when running the compiler through the slangc tool, in which case the default is col-major. This default is for legacy reasons and may change in the future.
  • Row-major layout is the only portable layout to use across targets (with significant caveats for non 4x4 matrices)
  • Use setMatrixLayoutMode/spSetMatrixLayoutMode/createSession to set the default
  • Use -matrix-layout-row-major or -matrix-layout-column-major for the command line
    • or via spProcessCommandLineArguments/processCommandLineArguments
  • Depending on your host maths library, matrix sizes and targets, it may be necessary to convert matrices at host/kernel boundary

On the portability issue, some targets ignore the matrix layout mode, notably CUDA and CPU/C++. For this reason for the widest breadth of targets it is recommended to use row-major matrix layout.

Two conventions of matrix transform math

Depending on the platform a developer is used to, a matrix-vector transform can be expressed as either v*m (mul(v, m) in HLSL), or m*v (mul(m,v) in HLSL). This convention, together with the matrix layout (column-major or row-major), determines how a matrix should be filled out in host code.

In HLSL/Slang the order of vector and matrix parameters to mul determine how the vector is interpreted. This interpretation is required because a vector does not in as of it’s self differentiate between being a row or a column.

  • mul(v, m) - v is interpreted as a row vector.
  • mul(m, v) - v is interpreted as a column vector.

Through this mechanism a developer is able to write transforms in their preferred style.

These two styles are not directly interchangeable - for a given v and m then generally mul(v, m) != mul(m, v). For that the matrix needs to be transposed so

  • mul(v, m) == mul(transpose(m), v)
  • mul(m, v) == mul(v, transpose(m))

This behavior is independent of how a matrix layout in memory. Host code needs to be aware of how a shader code will interpret a matrix stored in memory, it’s layout, as well as the vector interpretation convention used in shader code (ie mul(v,m) or mul(m, v)).

Matrix layout can be either row-major or column-major. The difference just determines which elements are contiguous in memory. Row-major means the rows elements are contiguous. Column-major means the column elements are contiguous.

Another way to think about this difference is in terms of where translation terms should be placed in memory when filling a typical 4x4 transform matrix. When transforming a row vector (ie mul(v, m)) with a row-major matrix layout, translation will be at m + 12, 13, 14. For a column-major matrix layout, translation will be at m + 3, 7, 11.

Note it is a HLSL/Slang convention that the parameter ordering of mul(v, m) means v is a row vector. A host maths library could have a transform function SomeLib::transform(v, m) such that v is a interpreted as column vector. For simplicitys sake the remainder of this discussion assumes that the mul(v, m) in equivalent in host code follows the interpretation that v is row vector.

Discussion

There are four variables in play here:

  • Host vector interpretation (row or column) - and therefore effective transform order (column) m * v or (row) v * m
  • Host matrix memory layout
  • Shader vector interpretation (as determined via mul(v, m) or mul(m, v) )
  • Shader matrix memory layout

Since each item can be either row or column there are 16 possible combinations. For simplicity let’s reduce the variable space by making some assumptions.

1) The same vector convention will be used in host code as in shader code. 2) The host maths matrix layout is the same as the kernel.

If we accept 1, then we can ignore the vector interpretation because as long as they are consistent then only matrix layout is significant. If we accept 2, then there are only two possible combinations - either both host and shader are using row-major matrix layout or column-major layout.

This is simple, but is perhaps not the end of the story. First lets assume that we want our Slang code to be as portable as possible. As previously discussed for CUDA and C++/CPU targets Slang ignores the matrix layout settings - the matrix layout is always row-major.

Second lets consider performance. The matrix layout in a host maths library is not arbitrary from a performance point of view. A performant host maths library will want to use SIMD instructions. With both x86/x64 SSE and ARM NEON SIMD it makes a performance difference which layout is used, depending on if column or row is the preferred vector interpretation. If the row vector interpretation is preferred, it is most performant to have row-major matrix layout. Conversely if column vector interpretation is preferred column-major matrix will be the most performant.

The performance difference comes down to a SIMD implementation having to do a transpose if the layout doesn’t match the preferred vector interpretation.

If we put this all together - best performance, consistency between vector interpretation and platform independence we get:

1) Consistency : Same vector interpretation in shader and host code 2) Platform independence: Kernel uses row-major matrix layout 3) Performance: Host vector interpretation should match host matrix layout

The only combination that fulfills all aspects is row-major matrix layout and row vector interpretation for both host and kernel.

It’s worth noting that for targets that honor the default matrix layout - that setting can act like a toggle transposing a matrix layout. If for some reason the combination of choices leads to inconsistent vector transforms, an implementation can perform this transform in host code at the boundary between host and the kernel. This is not the most performant or convenient scenario, but if supported in an implementation it could be used for targets that do not support kernel matrix layout settings.

If only targeting platforms that honor matrix layout, there is more flexibility, our constraints are

1) Consistency : Same vector interpretation in shader and host code 2) Performance: Host vector interpretation should match host matrix layout

Then there are two combinations that work

1) row-major matrix layout for host and kernel, and row vector interpretation. 2) column-major matrix layout for host and kernel, and column vector interpretation.

If the host maths library is not performance orientated, it may be arbitrary from a performance point of view if a row or column vector interpretation is used. In that case assuming shader and host vector interpretation is the same it is only important that the kernel and maths library matrix layout match.

Another way of thinking about these combinations is to think of each change in row-major/column-major matrix layout and row/column vector interpretation is a transpose. If there are an even number of flips then all the transposes cancel out. Therefore the following combinations work

Host Vector Kernel Vector Host Mat Layout Kernel Mat Layout
Row Row Row Row
Row Row Column Column
Column Column Row Row
Column Column Column Column
Row Column Row Column
Row Column Column Row
Column Row Row Column
Column Row Column Row

To be clear ‘Kernel Mat Layout’ is the shader matrix layout setting. As previously touched upon, if it is not possible to use the setting (say because it is not supported on a target), then doing a transpose at the host/kernel boundary can fix the issue.

Matrix Layout

The above discussion is largely around 4x4 32-bit element matrices. For graphics APIs such as Vulkan, GL, and D3D there are typically additional restrictions for matrix layout. One restriction is for 16 byte alignment between rows (for row-major layout) and columns (for column-major layout).

More CPU-like targets such as CUDA and C++/CPU do not have this restriction, and all elements are consecutive.

This being the case only the following matrix types/matrix layouts will work across all targets. (Listed in the HLSL convention of RxC).

  • 1x4 row-major matrix layout
  • 2x4 row-major matrix layout
  • 3x4 row-major matrix layout
  • 4x4 row-major matrix layout

These are all ‘row-major’ because as previously discussed currently only row-major matrix layout works across all targets currently.

NOTE! This only applies to matrices that are transferred between host and kernel - any matrix size will work appropriately for variables in shader/kernel code for example.

The hosts maths library also plays a part here. The library may hold all elements consecutively in memory. If that’s the case it will match the CPU/CUDA kernels, but will only work on ‘graphics’-like targets that match that layout for the size.

For SIMD based host maths libraries it can be even more convoluted. If a SIMD library is being used that prefers row vector interpretation and therefore will have row-major layout it may for many sizes not match the CPU-like consecutive layout. For example a 4x3 - it will likely be packed with 16 byte row alignment. Additionally even if a matrix is packed in the same way it may not be the same size. For example a 3x2 matrix may hold the rows consecutively but be 16 bytes in size, as opposed to the 12 bytes that a CPU-like kernel will expect.

If a SIMD based host maths library with graphics-like APIs are being used, there is a good chance (but certainly not guaranteed) that layout across non 4x4 sizes will match because SIMD typically implies 16 byte alignment.

If your application uses matrix sizes that are not 4x4 across the host/kernel boundary and it wants to work across all targets, it is likely that some matrices will have to be converted at the boundary. This being the case, having to handle transposing matrices at the boundary is a less significant issue.

In conclusion if your application has to perform matrix conversion work at the host/kernel boundary the previous observation about “best performance” implies row-major layout and row vector interpretation becomes somewhat mute.

Overriding default matrix layout

Slang allows users to override default matrix layout with a compiler flag. This compiler flag can be specified during the creation of a Session:

slang::IGlobalSession* globalSession;
...
slang::SessionDesc slangSessionDesc = {};
slangSessionDesc.defaultMatrixLayoutMode = SLANG_MATRIX_LAYOUT_COLUMN_MAJOR;
...
slang::ISession* session;
globalSession->createSession(slangSessionDesc, &session);

This makes Slang treat all matrices as in column-major layout, and for example emitting column_major qualifier in resulting HLSL code.

Alternatively the default layout can be set by

  • Including a CompilerOptionName::MatrixLayoutColumn or CompilerOptionName::MatrixLayoutRow entry in SessionDesc::compilerOptionEntries.
  • Setting -matrix-layout-row-major or -matrix-layout-column-major command line options to slangc.