In order to use the functions provided by this module, you need to import this module:

```
>>> import HardwareIntrinsics
```

These intrinsic functions are only available if your CPU supports `Sse41`

features.

`mm_blend_epi16`

Blend packed 16-bit integers from "a" and "b" using control mask "imm8", and store the results in "dst".

__m128i _mm_blend_epi16 (__m128i a, __m128i b, const int imm8) PBLENDW xmm, xmm/m128 imm8

`mm_blend_pd`

Blend packed double-precision (64-bit) floating-point elements from "a" and "b" using control mask "imm8", and store the results in "dst".

__m128d _mm_blend_pd (__m128d a, __m128d b, const int imm8) BLENDPD xmm, xmm/m128, imm8

`mm_blend_ps`

Blend packed single-precision (32-bit) floating-point elements from "a" and "b" using control mask "imm8", and store the results in "dst".

__m128 _mm_blend_ps (__m128 a, __m128 b, const int imm8) BLENDPS xmm, xmm/m128, imm8

`mm_blendv_epi8`

Blend packed 8-bit integers from "a" and "b" using "mask", and store the results in "dst".

__m128i _mm_blendv_epi8 (__m128i a, __m128i b, __m128i mask) PBLENDVB xmm, xmm/m128, xmm

`mm_blendv_pd`

Blend packed double-precision (64-bit) floating-point elements from "a" and "b" using "mask", and store the results in "dst".

__m128d _mm_blendv_pd (__m128d a, __m128d b, __m128d mask) BLENDVPD xmm, xmm/m128, xmm0

`mm_blendv_ps`

Blend packed single-precision (32-bit) floating-point elements from "a" and "b" using "mask", and store the results in "dst".

__m128 _mm_blendv_ps (__m128 a, __m128 b, __m128 mask) BLENDVPS xmm, xmm/m128, xmm0

`mm_ceil_pd`

Round the packed double-precision (64-bit) floating-point elements in "a" up to an integer value, and store the results as packed double-precision floating-point elements in "dst".

__m128d _mm_ceil_pd (__m128d a) ROUNDPD xmm, xmm/m128, imm8(10)

`mm_ceil_ps`

Round the packed single-precision (32-bit) floating-point elements in "a" up to an integer value, and store the results as packed single-precision floating-point elements in "dst".

__m128 _mm_ceil_ps (__m128 a) ROUNDPS xmm, xmm/m128, imm8(10)

`mm_ceil_sd`

Round the lower double-precision (64-bit) floating-point element in "b" up to an integer value, store the result as a double-precision floating-point element in the lower element of "dst", and copy the upper element from "a" to the upper element of "dst".

__m128d _mm_ceil_sd (__m128d a, __m128d b) ROUNDSD xmm, xmm/m128, imm8(10)

`mm_ceil_sd1`

Round the lower double-precision (64-bit) floating-point element in "b" up to an integer value, store the result as a double-precision floating-point element in the lower element of "dst", and copy the upper element from "a" to the upper element of "dst".

__m128d _mm_ceil_sd (__m128d a) ROUNDSD xmm, xmm/m128, imm8(10)

`mm_ceil_ss`

Round the lower single-precision (32-bit) floating-point element in "b" up to an integer value, store the result as a single-precision floating-point element in the lower element of "dst", and copy the upper 3 packed elements from "a" to the upper elements of "dst".

__m128 _mm_ceil_ss (__m128 a, __m128 b) ROUNDSS xmm, xmm/m128, imm8(10)

`mm_cmpeq_epi64`

Compare packed 64-bit integers in "a" and "b" for equality, and store the results in "dst".

__m128i _mm_cmpeq_epi64 (__m128i a, __m128i b) PCMPEQQ xmm, xmm/m128

`mm_cvtepi16_epi32`

Sign extend packed 16-bit integers in "a" to packed 32-bit integers, and store the results in "dst".

__m128i _mm_cvtepi16_epi32 (__m128i a) PMOVSXWD xmm, xmm/m64

`mm_cvtepi16_epi64`

Sign extend packed 16-bit integers in "a" to packed 64-bit integers, and store the results in "dst".

__m128i _mm_cvtepi16_epi64 (__m128i a) PMOVSXWQ xmm, xmm/m32

`mm_cvtepi32_epi64`

Sign extend packed 32-bit integers in "a" to packed 64-bit integers, and store the results in "dst".

__m128i _mm_cvtepi32_epi64 (__m128i a) PMOVSXDQ xmm, xmm/m64

`mm_cvtepi8_epi16`

Sign extend packed 8-bit integers in "a" to packed 16-bit integers, and store the results in "dst".

__m128i _mm_cvtepi8_epi16 (__m128i a) PMOVSXBW xmm, xmm/m64

`mm_cvtepi8_epi32`

Sign extend packed 8-bit integers in "a" to packed 32-bit integers, and store the results in "dst".

__m128i _mm_cvtepi8_epi32 (__m128i a) PMOVSXBD xmm, xmm/m32

`mm_cvtepi8_epi64`

Sign extend packed 8-bit integers in the low 8 bytes of "a" to packed 64-bit integers, and store the results in "dst".

__m128i _mm_cvtepi8_epi64 (__m128i a) PMOVSXBQ xmm, xmm/m16

`mm_cvtepu16_epi32`

Zero extend packed unsigned 16-bit integers in "a" to packed 32-bit integers, and store the results in "dst".

__m128i _mm_cvtepu16_epi32 (__m128i a) PMOVZXWD xmm, xmm/m64

`mm_cvtepu16_epi64`

Zero extend packed unsigned 16-bit integers in "a" to packed 64-bit integers, and store the results in "dst".

__m128i _mm_cvtepu16_epi64 (__m128i a) PMOVZXWQ xmm, xmm/m32

`mm_cvtepu32_epi64`

Zero extend packed unsigned 32-bit integers in "a" to packed 64-bit integers, and store the results in "dst".

__m128i _mm_cvtepu32_epi64 (__m128i a) PMOVZXDQ xmm, xmm/m64

`mm_cvtepu8_epi16`

Zero extend packed unsigned 8-bit integers in "a" to packed 16-bit integers, and store the results in "dst".

__m128i _mm_cvtepu8_epi16 (__m128i a) PMOVZXBW xmm, xmm/m64

`mm_cvtepu8_epi32`

Zero extend packed unsigned 8-bit integers in "a" to packed 32-bit integers, and store the results in "dst".

__m128i _mm_cvtepu8_epi32 (__m128i a) PMOVZXBD xmm, xmm/m32

`mm_cvtepu8_epi64`

Zero extend packed unsigned 8-bit integers in the low 8 byte sof "a" to packed 64-bit integers, and store the results in "dst".

__m128i _mm_cvtepu8_epi64 (__m128i a) PMOVZXBQ xmm, xmm/m16

`mm_dp_pd`

Conditionally multiply the packed double-precision (64-bit) floating-point elements in "a" and "b" using the high 4 bits in "imm8", sum the four products, and conditionally store the sum in "dst" using the low 4 bits of "imm8".

__m128d _mm_dp_pd (__m128d a, __m128d b, const int imm8) DPPD xmm, xmm/m128, imm8

`mm_dp_ps`

Conditionally multiply the packed single-precision (32-bit) floating-point elements in "a" and "b" using the high 4 bits in "imm8", sum the four products, and conditionally store the sum in "dst" using the low 4 bits of "imm8".

__m128 _mm_dp_ps (__m128 a, __m128 b, const int imm8) DPPS xmm, xmm/m128, imm8

`mm_extract_epi32`

Extract a 32-bit integer from "a", selected with "imm8", and store the result in "dst".

int _mm_extract_epi32 (__m128i a, const int imm8) PEXTRD reg/m32, xmm, imm8

`mm_extract_epi8`

Extract an 8-bit integer from "a", selected with "imm8", and store the result in the lower element of "dst".

int _mm_extract_epi8 (__m128i a, const int imm8) PEXTRB reg/m8, xmm, imm8

`mm_extract_ps`

Extract a single-precision (32-bit) floating-point element from "a", selected with "imm8", and store the result in "dst".

int _mm_extract_ps (__m128 a, const int imm8) EXTRACTPS xmm, xmm/m32, imm8

`mm_floor_pd`

Round the packed double-precision (64-bit) floating-point elements in "a" down to an integer value, and store the results as packed double-precision floating-point elements in "dst".

__m128d _mm_floor_pd (__m128d a) ROUNDPD xmm, xmm/m128, imm8(9)

`mm_floor_ps`

Round the packed single-precision (32-bit) floating-point elements in "a" down to an integer value, and store the results as packed single-precision floating-point elements in "dst".

__m128 _mm_floor_ps (__m128 a) ROUNDPS xmm, xmm/m128, imm8(9)

`mm_floor_sd`

Round the lower double-precision (64-bit) floating-point element in "b" down to an integer value, store the result as a double-precision floating-point element in the lower element of "dst", and copy the upper element from "a" to the upper element of "dst".

__m128d _mm_floor_sd (__m128d a, __m128d b) ROUNDSD xmm, xmm/m128, imm8(9)

`mm_floor_sd1`

Round the lower double-precision (64-bit) floating-point element in "b" down to an integer value, store the result as a double-precision floating-point element in the lower element of "dst", and copy the upper element from "a" to the upper element of "dst".

__m128d _mm_floor_sd (__m128d a) ROUNDSD xmm, xmm/m128, imm8(9)

`mm_floor_ss`

Round the lower single-precision (32-bit) floating-point element in "b" down to an integer value, store the result as a single-precision floating-point element in the lower element of "dst", and copy the upper 3 packed elements from "a" to the upper elements of "dst".

__m128 _mm_floor_ss (__m128 a, __m128 b) ROUNDSS xmm, xmm/m128, imm8(9)

`mm_insert_epi32`

Copy "a" to "dst", and insert the 32-bit integer "i" into "dst" at the location specified by "imm8".

__m128i _mm_insert_epi32 (__m128i a, int i, const int imm8) PINSRD xmm, reg/m32, imm8

`mm_insert_epi8`

Copy "a" to "dst", and insert the lower 8-bit integer from "i" into "dst" at the location specified by "imm8".

__m128i _mm_insert_epi8 (__m128i a, int i, const int imm8) PINSRB xmm, reg/m8, imm8

`mm_insert_ps`

Copy "a" to "tmp", then insert a single-precision (32-bit) floating-point element from "b" into "tmp" using the control in "imm8". Store "tmp" to "dst" using the mask in "imm8" (elements are zeroed out when the corresponding bit is set).

__m128 _mm_insert_ps (__m128 a, __m128 b, const int imm8) INSERTPS xmm, xmm/m32, imm8

`mm_max_epi32`

Compare packed 32-bit integers in "a" and "b", and store packed maximum values in "dst".

__m128i _mm_max_epi32 (__m128i a, __m128i b) PMAXSD xmm, xmm/m128

`mm_max_epi8`

Compare packed 8-bit integers in "a" and "b", and store packed maximum values in "dst".

__m128i _mm_max_epi8 (__m128i a, __m128i b) PMAXSB xmm, xmm/m128

`mm_max_epu16`

Compare packed unsigned 16-bit integers in "a" and "b", and store packed maximum values in "dst".

__m128i _mm_max_epu16 (__m128i a, __m128i b) PMAXUW xmm, xmm/m128

`mm_max_epu32`

Compare packed unsigned 32-bit integers in "a" and "b", and store packed maximum values in "dst".

__m128i _mm_max_epu32 (__m128i a, __m128i b) PMAXUD xmm, xmm/m128

`mm_min_epi32`

Compare packed 32-bit integers in "a" and "b", and store packed minimum values in "dst".

__m128i _mm_min_epi32 (__m128i a, __m128i b) PMINSD xmm, xmm/m128

`mm_min_epi8`

Compare packed 8-bit integers in "a" and "b", and store packed minimum values in "dst".

__m128i _mm_min_epi8 (__m128i a, __m128i b) PMINSB xmm, xmm/m128

`mm_min_epu16`

Compare packed unsigned 16-bit integers in "a" and "b", and store packed minimum values in "dst".

__m128i _mm_min_epu16 (__m128i a, __m128i b) PMINUW xmm, xmm/m128

`mm_min_epu32`

Compare packed unsigned 32-bit integers in "a" and "b", and store packed minimum values in "dst".

__m128i _mm_min_epu32 (__m128i a, __m128i b) PMINUD xmm, xmm/m128

`mm_minpos_epu16`

Horizontally compute the minimum amongst the packed unsigned 16-bit integers in "a", store the minimum and index in "dst", and zero the remaining bits in "dst".

__m128i _mm_minpos_epu16 (__m128i a) PHMINPOSUW xmm, xmm/m128

`mm_mpsadbw_epu8`

Compute the sum of absolute differences (SADs) of quadruplets of unsigned 8-bit integers in "a" compared to those in "b", and store the 16-bit results in "dst". Eight SADs are performed using one quadruplet from "b" and eight quadruplets from "a". One quadruplet is selected from "b" starting at on the offset specified in "imm8". Eight quadruplets are formed from sequential 8-bit integers selected from "a" starting at the offset specified in "imm8".

__m128i _mm_mpsadbw_epu8 (__m128i a, __m128i b, const int imm8) MPSADBW xmm, xmm/m128, imm8

`mm_mul_epi32`

Multiply the low 32-bit integers from each packed 64-bit element in "a" and "b", and store the signed 64-bit results in "dst".

__m128i _mm_mul_epi32 (__m128i a, __m128i b) PMULDQ xmm, xmm/m128

`mm_mullo_epi32`

Multiply the packed 32-bit integers in "a" and "b", producing intermediate 64-bit integers, and store the low 32 bits of the intermediate integers in "dst".

__m128i _mm_mullo_epi32 (__m128i a, __m128i b) PMULLD xmm, xmm/m128

`mm_packus_epi32`

Convert packed 32-bit integers from "a" and "b" to packed 16-bit integers using unsigned saturation, and store the results in "dst".

__m128i _mm_packus_epi32 (__m128i a, __m128i b) PACKUSDW xmm, xmm/m128

`mm_round_pd1`

Round the packed double-precision (64-bit) floating-point elements in "a" using the "rounding" parameter, and store the results as packed double-precision floating-point elements in "dst".

__m128d _mm_round_pd (__m128d a, _MM_FROUND_CUR_DIRECTION); ROUNDPD xmm, xmm/m128, imm8(4)

`mm_round_pd1_to_nearest_integer`

Round the packed double-precision (64-bit) floating-point elements in "a" using the "rounding" parameter, and store the results as packed double-precision floating-point elements in "dst".

__m128d _mm_round_pd (__m128d a, int rounding) ROUNDPD xmm, xmm/m128, imm8(8) _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC

`mm_round_pd1_to_negative_infinity`

Round the packed double-precision (64-bit) floating-point elements in "a" using the "rounding" parameter, and store the results as packed double-precision floating-point elements in "dst".

__m128d _mm_round_pd (__m128d a, _MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC); ROUNDPD xmm, xmm/m128, imm8(9)

`mm_round_pd1_to_positive_infinity`

__m128d _mm_round_pd (__m128d a, _MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC); ROUNDPD xmm, xmm/m128, imm8(10)

`mm_round_pd1_to_zero`

__m128d _mm_round_pd (__m128d a, _MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC); ROUNDPD xmm, xmm/m128, imm8(11)

`mm_round_ps`

Round the packed single-precision (32-bit) floating-point elements in "a" using the "rounding" parameter, and store the results as packed single-precision floating-point elements in "dst".

__m128 _mm_round_ps (__m128 a, _MM_FROUND_CUR_DIRECTION); ROUNDPS xmm, xmm/m128, imm8(4)

`mm_round_ps_to_nearest_integer`

Round the packed single-precision (32-bit) floating-point elements in "a" using the "rounding" parameter, and store the results as packed single-precision floating-point elements in "dst".

__m128 _mm_round_ps (__m128 a, int rounding) ROUNDPS xmm, xmm/m128, imm8(8) _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC

`mm_round_ps_to_negative_infinity`

Round the packed single-precision (32-bit) floating-point elements in "a" using the "rounding" parameter, and store the results as packed single-precision floating-point elements in "dst".

__m128 _mm_round_ps (__m128 a, _MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC); ROUNDPS xmm, xmm/m128, imm8(9)

`mm_round_ps_to_positive_infinity`

__m128 _mm_round_ps (__m128 a, _MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC); ROUNDPS xmm, xmm/m128, imm8(10)

`mm_round_ps_to_zero`

__m128 _mm_round_ps (__m128 a, _MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC); ROUNDPS xmm, xmm/m128, imm8(11)

`mm_round_sd`

Round the lower double-precision (64-bit) floating-point element in "b" using the "rounding" parameter, store the result as a double-precision floating-point element in the lower element of "dst", and copy the upper element from "a" to the upper element of "dst".

__m128d _mm_round_sd (__m128d a, __m128d b, _MM_FROUND_CUR_DIRECTION) ROUNDSD xmm, xmm/m128, imm8(4)

`mm_round_sd1`

Round the lower double-precision (64-bit) floating-point element in "b" using the "rounding" parameter, store the result as a double-precision floating-point element in the lower element of "dst", and copy the upper element from "a" to the upper element of "dst".

__m128d _mm_round_sd (__m128d a, _MM_FROUND_CUR_DIRECTION) ROUNDSD xmm, xmm/m128, imm8(4)

`mm_round_sd1_to_nearest_integer_scalar`

Round the lower double-precision (64-bit) floating-point element in "b" using the "rounding" parameter, store the result as a double-precision floating-point element in the lower element of "dst", and copy the upper element from "a" to the upper element of "dst".

__m128d _mm_round_sd (__m128d a, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) ROUNDSD xmm, xmm/m128, imm8(8)

`mm_round_sd1_to_negative_infinity_scalar`

__m128d _mm_round_sd (__m128d a, _MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) ROUNDSD xmm, xmm/m128, imm8(9)

`mm_round_sd1_to_positive_infinity_scalar`

__m128d _mm_round_sd (__m128d a, _MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) ROUNDSD xmm, xmm/m128, imm8(10)

`mm_round_sd1_to_zero_scalar`

__m128d _mm_round_sd (__m128d a, _MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) ROUNDSD xmm, xmm/m128, imm8(11)

`mm_round_sd_to_nearest_integer_scalar`

__m128d _mm_round_sd (__m128d a, __m128d b, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC) ROUNDSD xmm, xmm/m128, imm8(8)

`mm_round_sd_to_negative_infinity_scalar`

__m128d _mm_round_sd (__m128d a, __m128d b, _MM_FROUND_TO_NEG_INF |_MM_FROUND_NO_EXC) ROUNDSD xmm, xmm/m128, imm8(9)

`mm_round_sd_to_positive_infinity_scalar`

__m128d _mm_round_sd (__m128d a, __m128d b, _MM_FROUND_TO_POS_INF |_MM_FROUND_NO_EXC) ROUNDSD xmm, xmm/m128, imm8(10)

`mm_round_sd_to_zero_scalar`

__m128d _mm_round_sd (__m128d a, __m128d b, _MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC) ROUNDSD xmm, xmm/m128, imm8(11)

`mm_round_ss`

Round the lower single-precision (32-bit) floating-point element in "b" using the "rounding" parameter, store the result as a single-precision floating-point element in the lower element of "dst", and copy the upper 3 packed elements from "a" to the upper elements of "dst".

__m128 _mm_round_ss (__m128 a, __m128 b, _MM_FROUND_CUR_DIRECTION) ROUNDSS xmm, xmm/m128, imm8(4)

`mm_round_ss1`

Round the lower single-precision (32-bit) floating-point element in "b" using the "rounding" parameter, store the result as a single-precision floating-point element in the lower element of "dst", and copy the upper 3 packed elements from "a" to the upper elements of "dst".

__m128 _mm_round_ss (__m128 a, _MM_FROUND_CUR_DIRECTION) ROUNDSS xmm, xmm/m128, imm8(4)

`mm_round_ss1_to_nearest_integer_scalar`

Round the lower single-precision (32-bit) floating-point element in "b" using the "rounding" parameter, store the result as a single-precision floating-point element in the lower element of "dst", and copy the upper 3 packed elements from "a" to the upper elements of "dst".

__m128 _mm_round_ss (__m128 a, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC) ROUNDSS xmm, xmm/m128, imm8(8)

`mm_round_ss1_to_negative_infinity_scalar`

__m128 _mm_round_ss (__m128 a, _MM_FROUND_TO_NEG_INF | _MM_FROUND_NO_EXC) ROUNDSS xmm, xmm/m128, imm8(9)

`mm_round_ss1_to_positive_infinity_scalar`

__m128 _mm_round_ss (__m128 a, _MM_FROUND_TO_POS_INF | _MM_FROUND_NO_EXC) ROUNDSS xmm, xmm/m128, imm8(10)

`mm_round_ss1_to_zero_scalar`

__m128 _mm_round_ss (__m128 a, _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC) ROUNDSS xmm, xmm/m128, imm8(11)

`mm_round_ss_to_nearest_integer_scalar`

__m128 _mm_round_ss (__m128 a, __m128 b, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC) ROUNDSS xmm, xmm/m128, imm8(8)

`mm_round_ss_to_negative_infinity_scalar`

__m128 _mm_round_ss (__m128 a, __m128 b, _MM_FROUND_TO_NEG_INF | _MM_FROUND_NO_EXC) ROUNDSS xmm, xmm/m128, imm8(9)

`mm_round_ss_to_positive_infinity_scalar`

__m128 _mm_round_ss (__m128 a, __m128 b, _MM_FROUND_TO_POS_INF | _MM_FROUND_NO_EXC) ROUNDSS xmm, xmm/m128, imm8(10)

`mm_round_ss_to_zero_scalar`

__m128 _mm_round_ss (__m128 a, __m128 b, _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC) ROUNDSS xmm, xmm/m128, imm8(11)

`mm_stream_load_si128`

Load 128-bits of integer data from memory into "dst" using a non-temporal memory hint. "mem_addr" must be aligned on a 16-byte boundary or a general-protection exception may be generated.

__m128i _mm_stream_load_si128 (const __m128i* mem_addr) MOVNTDQA xmm, m128

`mm_testc_si128`

Compute the bitwise AND of 128 bits (representing integer data) in "a" and "b", and set "ZF" to 1 if the result is zero, otherwise set "ZF" to 0. Compute the bitwise NOT of "a" and then AND with "b", and set "CF" to 1 if the result is zero, otherwise set "CF" to 0. Return the "CF" value.

int _mm_testc_si128 (__m128i a, __m128i b) PTEST xmm, xmm/m128

`mm_testnzc_si128`

Compute the bitwise AND of 128 bits (representing integer data) in "a" and "b", and set "ZF" to 1 if the result is zero, otherwise set "ZF" to 0. Compute the bitwise NOT of "a" and then AND with "b", and set "CF" to 1 if the result is zero, otherwise set "CF" to 0. Return 1 if both the "ZF" and "CF" values are zero, otherwise return 0.

int _mm_testnzc_si128 (__m128i a, __m128i b) PTEST xmm, xmm/m128

`mm_testz_si128`

Compute the bitwise AND of 128 bits (representing integer data) in "a" and "b", and set "ZF" to 1 if the result is zero, otherwise set "ZF" to 0. Compute the bitwise NOT of "a" and then AND with "b", and set "CF" to 1 if the result is zero, otherwise set "CF" to 0. Return the "ZF" value.

int _mm_testz_si128 (__m128i a, __m128i b) PTEST xmm, xmm/m128