如何使用 SSE2 和 c++ 非常快速地向 RGB 图像添加 Alpha 通道-解网

问：

我正在使用 SSE2 在 C++ 中编写 YUV420p 到 RGBA 颜色转换算法。现在，我有 YUV420p 到 RGB 和 RGB 到 RGBA。结果如下：

size of image: 1920 x 1200
time of RGBA to YUV conversion: 0.0029011
time of YUV to RGB conversion: 0.0044585
time of RGB to RGBA conversion (approach 1): 0.0064747
time of RGB to RGBA conversion (approach 2): 0.0066194
time of RGB to RGBA conversion (approach 3): 0.0069835

如您所见，RGB 到 RGBA 的转换比 YUV420p 到 RGB 或 RGBA 到 YUV420p 需要更长的时间。我在 YUV420p 到 RGB 计算中交错 alpha 通道时遇到了很多麻烦，所以我正在尝试后处理步骤（RGB 到 RGBA）。到目前为止，代码如下：

方法1：

void convertRGB24itoRGBA32i ( int width, int height, const unsigned char *RGB, unsigned char *RGBA ) {
    const size_t numPixels = (width - 1) * (height - 1);

    for ( size_t i = 0; i < numPixels; i++ ) 
    {
        __m128i sourcePixel = _mm_loadu_si128 ( (__m128i*)&RGB[i * 3] );
        //__m128i alphaChannel = _mm_setzero_si128 ( ); // Set alpha to 0 (transparent)
        __m128i alphaChannel = _mm_set1_epi32 ( 0xFF000000 );
        __m128i rgb32Pixel = _mm_or_si128 ( alphaChannel, sourcePixel );
        _mm_storeu_si128 ( (__m128i*)&RGBA[i * 4], rgb32Pixel );
    }
}

方法2：

void convertRGB24itoRGBA32i ( int width, int height, const RT_UByte *RGB, RT_UByte *RGBA )
{
    const size_t numPixels = (width - 1) * (height - 1);

    // Create the shuffle control mask for converting BGR to RGBA
    __m128i shuffleMask = _mm_setr_epi8 ( 2, 1, 0, 3, 5, 4, 3, 7, 8, 11, 10, 9, 13, 12, 15, 14 );

    for ( size_t i = 0; i < numPixels; i++ ) {
        __m128i sourcePixel = _mm_loadu_si128 ( reinterpret_cast<const __m128i*>(&RGB[i * 3]) );

        __m128i rgbaPixel = _mm_shuffle_epi8 ( sourcePixel, shuffleMask );

        __m128i alphaChannel = _mm_set1_epi32 ( 0xFF000000 );

        // Merge the RGBA channels
        rgbaPixel = _mm_or_si128 ( alphaChannel, rgbaPixel );

        _mm_storeu_si128 ( reinterpret_cast<__m128i*>(&RGBA[i * 4]), rgbaPixel );
    }
}

方法3：

inline void convertBGRi24toBGRAi32 ( const ubyte3 *bgri24, ubyte4* bgrai32, t_size size )
{
    for ( ; size != 0; --size, ++bgrai32, ++bgri24 )
    {
        bgrai32->x = bgri24->x;
        bgrai32->y = bgri24->y;
        bgrai32->z = bgri24->z;
        bgrai32->w = 0xff;
    };

}

C++ 图像处理 SIMD SSE SSE2

您可以将 SSSE3 用于吗？这可能要快得多;对于仅 SSE2，没有想到好的策略。但是，如果您的数据最初是 YUV420p，则已经需要做一些工作才能将子采样的 YUV 转换为全深度 RGB;如果这是您最终想要的，您可能想直接使用 RGBA。存储和重新加载对缓存局部性不利，除非您在小块中工作，这对于作为第一个随机步骤的一部分未对齐的加载可能很有用。（平面当然会更容易，alpha通道将是一个单独的数组。pshufb

1赞 harold 10/24/2023

起初我认为这段代码不可能是正确的（因为它不做任何排列来为额外的 A 字节腾出空间），但“诀窍”在于它像标量代码一样运行，只是使用向量指令。

1赞 Cris Luengo 10/24/2023

您是否尝试过逐个复制字节？让编译器以它认为合适的方式优化循环？这样，您就不需要进行任何显式计算。

2赞 harold 10/24/2023

当你使用时，你当然可以同时处理 4 个像素，这就是重点，但你的方法 2 仍然做方法 1 必须做的假标量事情才能解决没有pshufbpshufb

1赞 Peter Cordes 10/25/2023

从 RGB 到 BGRA 的快速矢量化转换有一个答案，使用 4x4 像素的块 = 3 个输入向量和 4 个输出向量。（不过，未对齐的负载可能比混合更有效。它也反转为 BGRA，但 RGBA 应该只是调整随机控制向量的问题。_mm_shuffle_epi8

答： 暂无答案

上一个：如何使 SIMD 除以零得到零？（86-64 版）

下一个：为什么这个结构方法仍然借用可变引用？

如何使用 SSE2 和 c++ 非常快速地向 RGB 图像添加 Alpha 通道

How to add an alpha channel very fast to a RGB image using SSE2 and c++

评论