以cufftPlanMany为例FFT变换中embed,stride,dist的解释与设置

关于FFT的自定义数据分布进行变换，之前每次都是用的写demo，这次搞明白之后记录一下，以便以后查阅。

比如需要对一个二维数组里的每一行或者每一列进行傅里叶变换，那么需要对cufftPlanMany进行设置，然后进行批量处理。

cufftPlanMany的函数声明如下

cufftResult cufftPlanMany(cufftHandle *plan, int rank, int *n, int *inembed,
int istride, int idist, int *onembed, int ostride,int odist, cufftType type, int batch);

比如对于一个10*5的二维数组，想对每一行进行FFT：

NX=10；NY=5；

rank=1，表明是进行一维傅里叶变换；

n[1]，数组n维度与rank相同，表明每一维的变换数目，本例中n只有一维，n[0]=NX，表示一维傅里叶变换需要10个元素；

inembed[2]，表明原始输入数据的维度，本例中输入数据是一个二维数组，所以inembed维度为2，inembed[0]=NX, inembed[1]=NY;

istride，表明在一个傅里叶变换内部每一个元素相距的距离，本例中 istride=1；

idist，表明在两个傅里叶变换之间第一个元素相距的距离，本例中 idist=NX;

batch 表明多少个独立的傅里叶变换，本例中 batch=NY;

本例中onembed,ostride,odist做相同设置。

本例中我们进行C2C的傅里叶变换，所以输入和输出格式一样。实现代码如下:

#include <stdlib.h>
#include <stdio.h>

#include <string.h>
#include <math.h>
#include "timer.h"

#include <cuda_runtime.h>
#include <cufft.h>
#include "device_launch_parameters.h"
#define Ndim 2
#define NX 10
#define NY 5


void testplanmany() {

	int N[2];
	N[0] = NX, N[1] = NY;
	int NXY = N[0] * N[1];
	cufftComplex *input = (cufftComplex*) malloc(NXY * sizeof(cufftComplex));
	cufftComplex *output = (cufftComplex*) malloc(NXY * sizeof(cufftComplex));
	int i;
	for (i = 0; i < NXY; i++) {
		input[i].x = i % 1000;
		input[i].y = 0;
	}
	cufftComplex *d_inputData, *d_outData;
	cudaMalloc((void**) &d_inputData, N[0] * N[1] * sizeof(cufftComplex));
	cudaMalloc((void**) &d_outData, N[0] * N[1] * sizeof(cufftComplex));
	cudaMemcpy(d_inputData, input, N[0] * N[1] * sizeof(cufftComplex), cudaMemcpyHostToDevice);
	cufftHandle plan;
	/*
	cufftMakePlanMany(cufftHandle plan, int rank, int *n, int *inembed,
	int istride, int idist, int *onembed, int ostride,
	int odist, cufftType type, int batch, size_t *workSize);
	 */
	int rank=1;
	int n[1];
	n[0]=NX;
	int istride=1;
	int idist = NX;
	int ostride=1;
	int odist = NX;
	int inembed[2];
	int onembed[2];
	inembed[0]=NX;  onembed[0]=NX;
	inembed[1] = NY; onembed[0] = NY;

	cufftPlanMany(&plan,rank,n,inembed, istride ,idist , onembed, ostride,odist, CUFFT_C2C, NY);
	cufftExecC2C(plan, d_inputData, d_outData, CUFFT_FORWARD);
	cudaMemcpy(output, d_outData, NXY * sizeof(cufftComplex), cudaMemcpyDeviceToHost);

	for (i = 0; i < NXY; i++) {
		if(i%NX==0)
			printf("\n");
		printf("%f %f \n", output[i].x, output[i].y);
	}

	cufftDestroy(plan);
	free(input);
	free(output);
	cudaFree(d_inputData);
	cudaFree(d_outData);
}

int main() {

	testplanmany();
}

实际输出如下：

45.000000 0.000000 
-5.000000 15.388418 
-5.000000 6.881910 
-5.000000 3.632713 
-5.000000 1.624598 
-5.000000 0.000000 
-5.000000 -1.624598 
-5.000000 -3.632713 
-5.000000 -6.881910 
-5.000000 -15.388418 

145.000000 0.000000 
-5.000000 15.388418 
-5.000000 6.881910 
-5.000000 3.632713 
-5.000000 1.624598 
-5.000000 0.000000 
-5.000000 -1.624598 
-5.000000 -3.632713 
-5.000000 -6.881910 
-5.000000 -15.388418 

245.000000 0.000000 
-4.999997 15.388416 
-5.000000 6.881910 
-5.000000 3.632713 
-5.000000 1.624597 
-5.000000 0.000000 
-5.000000 -1.624597 
-5.000000 -3.632713 
-5.000000 -6.881910 
-4.999997 -15.388416 

345.000000 0.000000 
-4.999998 15.388418 
-4.999999 6.881909 
-4.999999 3.632712 
-5.000000 1.624598 
-5.000000 0.000000 
-5.000000 -1.624598 
-4.999999 -3.632712 
-4.999999 -6.881909 
-4.999998 -15.388418 

445.000000 0.000000 
-4.999999 15.388418 
-5.000001 6.881911 
-5.000000 3.632714 
-5.000000 1.624598 
-5.000000 0.000000 
-5.000000 -1.624598 
-5.000000 -3.632714 
-5.000001 -6.881911 
-4.999999 -15.388418

对于上述数据每一列进行一次FFT来说：

NX=10；NY=5；

rank=1，表明是进行一维傅里叶变换；

n[1]，数组n维度与rank相同，表明每一维的变换数目，本例中n只有一维，n[0]=NY，表示一维傅里叶变换需要5个元素；

inembed[2]，表明原始输入数据的维度，本例中输入数据是一个二维数组，所以inembed维度为2，inembed[0]=NX, inembed[1]=NY;

istride，表明在一个傅里叶变换内部每一个元素相距的距离，本例中 istride= NX；

idist，表明在两个傅里叶变换之间第一个元素相距的距离，本例中 idist=1; （因为两列傅里叶变换第一个元素相邻）

batch 表明多少个独立的傅里叶变换，本例中 batch=NX;

代码如下：

#include <stdlib.h>
#include <stdio.h>

#include <string.h>
#include <math.h>
#include "timer.h"

#include <cuda_runtime.h>
#include <cufft.h>
#include "device_launch_parameters.h"
#define Ndim 2
#define NX 10
#define NY 5


void testplanmany() {

	int N[2];
	N[0] = NX, N[1] = NY;
	int NXY = N[0] * N[1];
	cufftComplex *input = (cufftComplex*) malloc(NXY * sizeof(cufftComplex));
	cufftComplex *output = (cufftComplex*) malloc(NXY * sizeof(cufftComplex));
	int i;
	for (i = 0; i < NXY; i++) {
		input[i].x = i % 1000;
		input[i].y = 0;
	}
	cufftComplex *d_inputData, *d_outData;
	cudaMalloc((void**) &d_inputData, N[0] * N[1] * sizeof(cufftComplex));
	cudaMalloc((void**) &d_outData, N[0] * N[1] * sizeof(cufftComplex));
	cudaMemcpy(d_inputData, input, N[0] * N[1] * sizeof(cufftComplex), cudaMemcpyHostToDevice);
	cufftHandle plan;
	/*
	cufftMakePlanMany(cufftHandle plan, int rank, int *n, int *inembed,
	int istride, int idist, int *onembed, int ostride,
	int odist, cufftType type, int batch, size_t *workSize);
	 */
	int rank=1;
	int n[1];
	n[0]=NY;
	int istride=NX;
	int idist = 1;
	int ostride=NX;
	int odist = 1;
	int inembed[2];
	int onembed[2];
	inembed[0]=NX;  onembed[0]=NX;
	inembed[1] = NY; onembed[0] = NY;

	cufftPlanMany(&plan,rank,n,inembed, istride ,idist , onembed, ostride,odist, CUFFT_C2C, NX);
	cufftExecC2C(plan, d_inputData, d_outData, CUFFT_FORWARD);
	cudaMemcpy(output, d_outData, NXY * sizeof(cufftComplex), cudaMemcpyDeviceToHost);

	for (i = 0; i < NXY; i++) {
		if(i%NX==0)
			printf("\n");
		printf("%f %f \n", output[i].x, output[i].y);
	}

	cufftDestroy(plan);
	free(input);
	free(output);
	cudaFree(d_inputData);
	cudaFree(d_outData);
}

int main() {

	testplanmany();
}

结果如下：


100.000000 0.000000 
105.000000 0.000000 
110.000000 0.000000 
115.000000 0.000000 
120.000000 0.000000 
125.000000 0.000000 
130.000000 0.000000 
135.000000 0.000000 
140.000000 0.000000 
145.000000 0.000000 

-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 

-25.000002 8.122991 
-25.000002 8.122991 
-24.999998 8.122991 
-24.999998 8.122991 
-24.999998 8.122991 
-25.000000 8.122991 
-25.000000 8.122991 
-25.000000 8.122991 
-25.000000 8.122991 
-25.000000 8.122991 

-25.000002 -8.122991 
-25.000002 -8.122991 
-24.999998 -8.122991 
-24.999998 -8.122991 
-24.999998 -8.122991 
-25.000000 -8.122991 
-25.000000 -8.122991 
-25.000000 -8.122991 
-25.000000 -8.122991 
-25.000000 -8.122991 

-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550

另外文章参考了这些资源：

https://docs.nvidia.com/cuda/cufft/index.html#data-layout

https://rocfft.readthedocs.io/en/latest/real.html#setting-strides

更新：感谢评论中meng_zhi_xiang的提醒，关于inembed和onembed，针对我这种特殊情况（按行按列进行FFT）可以直接写成数据大小。但是实际上如果数据视图定义不仅仅如此简单，inembed和onembed还需要重新设置。这段话的分析与解释可以参考meng_zhi_xiang的第二段评论。

The inembed and onembed parameters define the number of elements in each dimension in the input array and the output array respectively. The inembed[rank-1] contains the number of elements in the least significant (innermost) dimension of the input data excluding the istride elements; the number of total elements in the least significant dimension of the input array is then istride*inembed[rank-1]. The inembed[0] or onembed[0] corresponds to the most significant (that is, the outermost) dimension and is effectively ignored since the idist or odist parameter provides this information instead. Note that the size of each dimension of the transform should be less than or equal to the inembed and onembed values for the corresponding dimension, that is n[i] ≤ inembed[i], n[i] ≤ onembed[i], where i ∈ { 0, … , rank − 1}.

另外官方提供了一个更复杂的stride与inembed应用的例子：

https://docs.nvidia.com/cuda/cufft/index.html#twod-advanced-data-layout-use

原文链接：https://blog.csdn.net/wzh1026/article/details/100641403