GPU-Enabled Polyphase Filterbanks

FOSDEM 2017

Frequency Division Multiple Access (FDMA) schemes are widely used in many existing communication systems and standards. On Software Defined Radio (SDR) platforms, separating the channels can prove more difficult though due to high requirements for the digital filters. This talk will showcase an implementation of a polyphase filterbank on a graphics processor unit (GPU) that can help overcome the heavy computational load of those filters. In the software, all the partitioned filters can run in parallel. Each of these filters produces output samples for numerous input samples simultaneously, thus providing an additional parallel approach. Furthermore, several rational oversampling factors are supported by this implementation. Operations for oversampling can as well be implemented to run in parallel, due to the massive amount of usable hardware threads in a GPU. Hence, the effects of oversampling on the throughput can be reduced. On an Nvidia GTX970 GPU, this implementation achieved a throughput of 67.43 MSamples per second, 12 times higher than the (optimized) general purpose processor (GPP) version.

Separating information using the frequency space is rather simple if one can do it with appropriate hardware. Cycle accurate operations executed in FPGA and ASIC fabric allows one to carefully create the desired waveform and then control filterbanks and oscillators. This way, the desired information can get into the air at the exact time and frequency one fancies. Unfortunately, controlling a software defined radio (SDR) in that manner can prove to be quite a challenge. Timing constraints when changing the center frequency often turn out to be the main limiting factor [1]. Instead, a pure SDR approach is often the desired way of generating multifrequency content. To cope with hard latency/timing constraints, the solution is to generate the whole spectrum at once and to position the desired information digitally into the time/frequency matrix.

But this poses quite a challenge as one usually, depending on the number of channels, has to tremendously oversample the signal to generate the aggregated bandwidth. The needed anti-imaging filter at the transmitter side and the separation/anti-aliasing filters on the receiver side can grow to obscene amounts of filter taps. Coping with this amount of computational load can be demanding, even for high-end general purpose processors (GPP). Using a polyphase filterbank (PFB) to do the synthesis/separation of the waveform can help eminently with reducing the computational load. PFBs do this by breaking down the needed filter in several polyphase partitions and doing the filtering on these partitions. A division of a filter in M polyphase partitions can already reduce the theoretical computational load by exactly a factor of M [2]. Additionally, the Fast Fourier Transform (FFT) can be used to extract or generate all channels needed at once, using just one filtering operation.Still, the challenge of separating the channels can prove to be too much for the GPP, even with the help of a PFB channelizer/synthesizer.

Using a graphics processor unit (GPU) can help immensely with offloading the critical task of separating the dedicated channels, providing some headroom for the GPP to perform the remaining task of decoding the information imprinted on the individual channels. Filtering itself is an operation that can be mapped pretty well onto many-core architectures, especially if the constraints on latency and buffer sizes are not as stringent. But the operations inside a PFB really seem to be made for over the top parallelization. The notion of performing several independent polyphase partitioned filtering operations at once is something that can be easily mapped onto common GPU architecture abstractions, as every polyphase filter partition can run simultaneously.

Most GPUs have even more computing capabilities and their resources would lie idle if one only parallelized the PFB algorithm in terms of polyphase partitions. Additionally, the computation of consecutive output samples is independent from each other. So one can further parallelize the algorithm by computing several output samples concurrently. If the need for oversampling of the signal arises, that too can be exploited to produce the additional output samples in parallel, thereby minimizing the additional computational stress caused by oversampling.

This talk will showcase, how all operations of the algorithm of the PFB channelizer/synthesizer can be mapped to a GPU using the CUDA framework. Examples of the code will be shown to highlight some of the oddities one can encounter when developing code for GPUs or many-core architectures in general. Benchmarks and results of the current implementation will be presented and discussed. As of now, on an Nvidia GTX970, the implementation reaches a throughput of 67.43 MSamples per second (45 channels, 5x oversampling, 1318 tap prototype filter), which is 12 times higher than an optimized GPP (Intel I7 6850K) version. Last but not least, future steps for releasing and maintaining the code will be laid out.

References: [1] M. Ibrahim and I. Galal, “Improved sdr frequency tuning algorithm for frequency hopping systems,” ETRI Journal, vol. 3, no. 3, Jun 2016. [Online]. Available: http://dx.doi.org/10.4218/etrij.16.0115.0565 [2] F. J. Harris, Multirate Signal Processing for Communication Systems. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2004.

Speakers: Jan Kraemer