Jan Kraemer

👥 2 conferences
🎤 2 talks
📅 Years active: 2015 to 2017

Biography

I like Software Defined Radio (SDR) and I like architecture gnostic software development for real time digital signal processing. Currently I work at the German Aerospace Centre (DLR) in Oberpfaffenhofen. My research/work topics include SDR (software) system design and implementation and architecture specific optimization of implementations (x86/GPU/FPGA). I am also a big fan of premature optimization and dangling pointers!

— biography from FOSDEM 2017
https://archive.fosdem.org/2017/schedule/speaker/jan_kraemer/

Conferences

2 known conferences

👥 FOSDEM 2017 📅 04 Feb 2017

🎤 GPU-Enabled Polyphase Filterbanks
04 Feb 2017 show details

Frequency Division Multiple Access (FDMA) schemes are widely used in many existing communication systems and standards. On Software Defined Radio (SDR) platforms, separating the channels can prove more difficult though due to high requirements for the digital filters. This talk will showcase an implementation of a polyphase filterbank on a graphics processor unit (GPU) that can help overcome the heavy computational load of those filters. In the software, all the partitioned filters can run in parallel. Each of these filters produces output samples for numerous input samples simultaneously, thus providing an additional parallel approach. Furthermore, several rational oversampling factors are supported by this implementation. Operations for oversampling can as well be implemented to run in parallel, due to the massive amount of usable hardware threads in a GPU. Hence, the effects of oversampling on the throughput can be reduced. On an Nvidia GTX970 GPU, this implementation achieved a throughput of 67.43 MSamples per second, 12 times higher than the (optimized) general purpose processor (GPP) version.

Separating information using the frequency space is rather simple if one can do it with appropriate hardware. Cycle accurate operations executed in FPGA and ASIC fabric allows one to carefully create the desired waveform and then control filterbanks and oscillators. This way, the desired information can get into the air at the exact time and frequency one fancies. Unfortunately, controlling a software defined radio (SDR) in that manner can prove to be quite a challenge. Timing constraints when changing the center frequency often turn out to be the main limiting factor [1]. Instead, a pure SDR approach is often the desired way of generating multifrequency content. To cope with hard latency/timing constraints, the solution is to generate the whole spectrum at once and to position the desired information digitally into the time/frequency matrix.

But this poses quite a challenge as one usually, depending on the number of channels, has to tremendously oversample the signal to generate the aggregated bandwidth. The needed anti-imaging filter at the transmitter side and the separation/anti-aliasing filters on the receiver side can grow to obscene amounts of filter taps. Coping with this amount of computational load can be demanding, even for high-end general purpose processors (GPP). Using a polyphase filterbank (PFB) to do the synthesis/separation of the waveform can help eminently with reducing the computational load. PFBs do this by breaking down the needed filter in several polyphase partitions and doing the filtering on these partitions. A division of a filter in M polyphase partitions can already reduce the theoretical computational load by exactly a factor of M [2]. Additionally, the Fast Fourier Transform (FFT) can be used to extract or generate all channels needed at once, using just one filtering operation.Still, the challenge of separating the channels can prove to be too much for the GPP, even with the help of a PFB channelizer/synthesizer.

Using a graphics processor unit (GPU) can help immensely with offloading the critical task of separating the dedicated channels, providing some headroom for the GPP to perform the remaining task of decoding the information imprinted on the individual channels. Filtering itself is an operation that can be mapped pretty well onto many-core architectures, especially if the constraints on latency and buffer sizes are not as stringent. But the operations inside a PFB really seem to be made for over the top parallelization. The notion of performing several independent polyphase partitioned filtering operations at once is something that can be easily mapped onto common GPU architecture abstractions, as every polyphase filter partition can run simultaneously.

Most GPUs have even more computing capabilities and their resources would lie idle if one only parallelized the PFB algorithm in terms of polyphase partitions. Additionally, the computation of consecutive output samples is independent from each other. So one can further parallelize the algorithm by computing several output samples concurrently. If the need for oversampling of the signal arises, that too can be exploited to produce the additional output samples in parallel, thereby minimizing the additional computational stress caused by oversampling.

This talk will showcase, how all operations of the algorithm of the PFB channelizer/synthesizer can be mapped to a GPU using the CUDA framework. Examples of the code will be shown to highlight some of the oddities one can encounter when developing code for GPUs or many-core architectures in general. Benchmarks and results of the current implementation will be presented and discussed. As of now, on an Nvidia GTX970, the implementation reaches a throughput of 67.43 MSamples per second (45 channels, 5x oversampling, 1318 tap prototype filter), which is 12 times higher than an optimized GPP (Intel I7 6850K) version. Last but not least, future steps for releasing and maintaining the code will be laid out.

References: [1] M. Ibrahim and I. Galal, “Improved sdr frequency tuning algorithm for frequency hopping systems,” ETRI Journal, vol. 3, no. 3, Jun 2016. [Online]. Available: http://dx.doi.org/10.4218/etrij.16.0115.0565 [2] F. J. Harris, Multirate Signal Processing for Communication Systems. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2004.

👥 FOSDEM 2015 📅 31 Jan 2015

🎤 Viterbi's little Helper
01 Feb 2015 show details

Forward Error Correction (FEC) is a vital part of every communication scheme. Convolutional Error Codes can provide the protection of the data to drive the communication system close to the Shannon Limit. But due to the complexity of the decoders, it is challenging to implement these algorithms in software for use in software defined radios (SDR). Available coprocessors, such as graphic processor units (GPU) and single instruction multiple data architectures (SIMD) can dramatically enhance the throughput of such software based receivers. Strategies to start implementing Viterbi- and Maximum A Posteriori (MAP) Decoders on these coprocessors are presented in this talk. Potential tripping hazards are identified. The effects on the throughput of these algorithms are analyzed and shown.

Convolutional Codes have been known for a long time. Viterbi established his algorithm to decode convolutional encoded data in the year 1967 [1]. SDR has also been established since the late 90’s and early 00’s. But still the implementation of convolutional decoders, such as the Viterbi- or MAP-algorithm, in software has always been a problem. Both algorithms rely on a Hidden Markoff Model as the encoder is simply a Mealy Machine. So if one surveys every possible state transition caused by a bit stream, a trellis structure is created. For every encoded information bit you have to consider all possible transitions from one state to another [1][2]. This leads to a high complexity and implementations of these algorithms suffer from a heavy computational burden.

The Viterbi Algorithm tries to relax these conditions by applying a dynamic programming approach to the trellis structure [3]. In this approach only the strongest path survives to reduce the overhead generated by analyzing all possible paths through the trellis. Still the computational effort is very high.

Up until now the technology used in SDRs has not been able to handle the computational burden of these algorithms. Implementations generally have suffered from a low throughput that was not suitable for state of the art communication systems (i.e. 3GPPP LTE or WLAN). Therefor these systems still used fixed hardware chips, such as ASICs, to manage the high throughput that these systems require.

With the increasing clock rates of General Purpose Processors (GPP) and the higher density of units inside the architecture, implementing these algorithms is starting to become more feasible. Especially additional architectural features such as SIMD architectures and multiple processor cores on one chip have gained increasing importance when implementing digital signal processing algorithms in software [4].

Another interesting field is the use of coprocessors found in common computers. In most cases this is going to be a GPU. GPU vendors also provide libraries and software development kits (SDK) to use the GPU for general computations and signal processing [5]. This specialized processors and libraries can be efficiently used to accelerate algorithms that can be massively parallelized.

This talk will cover implementation constraints that occur when implementing FEC and DSP algorithms on SIMD processors and GPUs. It will highlight some of the tripping hazard that newcomers have to avoid when trying to make an efficient use of these coprocessors. Exemplary cases for both architectures are analyzed to show, how the proper use of these architectures enhances a software defined communication system.

REFERENCES [1] A. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” Information Theory, IEEE Transactions on, vol. 13, no. 2, pp. 260–269, April 1967.

[2] L. Hanzo, T. H. Liew, and B. L. Yeap, Turbo Coding, Turbo Equalisation and Space-Time Coding for Transmission over Wireless Channels. Wiley, 2002.

[3] J. Forney, G.D., “The viterbi algorithm,” Proceedings of the IEEE, vol. 61, no. 3, pp. 268–278, March 1973.

[4] U. Santoni and T. Long, Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives, [Online]. Available: http://www.intel.com/content/dam/doc/whitepaper/ signal-processing-on-intel-architecture.pdf, 2014.

[5]NVIDIA: OpenCL Programming Guide for the CUDA Architecture, 2009. Version 2.3.