A5/1

The A5/1 algorithm has these implementations:

The implementation is specified as an argument to the device. This makes it possible use different implementations concurrently. A practical application of this feature is during lookup, where the very last steps in the chain computation are handled by the sharedmem implementation, while the bulk of the chains are produced with the bitslice code.

NIVIDIA CUDA

implementationconcurrent chains per SMPnumber of A5/1 rounds per secondtotal number of rounds per second per SMP
bitslice2048680014M
bitslice24096450018M
bitslice3210241200012M
sharedmem256200005M

bitslice2 is only available on GT200 class GPUs and is the same as bitslice except that 2 times as many threads are started. The GT200 GPUs have twice as many registers per SMP. The bitslice32 variant uses 32bit wide slices which is the native hardware width of the shared memory and runs faster, because it does not have to do 64bit/32bit conversions.

currently not maintained or for testing purposes only:

  • simple
  • mixedmem
  • interleaved

bitslice

uses a vertical arrangement of the data. This implementation achieves the highest throughput. These options are given to the device option: For example:

--device cuda:blocks=4:implementation=bitslice

blocks=integer
the number of blocks to use. should be the number of Streaming Multiprocessors (= 8 cores) for devices before Compute capability 1.0 and twice that number for more sophisticated hardware. Best value chosen automatically.
threads=integer
must be 128

simple (for testing only)

straightforward one bit at a time and very slow. do not use it.

sharedmem

uses the shared memory of the GPU. this is the fastest in terms of operations per second. These options can be given to the device option:

blocks=integer
number should be the number of SMPs
threads=integer
Defaults to 256. By changing this one can trade throughput for latency. Values should by multiples of 32.

mixedmem (deprecated)

uses both the shared memory and the global memory. this one is the fastest for a single block but suffers from memory bus shortage for any significant number of blocks.

interleaved (deprecated)

one can mix sharedmem and mixedmem, so that most cores use shared memory only and some use global memory too. this gives best improvements on cards with few cores, but also then there is not much room for improvement if you can speed up 25% of your cores by 25%. Currently not working.