The A5/1 algorithm has these implementations:
The implementation is specified as an argument to the device. This makes it possible use different implementations concurrently. A practical application of this feature is during lookup, where the very last steps in the chain computation are handled by the sharedmem implementation, while the bulk of the chains are produced with the bitslice code.
NIVIDIA CUDA
| implementation | concurrent chains per SMP | number of A5/1 rounds per second | total number of rounds per second per SMP |
| bitslice | 2048 | 6800 | 14M |
| bitslice2 | 4096 | 4500 | 18M |
| bitslice32 | 1024 | 12000 | 12M |
| sharedmem | 256 | 20000 | 5M |
bitslice2 is only available on GT200 class GPUs and is the same as bitslice except that 2 times as many threads are started. The GT200 GPUs have twice as many registers per SMP. The bitslice32 variant uses 32bit wide slices which is the native hardware width of the shared memory and runs faster, because it does not have to do 64bit/32bit conversions.
currently not maintained or for testing purposes only:
- simple
- mixedmem
- interleaved
bitslice
uses a vertical arrangement of the data. This implementation achieves the highest throughput. These options are given to the device option: For example:
--device cuda:blocks=4:implementation=bitslice
- blocks=integer
- the number of blocks to use. should be the number of Streaming Multiprocessors (= 8 cores) for devices before Compute capability 1.0 and twice that number for more sophisticated hardware. Best value chosen automatically.
- threads=integer
- must be 128
simple (for testing only)
straightforward one bit at a time and very slow. do not use it.
sharedmem
uses the shared memory of the GPU. this is the fastest in terms of operations per second. These options can be given to the device option:
- blocks=integer
- number should be the number of SMPs
- threads=integer
- Defaults to 256. By changing this one can trade throughput for latency. Values should by multiples of 32.
mixedmem (deprecated)
uses both the shared memory and the global memory. this one is the fastest for a single block but suffers from memory bus shortage for any significant number of blocks.
interleaved (deprecated)
one can mix sharedmem and mixedmem, so that most cores use shared memory only and some use global memory too. this gives best improvements on cards with few cores, but also then there is not much room for improvement if you can speed up 25% of your cores by 25%. Currently not working.
