Now i will analyze the data to prove the advantage of this architecture.
| BitNet (A) | HW (B) | ||
|---|---|---|---|
| Cycles | 6,032,511 | 2,816,374 | -53.3% |
| Seconds | 0.012065 | 0.005633 | -2.14x |
| Insts | 7,871,514 | 2,628,624 | -66.6% |
| CPI | 0.766 | 1.071 | +39.8% |
| D-Cache Hits | 2,358,213 | 521,151 | -77.9% |
| D-Cache Misses | 4,163 | 4,161 | - |
Our architecture has less cycles and simulation time,although we got more CPI.That’s because there have less Insts in our program after optimizing.
Then,i noticed a interesting point in these datas.This the load/store
insts of BNRV bitnet kernel,with matrix size 1024: 1
2system.cpu.commitStats0.numLoadInsts 524288 # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts 1024 # Number of store instructions (Count)1
2system.cpu.commitStats0.numLoadInsts 2359299 # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts 1027 # Number of store instructions (Count)
The store count of BNRV equals to the matrix size!
To identify if this is a coincidence,i change the matrix to difference size:
N=2048: 1
2
3BNRV
system.cpu.commitStats0.numLoadInsts 2097152 # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts 2048 # Number of store instructions (Count)1
2
3origin
system.cpu.commitStats0.numLoadInsts 9437187 # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts 2051 # Number of store instructions (Count)1
2
3BNRV
system.cpu.commitStats0.numLoadInsts 128 # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts 16 # Number of store instructions (Count)1
2
3origin
system.cpu.commitStats0.numLoadInsts 636 # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts 61 # Number of store instructions (Count)1
2
3BNRV
system.cpu.commitStats0.numLoadInsts 8778 # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts 133 # Number of store instructions (Count)1
2
3origin
system.cpu.commitStats0.numLoadInsts 39504 # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts 136 # Number of store instructions (Count)
Our kernel was compiled with O3 optimization.In such situation,BNRV achieves the theoretical lower bound of store operations for this algorithm,compared with original verision who has a bit of redundancy operation.This is because BNRV designed a circut supporting the SIMD operation.Each registers act as a column vector in Linear Algebra,then BNRV circut compute the MatMul directly,implemented by MUX,which is also thanks to the quantization of the weights to ternary.