Tags: None.
Categories: None.

Now i will analyze the data to prove the advantage of this architecture.

BitNet (A) HW (B)
Cycles 6,032,511 2,816,374 -53.3%
Seconds 0.012065 0.005633 -2.14x
Insts 7,871,514 2,628,624 -66.6%
CPI 0.766 1.071 +39.8%
D-Cache Hits 2,358,213 521,151 -77.9%
D-Cache Misses 4,163 4,161 -

Our architecture has less cycles and simulation time,although we got more CPI.That’s because there have less Insts in our program after optimizing.

Then,i noticed a interesting point in these datas.This the load/store insts of BNRV bitnet kernel,with matrix size 1024:

1
2
system.cpu.commitStats0.numLoadInsts           524288                       # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts 1024 # Number of store instructions (Count)
For the origin version:
1
2
system.cpu.commitStats0.numLoadInsts          2359299                       # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts 1027 # Number of store instructions (Count)

The store count of BNRV equals to the matrix size!

To identify if this is a coincidence,i change the matrix to difference size:

N=2048:

1
2
3
BNRV
system.cpu.commitStats0.numLoadInsts 2097152 # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts 2048 # Number of store instructions (Count)
1
2
3
origin
system.cpu.commitStats0.numLoadInsts 9437187 # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts 2051 # Number of store instructions (Count)
N = 16:
1
2
3
BNRV
system.cpu.commitStats0.numLoadInsts 128 # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts 16 # Number of store instructions (Count)
1
2
3
origin
system.cpu.commitStats0.numLoadInsts 636 # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts 61 # Number of store instructions (Count)
N = 133:
1
2
3
BNRV
system.cpu.commitStats0.numLoadInsts 8778 # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts 133 # Number of store instructions (Count)
1
2
3
origin
system.cpu.commitStats0.numLoadInsts 39504 # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts 136 # Number of store instructions (Count)
The consistent alignment between store counts and matrix size N strongly suggests this is a structural characteristic of the BNRV kernel, rather than a coincidence.As you can see,the store count of BNRV always equals to matrix N.

Our kernel was compiled with O3 optimization.In such situation,BNRV achieves the theoretical lower bound of store operations for this algorithm,compared with original verision who has a bit of redundancy operation.This is because BNRV designed a circut supporting the SIMD operation.Each registers act as a column vector in Linear Algebra,then BNRV circut compute the MatMul directly,implemented by MUX,which is also thanks to the quantization of the weights to ternary.