BNRV(4)

Now i will analyze the data to prove the advantage of this architecture.

	BitNet (A)	HW (B)
Cycles	6,032,511	2,816,374	-53.3%
Seconds	0.012065	0.005633	-2.14x
Insts	7,871,514	2,628,624	-66.6%
CPI	0.766	1.071	+39.8%
D-Cache Hits	2,358,213	521,151	-77.9%
D-Cache Misses	4,163	4,161	-

Our architecture has less cycles and simulation time,although we got more CPI.That’s because there have less Insts in our program after optimizing.

Then,i noticed a interesting point in these datas.This the load/store insts of BNRV bitnet kernel,with matrix size 1024:

1 2	system.cpu.commitStats0.numLoadInsts 524288 # Number of load instructions (Count) system.cpu.commitStats0.numStoreInsts 1024 # Number of store instructions (Count)

For the origin version:

1 2	system.cpu.commitStats0.numLoadInsts 2359299 # Number of load instructions (Count) system.cpu.commitStats0.numStoreInsts 1027 # Number of store instructions (Count)

The store count of BNRV equals to the matrix size!

To identify if this is a coincidence,i change the matrix to difference size:

N=2048:

1
2
3

BNRV
system.cpu.commitStats0.numLoadInsts          2097152                       # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts            2048                       # Number of store instructions (Count)

1
2
3

origin
system.cpu.commitStats0.numLoadInsts          9437187                       # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts            2051                       # Number of store instructions (Count)

N = 16:

1
2
3

BNRV
system.cpu.commitStats0.numLoadInsts              128                       # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts              16                       # Number of store instructions (Count)

1
2
3

origin
system.cpu.commitStats0.numLoadInsts              636                       # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts              61                       # Number of store instructions (Count)

N = 133:

1
2
3

BNRV
system.cpu.commitStats0.numLoadInsts             8778                       # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts             133                       # Number of store instructions (Count)

1
2
3

origin
system.cpu.commitStats0.numLoadInsts            39504                       # Number of load instructions (Count)
system.cpu.commitStats0.numStoreInsts             136                       # Number of store instructions (Count)

The consistent alignment between store counts and matrix size N strongly suggests this is a structural characteristic of the BNRV kernel, rather than a coincidence.As you can see,the store count of BNRV always equals to matrix N.

Our kernel was compiled with O3 optimization.In such situation,BNRV achieves the theoretical lower bound of store operations for this algorithm,compared with original verision who has a bit of redundancy operation.This is because BNRV designed a circut supporting the SIMD operation.Each registers act as a column vector in Linear Algebra,then BNRV circut compute the MatMul directly,implemented by MUX,which is also thanks to the quantization of the weights to ternary.

Tian`s Blog

BNRV(4)