Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Official Review #1

Open
micronet-challenge-submissions opened this issue Oct 23, 2019 · 26 comments
Open

Official Review #1

micronet-challenge-submissions opened this issue Oct 23, 2019 · 26 comments

Comments

@micronet-challenge-submissions
Copy link
Collaborator

Hello! Thanks so much for your entry!

When I try to run eval, I get errors load the weight_path. It looks like you have a local path hardcoded into the script there. Is that file available in this repo somewhere I'm not seeing? Or is it not necessary?

Trevor

@micronet-challenge-submissions
Copy link
Collaborator Author

Ping. Please let us know about this issue as soon as you can!

Trevor

@tilmto
Copy link
Owner

tilmto commented Oct 25, 2019

Sorry for the mistake. I forgot to change this local path. I have corrected it to be the right one, you can try to eval it now. Please contact me thought my email [email protected] if you encounter any other issues. Thank you very much!

@micronet-challenge-submissions
Copy link
Collaborator Author

Thanks for the fix! I've successfully validated your model accuracy. A few questions about your scoring:

  1. Is the reason you're calling reduce_mean for the parameter counts because the parameters in different channels can be different bit-widths?

params += (model_info[key][0]['expand']*tf.reduce_mean(self.quant_info[new_key]['expand']['weight'])

  1. Why is only the projection parameters divided by 32 here?

+ model_info[key][0]['project']*tf.reduce_mean(self.quant_info[new_key]['project']['weight'])) / 32

  1. For all of your conv/matmul operations, it looks like you're counting both multiplication and addition as being performed in reduced precision:

flops += model_info[key][1]['total'] * 8 / 32

However, from your code it appears that you're performing "fake quantization" and rounding the input weights and activations to each layer before performing these in FP32. With this scheme, additions should be counted as occurring in full-precision as the result of the multiplications will be FP32, and those FP32 values will be then summed without rounding to the reduced precision format.

  1. Swish activation functions should be counted as four operations (see example here). Also, you round the input operand to reduced precision but all operations after the first in the swish should be counted as full-precision. In summary, swish should be counted as four operations, only one of which is reduced precision.

  2. Why is your FLOP count scaled by 2 here:

self.flops = flops*2 + flops_swish

Thanks!
Trevor

@tilmto
Copy link
Owner

tilmto commented Oct 25, 2019

  1. yes
  2. I think you mistook the brackets there, all the parameters are divided by 32
  3. The result of multiplication is still a quantized integer number because all the weights and activations are quantized to be integers. Compared with originally FP32+FP32, I think my way of calculating metric does make sense. Besides, the latency and energy cost of addition is much less than multiplication.
  4. swish is net*sigmoid(net), I think the multiplication operation can be fused into the sigmoid due to the numerator 1, i.e. net/(1+exp(-net)). Here net is quantized, 1 is quantized, the only difference the exponential part, which will not make a difference if quantized. I once emailed u about this function and I think I misunderstood that all the operations can be directly count as quantized operation, which will not make a difference to the accuracy.
  5. Because the flops I calculate before this step is actually MAC, i.e. multiplication and addition. I think multiplication and addition need to be calculated separately, right?
    Plz contact me if u have more problems, thanks!

@micronet-challenge-submissions
Copy link
Collaborator Author

  1. Sounds good.

  2. Ah, you're correct. Thanks!

  3. The cost of addition v. multiplication depends on the numerical format, which our rules are agnostic of. For simplicity, we count both as equal cost. The output of the multiplication being a quantized integer value would be true if it can be represented in the 23-bit mantissa of an FP32 value. However, your rounding procedure that you apply prior to each operation re-scales the quantized values after rounding & clamping them:

descrete_input_data = tf.div(net, scale_node, name="discrete_data")

This scaling can cause the outputs of your multiplications to be non-representable as a reduced precision integer value. These floating-point value are then all summed together, which does not properly model the error introduced by performing reduced precision accumulation. According to the competition rules, these additions should be counted as FP32. Please update your scoring accordingly.

  1. You only round the input to the operation, so the negation can be counted as reduced precision. The exponential will then output a full 32-bit value, which you do not round to demonstrate that it does not affect model quality. You then add 1 to a 32-bit value, which counts as a 32-bit operation and divide a quantized value by a 32-bit value, which counts as a 32-bit value. Please update your scoring to reflect this as well.

  2. They do need to be counted separately, thanks for clarifying!

Thanks!
Trevor

@tilmto
Copy link
Owner

tilmto commented Oct 25, 2019

  1. I think you can refer to this paper why the output can be still low-precision integers. It's a common way for calculating Flops by many low-precision inference papers. Plz correct me if I'm wrong.
  2. Can I re-upload a new checkpoint with quantized-exp swish? Because in my experiments, the precision of this part does not matter. It's totally ok if this is not allowed by your regulations.
    Thank you for your careful check!

@micronet-challenge-submissions
Copy link
Collaborator Author

  1. The output can definitely still be low precision integers, but when we're emulating reduced-precision arithmetic in FP32 like we are here you need to be careful about rounding s.t. the evaluation procedure appropriately models true reduced precision arithmetic. The rules of the competition are designed to take this into account, and in your case there is no explicit rounding or checks prior to the additions being performed that verifies that the necessary conditions are met.

  2. To be fair to the other competitors, we will stick with the checkpoint that you submitted prior the deadline. It's very cool that your verified this to work though, and if you want to upload it anyway we'd be interested to see it!

Thanks for your responses! If you can update your score with these two changes it looks like everything else checks out!

Trevor

@tilmto
Copy link
Owner

tilmto commented Oct 25, 2019

  1. I still haven't got your point. I think The equation 7 in the above paper can be emulated accurately by our quantization procedure. This method is totally same with tensorflow official fake quant node and verified to achieve same accuracy when converted to be a tflite format and executed on mobile devices. You can double check it. When you mentioned "explicit rounding", you mean after calculating each partial sum in one convolution operation, we need to round them into an integer and sum them up to be a new element in the output activation?
  2. Get it!
    Sorry for the late response, I was in a meeting.

@micronet-challenge-submissions
Copy link
Collaborator Author

Here's an example: For simplicity, in the quant_info.json file under "Conv", I found an example channel where the weights and activations are both 8-bit. In your evaluation script, I dumped the scales and biases for these channels used during evaluation:

w_bits = a_bits = 8
w_scale = 23.5443649
w_bias = -107
a_scale = 3.59408283
a_bias = -1

We have weight value w and activation value x. During evaluation and prior to the convolution, we compute:

w' = clip(round(w * w_scale) - w_bias, 0, 255) / w_scale
x' = clip(round(x * a_scale) - a_bias, 0, 255) / a_scale

If w has value 1.23, w' will have value:
clip(round(1.23*23.5443649) - -107), 0, 255) / 23.5443649
= 5.776

If x' has value 3.21, x' will have value:
clip((round(3.21*3.59408283) - -1), 0, 255) / 3.59408283
= 3.617

When we multiply these two values, we get the value 20.891792. This is not an integer value. When we sum a number of these together, it will be with full FP32 additions. We don't know what the difference in quality is between this and true integer accumulation, but if your model missed a single image more it would no longer meet the accuracy threshold. According to the competition rules, these additions should be counted as full 32-bit additions.

@tilmto
Copy link
Owner

tilmto commented Oct 25, 2019

Yes it's just simulation so it's not true integer. I think we have some misunderstanding here. So are u familiar with this quantization aware training paper, especially the derivation of equation 7, and how tensorflow framework inserts fake nodes when creating training graph? This is just a simulation for equation 7, so the weight/activation are all not true integers, but they successfully simulate the 'quant' effect just like equation 7 does, which is supported well by tflite and mobile devices. It's a common simulation tool by many related papers, and you can refer to the implementation of these papers WAGE Scalable 8-bit training.

@tilmto
Copy link
Owner

tilmto commented Oct 25, 2019

If you have some doubts about the gap between fake quant nodes and the deployment of tflite on mobile devices, I think both our results and many results from your company has shown no difference. And you mentioned the metric/accuracy tradeoff, I can definitely improve the accuracy by increasing the metric a little. I just want to give you the ckpt with best metric.
Thanks for your patience!

@micronet-challenge-submissions
Copy link
Collaborator Author

We understand that this is standard procedure for evaluating the performance of quantized models. It's also standard procedure to use higher precision accumulators when performing actual quantized inference. From the QaT paper you linked "Accumulating products of uint8 values requires a 32-bit accumulator".

The competition rules are designed around this system. The additions in this case are considered to be 32-bit, and should be counted as such. Please update your score to reflect this.

@tilmto
Copy link
Owner

tilmto commented Oct 26, 2019

"Accumulating products of uint8 values requires a 32-bit accumulator" is just a way of hardware implementation, doesn't mean the addition is performed between 2 32-bit numbers. You can also use different hardware design like chunked-based accumulation. And for the QaT paper, their precision is 8-bit, which is much higher than us so a 32-bit accumulator is required. What we focus on is algorithm part, and there are many hardware tricks target at this like https://arxiv.org/abs/1901.06588.

We have averagely quantized the weights and activations to 2.94 bits and 4.87 bits. Our metric will increase a lot due to this special regulation, since most of the savings in Flops cannot be taken into count. And in this way, the Flops of "addition" is 8 times of "multiplication", dominating the total metric which does not make sense at all since actually multiplication is much more expensive than addition. Then your metric can never reflect the true performance of a method on hardware.

I wonder other teams' metric, I believe the teams with low metric all use quantization, they all need to calculate addition in that way? I hope you can further discuss with your team and generate a fair solution.

Thanks a lot for the long discussion!

@celinerice
Copy link
Collaborator

celinerice commented Oct 26, 2019

Hi Trevor,

Thanks a lot for your questions and the long discussion with us! As this challenge emphasizes that “Our goal is for our scoring metric to be indicative of performance that could be achieved in hardware”, we would like to justify that 1) using 32-bit accumulator for all addition and at the same time 2) treating the computational cost of a 32-bit accumulator the same as that of a 32-bit multiplier greatly overcounts the models’ computational cost. We greatly appreciate if you could check our justification below:

  1. As we know, a 1-bit full adder is a canonical building block of arithmetic units, and commonly used as a measure of the computational cost in both machine learning and hardware communities. Specifically, multiplying two N-bit numbers requires N^2 full adders while adding two N-bit numbers requires N full adders (N^2/N=32 when N=32!), and Eq. 3 in this ICML2017 paper formulates the required number of full adders for a D-dimensional dot product between activations and weights. Hopefully, you agree that your way of 1) assigning 32-bit accumulators for additions in our model and 2) using the cost of 32-multipliers for these 32-bit accumulators, can greatly deviate with the model’s actual computational cost. I understand that the most accurate way is to quantify the total number of full adders. For example, assuming that we use 32-bit accumulators for all the additions for our model, the corresponding computational cost of 32-bit addition is similar to that of 6-bit multiplication (to be more precise, 5.65bit*5.65bit).

  2. The 32-bit accumulator is indeed one potential choice of hardware design for additions in our model, however, such a worst-case design is merely adopted for ease of design when addition cost is negligible. When the addition cost is not trivial, it is more common to use adder trees, in which the overall number of addition blocks halves at each successive adder depth while their length increases by one-bit, culminating in the final multibit output (see Fig. 6 in this reference). If adopting such a commonly used adder tree design, additions will need about 10bits on average in our model and thus no more become the bottleneck of convolutions, which is also consistent with the commonly recognized observation of “multiplications greatly dominates the computational cost of DNN convolutions (see reference 1 and reference 2)”.

We believe that the above justifications hold for all participants’ models in this challenge. We look forward to your comments.

  • Yingyan

@micronet-challenge-submissions
Copy link
Collaborator Author

Yonggan & Yingyan,

Apologies for the delay, we have a lot of entries to get through:)

We agree that the scoring system does not accurately reflect the relative cost of integer multiplications and additions. However, the competition is not limited to integer arithmetic. As we've explained, we decide to keep all operations the same cost to avoid making the first iteration of the competition too complex.

We allow entries to count their additions as if the minimum number of bits necessary were used provided they accurately model these additions in their evaluation code. There are two conditions for meeting this requirement

  1. The number of bits needed to represent the results of multiplications and additions exactly must be less than or equal to 23. This ensures that the values can be exactly represented in the 23-bit IEEE FP32 mantissa. For a given linear operation where the inputs are bit-width A and B and the reduction dimension is of size K, this means that A+B+log2(K) <= 23. This needs to be verified for every neuron/channel in the model.
  2. Weights and activations need to be input into linear operations in their integer form s.t. their values are exactly represented in the mantissa. This means that re-scaling to FP32 must occur after the linear operation.

While you could check condition 1, condition 2 requires code changes and unfortunately we can't allow you to change your entry as it wouldn't be fair to the other competitors.

Trevor

@tilmto
Copy link
Owner

tilmto commented Oct 29, 2019

Hi Trevor,

I think the second point you mentioned needs discussion and it's just targeting at traditional quantization method k*[w/k] where k is the minimal value in the low precision representation. Like I have mentioned, quantization-range-based quantization method, i.e. normalized by the min/max value of full precision weight, will always use the full precision as input to the convolution during simulation process, but they have fused the quantization effect previously and can correctly simulate the equation 7 (the forward process on a hardware) in QaT paper which definitely takes integer input.

Currently many popular low-precision training methods aiming at simulating integer-input convolution uses this quantization-range-based quantiztion method like WAGE, Scalable 8-bit training. This method is also adopted by Tensorflow official quantization method which is supported by Tflite on mobile devices). These facts and the derivation from (1)-(7) in the QaT paper both show the correctness of this simulation method. I hope you could treat equally for traditional quantization and quantization-range-based quantization.

Also the accumulator implementation we mentioned above works for all fixed point number, not just integer, and it's a common sense that addition operations shouldn't be the bottleneck of neural network.

For more efficient discussion, I want to know your main question here. Our main point here is that our simulation method (also QaT method) can correctly simulate the integer-input convolution on the real hardware. So whether you have any doubts about the quantization-range-based quantization method in the QaT paper, or you just question our implementation way?

Thanks a lot and look forward your reply.

Yonggan

@micronet-challenge-submissions
Copy link
Collaborator Author

I'm not sure what you mean by (2) is targeting only quantization methods of the form k*[w/k]. In your approach that uses both the min/max, you convert the value to an integer in the range [0, 2**nbits -1] here:

net = tf.clip_by_value(net - quant_bias, clip_value_min=0., clip_value_max=tf.reshape(2.**bits-1., [1,1,1,-1])) + quant_bias

and then scale it back from the integer representation to the rounded floating-point representation here:

descrete_input_data = tf.div(net, scale_node, name="discrete_data")

If you waited to apply the re-scaling until after the linear operation, as (2) suggests, and your model satisfies point (1) then the simulation of fixed-point arithmetic is exact. We would thus accept that scoring under the rules of the competition. You scale back to the FP representation prior to executing the kernel, so the arithmetic is approximate. Under the rules of the competition this is acceptable for counting the multiplications as reduced precision, but because the inputs to each addition inside the kernel are not rounded prior to executing the addition instructions we do not allow these to be counted as reduced precision.

Perhaps this more rigorous standard of proof is a shortcoming of the current rules and we should revisit it in future iterations of the competition. We'd be happy to discuss this with you further. However, for practical purposes we need to complete scoring of the entries by the end of this week and your current score does not comply with the rules. We would appreciate it if you would update your score.

@tilmto
Copy link
Owner

tilmto commented Oct 29, 2019

Hi Trevor,

I think I know your question here. You may have some misunderstanding of the hardware execution process of quantization-range-based quantization. We can refer to QaT to explain this more clearly.

The equation 7 here is the calculation of a convolution during inference in the hardware after conducting the quantization-range-based quantization, you can see that q1, q2 here are the quantized inputs (weight and activations).
equation 7
This is derived from Equation 3, where the operands are S(q-z), a full precision number which is exactly the input in our implementation.
equation 3
The only difference between these 2 equations is the M=S1*S2/S3, which maintains at 30 bits of precision claimed by the paper. So we can simulate the hardware situation in equation 7 based on equation 3 conveniently, and that's exactly the kernel idea in the QaT paper. And that's why tensorflow uses this method even for Tflite deployment on mobile devices.

You can double check the derivation in the paper to confirm it. Thanks!

Yonggan

@tilmto tilmto closed this as completed Oct 29, 2019
@tilmto tilmto reopened this Oct 29, 2019
@micronet-challenge-submissions
Copy link
Collaborator Author

Apologies, I think the confusion stems from the fact that the methodology I mentioned is for quantization approaches where the zero-point for both weights and activations is 0. You can see that in equation 4 from QaT the product decomposes into what I'm describing, with the scale applied after the inner product. Thus this is not an option for your approach.

The issue with your evaluation approach is not the factorization of the quantized computation, it's with the use of floating-point to simulate fixed-point computation. We allow this under the circumstances described in the rules, but your model does not meet these criterion given you do not round prior to performing additions inside the convolution kernels. We ask that you please update your score to reflect this.

@celinerice
Copy link
Collaborator

celinerice commented Oct 29, 2019 via email

@micronet-challenge-submissions
Copy link
Collaborator Author

Thank you for the discussion as well.

You are certainly correct that assuming 32-bits for all additions will overestimate the cost of your model. Unfortunately there isn't anything we can do about this at this point. Your evaluation procedure doesn't comply with the requirements of the competition rules, and we can't change the rules or make an exception while being fair to the other competitors.

Trevor

@celinerice
Copy link
Collaborator

celinerice commented Oct 30, 2019 via email

@micronet-challenge-submissions
Copy link
Collaborator Author

@micronet-challenge-submissions
Copy link
Collaborator Author

Yingyan & Yonggan,

We're hoping to finalize the results tomorrow so that we can release them early next week. Could you advise whether or not you plan to update your score?

Trevor

@tilmto
Copy link
Owner

tilmto commented Nov 1, 2019

Hi Trevor,

Sorry for the late update. Actually I'm conducting the experiments to verify the correctness of our simulation. Anyway, I will provide the metric in your rules first. I add the flops of swish part and change the flops of addition in a full precision way. Here are our metric and each part's contribution to the final metric:

Final Metric: 0.46779
Params: 0.07742 (0.534M)
Flops of Multiplication: 0.04878 (57.072M)
Flops of Addition: 0.32974 (385.796M, extremely dominating the metric)
Flops of Swish: 0.01184 (13.855M)

Thank you very much for your remind!

Yonggan

@micronet-challenge-submissions
Copy link
Collaborator Author

Thanks Yonggan!

Trevor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants