-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Official Review #1
Comments
Ping. Please let us know about this issue as soon as you can! Trevor |
Sorry for the mistake. I forgot to change this local path. I have corrected it to be the right one, you can try to eval it now. Please contact me thought my email [email protected] if you encounter any other issues. Thank you very much! |
Thanks for the fix! I've successfully validated your model accuracy. A few questions about your scoring:
However, from your code it appears that you're performing "fake quantization" and rounding the input weights and activations to each layer before performing these in FP32. With this scheme, additions should be counted as occurring in full-precision as the result of the multiplications will be FP32, and those FP32 values will be then summed without rounding to the reduced precision format.
Thanks! |
|
Adjustable-Quantization-MicroNet/adj_quant/scripts/effnetb0/effnetb0_model_adjutable.py Line 148 in 3ffb2ed
This scaling can cause the outputs of your multiplications to be non-representable as a reduced precision integer value. These floating-point value are then all summed together, which does not properly model the error introduced by performing reduced precision accumulation. According to the competition rules, these additions should be counted as FP32. Please update your scoring accordingly.
Thanks! |
|
Thanks for your responses! If you can update your score with these two changes it looks like everything else checks out! Trevor |
|
Here's an example: For simplicity, in the quant_info.json file under "Conv", I found an example channel where the weights and activations are both 8-bit. In your evaluation script, I dumped the scales and biases for these channels used during evaluation: w_bits = a_bits = 8 We have weight value w' = clip(round(w * w_scale) - w_bias, 0, 255) / w_scale If w has value 1.23, w' will have value: If x' has value 3.21, x' will have value: When we multiply these two values, we get the value 20.891792. This is not an integer value. When we sum a number of these together, it will be with full FP32 additions. We don't know what the difference in quality is between this and true integer accumulation, but if your model missed a single image more it would no longer meet the accuracy threshold. According to the competition rules, these additions should be counted as full 32-bit additions. |
Yes it's just simulation so it's not true integer. I think we have some misunderstanding here. So are u familiar with this quantization aware training paper, especially the derivation of equation 7, and how tensorflow framework inserts fake nodes when creating training graph? This is just a simulation for equation 7, so the weight/activation are all not true integers, but they successfully simulate the 'quant' effect just like equation 7 does, which is supported well by tflite and mobile devices. It's a common simulation tool by many related papers, and you can refer to the implementation of these papers WAGE Scalable 8-bit training. |
If you have some doubts about the gap between fake quant nodes and the deployment of tflite on mobile devices, I think both our results and many results from your company has shown no difference. And you mentioned the metric/accuracy tradeoff, I can definitely improve the accuracy by increasing the metric a little. I just want to give you the ckpt with best metric. |
We understand that this is standard procedure for evaluating the performance of quantized models. It's also standard procedure to use higher precision accumulators when performing actual quantized inference. From the QaT paper you linked "Accumulating products of uint8 values requires a 32-bit accumulator". The competition rules are designed around this system. The additions in this case are considered to be 32-bit, and should be counted as such. Please update your score to reflect this. |
"Accumulating products of uint8 values requires a 32-bit accumulator" is just a way of hardware implementation, doesn't mean the addition is performed between 2 32-bit numbers. You can also use different hardware design like chunked-based accumulation. And for the QaT paper, their precision is 8-bit, which is much higher than us so a 32-bit accumulator is required. What we focus on is algorithm part, and there are many hardware tricks target at this like https://arxiv.org/abs/1901.06588. We have averagely quantized the weights and activations to 2.94 bits and 4.87 bits. Our metric will increase a lot due to this special regulation, since most of the savings in Flops cannot be taken into count. And in this way, the Flops of "addition" is 8 times of "multiplication", dominating the total metric which does not make sense at all since actually multiplication is much more expensive than addition. Then your metric can never reflect the true performance of a method on hardware. I wonder other teams' metric, I believe the teams with low metric all use quantization, they all need to calculate addition in that way? I hope you can further discuss with your team and generate a fair solution. Thanks a lot for the long discussion! |
Hi Trevor, Thanks a lot for your questions and the long discussion with us! As this challenge emphasizes that “Our goal is for our scoring metric to be indicative of performance that could be achieved in hardware”, we would like to justify that 1) using 32-bit accumulator for all addition and at the same time 2) treating the computational cost of a 32-bit accumulator the same as that of a 32-bit multiplier greatly overcounts the models’ computational cost. We greatly appreciate if you could check our justification below:
We believe that the above justifications hold for all participants’ models in this challenge. We look forward to your comments.
|
Yonggan & Yingyan, Apologies for the delay, we have a lot of entries to get through:) We agree that the scoring system does not accurately reflect the relative cost of integer multiplications and additions. However, the competition is not limited to integer arithmetic. As we've explained, we decide to keep all operations the same cost to avoid making the first iteration of the competition too complex. We allow entries to count their additions as if the minimum number of bits necessary were used provided they accurately model these additions in their evaluation code. There are two conditions for meeting this requirement
While you could check condition 1, condition 2 requires code changes and unfortunately we can't allow you to change your entry as it wouldn't be fair to the other competitors. Trevor |
Hi Trevor, I think the second point you mentioned needs discussion and it's just targeting at traditional quantization method k*[w/k] where k is the minimal value in the low precision representation. Like I have mentioned, quantization-range-based quantization method, i.e. normalized by the min/max value of full precision weight, will always use the full precision as input to the convolution during simulation process, but they have fused the quantization effect previously and can correctly simulate the equation 7 (the forward process on a hardware) in QaT paper which definitely takes integer input. Currently many popular low-precision training methods aiming at simulating integer-input convolution uses this quantization-range-based quantiztion method like WAGE, Scalable 8-bit training. This method is also adopted by Tensorflow official quantization method which is supported by Tflite on mobile devices). These facts and the derivation from (1)-(7) in the QaT paper both show the correctness of this simulation method. I hope you could treat equally for traditional quantization and quantization-range-based quantization. Also the accumulator implementation we mentioned above works for all fixed point number, not just integer, and it's a common sense that addition operations shouldn't be the bottleneck of neural network. For more efficient discussion, I want to know your main question here. Our main point here is that our simulation method (also QaT method) can correctly simulate the integer-input convolution on the real hardware. So whether you have any doubts about the quantization-range-based quantization method in the QaT paper, or you just question our implementation way? Thanks a lot and look forward your reply. Yonggan |
I'm not sure what you mean by (2) is targeting only quantization methods of the form k*[w/k]. In your approach that uses both the min/max, you convert the value to an integer in the range [0, 2**nbits -1] here: Adjustable-Quantization-MicroNet/adj_quant/scripts/effnetb0/effnetb0_model_adjutable.py Line 146 in 3ffb2ed
and then scale it back from the integer representation to the rounded floating-point representation here: Adjustable-Quantization-MicroNet/adj_quant/scripts/effnetb0/effnetb0_model_adjutable.py Line 148 in 3ffb2ed
If you waited to apply the re-scaling until after the linear operation, as (2) suggests, and your model satisfies point (1) then the simulation of fixed-point arithmetic is exact. We would thus accept that scoring under the rules of the competition. You scale back to the FP representation prior to executing the kernel, so the arithmetic is approximate. Under the rules of the competition this is acceptable for counting the multiplications as reduced precision, but because the inputs to each addition inside the kernel are not rounded prior to executing the addition instructions we do not allow these to be counted as reduced precision. Perhaps this more rigorous standard of proof is a shortcoming of the current rules and we should revisit it in future iterations of the competition. We'd be happy to discuss this with you further. However, for practical purposes we need to complete scoring of the entries by the end of this week and your current score does not comply with the rules. We would appreciate it if you would update your score. |
Hi Trevor, I think I know your question here. You may have some misunderstanding of the hardware execution process of quantization-range-based quantization. We can refer to QaT to explain this more clearly. The equation 7 here is the calculation of a convolution during inference in the hardware after conducting the quantization-range-based quantization, you can see that q1, q2 here are the quantized inputs (weight and activations). You can double check the derivation in the paper to confirm it. Thanks! Yonggan |
Apologies, I think the confusion stems from the fact that the methodology I mentioned is for quantization approaches where the zero-point for both weights and activations is 0. You can see that in equation 4 from QaT the product decomposes into what I'm describing, with the scale applied after the inner product. Thus this is not an option for your approach. The issue with your evaluation approach is not the factorization of the quantized computation, it's with the use of floating-point to simulate fixed-point computation. We allow this under the circumstances described in the rules, but your model does not meet these criterion given you do not round prior to performing additions inside the convolution kernels. We ask that you please update your score to reflect this. |
Dear Trevor,
Thank you for taking the time to discuss with us!
I agree with you that our model can save more hardware resources if we
waited to apply the re-scaling until after the linear operation. However, I
hope you can also consider the second point of my previous response (I
copied it below for your convenience). Even if we are doing rescaling
before the linear operation, the additions could have not performed in
32-bit precision until the resulting carry of additions extends the
precision to 32bits. Therefore, assuming 32 bit for ALL additions in our
model indeed greatly overcounts our model complexity. We greatly appreciate
if you could consider this aspect. I believe that we share the common
motivation of encouraging techniques that can have real benefits in
hardware.
The 32-bit accumulator is indeed one potential choice of hardware design
for additions in our model, however, such a worst-case design is merely
adopted for ease of design when addition cost is negligible. When the
addition cost is not trivial, it is more common to use adder trees, in
which the overall number of addition blocks halves at each successive adder
depth while their length increases by one-bit, culminating in the final
multibit output (see Fig. 6 in this reference
<https://www.researchgate.net/figure/Two-level-fragment-of-the-adder-tree-structure_fig3_257672163>).
If adopting such a commonly used adder tree design, additions will need
about 10bits on average in our model and thus no more become the bottleneck
of convolutions, which is also consistent with the commonly recognized
observation of “multiplications greatly dominates the computational cost of
DNN convolutions (see reference 1 <https://arxiv.org/pdf/1905.13298.pdf>
and reference 2
<https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8429354>)”.
Thanks!
Best regards,
- Yingyan
************************************************************************************
Yingyan Lin
Assistant Professor
Electrical & Computer Engineering | Rice University
Duncan Hall 2040 | 6100 Main Street, MS 380 | Houston, TX 77005
Web: https://eiclab.net/ | Tel. 713-348-3020
************************************************************************************
…On Tue, Oct 29, 2019 at 6:10 PM micronet-challenge-submissions < ***@***.***> wrote:
Apologies, I think the confusion stems from the fact that the methodology
I mentioned is for quantization approaches where the zero-point for both
weights and activations is 0. You can see that in equation 4 from QaT the
product decomposes into what I'm describing, with the scale applied after
the inner product. Thus this is not an option for your approach.
The issue with your evaluation approach is not the factorization of the
quantized computation, it's with the use of floating-point to simulate
fixed-point computation. We allow this under the circumstances described in
the rules, but your model does not meet these criterion given you do not
round prior to performing additions inside the convolution kernels. We ask
that you please update your score to reflect this.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1?email_source=notifications&email_token=ANTGU65FV5H6XEDIFJRXFBDQRC7FHA5CNFSM4JEKFPKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECSMKHI#issuecomment-547669277>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANTGU66AMES7LWGHQ6V2X3DQRC7FHANCNFSM4JEKFPKA>
.
|
Thank you for the discussion as well. You are certainly correct that assuming 32-bits for all additions will overestimate the cost of your model. Unfortunately there isn't anything we can do about this at this point. Your evaluation procedure doesn't comply with the requirements of the competition rules, and we can't change the rules or make an exception while being fair to the other competitors. Trevor |
Dear Trevor,
In response to your explanation below, we will perform simulations to show
you that the outputs from the method in our submitted codes and a real
fixed implementation are the same when given randomly generated inputs.
The issue with your evaluation approach is not the factorization of the
quantized computation, it's with the use of floating-point to simulate
fixed-point computation. We allow this under the circumstances described in
the rules, but your model does not meet these criteria given you do not
round prior to performing additions inside the convolution kernels.
Thank you in advance for your time and patience!
Best regards,
- Yingyan
************************************************************************************
Yingyan Lin
Assistant Professor
Electrical & Computer Engineering | Rice University
Duncan Hall 2040 | 6100 Main Street, MS 380 | Houston, TX 77005
Web: https://eiclab.net/ | Tel. 713-348-3020
************************************************************************************
…On Tue, Oct 29, 2019 at 6:54 PM Yingyan Lin ***@***.***> wrote:
Dear Trevor,
Thank you for taking the time to discuss with us!
I agree with you that our model can save more hardware resources if we
waited to apply the re-scaling until after the linear operation. However, I
hope you can also consider the second point of my previous response (I
copied it below for your convenience). Even if we are doing rescaling
before the linear operation, the additions could have not performed in
32-bit precision until the resulting carry of additions extends the
precision to 32bits. Therefore, assuming 32 bit for ALL additions in our
model indeed greatly overcounts our model complexity. We greatly appreciate
if you could consider this aspect. I believe that we share the common
motivation of encouraging techniques that can have real benefits in
hardware.
The 32-bit accumulator is indeed one potential choice of hardware design
for additions in our model, however, such a worst-case design is merely
adopted for ease of design when addition cost is negligible. When the
addition cost is not trivial, it is more common to use adder trees, in
which the overall number of addition blocks halves at each successive adder
depth while their length increases by one-bit, culminating in the final
multibit output (see Fig. 6 in this reference
<https://www.researchgate.net/figure/Two-level-fragment-of-the-adder-tree-structure_fig3_257672163>).
If adopting such a commonly used adder tree design, additions will need
about 10bits on average in our model and thus no more become the bottleneck
of convolutions, which is also consistent with the commonly recognized
observation of “multiplications greatly dominates the computational cost of
DNN convolutions (see reference 1 <https://arxiv.org/pdf/1905.13298.pdf>
and reference 2
<https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8429354>)”.
Thanks!
Best regards,
- Yingyan
************************************************************************************
Yingyan Lin
Assistant Professor
Electrical & Computer Engineering | Rice University
Duncan Hall 2040 | 6100 Main Street, MS 380 | Houston, TX 77005
Web: https://eiclab.net/ | Tel. 713-348-3020
************************************************************************************
On Tue, Oct 29, 2019 at 6:10 PM micronet-challenge-submissions <
***@***.***> wrote:
> Apologies, I think the confusion stems from the fact that the methodology
> I mentioned is for quantization approaches where the zero-point for both
> weights and activations is 0. You can see that in equation 4 from QaT the
> product decomposes into what I'm describing, with the scale applied after
> the inner product. Thus this is not an option for your approach.
>
> The issue with your evaluation approach is not the factorization of the
> quantized computation, it's with the use of floating-point to simulate
> fixed-point computation. We allow this under the circumstances described in
> the rules, but your model does not meet these criterion given you do not
> round prior to performing additions inside the convolution kernels. We ask
> that you please update your score to reflect this.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#1?email_source=notifications&email_token=ANTGU65FV5H6XEDIFJRXFBDQRC7FHA5CNFSM4JEKFPKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECSMKHI#issuecomment-547669277>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ANTGU66AMES7LWGHQ6V2X3DQRC7FHANCNFSM4JEKFPKA>
> .
>
|
We're certainly interested in your results, but you must understand that we
cannot change the rules or make an exception at this point in the
competition.
Trevor
…On Tue, Oct 29, 2019 at 5:35 PM celinerice ***@***.***> wrote:
Dear Trevor,
In response to your explanation below, we will perform simulations to show
you that the outputs from the method in our submitted codes and a real
fixed implementation are the same when given randomly generated inputs.
The issue with your evaluation approach is not the factorization of the
quantized computation, it's with the use of floating-point to simulate
fixed-point computation. We allow this under the circumstances described in
the rules, but your model does not meet these criteria given you do not
round prior to performing additions inside the convolution kernels.
Thank you in advance for your time and patience!
Best regards,
- Yingyan
************************************************************************************
Yingyan Lin
Assistant Professor
Electrical & Computer Engineering | Rice University
Duncan Hall 2040 | 6100 Main Street, MS 380 | Houston, TX 77005
Web: https://eiclab.net/ | Tel. 713-348-3020
************************************************************************************
On Tue, Oct 29, 2019 at 6:54 PM Yingyan Lin ***@***.***> wrote:
> Dear Trevor,
>
> Thank you for taking the time to discuss with us!
> I agree with you that our model can save more hardware resources if we
> waited to apply the re-scaling until after the linear operation.
However, I
> hope you can also consider the second point of my previous response (I
> copied it below for your convenience). Even if we are doing rescaling
> before the linear operation, the additions could have not performed in
> 32-bit precision until the resulting carry of additions extends the
> precision to 32bits. Therefore, assuming 32 bit for ALL additions in our
> model indeed greatly overcounts our model complexity. We greatly
appreciate
> if you could consider this aspect. I believe that we share the common
> motivation of encouraging techniques that can have real benefits in
> hardware.
>
> The 32-bit accumulator is indeed one potential choice of hardware design
> for additions in our model, however, such a worst-case design is merely
> adopted for ease of design when addition cost is negligible. When the
> addition cost is not trivial, it is more common to use adder trees, in
> which the overall number of addition blocks halves at each successive
adder
> depth while their length increases by one-bit, culminating in the final
> multibit output (see Fig. 6 in this reference
> <
https://www.researchgate.net/figure/Two-level-fragment-of-the-adder-tree-structure_fig3_257672163
>).
> If adopting such a commonly used adder tree design, additions will need
> about 10bits on average in our model and thus no more become the
bottleneck
> of convolutions, which is also consistent with the commonly recognized
> observation of “multiplications greatly dominates the computational cost
of
> DNN convolutions (see reference 1 <https://arxiv.org/pdf/1905.13298.pdf>
> and reference 2
> <https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8429354>)”.
>
> Thanks!
>
> Best regards,
> - Yingyan
>
>
>
************************************************************************************
> Yingyan Lin
> Assistant Professor
> Electrical & Computer Engineering | Rice University
> Duncan Hall 2040 | 6100 Main Street, MS 380 | Houston, TX 77005
> Web: https://eiclab.net/ | Tel. 713-348-3020
>
>
************************************************************************************
>
>
> On Tue, Oct 29, 2019 at 6:10 PM micronet-challenge-submissions <
> ***@***.***> wrote:
>
>> Apologies, I think the confusion stems from the fact that the
methodology
>> I mentioned is for quantization approaches where the zero-point for both
>> weights and activations is 0. You can see that in equation 4 from QaT
the
>> product decomposes into what I'm describing, with the scale applied
after
>> the inner product. Thus this is not an option for your approach.
>>
>> The issue with your evaluation approach is not the factorization of the
>> quantized computation, it's with the use of floating-point to simulate
>> fixed-point computation. We allow this under the circumstances
described in
>> the rules, but your model does not meet these criterion given you do not
>> round prior to performing additions inside the convolution kernels. We
ask
>> that you please update your score to reflect this.
>>
>> —
>> You are receiving this because you commented.
>> Reply to this email directly, view it on GitHub
>> <
#1?email_source=notifications&email_token=ANTGU65FV5H6XEDIFJRXFBDQRC7FHA5CNFSM4JEKFPKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECSMKHI#issuecomment-547669277
>,
>> or unsubscribe
>> <
https://github.com/notifications/unsubscribe-auth/ANTGU66AMES7LWGHQ6V2X3DQRC7FHANCNFSM4JEKFPKA
>
>> .
>>
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1?email_source=notifications&email_token=AMILA65HWOYOFT4EYAPURS3QRDJDLA5CNFSM4JEKFPKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECSRAPY#issuecomment-547688511>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMILA66DODFSHC7LK6NBTIDQRDJDLANCNFSM4JEKFPKA>
.
|
Yingyan & Yonggan, We're hoping to finalize the results tomorrow so that we can release them early next week. Could you advise whether or not you plan to update your score? Trevor |
Hi Trevor, Sorry for the late update. Actually I'm conducting the experiments to verify the correctness of our simulation. Anyway, I will provide the metric in your rules first. I add the flops of swish part and change the flops of addition in a full precision way. Here are our metric and each part's contribution to the final metric: Final Metric: 0.46779 Thank you very much for your remind! Yonggan |
Thanks Yonggan! Trevor |
Hello! Thanks so much for your entry!
When I try to run eval, I get errors load the weight_path. It looks like you have a local path hardcoded into the script there. Is that file available in this repo somewhere I'm not seeing? Or is it not necessary?
Trevor
The text was updated successfully, but these errors were encountered: