Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple Topic publish when only one is sent #161

Closed
russellgsystems opened this issue Dec 11, 2020 · 30 comments
Closed

Multiple Topic publish when only one is sent #161

russellgsystems opened this issue Dec 11, 2020 · 30 comments
Assignees
Labels
bug Something isn't working

Comments

@russellgsystems
Copy link

@francois-normandin I am seeing multiple topics being received by my subscriber for 45 seconds after the original message was sent. See the log below. You will see that I send the request two times at the beginning but then my process receives it multiple times for up to 45 seconds. This is happening every message I send and causing timeout errors. I am using QoS of 2 (Exactly Once). I tried QoS of 0 (At Most Once) with the same results. This worked back on 10/30/2020 after I made the upgrade to the new version of the MQTT. Is it possible something changed in the way the subscription and publish works?

image

@francois-normandin francois-normandin self-assigned this Dec 11, 2020
@francois-normandin francois-normandin added the bug Something isn't working label Dec 11, 2020
@francois-normandin
Copy link
Member

Thanks @russellgsystems
I just confirmed I see the same with QoS = 1 on a sbRIO, but I don't see it on QoS = 0.
I'll investigate this weekend. This is one that needs to be fixed in high-priority.

Can you tell me which version was working for you on Oct 30th? That will serve as a starting point.
I see that version 3.1.2 was released on Nov 1st. For my initial investigation, I'll assume the bug has been introduced in that version unless you can confirm otherwise.

@russellgsystems
Copy link
Author

Yes that is right I was using verion 3.1.1. Do you recommend that I roll back to the older version 3.1.1 just for my release to my customer today?

@francois-normandin
Copy link
Member

Probably a good idea if you have a delivery to make today. I won't have time to investigate until tomorrow...

@russellgsystems
Copy link
Author

I moved to these versions below. I had to replace the Subscribe (Array).vi and Subscribe (Scalar).vi and Decode Application Message (Type).vi with the older low level ones. I am wondering if those are the problems with the issues I was seeing;

image

@francois-normandin
Copy link
Member

I doubt it very much. Those are simply the public-facing API. I believe the bug is in the logic to handle the message storage while the QoS transaction unfolds.

Here is what happens on a QoS = 1 message
https://github.com/LabVIEW-Open-Source/LV-MQTT-Broker/wiki/Publish-message-with-QoS-=-1

In a nutshell, the server responds to the publisher with a PUBACK message.
It also stores a copy of the message in its session so that it can redirect the message to the subscribers.
Each subscriber has their own QoS setting for the subscription, and the server will keep the message stored until the subscribing clients have received/acknowledged the packet. This is where the server retries to send a duplicate packet every until it is acknowledged or it times out.

There might be an issue with how this message is stored and discarded, not with the subscribe packet itself.

@russellgsystems
Copy link
Author

That makes and I was thinking it was something in the retry logic just hoping it was something more simple.

@francois-normandin
Copy link
Member

OK, I've reproduced the issue in a test and verified the effect of publication QoS and subscription QoS.
The bug only occurs when Subscription QoS > 0, and it does not matter what the QoS is on the publication.

This suggests that the issue is in the subscription process, where it either does not receive the acknowledgement or it does not discard the retained message appropriately.

@francois-normandin
Copy link
Member

Found the problem. If it were a text-based language program, it would be a "1-line fix" type of solution...
The problem is that the Client's "Handle Incoming Request" does not respond with a PUBACK message on QoS = 1 or PUBREL on QoS = 2, but it does respond with a PUBCOMP packet on the second handhake... so only a single VI needs to be dropped to fix the issue.

image

This will be fixed in the client package update after I also tackle this issue#4: LabVIEW-Open-Source/MQTT-Client#4

@francois-normandin
Copy link
Member

Fixed in MQTT Client 3.1.5

@russellgsystems
Copy link
Author

@francois-normandin I updated the driver, but I am still getting duplicate messages so I think there is still an issue. Any thoughts on what might be happening here? See screenshot below.

image

Here are the versions of the software I have installed:

image

@francois-normandin
Copy link
Member

I've created a test VI that is in the MQTT Client package. Can you try it on your system?

@russellgsystems
Copy link
Author

Sure, I want to make sure I am running the right one. Is it one of the DropVIs and if so which one:

image

@francois-normandin
Copy link
Member

Sorry, I should have mentioned that it is in the Unit Tests section of the source code. It is not part of the built package.
https://github.com/LabVIEW-Open-Source/MQTT-Client/tree/main/unit%20tests

I'll have a look a bit later. I'm not around my LV2013 machine at the moment.

I've re-opened the case until we figure it out.

@francois-normandin
Copy link
Member

Sorry, wrong test... that was for the QoS = 0 latency benchmark.

The repetition bug test is here:
https://github.com/LabVIEW-Open-Source/LV-MQTT-Broker/tree/master/Unit%20Tests/Manual%20Dev%20Tests
(labeled "161-Test QoS Repetition Bug.vi")

@francois-normandin
Copy link
Member

If you get a repetition, it will be within three seconds, so the timeout of the loop has been set to 3000ms.
If you receive a copy of the packet, it will increment the "received 2" indicator.
image

The problem was not with the Server, but with the Client. Make sure your clients are updated on all your targets.

@russellgsystems
Copy link
Author

I got a warning when I tried to load this VI:
image

After I run it the "received 2" is always 1.

I am currently running this all localhost on the same computer so there is only one version of the software installed.

@francois-normandin
Copy link
Member

Yes, that warning is because you are running the test VI which is in source folder, and relinking it to your installed copy in vi.lib.
That's all fine, as long as you don't save changes and commit it :-)

OK, so you are seeing duplicates but not repetitions. That would be different. I don't see duplicate messages on my part. Would your subscriber use the "#" filter by any chance?

@russellgsystems
Copy link
Author

My subscriber does not use the # filter. I subscribe to the specific topics.

@francois-normandin
Copy link
Member

francois-normandin commented Dec 17, 2020

I can't reproduce the duplication of values received by a subscriber in QoS = 1 or 2.
(But I do see unexpected disconnections in QoS = 2 because of timeouts... which I'll have to investigate).

Top graph is the publisher. Bottom is the subscriber.

  • Contrary to what the X-axis states, it is not "time", but counter.
  • The QoS on the publisher loop has no effect.

image

If there were duplicates, I'd see each point in double on the bottom graph.
Where do you put your tracer? Can you send me an example that reproduces the problem?

@russellgsystems
Copy link
Author

Let me see if I can reproduce the issue with a simple example. Specifically I log when a message is received or sent. So I can see duplicate messages. One thing that I am doing that is different is that most of my messages are request and response. When my program first starts I launch a client app programmatically and then send it a request message to know when it is running. If I don't get a response within 2 seconds I send another request. By default it can retry up to 120 seconds before it times out. I am also using TestStand and LabVIEW.

@francois-normandin
Copy link
Member

I too use a request-response scheme, but it is asynchronous (the message includes a forwarding address for the response, aka another topic name that the client subscribes to...) and I did see the effect of the problem you had originally reported when I used QoS 1 or 2, namely that the server would be resending duplicate packets because the client didn't send the PUBACK or PUBREC packet correctly.
However, this seems different and I cannot reproduce it yet.

@russellgsystems
Copy link
Author

I haven't been able to try to reproduce the issue yet because I am trying to get back to the state that was working 5 days ago so I can make an update for the customer. However now I can't get that to work. I put back the software to what it was here: https://user-images.githubusercontent.com/68344965/102397988-4c2aec80-3fa4-11eb-98ed-8b0a5eaf255d.png

Now I am getting errors that the forward packet VI is missing:

image

I can find it here, but I am not sure why now these VIs are in a different place:

image

This is so strange because 5 days ago this configuration worked.

@francois-normandin
Copy link
Member

francois-normandin commented Dec 17, 2020

You are not looking at the right file. The Forward Packet method you're looking for is in the Session.Server class.

Look into:
<vi.lib>\LabVIEW Open Source Project\MQTT Broker\Session.Server

Also, if you are having these cross-linking issues after updating or downgrading packages, it is very likely that you either had your project opened during the downgrade, a compile-cache issue with one of your caller classes or a discrepency between on-disk vs in-memory cache (SCC-related or a VI revert is needed).

@russellgsystems
Copy link
Author

Thanks @francois-normandin the forward packet wasn't in <vi.lib>\LabVIEW Open Source Project\MQTT Broker\Session.Server. It was only in the folder I showed above. You are right I did try to downgrade while my project was open last night. So last night I did clear compile cache, get the previous version from SCC, and reinstalled the MQTT packages but I still had this problem. I am going to uninstall everything related to LabVIEW Open Source Project and try to use my last released version of the code rather than the previous version in SCC. I will also clear compile cache. If that doesn't work I will see if someone else can try to load it on their computer. Thanks again for your help.

@russellgsystems
Copy link
Author

I uninstalled and reinstalled everything that was LabVIEW Open Source Project and also went back to my last release source code for my project. Everything is linked properly, but now I can only send one message successfully. Then after that initial message all other messages are never arriving to the subscriber. I also got a 56 error once too. I am using the versions of software here: https://user-images.githubusercontent.com/68344965/102397988-4c2aec80-3fa4-11eb-98ed-8b0a5eaf255d.png

@russellgsystems
Copy link
Author

russellgsystems commented Dec 17, 2020

I noticed I am using this VI:

image

But you are using these two VIs in your example:

image

Would that cause this problem of only getting the first message? My hunch is no, but thought I would ask since it was different. The other difference is you are checking if still connected before publishing, but I am not doing that in my code.

Your example program works so that is why I was asking.

@russellgsystems
Copy link
Author

@francois-normandin It seems the current problem with only getting the first message seems to be related to the QoS. I was able to reproduce it with the attached example. If QoS is 2 you will see that after you subscribe it runs for a little bit and stops.

Main.zip

@francois-normandin
Copy link
Member

I can't reproduce the duplication of values received by a subscriber in QoS = 1 or 2.
(But I do see unexpected disconnections in QoS = 2 because of timeouts... which I'll have to investigate).

Yes, I had seen this two days ago when you reopened the issue.
I don't know how this escaped the unit tests, to be quite truthful. Probably because I stopped the test after a few seconds, so this was missed.

I'll open a new issue, because this is not the same thing. Error 66 is "Disconnected by peer", so there is either a message that is considered a protocol violation by the server, or the session has an unhandled error and terminates prematurely.

@francois-normandin
Copy link
Member

For the use of the variant to data... nah, this is fine the way you do it. There are a few ways to decode the application messages, depending on which class level you are at (base or child) and whether you decode with the serializer or use the raw payload.

@francois-normandin
Copy link
Member

I confirm that this issue is fixed in the current version.
Closing, but referencing #162 for newly identified bug during testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants