Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ice-zc device initialization randomly fails #933

Closed
cardigliano opened this issue May 6, 2024 · 30 comments
Closed

ice-zc device initialization randomly fails #933

cardigliano opened this issue May 6, 2024 · 30 comments

Comments

@cardigliano
Copy link
Member

When starting applications on ice-zc, sometimes the device initialization fails or all packets are dropped

@gyarom
Copy link

gyarom commented Jun 20, 2024

Hi Alfredo,

regarding to #933
We still have problem. And it is critical for us.
I just want to refine and tell you when it append.
We are using pfring zc in 2 types of applications.

  1. Just receive packets from interface.
  2. Receive packets from interface and answer to arp and ping.
    to answer, we open the device also to tx.

We have problem only when we create type #2 application with tx queue.
The first type (only receive) it works perfect and never stuck.

I want to say that in X520 run on Dell G14 we never so problem. We uses pfring 7.4.
The problem is appended during startup if there is income of 1-2G, in E810 in Dell G15 (and we use pfring we rx and tx queues). Than all packets are drops after startup.
I try many things to solve it, change the number of buffers in cluster, open the tx device without zc: even that I open the rx device with zc: prefix.
Nothing did not help me.

I want to ask after refine the problem do you have some clue what can append.
Can I debug why all packets are drops, than I will have some direction to investigate the problem.

Thanks,
Guy

@cardigliano
Copy link
Member Author

@gyarom the additional info you provided would definitely help reproducing the issue, thank you.

@cardigliano
Copy link
Member Author

One more question: are you using a single queue or multiple RSS queues in 2.?

@gyarom
Copy link

gyarom commented Jun 20, 2024 via email

@cardigliano
Copy link
Member Author

That's fine, I was just asking to collect all the info to reproduce the issue. Thank you.

@gyarom
Copy link

gyarom commented Jun 30, 2024

@cardigliano @gyarom @[email protected] @[email protected]
Hi Alfredo,
we still facing the problem.
we can make constant steps that reproduce the problem.
It looks that problem occurred only if nic receive traffic from another Cognyte device that wrap the traffic from the simulator, it insert some Cognyte private header to traffic.
If the same traffic income to our nic directly from the simulator there are no problems.
I don’t know the reason for that.
How we reproduce the problem:

  1. Start our application with no traffic from simulator.
  2. Wait for keep alive between our application and another Cognyte device.
  3. Start to inject traffic from simulator (TestCenter/Spirent).
    traffic. Income with no problem.
  4. Stop our application and run ntop application zbalance, which is similar to our application.
    ./zbalance -i zc:ens3f0 -c 2 -g 1:3:5:7:9:11:13:15 -r 31
    and all packets are drops.
    if you stop the simulator during start running zbalance, there are no drops.
    Can we make short meeting that I will demonstrate the problem, maybe be you will have some idea how to continue.

@cardigliano
Copy link
Member Author

@gyarom I tried running pfcount and pfsend at the same time, while receiving 10Gbit/15Mpps, but I was not able to reproduce the issue.
Could you provide a code snippet (or a sample application source code) for reproducing this?

@gyarom
Copy link

gyarom commented Jul 5, 2024 via email

@cardigliano
Copy link
Member Author

cardigliano commented Jul 5, 2024 via email

@gyarom
Copy link

gyarom commented Jul 5, 2024 via email

@cardigliano
Copy link
Member Author

In the steps above about "How we reproduce the problem" you wrote:

  1. Start our application with no traffic from simulator.
  2. Wait for keep alive between our application and another Cognyte device.
  3. Start to inject traffic from simulator (TestCenter/Spirent).
    traffic. Income with no problem.
  4. Stop our application and run ntop application zbalance, which is similar to our application.
    ./zbalance -i zc:ens3f0 -c 2 -g 1:3:5:7:9:11:13:15 -r 31
    and all packets are drops.
    if you stop the simulator during start running zbalance, there are no drops.
    Can we make short meeting that I will demonstrate the problem, maybe be you will have some idea how to continue.

But I am a bit confused:

  • This means you run zbalance after stopping your application, this means the only active traffic is come the simulator at the time you open the socket, thus I do not understand how your application can affect zbalance..
  • You said it happens when you receive and transmit at the same time, however zbalance is not transmitting in your configuration. Please clarify.

@gyarom
Copy link

gyarom commented Jul 5, 2024 via email

@gyarom
Copy link

gyarom commented Jul 7, 2024 via email

@cardigliano
Copy link
Member Author

I do not see the color, but I guess you mean "irq 889: Affinity broken due to vector space exhaustion". I will dig a bit, first time I see this error.

@gyarom
Copy link

gyarom commented Jul 8, 2024 via email

@gyarom
Copy link

gyarom commented Jul 16, 2024 via email

@cardigliano
Copy link
Member Author

@gyarom that would be useful. I will be available next week in the CET (Italy) timezone.

@cardigliano
Copy link
Member Author

It seems a4e76ea fixed this, please reopen if reoccurs.

@gyarom
Copy link

gyarom commented Aug 28, 2024

Hi Alfredo,

@[email protected], @[email protected]

Your fixes is allot better, but it does not fix every thing.
I want to reopen bug #933, i don't find it in the github gui!
Can you assite me to reopen the bug.
For example, sometime i inject with test center 3M pps, and our application see 4.77 M pps and bandwidth of 10G.
Then i stop our application and start for example zbalanace or zcount and it see also the same 4.77 M pps and bandwidth of 10G. Something in the driver is bad.
One thing that I notify, after our application already running, i stop the test center and restart it and everything becomes good.
Maybe there is some pfring interface function to disable\enable the all nic (like stop the test center and restart it) to workaround the problem?
I also saw after night, because we restart the application every 5 min’s (no license), that it was stuck with all packets drops.
You still have the same team viewer connection. I can simulate the bug for you, if you like.
Can you please assist.

Thx,
Guy

@cardigliano
Copy link
Member Author

@gyarom please ignore the pps and check the absolute packet count (e.g. send 10 Million packets and count how many are captured). If there are more packets then expected, please print or dump those and let us see them to figure out from where they are coming from.

@gyarom
Copy link

gyarom commented Aug 29, 2024

Hi Alfredo,

@cardigliano,@[email protected], @[email protected]

I checked your assumption that it is only issue of “absolute packet count”.
I checked it in think it is not only “absolute packet count”.
I changed our code and do not use your function pfring_zc_stats().
We calculate bit rate [pps] and bandwidth by ourselves.
It seems that when we have problem, it looks to me that pf_ring send (actually we are polling) in the max bandwidth ~10G.
when you stop traffic in the test center and restart it, everything become normal.
In addition we are running in Linux service and because we works without license, each 5 min our application crash and service restart it, i checked yesterday, after 10 restart all packets become drops and application stop to crash.
Bug 933 is closed, and we does not have permission to reopen it.
Can you please advice.
Guy

@cardigliano
Copy link
Member Author

@gyarom please ask for an evaluation license to avoid restating the application every 5 minutes as application crashes may corrupt data structures. As of the packet count, we cannot do much if we do not have evidence of what kind of packets are exceeding the expected count, it is strange the adapter produces extra packets, it may be there is some loop in the network or other issues.

@gyarom
Copy link

gyarom commented Aug 29, 2024 via email

@gyarom
Copy link

gyarom commented Aug 29, 2024

Hi Alfredo,
@cardigliano

I ordered evaluation license from Maria.
Regarding to unexpected pps and bandwidth, i don’t thing that pf_ring generate traffic (-:
I’m not familiar with your code, But i can think that if there is bug and buffer that was read sign in buffer descriptor (BD) as ‘not read’ , then we will continue to read it for ever.
I can run constant scenario that cause also zcount zbalance see the same as our application, with wrong pps and 10G.

  1. Inject 3 Mpps by test canter. 3 lines out of 4 are checked. Each line has 1M pps

image

  1. Start vtps (our application)
    systemctl start vtps.
    check by tail how many packet vtps see
    tail -f /usr/local/vtps/rtp/Logs/rtpLog_2.0
    in the follow example vtps see 4.77 G but we inject only 3M

image

  1. stop vtps
    systemctl stop vtps

  2. Run zount
    /usr/local/vtps/pf_ring/zc/zcount -i zc:ens1f0 -c 2
    image

image

  1. stop the test center wait few sec. and restart it.
    Everything back to normal
    to stop/start test center, it is up in the menu with the light traffic + play\stop.

zcount see the same as vtps

  1. kill the zcount before start vtps again.

@cardigliano
Copy link
Member Author

@gyarom what is vtps doing? Is it injecting some traffic perhaps?

@cardigliano
Copy link
Member Author

@gyarom I connected to your machine, I ran vtps, anche checked the hadware packet counter on the network interface with ethtool -S ens1f0 with a 1sec interval, and the counter is increasing by 4.7Mpps. This means there are actually 4.7 Mpps hitting the adapter. I think vtps is creating some loop in the network.

@gyarom
Copy link

gyarom commented Aug 29, 2024

@cardigliano
vtps mainly read from the network. but it also answers to arp\ping, in very low rate.
i will try to disable the tx.

@gyarom
Copy link

gyarom commented Aug 29, 2024

Hi Alfredo,
@cardigliano

First, the 4.77 Mpps and 10G input issue is not related to pf_ring; it is the Cognyte environment that is causing the problem.
I’m sorry for that, and thank you for helping me find the problem.
The issue that remains is the stability.
I restarted the vtps service 10 times and checked if the packets were received properly or if all packets were dropped. In 8 out of 10 cases,
the packets were received properly, which leaves us with a 20% wake-up failure rate.
Can we do something about that?
Maybe we could increase the timeout in the places where you inserted timeouts.
I can try it in our version only if you don’t want to apply it to the generic version.

Thanks,
Guy

@cardigliano
Copy link
Member Author

@gyarom please note that the adapter takes a bit to reload when opening/closing the socket, it may be when the service is restarted due to a demo expiration, the socket reset is too fast creating such issue. I suggest to check if this creates issues also after fixing the license, as in that case you do not have such restarts.

@gyarom
Copy link

gyarom commented Sep 4, 2024

Hi Alfredo,
@cardigliano
I think that now you can close the bug 933 also from Cognyte side.
There are still ~10% situation when all drops after start-up.
We make work around in our application, that when we identify the problem, we make automatic restart.
In production that we have license, it will rarely append.
Thanks for all help during this time, and that you solved the problem.

Guy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants