Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task ID wrong in health status event #296

Open
bartjolenovation opened this issue Oct 10, 2019 · 10 comments
Open

Task ID wrong in health status event #296

bartjolenovation opened this issue Oct 10, 2019 · 10 comments

Comments

@bartjolenovation
Copy link

bartjolenovation commented Oct 10, 2019

Hi,

we are looking to upgrade marathon, and also marathon consul.
We are testing with Version: 1.4.2 (from apt) and marathon 1.9.

We see that when an application is stopped (by scaling to 0 instances) and then starting by setting 1 instance, the application is not registered in consul (the unregister works fine though). We see the following in the marathon-consul log where I grep the instance UUID:

`grep b98f6bf3-eb61-11e9-af4c-02423f927891 /var/log/marathon-consul.log

time="2019-10-10T13:27:38Z" level=info msg="Got StatusEvent" Id=demo_hello-world.instance-b98f6bf3-eb61-11e9-af4c-02423f927891._app.1 TaskStatus="TASK_STARTING"

time="2019-10-10T13:27:38Z" level=debug msg="Not handled task status" Id=demo_hello-world.instance-b98f6bf3-eb61-11e9-af4c-02423f927891._app.1 taskStatus="TASK_STARTING"

time="2019-10-10T13:27:38Z" level=info msg="Got StatusEvent" Id=demo_hello-world.instance-b98f6bf3-eb61-11e9-af4c-02423f927891._app.1 TaskStatus="TASK_RUNNING"

time="2019-10-10T13:27:38Z" level=debug msg="Not handled task status" Id=demo_hello-world.instance-b98f6bf3-eb61-11e9-af4c-02423f927891._app.1 taskStatus="TASK_RUNNING"

time="2019-10-10T13:27:41Z" level=info msg="Got HealthStatusEvent" Id=demo_hello-world.b98f6bf3-eb61-11e9-af4c-02423f927891

time="2019-10-10T13:27:41Z" level=error msg="Task not found" Id=demo_hello-world.b98f6bf3-eb61-11e9-af4c-02423f927891

time="2019-10-10T13:27:41Z" level=info msg="Got HealthStatusEvent" Id=demo_hello-world.b98f6bf3-eb61-11e9-af4c-02423f927891

time="2019-10-10T13:27:41Z" level=error msg="Task not found" Id=demo_hello-world.b98f6bf3-eb61-11e9-af4c-02423f927891 `

So the "Got HealthStatusEvent" registers the following instanceID: demo_hello-world.b98f6bf3-eb61-11e9-af4c-02423f927891 where we see that the instance ID is demo_hello-world.instance-b98f6bf3-eb61-11e9-af4c-02423f927891

So I thought that the marathon event endpoint returns an incorrect instanceid and went on to capture the traffic with tcpflow (tcpflow -i eth1 -c port 8080 >> ~/tcpdump2) that goes through the events endpoint:

grep b98f6bf3-eb61-11e9-af4c-02423f927891 ~/tcpdump2 data: {"instanceId":"demo_hello-world.instance-b98f6bf3-eb61-11e9-af4c-02423f927891","condition":"Scheduled","runSpecId":"/demo/hello-world","agentId":null,"host":null,"runSpecVersion":"2019-10-10T13:27:36.878Z","timestamp":"2019-10-10T13:27:36.906Z","eventType":"instance_changed_event"} data: {"instanceId":"demo_hello-world.instance-b98f6bf3-eb61-11e9-af4c-02423f927891","condition":"Provisioned","runSpecId":"/demo/hello-world","agentId":"6b0d8829-4681-4638-bbe2-121d950e241e-S0","host":"10.141.141.10","runSpecVersion":"2019-10-10T13:27:36.878Z","timestamp":"2019-10-10T13:27:37.746Z","eventType":"instance_changed_event"} data: {"slaveId":"6b0d8829-4681-4638-bbe2-121d950e241e-S0","taskId":"demo_hello-world.instance-b98f6bf3-eb61-11e9-af4c-02423f927891._app.1","taskStatus":"TASK_STARTING","message":"","appId":"/demo/hello-world","host":"10.141.141.10","ipAddresses":[{"ipAddress":"127.0.1.1","protocol":"IPv4"}],"ports":[31338],"version":"2019-10-10T13:27:36.878Z","eventType":"status_update_event","timestamp":"2019-10-10T13:27:38.063Z"} data: {"instanceId":"demo_hello-world.instance-b98f6bf3-eb61-11e9-af4c-02423f927891","condition":"Starting","runSpecId":"/demo/hello-world","agentId":"6b0d8829-4681-4638-bbe2-121d950e241e-S0","host":"10.141.141.10","runSpecVersion":"2019-10-10T13:27:36.878Z","timestamp":"2019-10-10T13:27:38.063Z","eventType":"instance_changed_event"} data: {"slaveId":"6b0d8829-4681-4638-bbe2-121d950e241e-S0","taskId":"demo_hello-world.instance-b98f6bf3-eb61-11e9-af4c-02423f927891._app.1","taskStatus":"TASK_RUNNING","message":"","appId":"/demo/hello-world","host":"10.141.141.10","ipAddresses":[{"ipAddress":"127.0.1.1","protocol":"IPv4"}],"ports":[31338],"version":"2019-10-10T13:27:36.878Z","eventType":"status_update_event","timestamp":"2019-10-10T13:27:38.769Z"} data: {"instanceId":"demo_hello-world.instance-b98f6bf3-eb61-11e9-af4c-02423f927891","condition":"Running","runSpecId":"/demo/hello-world","agentId":"6b0d8829-4681-4638-bbe2-121d950e241e-S0","host":"10.141.141.10","runSpecVersion":"2019-10-10T13:27:36.878Z","timestamp":"2019-10-10T13:27:38.769Z","eventType":"instance_changed_event"} data: {"appId":"/demo/hello-world","instanceId":"demo_hello-world.instance-b98f6bf3-eb61-11e9-af4c-02423f927891","version":"2019-10-10T13:27:36.878Z","alive":true,"eventType":"health_status_changed_event","timestamp":"2019-10-10T13:27:41.919Z"} data: {"appId":"/demo/hello-world","instanceId":"demo_hello-world.instance-b98f6bf3-eb61-11e9-af4c-02423f927891","version":"2019-10-10T13:27:36.878Z","alive":true,"eventType":"health_status_changed_event","timestamp":"2019-10-10T13:27:41.929Z"} data: {"instanceId":"demo_hello-world.instance-b98f6bf3-eb61-11e9-af4c-02423f927891","runSpecId":"/demo/hello-world","healthy":true,"runSpecVersion":"2019-10-10T13:27:36.878Z","timestamp":"2019-10-10T13:27:41.929Z","eventType":"instance_health_changed_event"}

And I see no instanceID that lacks the "instance-" prefix. I haven't found anything that resembles this problem and am not sure whether this is a marathon or a marathon-consul problem.

The error is thrown by events/event_handler.go (line 142)

@bartjolenovation
Copy link
Author

apparently it get caught here:

// Regular expression to extract runSpecId from instanceId // See: https://github.com/mesosphere/marathon/blob/v1.4.0-RC4/src/main/scala/mesosphere/marathon/core/instance/Instance.scala#L244 var instanceIDRegex = regexp.MustCompile(^(.+).(instance-|marathon-)([^\.]+)$`)

func (t TaskHealthChange) TaskID() apps.TaskID {
if t.ID != "" {
return t.ID
}
return apps.TaskID(instanceIDRegex.ReplaceAllString(t.InstanceID, "$1.$3"))
}`

If I read that correct, it seems that that the ID is empty and therefor uses the regex replacement

@patrick-jumio
Copy link

patrick-jumio commented Oct 23, 2019

I would like to add to this bug, as we are hitting the same thing.

though I would add some extra information.

time="2019-10-22T23:25:19Z" level=debug msg="Service is running" Id=pyt_my-service-svc.instance-c98bdb62-f522-11e9-9b69-02429c8e109f._app.1_my-service-svc_31650
time="2019-10-22T23:26:08Z" level=info msg="Got StatusEvent" Id=pyt_my-service-svc.instance-c98bdb62-f522-11e9-9b69-02429c8e109f._app.1 TaskStatus=TASK_KILLED
time="2019-10-22T23:26:08Z" level=info msg=Deregistering Address=10.130.37.52 Id=pyt_my-service-svc.instance-c98bdb62-f522-11e9-9b69-02429c8e109f._app.1_my-service-svc_31650
time="2019-10-22T23:26:09Z" level=info msg="Got StatusEvent" Id=pyt_my-service-svc.instance-5373f6f3-f523-11e9-9b69-02429c8e109f._app.1 TaskStatus=TASK_STARTING
time="2019-10-22T23:26:09Z" level=debug msg="Not handled task status" Id=pyt_my-service-svc.instance-5373f6f3-f523-11e9-9b69-02429c8e109f._app.1 taskStatus=TASK_STARTING
time="2019-10-22T23:26:09Z" level=info msg="Got StatusEvent" Id=pyt_my-service-svc.instance-5373f6f3-f523-11e9-9b69-02429c8e109f._app.1 TaskStatus=TASK_RUNNING
time="2019-10-22T23:26:09Z" level=debug msg="Not handled task status" Id=pyt_my-service-svc.instance-5373f6f3-f523-11e9-9b69-02429c8e109f._app.1 taskStatus=TASK_RUNNING
time="2019-10-22T23:26:24Z" level=info msg="Got HealthStatusEvent" Id=pyt_my-service-svc.5373f6f3-f523-11e9-9b69-02429c8e109f
time="2019-10-22T23:26:24Z" level=debug msg="Asking Marathon for /pyt/my-service-svc" Location="ip-10-130-37-72:8080"
time="2019-10-22T23:26:24Z" level=debug msg="Sending GET request to marathon" Location="ip-10-xx-xx-72:8080" Protocol=http Uri="/v2/apps//pyt/my-service-svc?embed=apps.tasks"
time="2019-10-22T23:26:24Z" level=info msg="Got StatusEvent" Id=pyt_my-service-svc.instance-5373f6f3-f523-11e9-9b69-02429c8e109f._app.1 TaskStatus=TASK_RUNNING
time="2019-10-22T23:26:24Z" level=debug msg="Not handled task status" Id=pyt_my-service-svc.instance-5373f6f3-f523-11e9-9b69-02429c8e109f._app.1 taskStatus=TASK_RUNNING
time="2019-10-22T23:26:24Z" level=error msg="Task not found" Id=pyt_my-service-svc.5373f6f3-f523-11e9-9b69-02429c8e109f
time="2019-10-22T23:30:19Z" level=info msg="Ignoring health check of type COMMAND" Address=10.xx.xx.52 Id=/pyt/my-service-svc
time="2019-10-22T23:30:19Z" level=info msg=Registering Address=10.xx.xx.52 EnableTagOverride=false Id=pyt_my-service-svc.instance-5373f6f3-f523-11e9-9b69-02429c8e109f._app.1_my-service-svc_31586 Name=my-service-svc Port=31586 Tags="[marathon haproxy_fqdn=my-service-svc.core-us.jumio.link haproxy_http_app_public haproxy_http_app_waf_backend=detection-only haproxy_httpchk=GET /api/internal/monitoring/healthchecks haproxy_http_app haproxy_http_app_waf_backend=detection-only marathon-task:pyt_my-service-svc.instance-5373f6f3-f523-11e9-9b69-02429c8e109f._app.1]"

This is the curated logs during a normal marathon deploy.
the task is correctly de-registered, but it cannot re-register, though this only occurs from the register task.
After some minutes, when the sync all task is called, the registration of the service works.

msg=Registering Address=10.xx.xx.52 EnableTagOverride=false Id=pyt_my-service-svc.instance-5373f6f3-f523-11e9-9b69-02429c8e109f._app.1_my-service-svc_31586

Unfortunately, the sync is quite infrequent, and this causes downtime for us. We will no doubt increase the frequency of the sync until this issue can be resolved, but this is not ideal.
(In this example, we are talking about 3.5 minutes before the service gets registered back in consul)

@bartjolenovation
Copy link
Author

now testing with 1.5.1 (marathon 1.9.100 from yum repo http://repos.mesosphere.com/el/7/$basearch/) and we see the same

@bartjolenovation
Copy link
Author

bartjolenovation commented Nov 8, 2019

Just to debug some code changes which make it work (but are no real fix in any sense of that combination of words ;)

diff marathon-consul-master/events/task_health_change.go go/src/github.com/allegro/marathon-consul/events/task_health_change.go
24c24
var instanceIDRegex = regexp.MustCompile(^(.+)\.(instance-|marathon-)([^\.]+)$)

var instanceIDRegex = regexp.MustCompile(^(.+)$)

30c30
return apps.TaskID(instanceIDRegex.ReplaceAllString(t.InstanceID, "$1.$3"))

return apps.TaskID(instanceIDRegex.ReplaceAllString(t.InstanceID, "$1._app.1"))

So essentially I just take the complete instanceID and put ._app.1 after that, which causes the task to be found.
However if the task has more than 1 instance it will fail, also the regex change is just giving back the input, which should be done in a correct way (but hey, I'm an operator, not a programmer).

@bartjolenovation
Copy link
Author

The data given by the health status change event seems to be

data: {"appId":"/demo/hello-world","instanceId":"demo_hello-world.instance-b98f6bf3- eb61-11e9-af4c-02423f927891","version":"2019-10-10T13:27:36.878Z","alive":true,"eventType":"health_status_changed_event","timestamp":"2019-10-10T13:27:41.929Z"}

So if we can match that instanceId to a field in the task list instead of the id field which does include the ._app.# we should be baconbuyer.

@bartjolenovation
Copy link
Author

we are working on a fix and hopefully will submit a merge request semi-soon

JurrianFahner pushed a commit to JurrianFahner/marathon-consul that referenced this issue Nov 14, 2019
Change FindTaskByID so that Marathon-Consul can be used with marathon version > 1.9, this change will fix issue: allegro#296

Refactor test so the output resembles marathon v1.9
Also added 3 tests to test the find tasks method.

This change should be backwards compatible
@patrickduboucher
Copy link

Is there any update on this topic?

@bartjolenovation
Copy link
Author

There is a PR but there is a build fail, which JurrianFahner can't reproduce. Not sure what's the hold-up on the reply to his question.

@pbetkier
Copy link
Contributor

The thing is we stopped developing this project. We still use it in production, but it runs on an older version of Marathon and we only use a subset of features. We don't plan to keep it up to date with Marathon releases.

That being said, we'd be happy to accept PRs and release them for you. We don't want to invest time in debugging the travis build though.

Even better, you could fork this project and continue its development. If you decide so, we can make it official and provide a link in our README.

@patrickduboucher
Copy link

ok thanks for the update. We'll consider our options at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants