Connectors failes to complete sync #2925

sjors101 · 2024-10-29T09:03:03Z

Bug Description

We are using the connector framework for a while now with over 100 connectors configured. Since a few weeks we experiencing connector jobs failing with the following error: connectors.sync_job_runner.ConnectorJobNotRunningError: Connector job (ID: Wsl5npIBp9FxXy_8Hx2C) is not running but in status of JobStatus.ERROR. We can't really pinpoint the issue, some runs fail after a few second, others after 90 minutes, and others finish successfully. It seems related to the > 100 connectors.

We did found a workaround, we noticed when we run ~ less than 10 active connector containers on our kubernetes platform, the issue won't occur. This make me wonder if there is some queue on the Elastic side that is full. We also tried increasing the DEFAULT_PAGE_SIZE in connectors/es/index.py, but this did not solve the issue.

To Reproduce

Steps to reproduce the behavior:

Configure more that 100 connectors
Run > 10 containers
Run multiple jobs in parallel for multiple hours

Environment

Elasticsearch 8.15
ConnectorFW: 8.15.3.0

Logs / config

Attached logs of one connector container, and connector config (i replaced the sensitive records). We dont see any logs at Elasticsearch or Enterprisesearch. We notice the same behaviour on different connector_types.

container-config.txt
container-logs.txt

The text was updated successfully, but these errors were encountered:

seanstory · 2024-10-29T13:22:21Z

Congrats on having over 100 connectors at once!
Thanks for reporting. We'll dig into this.

I'm wondering if this is related to elastic/kibana#195127, and if Kibana is marking syncs as "error".
Did you just notice this after an upgrade? Or did you only recently scale up to so many connectors?

sjors101 · 2024-10-30T07:29:01Z

Great, thanks! Each connector we configure gets a dedicated container, we don't run multiple connectors in the same container. So i don't think it should not be related to elastic/kibana#195127. Its quite likely that its related to the scale-up, but a lot of developments happen in parallel, in the mean time we also moved from ES-stack 8.14 > 8.15.

We have a script to configure multiple connectors at once, which uses the connector API's (https://www.elastic.co/guide/en/elasticsearch/reference/current/connector-apis.html). As we speak we have 109 connectors configured, i could try to delete 10 and see if the issue still exists.

artem-shelkovnikov · 2024-11-06T12:37:04Z

Hi @sjors101,

Is there any chance you can collect the logs from all of your connector hosts in one place and grep by the failed job id there (in your log file that'd be Wsl5npIBp9FxXy_8Hx2C).

Connectors should not affect each other, but they seem to do it somehow: as if another service is marking the connector sync job as failed. Could it be that you have services running with identical config, so that they attempt to serve the same connector?

seanstory · 2024-11-06T19:25:25Z

as if another service is marking the connector sync job as failed. Could it be that you have services running with identical config, so that they attempt to serve the same connector?

This got me thinking, what if you had one service, configured to be responsible for more than 100 connectors all at once. Do we correctly fetch all connectors from Elasticsearch to compare against what's configured in YAML?

I don't think we do.

note that 100 is our page size. https://github.com/elastic/connectors/blob/main/connectors/es/index.py#L13
That's sus, given this bug report.

This logic looks buggy to me.

connectors/connectors/es/index.py

Lines 320 to 327 in 60e01ad

    
           hits = resp["hits"]["hits"] 
        
           total = resp["hits"]["total"]["value"] 
        
           count += len(hits) 
        
           for hit in hits: 
        
               yield self._create_object(hit) 
        
           if count >= total: 
        
               break 
        
           offset += len(hits)

total gets reset each iteration, but count gets incremented. This probably can't return more than 2 pages of hits, right? Because on the second page, count will be larger than total, so we'll break.

@sjors101 you might be able to test this faster than we can set up an env with 100 connectors. Can you you change that hardcoded page size to something like 1000 and see if that fixes things? (obviously not a good long-term fix, just as an investigation step).

artem-shelkovnikov · 2024-11-14T10:25:48Z

total gets reset each iteration, but count gets incremented. This probably can't return more than 2 pages of hits, right? Because on the second page, count will be larger than total, so we'll break.

I think total is independent - it's just number of documents matching the query, so it's okay to overwrite it. Although we don't really use PIT here so modification of collection can cause weird bugs. On the other hand, if indices are not added/removed, it should not be a problem and this inconsistency will be very eventual.

seanstory · 2024-11-14T19:46:11Z

🤦 you're right, total isn't the number of hits in the hits array, it's the total that matches the query. My misread.

I still think the hardcoded DEFAULT_PAGE_SIZE = 100 is sus. Even if I don't spot where the bug is.

sjors101 · 2024-11-23T12:17:48Z

Hi @artem-shelkovnikov

I checked the logs of all our elastic nodes but no log records with the job-id or connector-id. The only log messages i just saw during a crash are the following, but dont think they are relevant:

{"@timestamp":"2024-11-21T18:07:14.550Z", "log.level": "INFO", "message":"could not get token document [token_weLo5m5WH1O3jKTxoV377C_p_azAkVfCiGm_rZd0eqA] that should have been created, retrying", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-es-coordinating-0][transport_worker][T#15]","log.logger":"org.elasticsearch.xpack.security.authc.TokenService","trace.id":"520380caaa394bac07f6ab93cc2a15c9","elasticsearch.cluster.uuid":"mKB46yE8TTyd9j_87znstg","elasticsearch.node.id":"1tDQZUAPTpKu1-07XRep6A","elasticsearch.node.name":"elasticsearch-es-coordinating-0","elasticsearch.cluster.name":"elasticsearch"}

artem-shelkovnikov · 2024-11-25T13:28:17Z

Hi @sjors101, it seems to be a log from Elasticsearch. We're looking for logs from connector containers :)

sjors101 added the bug Something isn't working label Oct 29, 2024

seanstory added the community-driven label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connectors failes to complete sync #2925

Connectors failes to complete sync #2925

sjors101 commented Oct 29, 2024

seanstory commented Oct 29, 2024

sjors101 commented Oct 30, 2024

artem-shelkovnikov commented Nov 6, 2024

seanstory commented Nov 6, 2024 •

edited

Loading

artem-shelkovnikov commented Nov 14, 2024

seanstory commented Nov 14, 2024

sjors101 commented Nov 23, 2024

artem-shelkovnikov commented Nov 25, 2024

Connectors failes to complete sync #2925

Connectors failes to complete sync #2925

Comments

sjors101 commented Oct 29, 2024

Bug Description

To Reproduce

Environment

Logs / config

seanstory commented Oct 29, 2024

sjors101 commented Oct 30, 2024

artem-shelkovnikov commented Nov 6, 2024

seanstory commented Nov 6, 2024 • edited Loading

artem-shelkovnikov commented Nov 14, 2024

seanstory commented Nov 14, 2024

sjors101 commented Nov 23, 2024

artem-shelkovnikov commented Nov 25, 2024

seanstory commented Nov 6, 2024 •

edited

Loading