Skip to content
This repository has been archived by the owner on Sep 16, 2024. It is now read-only.

GPY Modem - ESP32 Cannot Communicate, LTE() function error #445

Open
ptcoregon opened this issue May 13, 2020 · 23 comments
Open

GPY Modem - ESP32 Cannot Communicate, LTE() function error #445

ptcoregon opened this issue May 13, 2020 · 23 comments

Comments

@ptcoregon
Copy link

GPY Modem-ESP32 Communication Issue

This document describes this hardware/firmware issue that dramatically impacts reliability and potential applications.

Issue Description

The fundamental issue reloves around this function in main.py:

lte = LTE()

The ESP32 is unable to connect to the modem, resulting in the error below from pycom-micropython-sigfox/esp32/mods/modlte.c

OSError: Couldn't connect to Modem (modem_state=disconnected)

After resetting, the problem repeats itself.

machine.reset()

After hundreds of soft resets, the LTE() function occasionally returns successfully and we are able to proceed with full functionality from then on.
But after one (or more) soft resets or spontaneous disconnects, the problem returns. The same happens when resets occur via WDT.

Power cycling will usually cause the module to operate properly on the first try, but not always. And the issue always returns after a period of time.

Firmware Version Tests

This problem is present with many firmware versions, including the experimental Dev branch of pycom-mircopython-sigfox.

  • v1.18.2
  • v1.18.2.r7
  • v1.20.1
  • v1.20.1.r2
  • v1.20.2.rc6
  • Development
    I have used both the pre-compiled firmware as well as built it with pycom-esp-idf and the toolchain. Nothing fixes the problem.

I am using the latest stable version, CATM1-41065.dup firmware for the modem. It is almost impossible to downgrade because, obviously, I cannot communicate with the modem.

Hardware Tests

This problem is present in two different GPY modules using the Expansion Board 3.1 and the provided cellular antenna.

One module with a firmware test: (sysname='GPy', nodename='GPy', release='1.20.1.r2', version='v1.20.1.r2-5-gdb7f895-dirty on 2020-03-25', machine='GPy with ESP32')

Fix Attempts

Trying lte = LTE() from the CLI does exactly the same thing

Adding pycom.lte_modem_en_on_boot(False) does nothing

Other Instances

I am clearly not the only one with this issue. See these forum threads from 2018:

https://forum.pycom.io/topic/3675/despite-heavy-investment-in-fipy-gpy-not-possible-to-use-board-as-anything-more-than-lte-modem-and-even-that-s-problematic

https://forum.pycom.io/topic/3129/lte-lte-getting-stuck-after-reset-fw-1-17-3-b1-on-fipy

Relevant Code

import pycom
import time
import os
from machine import WDT
from machine import SD
import machine

pycom.wdt_on_boot(True)
pycom.wdt_on_boot_timeout(240000)

wdt = WDT(timeout=240000)  # enable it with a timeout of 240 seconds
wdt.init(240000)
wdt.feed()

import ujson

pycom.wifi_on_boot(False)
pycom.heartbeat(False)

from machine import Pin
from network import WLAN

wlan = WLAN()
wlan.antenna(WLAN.EXT_ANT)
wlan.deinit()

import socket
import ssl
from network import LTE
from network import Bluetooth
from simple import MQTTClient
import ubinascii
import array
from machine import RTC

if pycom.lte_modem_en_on_boot():
    print("LTE on boot was enabled. Disabling.")
    pycom.lte_modem_en_on_boot(False)

print("LTE()")

try:
    lte = LTE()
except:
    time.sleep(6)
    machine.reset()

#we rarely get here...
print("LTE() done")
#rest of program...

Proposed Resolution

There must be a way to reset the communication lines between the ESP32 and the Modem without first executing lte = LTE(). I am comfortable building my own updated firmware with a solution from Pycom. However, I need assistance due to the complexity of the the LTE-related firmware and processes.

Thank you thank you thank you for any help in solving this issue!

@abatardi
Copy link

abatardi commented Jun 1, 2020

My company just spent thousands of dollars developing a sensor companion board for use with Pycom after some promising initial tests (trying to move away from Particle). However, it's become unusable lately, and the latest firmware/sequans firmware releases just seem to make it worse and worse. If we can even get an LTE connection at this point, it lasts maybe 2 minutes before disconnecting. We have tried hologram and nimbelink sims on verizon, same problem on both. Meanwhile Particle devices on the same sims/networks in the same physical location are making connections perfectly and holding them for days without dropping.

Why is this so bad? I understand there are a lot of differences and problems with cellular communication, but it seems the rest of the world has it figured out. Meanwhile I'm trying to decide if we are going to drop more thousands of dollars moving our new board design back to Particle or continue to throw more wasted time and money at Pycom. Ridiculous.

@abatardi
Copy link

abatardi commented Jun 1, 2020

LTE connect()
LTE is_connected()
LTE connection established
connect_lte with start_mqtt is now removed please call communication_protocol or start_mqtt directly
MQTT Protocol
Packet sent. (Length: 103)
This is PybytesProtocol.start_MQTT
Packet sent. (Length: 44)
Connected to MQTT mqtt.pybytes.pycom.io
Pybytes connected successfully (using the built-in pybytes library)
This is pack_info_message()
__pack_message: b'310504a00507'
MQTT Protocol
Socket send error [Errno 104] ECONNRESET
This is pack_pybytes_message_variable(5, 0, bytearray(b'\x00\x00\x00\x00\x00'))
__pack_message: b'3e05000000000000'
MQTT Protocol
Socket send error [Errno 104] ECONNRESET
This is pack_pybytes_message_variable(5, 0, bytearray(b'\x00\x00\x00\x01\x00'))
__pack_message: b'3e05000000000100'
MQTT Protocol
Socket send error [Errno 113] ECONNABORTED

@catalinio
Copy link
Contributor

Hi,

Please drop us a short email here: https://pycom.io/community/contact-support/ we can provide you an experimental modem firmware.

Best wishes,
Catalin

@ptcoregon
Copy link
Author

After working with the Pycom engineers, it appears that the "experimental" new Modem firmware they have fixes this problem. However, I am keeping this ticket open until I see that someone has posted a link to where people can get this firmware.

@abatardi
Copy link

abatardi commented Jun 24, 2020

Meanwhile pycom engineers continue to completely ignore the support request I submitted to them 3 WEEKS ago. Please make these firmware files available.

@tlanier9
Copy link

tlanier9 commented Sep 7, 2020

I've also seen this issue using V1.20.3.b0.

Has anything been done to attempt to fix this?

Does the GPy have a method to physically reset the modem (not AT command)?

A power down/up fixed the problem in my case.

@amcewen
Copy link

amcewen commented Jun 15, 2021

Is there any update on the "experimental" new modem firmware? We're seeing the same problem on some of our devices, and don't have an easy way to power-cycle them.

@tlanier9
Copy link

tlanier9 commented Jun 15, 2021 via email

@abatardi
Copy link

We had to switch to a different product entirely as the pycom units we had in the field caused countless issues and we had site visits literally on a weekly basis. These are in no way ready for prime time.

@peter-pycom
Copy link
Contributor

peter-pycom commented Jun 16, 2021

Is there any update on the "experimental" new modem firmware? We're seeing the same problem on some of our devices, and don't have an easy way to power-cycle them.

CAT-M fw update has been released: https://forum.pycom.io/topic/6881/lte-modem-firmware-release-for-cat-m1-5-2-48829

wrt FW version for the esp32: If you intend to use our LTE class then I'd strongly suggest 1.20.2.r4 since it contains some LTE fixes (If you use another solution like mentioned by @tlanier9 this is less relevant)

@pkharvey
Copy link

We've been trying for some time to find a fix/workaround to these same LTE() issues you're experiencing while trying to use the Fipy on NB-IoT in both the UK and the US. It would be ideal for us if the issues with the LTE class could be resolved for our existing code and hardware as we are also unable to power-cycle the modem in our intended application, and we're able and willing to help do some testing here with the equipment we have and contribute our findings.

Our focus is on getting the Fipy working on NB-IoT in the US ideally, but the LTE functionality is equally as important to us in general. Our UK Fipy can send/receive on NB-IoT whenever the LTE class doesn't error out, however we haven't had our US Fipy work once on NB-IoT despite using a known-working SIM.

We have a few Fipys in the UK and US on firmware 1.20.0.rc13 or 1.20.2.r4, running basic NB-IoT Python code from file or REPL. Most are on custom boards with sensors but we also have Pysense/Expansion boards for testing. Modem firmware is LR6.0.0.0-41019 (NB-IoT). We're yet to try 46262, but I'll report back on our findings with it when we try it. For sake of doubt, some tests were performed with a (1000 µF + 100 nF) capacitor pair added to each power input pins and 3.3 V out, then powered by battery, USB, sometimes both, or external power supply and found no improvement or difference with/without any combination. (Notably with the modem active and Fipy idle in REPL, sometimes observed power spikes around 600 mA for 7 ms, peaks reduced slightly with capacitors added but functionally the same).

Some of our investigations / findings:

  • We experience the same LTE() module issues with both 1.20.0.rc13 previously and now 1.20.2.r4.
  • We have observed problems with initialization/creation of the LTE() object as well as lte.deinit() and find that either one can become a recurring problem on each machine.reset() cycle.
  • We don't have a PyJTAG but instead connected an analyzer on the five accessible LTE UART pins on the Fipy with pycom.lte_modem_en_on_boot() == True and observed modem responses
  • As power is applied, and before manually calling LTE(), the firmware tries to put the modem in command mode with +++, sent a second time if no response after the first (as seen in lteppp.c)
  • The same modem initialization sequence happens when lte=LTE() is called. Some times when the call fails we have observed either:
    • no response from the modem (RX is silent)
    • or the modem ignores the +++ and stays in data mode sending HDLC frames
  • After a successful uplink (and downlink), lte.deinit() is called. The ESP32 sends an 'ATH\r' and then AT commands proceed up to AT+CFUN=4 which should turn off the transceiver. Here's where the paths diverge:
    • Where the deinit() call succeeds, the modem sends a OK CEREG: 0 and WAKE goes low for 1 second
    • Where the call fails, it was observed that RTS went high and no RX data from the modem. WAKE does not change.

We also tried some program-stress tests, repeatedly running the same LTE init/send/receive/deinit/reset code over 150 times to spot any patterns (varying our experimental setup after around 20 tries each). There was no clear difference varying power source or with/without capacitors. We did observe a pattern when calling deinit() quickly after receiving an NB-IoT downlink where it succeeded almost every time with only a few failures, however adding a few seconds extra processing delay between the downlink and the deinit() prompted it to fail almost every time with a few successful deinits.

It seems the internal functions can't take back control over the modem at times where the modem is expected to be in one given state. Maybe there's another way of interrupting the modem or getting its attention (is DTR wired and does the modem use it?).

Let's hope some clues lead to a better understanding and a solution. We have a lot of data. I can provide more detail on any of these points if needed.

@pkharvey
Copy link

Had to focus on some procurement here. The LTE module is still important to us - I haven't forgotten. I have some current profile plots that I can put up... I'll be back to post those when I get a moment.

@pkharvey
Copy link

pkharvey commented Aug 9, 2021

Thanks to you all for your patience. Unfortunately despite best efforts, we haven't been able to get much closer to solving it once and for all, however we made some small steps and I can give a bit more information on what we've learned. I don't have vast experience with LTE modems so might give some awkward descriptions or have missed some clues.

To recap (OP described it best), we were experiencing a problem failing to gain control of the LTE modem at times and so the LTE() object could not be created, and so it could not be lte.deinit()ed etc. We were running the NB1-41019 firmware. We need to deepsleep in our application but we can't power cycle.

This would often happen to us when attempting an lte.attach(), whereby if it failed to attach in allotted time, the program would time out, reset, and be unable to reinitialize the modem with LTE(). The logic analyzer could see the Fipy trying to interrupt the modem with +++ while the modem appears to be stuck in data mode ignoring the interrupt (the PyJTAG is a bit out of reach for us).

Unable to predict when or recreate the conditions (it would usually just happen from time to time), we found it to be happening a lot with one of our test devices running NB1-41019, increasingly frequent until it became effectively bricked. With nothing to lose, it was decided to flash the experimental NB1-46262 available from a Pycom service request. Since flashing 46262, we can regain control over the modem almost every time (failed twice but was able to retry and regain control). Not sure whether it was just the act of flashing the modem firmware that cleared its config or the version itself, but flashing it did mostly unblock it for us in this instance. Monitoring the current draw, we observed the modem running 46262 to enter deepsleep every time with lte.deinit(reset=True).

Our capabilities will be limited with the equipment we have but if there are any suggestions we can try to get closer to resolving it, we can post our results.

@SebastiaanMerckx
Copy link

I have that same 46262 firmware and, together with an unofficial 1.20.2.rc11 micropython binary, this is the most stable that I could get so far (using NB-IoT).
So it means we have a live setups with unofficial modem firmware and unofficial pycom firmware, great 😄 .

@tlanier9
Copy link

tlanier9 commented Nov 24, 2021 via email

@jonnerd154
Copy link

jonnerd154 commented Nov 24, 2021

We ended up having to add an external watchdog circuit which toggles the +5V power to the GPy chip.
@tlanier9

We did the same. An external micro (ATTINY) with a couple GPIO pins for interface to the GPy, and one to control the regulator's Enable pin. It replaced the expensive pushbutton controller we used for on/off control, so we got more functionality for less money in the end. And the GPy can continue running LTE! 🎉

Feels a bit drastic, but it works. I would recommend planning this into any GPy LTE project from the beginning.

@curtmiller
Copy link

Some great insight in here. We have spent the last week or so trying to talk some sense into the GPy LTE modem with little to no success or consistent performance as a result. Going to try implementing an external keepalive circuit as it appears to have worked for a few people on here.

Have there been any further discoveries that will help us achieve consistent GPy LTE connection? Trying to keep boards revs to a minimum. Thanks guys!

@jonnerd154
Copy link

Have there been any further discoveries that will help us achieve consistent GPy LTE connection? Trying to keep boards revs to a minimum.

~1 year after we added the external watchdog, we are still very happy we did. It has worked very well, and I still recommend implementing one in every GPy design. There are still some quirks, but having the ability to power cycle the whole Pycom rescues us from most of them.

From a hardware perspective, here's the kitchen sink we threw at it with good results:

  • Add the aforementioned external power supervisor/watchdog
  • Power the GPy from ~4.1v or higher to provide additional FOS to protect from brownouts that can cause a reset when LTE modem draws a lot of current
  • Noisy supplies or supplies that brownout during spikes in power demand cause LTE issues. I had some issues with the routing of a switching supply that caused some trouble but only in certain conditions. Scope both VIN and 3.3v during LTE init, attach, connect, transfer, detach, deinit. Adjust designs and decoupling caps as required.
  • Add a healthy amount of tank capacitance on the 3.3 and VIN rails, right next to the GPY. I did this with a bunch of ceramics, but if your application allows, tantalum or aluminum would be great.
  • Reserve the Pycom's 3.3v supply exclusively for it's own internal use. If you need a 3.3v supply somewhere else in your application, add a separate LDO. The only things connected to the Pycom's 3.3v net should be the decoupling caps.
  • As you're planning the power supervisor, it's control strategy, and potentially a separate 3.3v supply, consider and prevent the potential for latch-up if/when various supplies are enabled/disabled separately. I'd suggest switching the highest-level/most-global regulator that you can. Of course, the power supervisor itself needs to stay powered separately. Be sure to respect logic levels/shift where required on the comm lines between the GPy (3.3v) and the supervisor (>=4.01v). Prefer controlling a regulator EN pin over adding a FET to switch the power on/off.

Also, I wanted to make sure you have seen this: #600 (comment) . Depending on where you are in the product life cycle, it might be good to avoid the GPy in your design. :(

@curtmiller
Copy link

Fantastic, thank you very much! Great to see there is light at the end of the tunnel for getting these guys up and running. Unfortunately for us, we have a fairly significant stash of these GPy's.. so we'd like to at least get them reliable for some product testing and customer feedback.

As far as firmware is concerned, has the most reliable combination been 46262/1.20.2.rc11 as @tlanier9 mentioned in their post? We are aiming for CAT-M1, so their implementation may not be ideal as it sounds like it's geared for NB-IoT.

@tlanier9
Copy link

As indicated previously we are using Pycom MicroPython V1.20.0.rc13 release candidate in our application. We do not use the LTE module for connectivity but instead talk directly to the cell modem thru the serial port. It is possible that improvements have been made to the LTE module in newer versions that we have not tried. The external watchdog is definitely necessary for long-term reliability.

@ELundby45
Copy link

ELundby45 commented Jun 23, 2022

@curtmiller, @jonnerd154's information has all been based on CATM1-5.2-48829/1.20.2.r4.

One additional tidbit that has been helpful was from #585: Occasionally I have seen that the GPy has been running fine for weeks (with a few automated power cycles from the external watchdog) but will get into a cycle where it will not longer attach. Sending the AT&F command and then a reset has been able to get the GPy out of this state.

@curtmiller
Copy link

Thank you guys for the input! I really appreciate it.

@curtmiller
Copy link

We have achieved LTE connection and have since been working out some of the hardware kinks in hopes of better reliability. However, with the recent EOL announcement for the GPy my manager has decided to change course and pursue a more future-proof product.

If those of you who have had success with your GPy designs are interested in more stock, shoot me an email at [email protected]. Again, thank you for the input guys!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests