Canbus only works correctly while debugger is connected in Zephyr?

PeeJay · July 12, 2024, 8:01am

I’m currently trying to get a minimal working canbus receiver running in Zephyr on a SAM E51 board, but I’ve hit this strange problem where without the debugger connected (Jlink) I’m only correctly receiving about every 1 in 8 packets, the rest are triggering a <err> can_mcan: Message RAM access failure error. When I have the debugger connected (attach gdb, not the physical connection) the error goes away and everything immediately works perfectly!
I’ve verified the hardware works by making an Arduino sketch using the Adafruit can library (which is mostly taken from ASF), and that works fine with and without the debugger. I’ve tried enabling and disabling the MPU and cache, but that makes no difference. I can’t find any extra clocks that need enabling, or other similar things.
I don’t suppose anyone here has experience with the Bosch Can IP block that is used in SAM E and a bunch of other micros?

cezar · July 12, 2024, 10:08am

I played a little with the canbus with some ST boards and ESP32, but I don’t recall seeing this message. I would enable the can shell, configure the device (set the bitrate, accept all packets, etc) and then check what frames are received (these commands are from an older version of Zephyr, not sure if they are up to date, there were some refactoring):

canbus config CAN_1 500000
canbus attach CAN_1 0x000 0x000

… has not completed acceptance filtering or storage …

Did you set filters? Something like this:

   const struct can_filter stdFilter = {
      .id = 0,
      .mask = 0,
      .flags = 0,
   };
   const struct can_filter extFilter = {
      .id = 0,
      .mask = 0,
      .flags = CAN_FILTER_IDE,
   };
   int ret = can_set_mode(mDev, CAN_MODE_NORMAL);
   ret = can_set_bitrate(mDev, bitrate);

   int filter_id = can_add_rx_filter_msgq(mDev, &mQueue, &stdFilter);
   LOG_DBG("Counter filter id: %d", filter_id);

   filter_id = can_add_rx_filter_msgq(mDev, &mQueue, &extFilter);
   LOG_DBG("Counter filter id: %d", filter_id);

PeeJay · July 12, 2024, 10:37am

I can transmit fine, I can even receive fine while gdb is running. The moment I stop gdb is when the issue happens.

[00:00:06.004,000] <inf> main: Received CAN frame
[00:00:06.007,000] <inf> main: Received CAN frame
[00:00:06.008,000] <inf> main: Received CAN frame
[00:00:06.009,000] <inf> main: Received CAN frame
[00:00:06.012,000] <inf> main: Received CAN frame
[00:00:06.014,000] <inf> main: Received CAN frame
Quit GDB...
[00:00:06.017,000] <inf> main: Received CAN frame
[00:00:06.018,000] <err> can_mcan: Message lost on FIFO0
[00:00:06.020,000] <inf> main: Received CAN frame
[00:00:06.163,000] <err> can_mcan: Message RAM access failure
[00:00:06.171,000] <err> can_mcan: Message RAM access failure
[00:00:06.311,000] <err> can_mcan: Message RAM access failure
[00:00:06.319,000] <err> can_mcan: Message RAM access failure
[00:00:06.459,000] <err> can_mcan: Message RAM access failure
[00:00:06.465,000] <inf> main: Received CAN frame
[00:00:06.470,000] <inf> main: Received CAN frame
[00:00:06.617,000] <err> can_mcan: Message RAM access failure
[00:00:06.623,000] <inf> main: Received CAN frame
[00:00:06.770,000] <err> can_mcan: Message RAM access failure
[00:00:06.918,000] <err> can_mcan: Message RAM access failure
[00:00:06.925,000] <err> can_mcan: Message RAM access failure
[00:00:06.927,000] <inf> main: Received CAN frame
[00:00:06.930,000] <inf> main: Received CAN frame
[00:00:07.071,000] <err> can_mcan: Message RAM access failure
[00:00:07.079,000] <err> can_mcan: Message RAM access failure
[00:00:07.219,000] <err> can_mcan: Message RAM access failure
[00:00:07.225,000] <inf> main: Received CAN frame

mbaker335 · July 12, 2024, 6:24pm

Knowing nothing about your setup but could this be caused by termination issues? The problems arise from reflections and impedance mismatches. Adding the debugger you are perhaps adding in an internal terminator that stops the reflections. This is all handwaving but the debugger is changing the environment. Just an idea.

mattway · July 12, 2024, 8:56pm

I’ve got no idea if this helps or not.
Sometime ago I had code that would only work with the debugger.
There was actually a bug in my code not returning a value in some edge cases.
With a normal compile GCC was turning on optimisations. With the debugger no optimisation and the code ran.

Good luck

PeeJay · July 13, 2024, 1:11am

No, I’m not changing anything physical. The wires are only 30cm anyway.

I’ve tried all the optimisation settings, no change. The thing is that when I detach gdb the exact same code keeps running without even a reboot, but with all the errors.

mbaker335 · July 14, 2024, 6:31pm

I misread the issue with my earlier comment, sorry. I have seen code run OK in a debugger but not on its own. This was caused by faulty memory management. A pointer to memory that had already been freed was used. In the debugger the memory environment was safe, outside the debugger the workstation crashed. Another symptom is it ran ok on one workstation but crashed on another next to it but only sometimes. This was on SunOS rather than a low level environment but a simple mistake with pointers had weird symptoms.
I am probably talking rubbish again but these issues do bring back memories.

coflynn · July 15, 2024, 8:05pm

Normally for me when stuff works on the debugger but not “native” it’s due to initialization the debugger did, as they normally do some minor setup of the SP/PC/etc (ignoring more obvious cases like when interacting with it and debugger is adding delays).

If it was me - I’d try another GUI, like using Ozone (you mentioned a J-Link) and seeing if anything changes. You could also check the J-Link logger to see what commands are being sent in case it does anything on attach/detach that flags it as interesting.

There ARE still a bunch of changes that happen with the debug probe attached, so you may need to spend some effort narrowing this down. More complex ones are described in the Datasheet & Errata (search ‘debug’ and you’ll see many hits in both). Important things that I’ve seen before:

Cache behavior changes with debug enabled. Try disabling the cache in your initialization code (startup.S or equivalent normally), and see if that changes things. May need to test this with both ICache & DCache depending what the device has.
The debugger may be refreshing registers or RAM. There can be issues with that if you are watching certain registers (or it’s refreshing an I/O status type window). Note some of these for example come up in the E51 errata.
Some modules have a debug mode - normally this is only to do stuff like stop the RTC when processor is halted or similar, so I doubt this is the case for you. Less often do they change modes just when debug is attached.

ToyBuilder · July 15, 2024, 9:59pm

Does your GDB session use signal lines or peripherals that might overlap with your CAN bus hardware in some fashion?

Even things like IRQ servicing can be affected by debug facilities. I/O pin configuration (pullups, pin direction, interrupts configuration) could possibly be affected, too, depending on the nature of the GDB interface.

One thing that might be instructive is to look at the actual data exchange on the bus itself to see if using GDB is affecting the conversation on the interface (subtle shift in timing, or signal levels, for example) or if it is indeed identical.

Some debug interfaces will intercept memory accesses on the target and modify the access – profiling a memory loop access test might hint at whether something like that might be in play?

DanielS · July 16, 2024, 9:23pm

Don’t dismiss CAN bus termination before doing a simple check: without the debugger plugged in, use a DMM to measure the DC resistance from CANH to CANL anywhere on the bus. It should be 60 ohms (120 ohms at both “ends” of the bus). Some CAN transceiver chips are pretty tolerant and can handle half-termination (120 ohms at one end). Then connect your debugger and make sure termination value doesn’t change much.

mzapp · June 24, 2025, 3:33pm

In case you’re still having this problem, I have now come across the exact same problem and have narrowed it down to power management. From the data sheet, “When the CPU is halted in debug mode, the PM continues normal operation.” If I remove WFI() the error goes away and CAN works as expected without the debugger. Would love to find a configuration that allows WFI(); off to look at the CAN low-power mode next.