The Reference Implementation for AWS IoT Fleetwise ("FWE") includes a TraceModule. The TraceModule provides a set of metrics that are used as an entry point to efficiently diagnose issues, saving you time since you no longer need to review the entire log of all FWE instances running.
RFrames0
-RFrames19
are monotonic counters of the number of raw can frames read on each bus. If these counters remain null or remain fixed for a longer runtime, the system might either have no CAN bus traffic or there is no CAN bound data collection campaign ( e.g. OBD2 only campaign ).ConInt
andConRes
enable you to monitor the the number of MQTT connection interruptions and connection resumptions. If and how long it takes to detect a connection loss depends on the kernel configuration parameters/proc/sys/net/ipv4/tcp/keepalive*
and the compile time constants of FWE:MQTT_CONNECT_KEEP_ALIVE_SECONDS
andMQTT_PING_TIMEOUT_MS
. If the values of the metricConInt
are not null, the internet coverage in the tested environment might be unreliable, orMQTT_PING_TIMEOUT_MS
, which defaults to 3 seconds, needs to be increased because there's high latency to the IoT Core endpoint. Changing the AWS Region can help to decrease latency.CeTrgCnt
is a monotonic counter that monitors the number of triggers (inspection rules) detected since the FWE process started. Triggers are detected if one or more data collection campaign conditions are true. If this counter is larger than zero, but no data appears in the cloud, either no actual data was collected ( such as a time-based data collection campaign with no bus activity), or the data has been ingested to the cloud but there was an error processing it. To debug this, enable cloud logs in AWS IoT Fleetwise settings.QUEUE_CONSUMER_TO_INSPECTION_SIGNALS
monitors the current count of signals in queue to the signal history buffer. If this value is close to the value defined in the static configdecodedSignalsBufferSize
, increase the static config, decreaseinspectionThreadIdleTimeMs
, reduce the bus load or reduce the amount of decoded signals in the decoder manifest in the cloud.ConRej
monitors the number of MQTT connection rejects. If this is not zero check the certificates and make sure you use a unique client id for each vehicle.ConFail
monitors the number of MQTT connection failures. This can have multiple root causes. If this is not zero please check the logs and search forConnection failed with error
FWE_STARTUP
andFWE_SHUTDOWN
provide the amount of time it takes to start and stop the AWS IoT Edge Fleetwise process. If any value is more than 5 seconds, review the logs and make sure all required resources such as internet and buses are available before starting the process.ObdE0
toObdE3
monitors errors related to the OBD session. If you see non-zero values, make sure you're connected to a compatible OBD vehicle which is powered on. Otherwise turn off the OBD signals collection in the cloud.PmE3
provides hints on whether the data persistency framework (a mechanism used to store and forward vehicle data when no connectivity is available) has an error. If this error counter is not zero, make sure that the directory defined inpersistencyPath
is writeable and that there is space available in the filesystemSysKerTimeDiff
shows the difference between the CAN frame RX timestamp from the kernel and the system time. If this is significantly higher thansocketCANThreadIdleTimeMs
, which is 50 milliseconds in the default configuration, the timestamps from the kernel are out of sync. Make sure an updated SocketCAN driver for your CAN device is used. Alternatively, consider switchingtimestampType
in the static config toPolling
. This will affect timestamp precision. Consider reducing the polling timesocketCANThreadIdleTimeMs
to mitigate.CeSCnt
is a monotonic counter that counts the signals decoded and processed since startup. This can be used for performance evaluations.CpuPercentageSum
andCpuThread_*
tracks the CPU usage for the complete process and per thread. In multi-core systems this can be above 100%. FWE uses the linux/proc/
directory to calculate this information.MemoryMaxResidentRam
gives the maximum bytes of resident RAM used by the process. If this is above 50 MB high consider switching from cmake Debug to Release build. Also the queue sizes in the static config can be reduced.CampaignFailures
monitors errors related to the campaign activation. If you see non-zero values, please check the logs. Make sure not to deploy more campaigns in parallel than defined inMAX_NUMBER_OF_ACTIVE_CONDITION
, which defaults to 256. Also check that themaxSampleCount
of all collected signals fits into the memory used for the signals history buffer defined inMAX_SAMPLE_MEMORY
, which defaults to 20MB.CampaignRxToDataTx
provides the amount of time it takes from changing the set of active campaigns to the first signal data being published. If at least one time based collection scheme is active this should be at most the time period of that collection scheme.
There are multiple ways to collect metrics depending on how FWE is integrated. We describe two methods: using the RemoteProfiler and collecting processed logs and extract metrics (like through the AWS Systems Manager).
Each method incurs charges for different AWS services like
AWS IoT Core,
Amazon CloudWatch,
AWS System Manager and more. For example, using
the RemoteProfiler method, FWE uploads at your configured
interval, which is currently ~300 metrics. Per 10 metrics data points uploaded, at least one message
will be published to AWS IoT Core and one AWS IoT Rules Engine Action will be executed. If
profilerPrefix
is different for every vehicle, ~300 new Amazon CloudWatch metrics will be used per
vehicle.
The RemoteProfiler module is provided as part of the FWE C++ code base. If activated, it will regularly ingest the metrics and logs to AWS IoT Core topics, which have underlying AWS IoT Core Rules and actions to route the data to Amazon CloudWatch. The same existing MQTT connection used to ingest the data collection campaign is reused for this purpose. In order to activate the RemoteProfile, add the following parameters to your config file:
{
...
"staticConfig": {
...
"mqttConnection": {
...
"metricsUploadTopic": "aws-iot-fleetwise-metrics-upload",
"loggingUploadTopic": "aws-iot-fleetwise-logging-upload"
},
"remoteProfilerDefaultValues": {
"loggingUploadLevelThreshold": "Warning",
"metricsUploadIntervalMs": 60000,
"loggingUploadMaxWaitBeforeUploadMs": 60000,
"profilerPrefix": "TestVehicle1"
},
}
}
In the above example configuration, a plain text json file with metrics will be uploaded to the AWS
IoT Core topic: aws-iot-fleetwise-metrics-upload
and log messages of level Warning and Error to
the topic aws-iot-fleetwise-logging-upload
. If profilerPrefix
is unique for every vehicle, such
as if it's the same as clientId
, there will be separate Amazon CloudWatch metrics for each
vehicle. If all vehicles have the same profilerPrefix
, Amazon CloudWatch metrics are aggregated.
Two AWS IoT Core rule actions are needed for these topics to forward the data to Amazon CloudWatch metrics and logs. They can be created by using the following AWS CloudFormation stack template: fwremoteprofiler.yml
Click here to Launch CloudFormation Template.
After the first vehicle uploads metrics, they can be found under the namespace
AWSIotFleetWiseEdge. The format is
{profilerPrefix}_(variableMaxSinceStartup|variableMaxSinceLast|)_{name}
for variables and
{profilerPrefix}_(sectionAvgSinceStartup|sectionCountSinceStartup|sectionMaxSinceLast|sectionMaxSinceStartup)_{name}
for measuring the time in seconds needed for certain code sections. After running a vehicle with the
above config the metrics TestVehicle1_variableMaxSinceStartup_RFrames0
and ~ 300 more will appear
in Amazon CloudWatch. Every minute new values will appear as metricsUploadIntervalMs
is set
to 60000.
For the direct upload of every log message above the specified threshold
(loggingUploadLevelThreshold
), log messages are cached at edge for a maximum of 60 seconds
(loggingUploadMaxWaitBeforeUploadMs
) before being uploaded over MQTT.
The RemoteProfile module will not cache any metrics or logs during the loss of connectivity. The local system log file can be used in that case, see the following section.
This method uses remote access, such as over SSH leveraging AWS Systems Manager or AWS IoT secure
tunneling to access the logs/metrics. In our examples, we use journald to manage the FWE logs. This
has the benefits of log rotation which might be necessary as FWE logs on TRACE level under high load
can produce multiple gigabytes of logs per day. These logs can be collected fully or aggregated like
over ssh from single vehicles in case of need for debugging or cyclically from the whole fleet. To
manage easy remote connections to multiple vehicles AWS Systems Manager or AWS IoT secure tunneling
could be used. For aggregation, custom scripts can be used to filter certain log levels. The log
levels in the FWE logs go from [ERROR]
to [TRACE]
. To make the metrics easier to parse, you can
set the parameter .staticConfig.internalParameters.metricsCyclicPrintIntervalMs
in the static
config an interval like 60000. This will cause the metrics to print in an easy parsable format to
the log every minute. The following regex expression can be used by any log/metrics
aggregator/uploader that supports Python. For lines that start with
TraceModule-ConsoleLogging-TraceAtomicVariable
or TraceModule-ConsoleLogging-Variable
:
regex_variable = re.compile(
r".*\'(?P<name>.*?)\'"
r" \[(?P<id>.*?)\]"
r" .*\[(?P<current>.*?)\]"
r" .*\[(?P<temp_max>.*?)\]"
r" .*\[(?P<max>.*?)\]"
)
For lines starting with TraceModule-ConsoleLogging-Section
regex_section = re.compile(
r".*\'(?P<name>.*?)\'"
r" \[(?P<id>.*?)\]"
r" .*\[(?P<times>.*?)\]"
r" .*\[(?P<avg_time>.*?)\]"
r" .*\[(?P<tmp_max_time>.*?)\]"
r" .*\[(?P<max_time>.*?)\]"
r" .*\[(?P<avg_interval>.*?)\]"
r" .*\[(?P<tmp_max_interval>.*?)\]"
r" .*\[(?P<max_interval>.*?)\]"
)
After the metrics are parsed from the local log files, a local health monitoring program can decide if and how to upload them to the cloud.
Adding new metrics requires changing the C++ code and recompiling FWE. Add the metrics to the
TraceVariable
enum in TraceModule.h and assign a short name in the
function getVariableName
of TraceModule.cpp. Then you can set the
metrics anywhere by using:
TraceModule::get().setVariable( TraceVariable::MAX_SYSTEMTIME_KERNELTIME_DIFF, observedNewValue);
The metric will be automatically included in both methods described above. There are no changes needed in the cloud, and the new metric will just show up in the same namespace.