Troubleshooting Runtime Fabric on VMs / Bare Metal

This topic describes common errors and steps to resolve them when installing and managing Anypoint Runtime Fabric on VMs / Bare Metal.

Obtain a Full Network Assessment

Run the following command for an overall health assessment of the network:

/opt/anypoint/runtimefabric/rtfctl status

Troubleshoot Network Connectivity Using rtfctl

Every Anypoint Runtime Fabric cluster requires connectivity with Anypoint control plane, and any interference with connectivity can limit functionality, resulting in application deployment failures or degraded status in Anypoint Runtime Manager.

You can use rtfctl to verify that Runtime Fabric has the required outbound connectivity as well as troubleshoot connectivity issues.

Verify Outbound Connectivity

On each node, follow the instructions in Install rtfctl to install rtfctl. Run the following command in all controller and worker nodes on the cluster to verify the required outbound connectivity:

sudo ./rtfctl test outbound-network

Sample output:

[root@rtf-controller-1 runtimefabric]# sudo ./rtfctl test outbound-network
Using proxy configuration from Runtime Fabric (proxy "", no proxy "")

Using 'US' region
transport-layer.prod.cloudhub.io:443 ✔
https://anypoint.mulesoft.com ✔
https://worker-cloud-helm-prod.s3.amazonaws.com ✔
https://exchange2-asset-manager-kprod.s3.amazonaws.com ✔
https://ecr.us-east-1.amazonaws.com ✔
https://494141260463.dkr.ecr.us-east-1.amazonaws.com ✔
https://prod-us-east-1-starport-layer-bucket.s3.amazonaws.com ✔
https://runtime-fabric.s3.amazonaws.com ✔
tcp://dias-ingestor-nginx.prod.cloudhub.io:443 ✔

If you have outbound connectivity issues that prevent Runtime Fabric from reaching any of the required Anypoint control plane services, work with your network team to verify that you have added the required port IPs and hostnames to the allowlist as described in Port IP Addresses and Hostnames to Add to the Allowlist.

Troubleshoot an Unknown Certificate Authority Error

When running a network test to verify outbound connectivity, you might see an error that says, x509: certificate signed by unknown authority.

This is caused when traffic is being intercepted by a firewall or proxy using its own root CA, which Runtime Fabric does not support.

To verify the certificate issuer, run:

curl -v https://rtf-runtime-registry.kprod.msap.io

Ensure the result shows that Amazon is listed as the issuer.

Server certificate:

subject: CN=*.prod.cloudhub.io
start date: Nov 12 00:00:00 2021 GMT
expire date: Dec 10 23:59:59 2022 GMT
subjectAltName: host "rtf-runtime-registry.kprod.msap.io" matched cert's "*.kprod.msap.io"
issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
SSL certificate verify ok.

If any other issuer is listed, traffic is being intercepted.

Troubleshoot Connectivity Issues on Anypoint Monitoring Sidecar Container for Application Pods

If you experience connectivity issues in the monitoring sidecar container of application deployment pods, review the networking port prerequisites to unblock all required ports. If you decide not to allow Anypoint Monitoring, Anypoint Visualizer port 5044 may report errors even though functionality is not affected. For more details, see: "Failed to connect to failover" ERROR logs in RTF application pod sidecar container.

Troubleshoot Chrony Synchronization Check Failure

Runtime Fabric requires chrony for its time synchronization daemon. If chrony is not found by the init.sh installation script, Runtime Fabric installation fails with the following error:

============================================================================================
1 / 10: Install required packages
================================================
chrony-3.4-1.el7.x86_64
Checking chrony sync status...Retrying in 30 seconds...
Retrying in 30 seconds...
Error: chrony sync check failed 3 times, giving up.

***********************************************************
** Oh no! Your installation has stopped due to an error. **
***********************************************************
1. Visit the troubleshooting guide for help:
xref:runtime-fabric::troubleshoot-guide.adoc#troubleshoot-install-package-issues[Troubleshoot Install Package Issues].

2. Resume installation by running /opt/anypoint/runtimefabric/init.sh

Additional information: Error code: 1; Step: install_required_packages; Line: -;

Perform the following steps to fix this issue:

Verify that chrony is enabled:

systemctl enable chronyd
Verify that Network Time Protocol (NTP) is disabled and is not running:

systemctl stop ntpd; systemctl disable ntpd
Contact your network team to verify that the time servers in /etc/chrony.conf are reachable.

Verify that chrony is synced with sources:

chronyc sourcestats -v

Number of sources = 4
Name/IP Address            NP  NR  Span  Frequency  Freq Skew  Offset  Std Dev
==============================================================================
ns3.turbodns.co.uk         16  12   58m     -0.006      0.014   +370us    14us
test.diarizer.com          16   9  327m     -0.184      0.144  -1618us   799us
ntp1.vmar.se                6   3   21m     +0.117      0.074   +228us    10us
time.cloudflare.com         6   5   86m     +0.014      0.214  +2040ns    94us

Verify that the Leap status value of chronyc tracking is Normal:

chronyc tracking

Reference ID : A29FC87B (time.cloudflare.com)
Stratum : 4
Ref time (UTC) : Mon Jul 20 15:42:57 2020
System time : 0.000000003 seconds slow of NTP time
Last offset : +0.000344024 seconds
RMS offset : 0.000172783 seconds
Frequency : 2.265 ppm slow
Residual freq : +0.149 ppm
Skew : 0.124 ppm
Root delay : 0.003362593 seconds
Root dispersion : 0.000759320 seconds
Update interval : 1031.1 seconds
Leap status : Normal

Retry the installation by running init.sh.

Troubleshoot Cluster Issues

When filing a support case, the support team might ask you to run one or both of the following commands to generate debugging information:

The rtfctl report command to generate an archive containing only Kubernetes objects and logs.
The rtfctl appliance report command to collect diagnostics from all cluster nodes.

The support team might also ask you to download information through Ops Center as described in Download Debug Info.

Troubleshoot Cluster Shrinking or Expanding Errors

If you add or remove a node from a Runtime Fabric cluster (expand or shrink the cluster), and then immediately try to add a new node, you might encounter a cluster status is "shrinking" or a cluster status is "expanding" error. If this occurs, wait four minutes or longer and then add the new node.

Troubleshoot Application Deployment Issues

Anypoint Monitoring Agent issues

In rare situations, the Anypoint Monitoring agent might prevent an application from deploying. In these situations, you might see the following messages:

The application remains in the Deploying state, or
Error starting monitoring agent (code -1)

In this situation, redeploy your application and set the following custom property:

anypoint.platform.config.analytics.agent.enabled=false

The Anypoint Monitoring agent might also change the state of a deployed application. If you see one of the following:

The application transitions from Running to Applying, or
Monitoring agent has exited with code -1

This indicates that the agent is restarting. There should be no impact to the running application. Application metrics are queued, and are again collected after the agent restarts.

VMware VMXNET Generation 3 Issues

If you host your Runtime Fabric appliance on a VMware environment and are unable to deploy any applications, refer to the following MuleSoft knowledge base article: RTF Networking Issues with VMware and "[Runtime Fabric] java.net.UnknownHostException:" Errors.

Troubleshoot "CrashLoopBackOff - Couldn’t initialize inotify" Errors

If inotify values are too low for your environment, you might encounter the following error message:

[Kubernetes] Container "anypoint-monitoring" - CrashLoopBackOff - Couldn't initialize inotify

You might also see the the following alerts displayed on the gravity status command output:

        * rtf-worker-1 (172.31.3.5, worker_node)
            Status:		degraded
            [×]			Unable to initialize inotify (too many open files)
            Remote access:	online

Two Linux kernel values control inotify usage:

inotify.max_user_watches
inotify.max_user_instances

The inotify.max_user_watches setting should be set to 1048576. The Runtime Fabric installer automatically performs this configuration, so the most likely cause is that another process or user changed this setting to a lower value.

To correct the setting, run the following command:

$ sysctl -w fs.inotify.max_user_watches=1048576

To make the change persistent so it survives any node reboots, set the setting in a file inside the /etc/sysctl.d directory, for example:

$ cat /etc/sysctl.d/inotify.conf
fs.inotify.max_user_watches=1048576

Most OS distributions set inotify.max_user_instances to 128 by default. This value is too low for worker nodes with many application deployments and replicas. If you encounter this error, increase the value for inotify.max_user_instances following the steps described above.

In either case, review this situation with your system administrator to make sure that these values are appropriate for your usage and are not reduced in the future.

Troubleshoot Application Runtime Issues

If any of the following Runtime Fabric alert messages are reported, you might need to recover one or more controller nodes.

Management plane is unreachable

InfluxDB is down or no connection between Kapacitor and InfluxDB

Node is down

CRITICAL / Kubernetes node is not ready: <ip_address>

CRITICAL / etcd: cluster is unhealthy

Open a terminal and run the gravity status command to obtain the health status of the cluster as well as individual components.

To recover a node, follow the instructions provided in Add or Remove a Node from a Runtime Fabric.

Troubleshoot Environment Variable Issues

This step detects the variables which the installation process needs to carry out its procedures. The methods for providing these variables to the installation vary based upon where the installation is running.

For AWS, the variables are set in the terraform script, and outputted to a file located in /opt/anypoint/runtimefabric/env.
For Azure, the variables are set when running the ARM template, and are retrieved as tags on the Virtual Machine instances.
For manual installations, the user creates a file with the values located in /opt/anypoint/runtimefabric/env.

After these properites are retrieved, a procedure will run to connect to Anypoint Platform and retrieve additional values based upon the RTF_ACTIVATION_DATA value.

Common Errors

The following error may occur if there is an issue with the activation data value, or if there is trouble reaching the internet on the instance:

curl: (7) Failed connect to anypoint.mulesoft.com:443; Operation now in progress
Error: Failed to fetch registration properties. (000). Please check your token is valid


============ Error ============
Exit code: 1
Line:

If this error is observed, try the following:

Ensure your instance has outbound internet connectivity. A simple way to validate is to run the following command and verify a 301 response is returned:
```
curl https://anypoint.mulesoft.com
```
Re-try running the installation procedure, in case the network connectivity was not finished initalizing.
- On Azure, the script should be located at /opt/anypoint/runtimefabric/script.sh foreground
- On AWS and manual installatons, the script should be located at /opt/anypoint/runtimefabric/init.sh foreground
Validate the activation data value is correct by comparing with the Runtime Fabric created in Anypoint Runtime Manager.

If you are still encountering issues, file a support ticket for further assistance.

Resume a Failed Installation

You can resume an installation at the point where it failed by running the init script:

AWS and manual installations:
```
/opt/anypoint/runtimefabric/init.sh
```
Azure installations:
```
/opt/anypoint/runtimefabric/script.sh
```

Troubleshoot Install Package Issues

This step will install required packages on the instance. It uses the yum package repository to download and install the required packages.

Common Errors

If there is a failure on this step, verify the following:

Ensure your instance has outbound internet connectivity. A simple way to validate is to run the following command and verify a 301 response is returned:
```
curl https://anypoint.mulesoft.com
```
If running a manual installation, ensure the init.sh script is run with root privledges:
```
sudo ./init.sh foreground
```
Manually install one of the required packages to determine if it is successful outside of the installation script:
```
sudo yum install -y chrony
```
- If not successful, work with your operations team for assistance. You may need to ask for elevated access to the instance.

If manual installation of a package was successful, or if you still require assistance, file a support ticket.

Format and Mount Disks

This step performs the following tasks on the block devices or disks provided with the RTF_DOCKER_DEVICE and/or RTF_ETCD_DEVICE variables:

Performs a check to confirm the values map to block devices available on the instance.
Unmounts the disks in case they were previously mounted.
Formats the disks.
Adds an mount entry in the /etc/fstab file.
Creates directories based upon the values in $DOCKER_MOUNT and/or $ETCD_MOUNT.
Mounts the disks to the expected directories created above.

Install RTF Components

This step connects to Anypoint Platform to download and install the Runtime Fabric components on the cluster.

In some cases, this step may return an error if the deployment failed to complete within the time limit:

...
[OK]
Installing Runtime Fabric Agent. This may take several minutes...
configmap "grafana-dashboards" deleted
configmap "kapacitor-alerts" deleted
Release "runtime-fabric" does not exist. Installing it now.
The following deployments failed to become ready within the time limit:
monitor ---
Name:           monitor-79c7564b77-wb9c6
Namespace:      rtf
Node:           10.165.12.87/10.165.12.87
Start Time:     Thu, 13 Dec 2018 20:23:59 +0000
Labels:         app=monitor
                pod-template-hash=3573120633
Annotations:    checksum/config=4c4aac48d9cc8b24828b38ba0eb587840bc17b2449a54d593f74e2d58e5c12ae
                kubernetes.io/psp=privileged
                seccomp.security.alpha.kubernetes.io/pod=docker/default
Status:         Running
IP:             10.244.82.17
Controlled By:  ReplicaSet/monitor-79c7564b77
Containers:
...
<< More information displayed that describes the deployment manifest and stack trace >>

If this error is observed:

Verify outbound TCP port 5672 is open to the Internet. Connections should be allowed from the controller VM(s) running in your internal network to this hostname on the Internet.
A TCP proxy may be needed to establish a connection to the Internet. Check with your network team to verify and configure if needed. Refer to Anypoint Runtime Fabric Installation Prerequisites.

Troubleshoot Expired Client Certificates

If your cluster uses Mutual Authentication with Anypoint Platform, a cron job called certificate-renewal in the rtf namespace periodically renews the client certificate. This job runs in advance of certificate expiration to renew the certificate before it expires. Unexpected circumstances may cause the certificate to expire without renewal. In such cases, the certificate-renewal job logs the following error:

time="2020-03-11T12:42:05Z" level=warning msg="{\"timestamp\":\"2020-03-11T12:42:05.035+00:00\",\"status\":409,\"error\":\"Conflict\",\"message\":\"Certificate was already exipired, please renew cert!\",\"path\":\"/api/v1/organizations/4d70ccf6f0ce/agents/74f0df29/renew\"}"

To renew the certificate, use the rtfctl appliance renew-expired-client-cert command. For more information, refer to:

Troubleshoot Inconsistencies After Backing Up and Restoring Rutime Fabric

After restoring a cluster from a backup, you may see some inconsistencies in Runtime Manager, for example:

An application you deleted shows as running
A recently deployed application is listed with 0 replicas
A deployment’s replica count does not match the number of replicas you specified in Runtime Manager

In each of these circumstances, verify that the deployments status appears as applying and that a corresponding API call reveals that the status and desiredState fields do not match. See The Back Up and Restore Process Does Not Affect the Control Plane for an explanation of the root cause.

To resolved these conflicts, you must manually apply the necessary changes to the cluster to match the expected status. You cannot automatically re-sync these changes nor override them with new changes. The current status of the cluster must match the expected values before you can apply any further changes from the control plane. If you need to review changes you applied after creating a backup, use audit logs.

Troubleshoot Ops Center Monitoring and Logs Issues

If Ops Center monitoring and logging fails to restart after restarting one or more nodes, ensure port forwarding rules are applied on all VMs so that traffic can communicate with the Kubernetes pods running on the VMs. Refer to Enable Forwarding When Using firewalld for additional information.

Troubleshoot Ops Center Metric Dashboard Missing Applications

Scenario: Ops Center metric dashboards have the following symptoms:

Metric data shows frequent and unexpected spikes
Missing applications from drop-down menu
InfluxDB logs show a max-series-per-database limit exceeded error.

To get the influxdb_pod_name, run, kubectl -n monitoring get pod -lcomponent=influxdb

This scenario can occur when the number of entries hits the pre-configured 1 million limit.

To resolve this issue, increase the series limit to 10 million or a larger number.

Edit the config map:
```
kubectl edit cm influxdb -n monitoring
```
Modify the max-series-per-database value to 10 million:
```
max-series-per-database = 10000000
```
Save the changes.

Restart the InfluxDB pod to apply the changes:

kubectl -n monitoring delete pod <influxdb_pod_name>

To delete the database series and modify what is logged, follow the steps in RTF - Error Reporting Heap Metrics: InfluxDBException: partial write: max-series-per-database limit exceeded.

Troubleshoot Agent Version Mismatch

Scenario: After a Runtime Fabric upgrade, you encounter the following error status in Runtime Manager: Degraded : Runtime Fabric agent version mismatch detected from control plane. This error is caused by a discrepancy between the Runtime Fabric agent version stored on Anypoint Platform for that cluster and the version reported by the agent status probe.

Common causes of this issue include:

Failed or rolled back Runtime Fabric agent upgrades
Restoration of pre-upgrade backups
Disabled mutual authentication

Failed Agent Upgrades

In this situation, a Runtime Fabric agent upgrade fails and gets rolled back, but the desired (upgraded) version is listed in Anypoint Platform. To correct this issue, apply the procedure described in Degraded Status as Runtime Fabric Agent Version Mismatch Detected.

Restoration of Pre-Upgrade Backups

If you restore a backup that was taken on a previous Runtime Fabric agent version than the latest version applied from the control plane for that cluster, you must repeat the upgrade procedure. For more information, refer to Changes Made After Performing a Backup Are Not Restored.

Disabled Mutual Authentication

Runtime Fabric version 1.11 and later requires mutual authentication. If you upgrade from a version before 1.11 to 1.11 or later, you can encounter this error message if mutual authentication is not enabled. To fix this, enable mutual authentication and run the upgrade again.

Troubleshoot Agent Pod

If one or more Runtime Fabric agent pod containers do not check their logs and look for an error message matching an entry, review the following metrics server not running troubleshooting scenario:

If the rtfd container crashes in every attempt, it can return the following error message:

2023/02/23 16:35:20 instantiating client: fetching server resource preferences: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

The error can occur because the Kubernetes metrics server is installed but it is not running. Runtime Fabric appliance version 1.1.1636064094-8b70d2d and later include metrics server as a deployment in the monitoring namespace. The events and logs for this deployment and associated pods must be investigated for further troubleshooting. After the metrics server is up and running, the rtfd comes up as expected.

Contact the MuleSoft Support team with a full appliance report if you require assistance to restore the metrics server.

Troubleshoot Missing Email Alert Customizations

Scenario: After an appliance upgrade, you lose customizations made to your email alerts.

This is a limitation in Runtime Fabric on VMs / Bare Metal, as documented in before you upgrade the Runtime Fabric appliance. The suggested workaround is to back up any customizations before you run the upgrade and then reapply the customizations when the upgrade completes.

Troubleshoot External Log Forwarding

Log Aggregators Must Use TLS v1.2

Runtime Fabric external log forwarding libraries do not support TLS v1.3.

If your log aggregator does not use TLS v1.2, your logs won’t be forwarded, and you’ll see warnings on your external log forwarder pods, similar to:

[ warn] [engine] failed to flush chunk '8-1652967809.949249428.flb', retry in 10 seconds: task_id=5, input=tail.0 > output=es.0 (out_id=0)
[ warn] [engine] chunk '8-1652967799.665591977.flb' cannot be retried: task_id=1, input=tail.0 > output=es.0

On your log aggregator, you may see an error message similar to:

[WARN ][o.e.h.AbstractHttpServerTransport] [] caught exception while handling client http traffic, closing connection Netty4HttpChannel{localAddress=0.0.0.0/0.0.0.0:9200, remoteAddress=/*.*.*.*:1787}

io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Client requested protocol TLSv1.2 is not enabled or supported in server context

...

Caused by: javax.net.ssl.SSLHandshakeException: Client requested protocol TLSv1.2 is not enabled or supported in server context

In such cases, either enable TLSv v1.2, or set your networking infrastructure to have an intermediate node terminate the TLS v1.2 session and create a new TLS v1.3 session to the final destination.