/opt/anypoint/runtimefabric/rtfctl status
Troubleshooting Runtime Fabric on VMs / Bare Metal
This topic describes common errors and steps to resolve them when installing and managing Anypoint Runtime Fabric on VMs / Bare Metal.
Obtain a Full Network Assessment
Run the following command for an overall health assessment of the network:
Troubleshoot Network Connectivity Using rtfctl
Every Anypoint Runtime Fabric cluster requires connectivity with Anypoint control plane, and any interference with connectivity can limit functionality, resulting in application deployment failures or degraded status in Anypoint Runtime Manager.
You can use rtfctl
to verify that Runtime Fabric has the required outbound connectivity as well as troubleshoot connectivity issues.
Verify Outbound Connectivity
On each node, follow the instructions in Install rtfctl to install rtfctl
.
Run the following command in all controller and worker nodes on the cluster to verify the required outbound connectivity:
sudo ./rtfctl test outbound-network
Sample output:
[root@rtf-controller-1 runtimefabric]# sudo ./rtfctl test outbound-network
Using proxy configuration from Runtime Fabric (proxy "", no proxy "")
Using 'US' region
transport-layer.prod.cloudhub.io:443 ✔
https://anypoint.mulesoft.com ✔
https://worker-cloud-helm-prod.s3.amazonaws.com ✔
https://exchange2-asset-manager-kprod.s3.amazonaws.com ✔
https://ecr.us-east-1.amazonaws.com ✔
https://494141260463.dkr.ecr.us-east-1.amazonaws.com ✔
https://prod-us-east-1-starport-layer-bucket.s3.amazonaws.com ✔
https://runtime-fabric.s3.amazonaws.com ✔
tcp://dias-ingestor-nginx.prod.cloudhub.io:443 ✔
If you have outbound connectivity issues that prevent Runtime Fabric from reaching any of the required Anypoint control plane services, work with your network team to verify that you have added the required port IPs and hostnames to the allowlist as described in Port IP Addresses and Hostnames to Add to the Allowlist.
Troubleshoot an Unknown Certificate Authority Error
When running a network test to verify outbound connectivity, you might see an error that says, x509: certificate signed by unknown authority
.
This is caused when traffic is being intercepted by a firewall or proxy using its own root CA, which Runtime Fabric does not support.
To verify the certificate issuer, run:
curl -v https://rtf-runtime-registry.kprod.msap.io
Ensure the result shows that Amazon is listed as the issuer.
Server certificate: subject: CN=*.prod.cloudhub.io start date: Nov 12 00:00:00 2021 GMT expire date: Dec 10 23:59:59 2022 GMT subjectAltName: host "rtf-runtime-registry.kprod.msap.io" matched cert's "*.kprod.msap.io" issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon SSL certificate verify ok.
If any other issuer is listed, traffic is being intercepted.
Troubleshoot Connectivity Issues on Anypoint Monitoring Sidecar Container for Application Pods
If you experience connectivity issues in the monitoring sidecar container of application deployment pods, review the networking port prerequisites to unblock all required ports. If you decide not to allow Anypoint Monitoring, Anypoint Visualizer port 5044 may report errors even though functionality is not affected. For more details, see: "Failed to connect to failover" ERROR logs in RTF application pod sidecar container.
Troubleshoot Chrony Synchronization Check Failure
Runtime Fabric requires chrony for its time synchronization daemon. If chrony is not found by the init.sh
installation script, Runtime Fabric installation fails with the following error:
============================================================================================
1 / 10: Install required packages
================================================
chrony-3.4-1.el7.x86_64
Checking chrony sync status...Retrying in 30 seconds...
Retrying in 30 seconds...
Error: chrony sync check failed 3 times, giving up.
***********************************************************
** Oh no! Your installation has stopped due to an error. **
***********************************************************
1. Visit the troubleshooting guide for help:
xref:runtime-fabric::troubleshoot-guide.adoc#troubleshoot-install-package-issues[Troubleshoot Install Package Issues].
2. Resume installation by running /opt/anypoint/runtimefabric/init.sh
Additional information: Error code: 1; Step: install_required_packages; Line: -;
Perform the following steps to fix this issue:
-
Verify that chrony is enabled:
systemctl enable chronyd
-
Verify that Network Time Protocol (NTP) is disabled and is not running:
systemctl stop ntpd; systemctl disable ntpd
-
Contact your network team to verify that the time servers in
/etc/chrony.conf
are reachable. -
Verify that chrony is synced with sources:
chronyc sourcestats -v Number of sources = 4 Name/IP Address NP NR Span Frequency Freq Skew Offset Std Dev ============================================================================== ns3.turbodns.co.uk 16 12 58m -0.006 0.014 +370us 14us test.diarizer.com 16 9 327m -0.184 0.144 -1618us 799us ntp1.vmar.se 6 3 21m +0.117 0.074 +228us 10us time.cloudflare.com 6 5 86m +0.014 0.214 +2040ns 94us
-
Verify that the
Leap status
value ofchronyc
tracking isNormal
:chronyc tracking Reference ID : A29FC87B (time.cloudflare.com) Stratum : 4 Ref time (UTC) : Mon Jul 20 15:42:57 2020 System time : 0.000000003 seconds slow of NTP time Last offset : +0.000344024 seconds RMS offset : 0.000172783 seconds Frequency : 2.265 ppm slow Residual freq : +0.149 ppm Skew : 0.124 ppm Root delay : 0.003362593 seconds Root dispersion : 0.000759320 seconds Update interval : 1031.1 seconds Leap status : Normal
-
Retry the installation by running
init.sh
.
Troubleshoot Cluster Issues
When filing a support case, the support team might ask you to run one or both of the following commands to generate debugging information:
-
The
rtfctl report
command to generate an archive containing only Kubernetes objects and logs. -
The
rtfctl appliance report
command to collect diagnostics from all cluster nodes.
The support team might also ask you to download information through Ops Center as described in Download Debug Info.
Troubleshoot Cluster Shrinking or Expanding Errors
If you add or remove a node from a Runtime Fabric cluster (expand or shrink the cluster), and then immediately try to add a new node, you might encounter a cluster status is "shrinking"
or a cluster status is "expanding"
error. If this occurs, wait four minutes or longer and then add the new node.
Troubleshoot Application Deployment Issues
Anypoint Monitoring Agent issues
In rare situations, the Anypoint Monitoring agent might prevent an application from deploying. In these situations, you might see the following messages:
-
The application remains in the
Deploying
state, or -
Error starting monitoring agent (code -1)
In this situation, redeploy your application and set the following custom property:
anypoint.platform.config.analytics.agent.enabled=false
The Anypoint Monitoring agent might also change the state of a deployed application. If you see one of the following:
-
The application transitions from
Running
toApplying
, or -
Monitoring agent has exited with code -1
This indicates that the agent is restarting. There should be no impact to the running application. Application metrics are queued, and are again collected after the agent restarts.
VMware VMXNET Generation 3 Issues
If you host your Runtime Fabric appliance on a VMware environment and are unable to deploy any applications, refer to the following MuleSoft knowledge base article: RTF Networking Issues with VMware and "[Runtime Fabric] java.net.UnknownHostException:" Errors.
Troubleshoot "CrashLoopBackOff - Couldn’t initialize inotify" Errors
If inotify values are too low for your environment, you might encounter the following error message:
[Kubernetes] Container "anypoint-monitoring" - CrashLoopBackOff - Couldn't initialize inotify
You might also see the the following alerts displayed on the gravity status
command output:
* rtf-worker-1 (172.31.3.5, worker_node) Status: degraded [×] Unable to initialize inotify (too many open files) Remote access: online
Two Linux kernel values control inotify usage:
-
inotify.max_user_watches
-
inotify.max_user_instances
The inotify.max_user_watches
setting should be set to 1048576
. The Runtime Fabric installer automatically performs this configuration, so the most likely cause is that another process or user changed this setting to a lower value.
To correct the setting, run the following command:
$ sysctl -w fs.inotify.max_user_watches=1048576
To make the change persistent so it survives any node reboots, set the setting in a file inside the /etc/sysctl.d
directory, for example:
$ cat /etc/sysctl.d/inotify.conf fs.inotify.max_user_watches=1048576
Most OS distributions set inotify.max_user_instances
to 128
by default. This value is too low for worker nodes with many application deployments and replicas. If you encounter this error, increase the value for inotify.max_user_instances
following the steps described above.
In either case, review this situation with your system administrator to make sure that these values are appropriate for your usage and are not reduced in the future.
Troubleshoot Application Runtime Issues
If any of the following Runtime Fabric alert messages are reported, you might need to recover one or more controller nodes.
Management plane is unreachable InfluxDB is down or no connection between Kapacitor and InfluxDB Node is down CRITICAL / Kubernetes node is not ready: <ip_address> CRITICAL / etcd: cluster is unhealthy
Open a terminal and run the gravity status
command to obtain the health status of the cluster
as well as individual components.
To recover a node, follow the instructions provided in Add or Remove a Node from a Runtime Fabric.
Troubleshoot Environment Variable Issues
This step detects the variables which the installation process needs to carry out its procedures. The methods for providing these variables to the installation vary based upon where the installation is running.
-
For AWS, the variables are set in the terraform script, and outputted to a file located in
/opt/anypoint/runtimefabric/env
. -
For Azure, the variables are set when running the ARM template, and are retrieved as tags on the Virtual Machine instances.
-
For manual installations, the user creates a file with the values located in
/opt/anypoint/runtimefabric/env
.
After these properites are retrieved, a procedure will run to connect to Anypoint Platform and
retrieve additional values based upon the RTF_ACTIVATION_DATA
value.
Common Errors
The following error may occur if there is an issue with the activation data value, or if there is trouble reaching the internet on the instance:
curl: (7) Failed connect to anypoint.mulesoft.com:443; Operation now in progress Error: Failed to fetch registration properties. (000). Please check your token is valid ============ Error ============ Exit code: 1 Line:
If this error is observed, try the following:
-
Ensure your instance has outbound internet connectivity. A simple way to validate is to run the following command and verify a 301 response is returned:
curl https://anypoint.mulesoft.com
-
Re-try running the installation procedure, in case the network connectivity was not finished initalizing.
-
On Azure, the script should be located at
/opt/anypoint/runtimefabric/script.sh foreground
-
On AWS and manual installatons, the script should be located at
/opt/anypoint/runtimefabric/init.sh foreground
-
-
Validate the activation data value is correct by comparing with the Runtime Fabric created in Anypoint Runtime Manager.
If you are still encountering issues, file a support ticket for further assistance.
Resume a Failed Installation
You can resume an installation at the point where it failed by running the init script:
-
AWS and manual installations:
/opt/anypoint/runtimefabric/init.sh
-
Azure installations:
/opt/anypoint/runtimefabric/script.sh
Troubleshoot Install Package Issues
This step will install required packages on the instance. It uses the yum
package repository
to download and install the required packages.
Common Errors
If there is a failure on this step, verify the following:
-
Ensure your instance has outbound internet connectivity. A simple way to validate is to run the following command and verify a 301 response is returned:
curl https://anypoint.mulesoft.com
-
If running a manual installation, ensure the
init.sh
script is run with root privledges:sudo ./init.sh foreground
-
Manually install one of the required packages to determine if it is successful outside of the installation script:
sudo yum install -y chrony
-
If not successful, work with your operations team for assistance. You may need to ask for elevated access to the instance.
-
If manual installation of a package was successful, or if you still require assistance, file a support ticket.
Format and Mount Disks
This step performs the following tasks on the block devices or disks provided with the
RTF_DOCKER_DEVICE
and/or RTF_ETCD_DEVICE
variables:
-
Performs a check to confirm the values map to block devices available on the instance.
-
Unmounts the disks in case they were previously mounted.
-
Formats the disks.
-
Adds an mount entry in the
/etc/fstab
file. -
Creates directories based upon the values in
$DOCKER_MOUNT
and/or$ETCD_MOUNT
. -
Mounts the disks to the expected directories created above.
Install RTF Components
This step connects to Anypoint Platform to download and install the Runtime Fabric components on the cluster.
In some cases, this step may return an error if the deployment failed to complete within the time limit:
... [OK] Installing Runtime Fabric Agent. This may take several minutes... configmap "grafana-dashboards" deleted configmap "kapacitor-alerts" deleted Release "runtime-fabric" does not exist. Installing it now. The following deployments failed to become ready within the time limit: monitor --- Name: monitor-79c7564b77-wb9c6 Namespace: rtf Node: 10.165.12.87/10.165.12.87 Start Time: Thu, 13 Dec 2018 20:23:59 +0000 Labels: app=monitor pod-template-hash=3573120633 Annotations: checksum/config=4c4aac48d9cc8b24828b38ba0eb587840bc17b2449a54d593f74e2d58e5c12ae kubernetes.io/psp=privileged seccomp.security.alpha.kubernetes.io/pod=docker/default Status: Running IP: 10.244.82.17 Controlled By: ReplicaSet/monitor-79c7564b77 Containers: ... << More information displayed that describes the deployment manifest and stack trace >>
If this error is observed:
-
Verify outbound TCP port 5672 is open to the Internet. Connections should be allowed from the controller VM(s) running in your internal network to this hostname on the Internet.
-
A TCP proxy may be needed to establish a connection to the Internet. Check with your network team to verify and configure if needed. Refer to Anypoint Runtime Fabric Installation Prerequisites.
Troubleshoot Expired Client Certificates
If your cluster uses Mutual Authentication with Anypoint Platform, a cron job called certificate-renewal
in the rtf
namespace periodically renews the client certificate. This job runs in advance of certificate expiration to renew the certificate before it expires. Unexpected circumstances may cause the certificate to expire without renewal. In such cases, the certificate-renewal
job logs the following error:
time="2020-03-11T12:42:05Z" level=warning msg="{\"timestamp\":\"2020-03-11T12:42:05.035+00:00\",\"status\":409,\"error\":\"Conflict\",\"message\":\"Certificate was already exipired, please renew cert!\",\"path\":\"/api/v1/organizations/4d70ccf6f0ce/agents/74f0df29/renew\"}"
To renew the certificate, use the rtfctl appliance renew-expired-client-cert
command. For more information, refer to:
Troubleshoot Inconsistencies After Backing Up and Restoring Rutime Fabric
After restoring a cluster from a backup, you may see some inconsistencies in Runtime Manager, for example:
-
An application you deleted shows as running
-
A recently deployed application is listed with 0 replicas
-
A deployment’s replica count does not match the number of replicas you specified in Runtime Manager
In each of these circumstances, verify that the deployments status appears as applying and that a corresponding API call reveals that the status
and desiredState
fields do not match. See The Back Up and Restore Process Does Not Affect the Control Plane for an explanation of the root cause.
To resolved these conflicts, you must manually apply the necessary changes to the cluster to match the expected status. You cannot automatically re-sync these changes nor override them with new changes. The current status of the cluster must match the expected values before you can apply any further changes from the control plane. If you need to review changes you applied after creating a backup, use audit logs.
Troubleshoot Ops Center Monitoring and Logs Issues
If Ops Center monitoring and logging fails to restart after restarting one or more nodes, ensure port forwarding rules are applied on all VMs so that traffic can communicate with the Kubernetes pods running on the VMs. Refer to Enable Forwarding When Using firewalld for additional information.
Troubleshoot Ops Center Metric Dashboard Missing Applications
Scenario: Ops Center metric dashboards have the following symptoms:
-
Metric data shows frequent and unexpected spikes
-
Missing applications from drop-down menu
-
InfluxDB logs show a
max-series-per-database limit exceeded
error.To get the influxdb_pod_name, run,
kubectl -n monitoring get pod -lcomponent=influxdb
This scenario can occur when the number of entries hits the pre-configured 1 million limit.
To resolve this issue, increase the series limit to 10 million or a larger number.
-
Edit the config map:
kubectl edit cm influxdb -n monitoring
-
Modify the
max-series-per-database
value to 10 million:max-series-per-database = 10000000
-
Save the changes.
-
Restart the InfluxDB pod to apply the changes:
kubectl -n monitoring delete pod <influxdb_pod_name>
To delete the database series and modify what is logged, follow the steps in RTF - Error Reporting Heap Metrics: InfluxDBException: partial write: max-series-per-database limit exceeded.
Troubleshoot Agent Version Mismatch
Scenario: After a Runtime Fabric upgrade, you encounter the following error status in Runtime Manager: Degraded : Runtime Fabric agent version mismatch detected from control plane
. This error is caused by a discrepancy between the Runtime Fabric agent version stored on Anypoint Platform for that cluster and the version reported by the agent status probe.
Common causes of this issue include:
-
Failed or rolled back Runtime Fabric agent upgrades
-
Restoration of pre-upgrade backups
-
Disabled mutual authentication
Failed Agent Upgrades
In this situation, a Runtime Fabric agent upgrade fails and gets rolled back, but the desired (upgraded) version is listed in Anypoint Platform. To correct this issue, apply the procedure described in Degraded Status as Runtime Fabric Agent Version Mismatch Detected.
Restoration of Pre-Upgrade Backups
If you restore a backup that was taken on a previous Runtime Fabric agent version than the latest version applied from the control plane for that cluster, you must repeat the upgrade procedure. For more information, refer to Changes Made After Performing a Backup Are Not Restored.
Disabled Mutual Authentication
Runtime Fabric version 1.11 and later requires mutual authentication. If you upgrade from a version before 1.11 to 1.11 or later, you can encounter this error message if mutual authentication is not enabled. To fix this, enable mutual authentication and run the upgrade again.
Troubleshoot Agent Pod
If one or more Runtime Fabric agent pod containers do not check their logs and look for an error message matching an entry, review the following metrics server not running troubleshooting scenario:
If the rtfd container crashes in every attempt, it can return the following error message:
2023/02/23 16:35:20 instantiating client: fetching server resource preferences: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
The error can occur because the Kubernetes metrics server is installed but it is not running. Runtime Fabric appliance version 1.1.1636064094-8b70d2d and later include metrics server as a deployment in the monitoring namespace. The events and logs for this deployment and associated pods must be investigated for further troubleshooting. After the metrics server is up and running, the rtfd comes up as expected.
Contact the MuleSoft Support team with a full appliance report if you require assistance to restore the metrics server.
Troubleshoot Missing Email Alert Customizations
Scenario: After an appliance upgrade, you lose customizations made to your email alerts.
This is a limitation in Runtime Fabric on VMs / Bare Metal, as documented in before you upgrade the Runtime Fabric appliance. The suggested workaround is to back up any customizations before you run the upgrade and then reapply the customizations when the upgrade completes.
Troubleshoot External Log Forwarding
Log Aggregators Must Use TLS v1.2
Runtime Fabric external log forwarding libraries do not support TLS v1.3.
If your log aggregator does not use TLS v1.2, your logs won’t be forwarded, and you’ll see warnings on your external log forwarder pods, similar to:
[ warn] [engine] failed to flush chunk '8-1652967809.949249428.flb', retry in 10 seconds: task_id=5, input=tail.0 > output=es.0 (out_id=0) [ warn] [engine] chunk '8-1652967799.665591977.flb' cannot be retried: task_id=1, input=tail.0 > output=es.0
On your log aggregator, you may see an error message similar to:
[WARN ][o.e.h.AbstractHttpServerTransport] [] caught exception while handling client http traffic, closing connection Netty4HttpChannel{localAddress=0.0.0.0/0.0.0.0:9200, remoteAddress=/*.*.*.*:1787} io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Client requested protocol TLSv1.2 is not enabled or supported in server context ... Caused by: javax.net.ssl.SSLHandshakeException: Client requested protocol TLSv1.2 is not enabled or supported in server context
In such cases, either enable TLSv v1.2, or set your networking infrastructure to have an intermediate node terminate the TLS v1.2 session and create a new TLS v1.3 session to the final destination.