Troubleshooting Guide for Runtime Fabric

This topic describes common errors and steps to resolve them when installing Anypoint Runtime Fabric on VMs / Bare Metal.

Obtain a Full Network Assessment

Run the following command for an overall health assessment of the network:

/opt/anypoint/runtimefabric/rtfctl status

Troubleshoot Network Connectivity Using rtfctl

Every Anypoint Runtime Fabric cluster requires connectivity with Anypoint control plane, and any interference with connectivity can limit functionality, resulting in application deployment failures or degraded status in Anypoint Runtime Manager.

You can use rtfctl to verify that Runtime Fabric has the required outbound connectivity as well as troubleshoot connectivity issues.

Verify Outbound Connectivity

On each node, follow the instructions in Install rtfctl to install rtfctl. Run the following command in all controller and worker nodes on the cluster to verify the required outbound connectivity:

sudo ./rtfctl test outbound-network

Sample output:

[root@rtf-controller-1 runtimefabric]# sudo ./rtfctl test outbound-network
Using proxy configuration from Runtime Fabric (proxy "", no proxy "")

Using 'US' region
transport-layer.prod.cloudhub.io:443 ✔
https://anypoint.mulesoft.com ✔
https://worker-cloud-helm-prod.s3.amazonaws.com ✔
https://exchange2-asset-manager-kprod.s3.amazonaws.com ✔
https://ecr.us-east-1.amazonaws.com ✔
https://494141260463.dkr.ecr.us-east-1.amazonaws.com ✔
https://prod-us-east-1-starport-layer-bucket.s3.amazonaws.com ✔
https://runtime-fabric.s3.amazonaws.com ✔
tcp://dias-ingestor-nginx.prod.cloudhub.io:443 ✔

If you have outbound connectivity issues that prevent Runtime Fabric from reaching any of the required Anypoint control plane services, work with your network team to verify that you have added the required port IPs and hostnames to the allowlist as described in Port IP Addresses and Hostnames to Add to the Allowlist.

Troubleshoot Chrony Synchronization Check Failure

Runtime Fabric requires chrony for its time synchronization daemon. If chrony is not found by the init.sh installation script, Runtime Fabric installation fails with the following error:

============================================================================================
1 / 10: Install required packages
================================================
chrony-3.4-1.el7.x86_64
Checking chrony sync status...Retrying in 30 seconds...
Retrying in 30 seconds...
Error: chrony sync check failed 3 times, giving up.

***********************************************************
** Oh no! Your installation has stopped due to an error. **
***********************************************************
1. Visit the troubleshooting guide for help:
xref:runtime-fabric::troubleshoot-guide.adoc#troubleshoot-install-package-issues[Troubleshoot Install Package Issues].

2. Resume installation by running /opt/anypoint/runtimefabric/init.sh

Additional information: Error code: 1; Step: install_required_packages; Line: -;

Perform the following steps to fix this issue:

Verify that chrony is enabled:

systemctl enable chronyd
Verify that Network Time Protocol (NTP) is disabled and is not running:

systemctl stop ntpd; systemctl disable ntpd
Contact your network team to verify that the time servers in /etc/chrony.conf are reachable.

Verify that chrony is synced with sources:

chronyc sourcestats -v

Number of sources = 4
Name/IP Address            NP  NR  Span  Frequency  Freq Skew  Offset  Std Dev
==============================================================================
ns3.turbodns.co.uk         16  12   58m     -0.006      0.014   +370us    14us
test.diarizer.com          16   9  327m     -0.184      0.144  -1618us   799us
ntp1.vmar.se                6   3   21m     +0.117      0.074   +228us    10us
time.cloudflare.com         6   5   86m     +0.014      0.214  +2040ns    94us

Verify that the Leap status value of chronyc tracking is Normal:

chronyc tracking

Reference ID : A29FC87B (time.cloudflare.com)
Stratum : 4
Ref time (UTC) : Mon Jul 20 15:42:57 2020
System time : 0.000000003 seconds slow of NTP time
Last offset : +0.000344024 seconds
RMS offset : 0.000172783 seconds
Frequency : 2.265 ppm slow
Residual freq : +0.149 ppm
Skew : 0.124 ppm
Root delay : 0.003362593 seconds
Root dispersion : 0.000759320 seconds
Update interval : 1031.1 seconds
Leap status : Normal

Retry the installation by running init.sh.

Troubleshoot Cluster Issues

When filing a support case, the support team might ask you to run one or both of the following commands to generate debugging information:

The rtfctl report command to generate an archive containing only Kubernetes objects and logs.
The rtfctl appliance report command to collect diagnostics from all cluster nodes.

The support team might also ask you to download information through Ops Center as described in Download Debug Info.

Troubleshoot Application Deployment Issues

In rare situations, the Anypoint Monitoring agent might prevent an application from deploying. In these situations, you might see the following messages:

The application remains in the Deploying state, or
Error starting monitoring agent (code -1)

In this situation, redeploy your application and set the following custom property:

anypoint.platform.config.analytics.agent.enabled=false

The Anypoint Monitoring agent might also change the state of a deployed application. If you see one of the following:

The application transitions from Running to Applying, or
Monitoring agent has exited with code -1

This indicates that the agent is restarting. There should be no impact to the running application. Application metrics are queued, and are again collected after the agent restarts.

Troubleshoot Application Runtime Issues

If any of the following Runtime Fabric alert messages are reported, you might need to recover one or more controller nodes.

Management plane is unreachable

InfluxDB is down or no connection between Kapacitor and InfluxDB

Node is down

CRITICAL / Kubernetes node is not ready: <ip_address>

CRITICAL / etcd: cluster is unhealthy

Open a terminal and run the gravity status command to obtain the health status of the cluster as well as individual components.

To recover a node, follow the instructions provided in Add or Remove a Node from a Runtime Fabric.

Troubleshoot Environment Variable Issues

This step detects the variables which the installation process needs to carry out its procedures. The methods for providing these variables to the installation vary based upon where the installation is running.

For AWS, the variables are set in the terraform script, and outputted to a file located in /opt/anypoint/runtimefabric/env.
For Azure, the variables are set when running the ARM template, and are retrieved as tags on the Virtual Machine instances.
For manual installations, the user creates a file with the values located in /opt/anypoint/runtimefabric/env.

After these properites are retrieved, a procedure will run to connect to Anypoint Platform and retrieve additional values based upon the RTF_ACTIVATION_DATA value.

Common errors

The following error may occur if there is an issue with the activation data value, or if there is trouble reaching the internet on the instance:

curl: (7) Failed connect to anypoint.mulesoft.com:443; Operation now in progress
Error: Failed to fetch registration properties. (000). Please check your token is valid


============ Error ============
Exit code: 1
Line:

If this error is observed, try the following:

Ensure your instance has outbound internet connectivity. A simple way to validate is to run the following command and verify a 301 response is returned:
```
curl https://anypoint.mulesoft.com
```
Re-try running the installation procedure, in case the network connectivity was not finished initalizing.
- On Azure, the script should be located at /opt/anypoint/runtimefabric/script.sh foreground
- On AWS and manual installatons, the script should be located at /opt/anypoint/runtimefabric/init.sh foreground
Validate the activation data value is correct by comparing with the Runtime Fabric created in Anypoint Runtime Manager.

If you are still encountering issues, file a support ticket for further assistance.

Resume a Failed Installation

You can resume an installation at the point where it failed by running the init script:

AWS and manual installations:
```
/opt/anypoint/runtimefabric/init.sh
```
Azure installations:
```
/opt/anypoint/runtimefabric/script.sh
```

Troubleshoot Install Package Issues

This step will install required packages on the instance. It uses the yum package repository to download and install the required packages.

Common errors

If there is a failure on this step, verify the following:

Ensure your instance has outbound internet connectivity. A simple way to validate is to run the following command and verify a 301 response is returned:
```
curl https://anypoint.mulesoft.com
```
If running a manual installation, ensure the init.sh script is run with root privledges:
```
sudo ./init.sh foreground
```
Manually install one of the required packages to determine if it is successful outside of the installation script:
```
sudo yum install -y chrony
```
- If not successful, work with your operations team for assistance. You may need to ask for elevated access to the instance.

If manual installation of a package was successful, or if you still require assistance, file a support ticket.

Troubleshoot Ops Center Monitoring and Logs Issues

If Ops Center monitoring and logging fails to restart after restarting one or more nodes, ensure port forwarding rules are applied on all VMs so that traffic can communicate with the Kubernetes pods running on the VMs. Refer to Enable Forwarding When Using firewalld for additional information.

Format and Mount Disks

This step performs the following tasks on the block devices or disks provided with the RTF_DOCKER_DEVICE and/or RTF_ETCD_DEVICE variables:

Performs a check to confirm the values map to block devices available on the instance.
Unmounts the disks in case they were previously mounted.
Formats the disks.
Adds an mount entry in the /etc/fstab file.
Creates directories based upon the values in $DOCKER_MOUNT and/or $ETCD_MOUNT.
Mounts the disks to the expected directories created above.

Install RTF Components

This step connects to Anypoint Platform to download and install the Runtime Fabric components on the cluster.

In some cases, this step may return an error if the deployment failed to complete within the time limit:

...
[OK]
Installing Runtime Fabric Agent. This may take several minutes...
configmap "grafana-dashboards" deleted
configmap "kapacitor-alerts" deleted
Release "runtime-fabric" does not exist. Installing it now.
The following deployments failed to become ready within the time limit:
monitor ---
Name:           monitor-79c7564b77-wb9c6
Namespace:      rtf
Node:           10.165.12.87/10.165.12.87
Start Time:     Thu, 13 Dec 2018 20:23:59 +0000
Labels:         app=monitor
                pod-template-hash=3573120633
Annotations:    checksum/config=4c4aac48d9cc8b24828b38ba0eb587840bc17b2449a54d593f74e2d58e5c12ae
                kubernetes.io/psp=privileged
                seccomp.security.alpha.kubernetes.io/pod=docker/default
Status:         Running
IP:             10.244.82.17
Controlled By:  ReplicaSet/monitor-79c7564b77
Containers:
...
<< More information displayed that describes the deployment manifest and stack trace >>

If this error is observed:

Verify outbound TCP port 5672 is open to the Internet. Connections should be allowed from the controller VM(s) running in your internal network to this hostname on the Internet.
A TCP proxy may be needed to establish a connection to the Internet. Check with your network team to verify and configure if needed. Refer to Anypoint Runtime Fabric Installation Prerequisites.