How to decode VMware PSOD Purple Screen of Death crashes?

If you have ever lost a host to a PSOD (Purple screen of Death) you know that they may as well be written in Greek. Sure you used to think that reading a Windows BSOD was a pain, but the VMware PSOD takes this to a new level.

In the past trying to break down what caused your host to crash on your own was a big task. Most folks would require a call to VMware support if the host did not come back up afterwards. But now VMware has since written a nice document on breaking down the many sections of the PSOD screen. This will give you some insight on what might be going on with your ESX host.

1.Troubleshooting a VMFS resource volume that is corrupted

The event indicates the reported VMFS volume is corrupted.
Example
If 4976b16c-bd394790-6fd8-00215aaf0626 represents the UUID and san-lun-100 represents the associated
volume label, you see:
For Event: vmfs.lock.corruptondisk
Volume 4976b16c-bd394790-6fd8-00215aaf0626 (san-lun-100) may be damaged on disk. Corrupt lock
detected at offset O
For Event: vmfs.resource.corruptondisk
Volume 4976b16c-bd394790-6fd8-00215aaf0626 (san-lun-100) may be damaged on disk. Resource cluster
metadata corruption detected
Impact
The scope of the corruption may vary. It might affect just one file or corrupt the whole volume. Do not use
the affected VMFS any longer.
Solution
To recover from this issue:
Back up all data on the volume.
Run the following command to save the VMFS3 metadata region and provide it to VMware customer support:
dd if=/vmfs/devices/disks/<disk>of=/root/dump bs=1M count=1200 conv=notrunc

where <disk> is the partition that contains the volume. If you have a spanned volume, <disk> is the head
partition.
This provides information on the extent of the volume corruption and can assist in recovering the volumes.

2.VMFS Lock Volume is Corrupted

Details
You may observe the following events within the /var/log/vmkernel logs within your VMware ESX host:
Volume 4976b16c-bd394790-6fd8-00215aaf0626 (san-lun-100) may be damaged on disk. Corrupt lock
detected at offset 0
Note: In this example 4976b16c-bd394790-6fd8-00215aaf0626 represents the UUID of the VMFS datastore
and san-lun-100 represents the name of the VMFS datastore.
You may observe the following events within the /var/log/vmkernel logs within your VMware ESX host:
Resource cluster metadata corruption detectedVolume 4976b16c-bd394790-6fd8-00215aaf0626 (san-lun-
100) may be damaged on disk.
Note: In this example 4976b16c-bd394790-6fd8-00215aaf0626 represents the UUID of the VMFS datastore
and san-lun-100 represents the name of the VMFS datastore.
Solution
The events indicate that the reported VMFS volume is corrupt. The scope and the cause of the corruption may
vary. The corruption may affect just one file or the entire volume.
Create a new datastore and restore any information that may have been compromised to the new datastore
from existing backups. Do not use the corrupt VMFS datastore any longer.
Note: If some information is still accessible on the datastore that is reportedly corrupt, you may attempt to
migrate the information off of the datastore with the use of the vCenter migrate feature, vmkfstools, or the
datastore browser. If you are able to migrate any information off of the corrupt datastore, validate the
information to ensure that it has not been affected by the corruption.

Determining the cause of the corruption
If you would like assistance in determining the cause of the corruption, VMware technical support can provide
assistance in a best effort capacity.
To collect the appropriate information to diagnose the issue:
Note: More information about support service terms and conditions can be found here.
Log into the service console as root.
Find the partition that contains the volume. In the case of a spanned volume, this is the head partition. Run
the following command to find the value of the partition:
vmkfstools -P /vmfs/volumes/<volumeUUID>
For example, run the following command to find the partition for 4976b16c-bd394790-6fd8-00215aaf0626:
# vmkfstools -P /vmfs/volumes/4976b16c-bd394790-6fd8-00215aaf0626
File system label (if any): san-lun-1000
Mode: public
Capacity 80262201344 (76544 file blocks * 1048576), 36768317440 (35065 blocks) avail
UUID: 49767b15-1f252bd1-1e57-00215aaf0626
Partitions spanned (on "lvm"): naa.60060160b4111600826120bae2e3dd11:1
Make note of the first device listed in the output for the Partitions spanned list. This is the value for the
partition. In the above example, the first device is:

naa.60060160b4111600826120bae2e3dd11:1
Using the value from step 3, run the following command to save the vmfs3 metadata region and provide it to
VMware customer support:

dd if=/vmfs/devices/disks/<disk:partition> of=/root/dump bs=1M count=1200 conv=notrunc

Note: The variable <disk:partition> is the value recorded in step 3.
Caution: The resulting file is approximately 1200 MB in size. Ensure that you have adequate space on the
destination. The destination in the above example is the /root/ folder. To compress the file, you can use an
open source utility called gzip. The following is an example of the command:
# gzip /root/dump
Note: For more information on the gzip utility, type man gzip at the console.
Create a new support request. For more information, see How to Submit a Support Request. Upload the
resulting file along with a full support bundle to VMware technical support.

3.Troubleshooting virtual machine performance issues

Symptoms
The guest operating system boots slowly
Applications running in virtual machines perform poorly
Applications running in virtual machines take a long time to launch
Applications running in virtual machines frequently become unresponsive
Multi-user services have long transaction times or can handle less simultaneous users than expected

Purpose
This articles discusses identifying and resolving various issues that affect virtual machine performance
running on VMware hosted products.

Resolution
Validate that each troubleshooting step below is true for your environment. The steps will provide instructions
or a link to a document, for validating the step and taking corrective action as necessary. The steps are
ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not
skip a step.
Verify that the reduced performance is unexpected behavior. When a workload is virtualized it is common to
see some performance reduction due to virtualization overhead. Troubleshoot a performance problem if you
experience the following conditions:

The virtual machine was previously working at acceptable performance levels but has since degraded
The virtual machine performs significantly slower than a similar setup on a physical computer
You want to optimize your virtual machines for the best performance possible
Verify that you are running the most recent version of the VMware product being used. For download
information, see the VMware Download Center.
Check that VMware Tools is installed in the virtual machine and running the correct version. The version listed
in the toolbox application must match the version of the product hosting the virtual machine. To access the
toolbox, double-click the VMware icon in the notification area on the task bar, or run vmware-toolbox in
Linux. Some VMware products indicate when the version does not match by displaying a message below the
console view. For more information on installing VMware Tools

4.Review the virtual machine's virtual hardware settings and verify that you have provided enough resources
to the virtual machine, including memory and CPU resources. Use the average hardware requirements
typically used in a physical machine for that operating system as a guide. Adjustments to the settings are
required to factor-in the application load: higher for larger loads such as databases or multi-user services,
and lower for less intense usage such as casual single-user application like e-mail or web clients.

5. Ensure that any antivirus software installed on the host is configured to exclude the virtual machine files
from active scanning. Install antivirus software inside the virtual machine for proper virus protection. For
more information, see Investigating busy hosted virtual machine files.

VMware Manishanized...

Pages

Troubleshooting..

How to decode VMware PSOD Purple Screen of Death crashes?

1 comment: