Troubleshooting..

How to decode VMware PSOD Purple Screen of Death crashes?


If you have ever lost a host to a PSOD (Purple screen of Death) you know that they may as well be written in Greek.  Sure you used to think that reading a Windows BSOD was a pain, but the VMware PSOD takes this to a new level.
In the past trying to break down what caused your host to crash on your own was a big task. Most folks would require a call to VMware support if the host did not come back up afterwards. But now VMware has since written a nice document on breaking down the many sections of the PSOD screen. This will give you some insight on what might be going on with your ESX host.

1.Troubleshooting a VMFS resource volume that is corrupted 

The event indicates the reported VMFS volume is corrupted. 
Example 
If 4976b16c-bd394790-6fd8-00215aaf0626 represents the UUID and san-lun-100 represents the associated 
volume label, you see: 
For Event: vmfs.lock.corruptondisk 
Volume 4976b16c-bd394790-6fd8-00215aaf0626 (san-lun-100) may be damaged on disk. Corrupt lock 
detected at offset O 
For Event: vmfs.resource.corruptondisk 
Volume 4976b16c-bd394790-6fd8-00215aaf0626 (san-lun-100) may be damaged on disk. Resource cluster 
metadata corruption detected 
Impact 
The scope of the corruption may vary. It might affect just one file or corrupt the whole volume. Do not use 
the affected VMFS any longer. 
Solution 
To recover from this issue: 
Back up all data on the volume. 
Run the following command to save the VMFS3 metadata region and provide it to VMware customer support: 
dd if=/vmfs/devices/disks/<disk>of=/root/dump bs=1M count=1200 conv=notrunc 

where <disk> is the partition that contains the volume. If you have a spanned volume, <disk> is the head 
partition. 
This provides information on the extent of the volume corruption and can assist in recovering the volumes. 

2.VMFS Lock Volume is Corrupted 

Details 
You may observe the following events within the /var/log/vmkernel logs within your VMware ESX host: 
Volume 4976b16c-bd394790-6fd8-00215aaf0626 (san-lun-100) may be damaged on disk. Corrupt lock 
detected at offset 0 
Note: In this example 4976b16c-bd394790-6fd8-00215aaf0626 represents the UUID of the VMFS datastore 
and san-lun-100 represents the name of the VMFS datastore. 
You may observe the following events within the /var/log/vmkernel logs within your VMware ESX host: 
Resource cluster metadata corruption detectedVolume 4976b16c-bd394790-6fd8-00215aaf0626 (san-lun-
100) may be damaged on disk. 
Note: In this example 4976b16c-bd394790-6fd8-00215aaf0626 represents the UUID of the VMFS datastore 
and san-lun-100 represents the name of the VMFS datastore. 
Solution 
The events indicate that the reported VMFS volume is corrupt. The scope and the cause of the corruption may 
vary. The corruption may affect just one file or the entire volume. 
Create a new datastore and restore any information that may have been compromised to the new datastore 
from existing backups. Do not use the corrupt VMFS datastore any longer. 
Note: If some information is still accessible on the datastore that is reportedly corrupt, you may attempt to 
migrate the information off of the datastore with the use of the vCenter migrate feature, vmkfstools, or the 
datastore browser. If you are able to migrate any information off of the corrupt datastore, validate the 
information to ensure that it has not been affected by the corruption. 

Determining the cause of the corruption 
If you would like assistance in determining the cause of the corruption, VMware technical support can provide 
assistance in a best effort capacity. 
To collect the appropriate information to diagnose the issue: 
Note: More information about support service terms and conditions can be found here. 
Log into the service console as root. 
Find the partition that contains the volume. In the case of a spanned volume, this is the head partition. Run 
the following command to find the value of the partition: 
vmkfstools -P /vmfs/volumes/<volumeUUID> 
For example, run the following command to find the partition for 4976b16c-bd394790-6fd8-00215aaf0626: 
# vmkfstools -P /vmfs/volumes/4976b16c-bd394790-6fd8-00215aaf0626 
File system label (if any): san-lun-1000 
Mode: public 
Capacity 80262201344 (76544 file blocks * 1048576), 36768317440 (35065 blocks) avail 
UUID: 49767b15-1f252bd1-1e57-00215aaf0626 
Partitions spanned (on "lvm"): naa.60060160b4111600826120bae2e3dd11:1 
Make note of the first device listed in the output for the Partitions spanned list. This is the value for the 
partition. In the above example, the first device is: 

naa.60060160b4111600826120bae2e3dd11:1 
Using the value from step 3, run the following command to save the vmfs3 metadata region and provide it to 
VMware customer support: 

dd if=/vmfs/devices/disks/<disk:partition> of=/root/dump bs=1M count=1200 conv=notrunc 

Note: The variable <disk:partition> is the value recorded in step 3. 
Caution: The resulting file is approximately 1200 MB in size. Ensure that you have adequate space on the 
destination. The destination in the above example is the /root/ folder. To compress the file, you can use an 
open source utility called gzip. The following is an example of the command: 
# gzip /root/dump 
Note: For more information on the gzip utility, type man gzip at the console. 
Create a new support request. For more information, see How to Submit a Support Request. Upload the 
resulting file along with a full support bundle to VMware technical support. 

3.Troubleshooting virtual machine performance issues 

Symptoms 
The guest operating system boots slowly 
Applications running in virtual machines perform poorly 
Applications running in virtual machines take a long time to launch 
Applications running in virtual machines frequently become unresponsive 
Multi-user services have long transaction times or can handle less simultaneous users than expected 

Purpose 
This articles discusses identifying and resolving various issues that affect virtual machine performance 
running on VMware hosted products. 


Resolution 
Validate that each troubleshooting step below is true for your environment. The steps will provide instructions 
or a link to a document, for validating the step and taking corrective action as necessary. The steps are 
ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not 
skip a step. 
Verify that the reduced performance is unexpected behavior. When a workload is virtualized it is common to 
see some performance reduction due to virtualization overhead. Troubleshoot a performance problem if you 
experience the following conditions: 

The virtual machine was previously working at acceptable performance levels but has since degraded 
The virtual machine performs significantly slower than a similar setup on a physical computer 
You want to optimize your virtual machines for the best performance possible 
Verify that you are running the most recent version of the VMware product being used. For download 
information, see the VMware Download Center. 
Check that VMware Tools is installed in the virtual machine and running the correct version. The version listed 
in the toolbox application must match the version of the product hosting the virtual machine. To access the 
toolbox, double-click the VMware icon in the notification area on the task bar, or run vmware-toolbox in 
Linux. Some VMware products indicate when the version does not match by displaying a message below the 
console view. For more information on installing VMware Tools 

4.Review the virtual machine's virtual hardware settings and verify that you have provided enough resources 
to the virtual machine, including memory and CPU resources. Use the average hardware requirements 
typically used in a physical machine for that operating system as a guide. Adjustments to the settings are 
required to factor-in the application load: higher for larger loads such as databases or multi-user services, 
and lower for less intense usage such as casual single-user application like e-mail or web clients. 

5. Ensure that any antivirus software installed on the host is configured to exclude the virtual machine files 
from active scanning. Install antivirus software inside the virtual machine for proper virus protection. For 
more information, see Investigating busy hosted virtual machine files.

1 comment: