Hi,
today one of our machines showed the same problem. It did became unresponsive and started using 100% vCPU.
Guest OS is SLES 11 SP1, Kernel 2.6.32.
I didn't find the problem that caused the crash but i wanted to share what i found out so far, maybe it is helpful:
I examined /var/log/messages and found several messages repeating over and over again until the crash:
Dec 28 08:54:36 vmguest01 kernel: [83801.848812] update_cpu_power: cpu_power = 83814858285515; scale_rt = 18428813490058355147
Dec 28 08:54:36 vmguest01 kernel: [83801.849402] update_cpu_power: cpu_power = 83814858881712; scale_rt = 18428813490058951344
Dec 28 08:54:36 vmguest01 kernel: [83801.851079] update_cpu_power: cpu_power = 83814860558860; scale_rt = 18428813490060628492
Dec 28 08:54:36 vmguest01 kernel: [83801.851757] update_cpu_power: cpu_power = 83814861225898; scale_rt = 18428813490061295530
Dec 28 08:54:36 vmguest01 kernel: [83801.852086] update_cpu_power: cpu_power = 83814861566880; scale_rt = 18428813490061636512
I went back to the first occurance of such a line and found something interesting at a point about 24 hours before the crash:
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849328] ------------[ cut here ]------------
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849345] WARNING: at /usr/src/packages/BUILD/kernel-default-2.6.32.36/linux-2.6.32/kernel/sched.c:3847 update_cpu_power+0x151/0x160()
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849348] Hardware name: VMware Virtual Platform
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849349] Modules linked in: iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables tun nfs lockd fscache nfs_acl auth_rpcgss sunrpc af_packet st ide_cd_mod joydev vsock(X) vmci(X) vmmemctl(X) acpiphp microcode fuse loop dm_mod rtc_cmos ppdev rtc_core parport_pc rtc_lib parport ac shpchp container pci_hotplug floppy intel_agp button tpm_tis tpm tpm_bios sg mptctl i2c_piix4 pcspkr e1000 i2c_core sr_mod cdrom sd_mod crc_t10dif edd ext3 mbcache jbd fan processor ide_pci_generic piix ide_core ata_generic ata_piix libata mptspi mptscsih mptbase scsi_transport_spi thermal thermal_sys hwmon vmw_pvscsi scsi_mod vmxnet3
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849405] Supported: Yes
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849407] Pid: 11077, comm: export2online-e Tainted: G X 2.6.32.36-0.5-default #1
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849408] Call Trace:
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849426] [<ffffffff810061dc>] dump_trace+0x6c/0x2d0
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849436] [<ffffffff81397d88>] dump_stack+0x69/0x71
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849445] [<ffffffff8104d804>] warn_slowpath_common+0x74/0xd0
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849453] [<ffffffff8103d771>] update_cpu_power+0x151/0x160
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849458] [<ffffffff8103e3b3>] find_busiest_group+0xa83/0xce0
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849461] [<ffffffff810460dd>] load_balance_newidle+0xcd/0x380
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849465] [<ffffffff81398b7b>] thread_return+0x2a7/0x34c
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849468] [<ffffffff81399b9d>] do_nanosleep+0x8d/0xc0
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849475] [<ffffffff81068768>] hrtimer_nanosleep+0xa8/0x140
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849483] [<ffffffff8108b0e6>] compat_sys_nanosleep+0xd6/0x150
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849488] [<ffffffff81033c19>] sysenter_dispatch+0x7/0x27
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849496] [<00000000ffffe430>] 0xffffe430
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849497] ---[ end trace d164ede915d60e57 ]---
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849498] update_cpu_power: cpu_power = 2149207707; scale_rt = 18428729677349277339
Dec 27 09:35:18 vmguest01 kernel: [18446744062.850151] update_cpu_power: cpu_power = 2150067177; scale_rt = 18428729677350136809
The most interesting part is the timestamp: 18446744062.850151.
I went through the sources of the linux kernel and it seems that this value gives the number of seconds since boot up. So this would mean boot up was about 18 billion seconds ago, that's nearly 585 years. Internally this value seems to be handled in nanoseconds. Expressed in nanoseconds, the value is 1.84 * 10^19. This is near to the maximum value an unsigned long integer (2 ^ 64) can hold. And a few lines later, it seems to have overflown:
Dec 27 09:35:28 vmguest01 kernel: [18446744072.839931] update_cpu_power: cpu_power = 12139859978; scale_rt = 18428729687339929610
Dec 27 09:35:28 vmguest01 kernel: [18446744072.840340] update_cpu_power: cpu_power = 12140269170; scale_rt = 18428729687340338802
Dec 27 09:35:33 vmguest01 kernel: [ 4.123417] __ratelimit: 909 callbacks suppressed
Dec 27 09:35:33 vmguest01 kernel: [ 4.123420] update_cpu_power: cpu_power = 17132887892; scale_rt = 18428729692332957524
The values of cpu_power and scale_rt kept increasing monotonically throughout the log file. I guess the point the machine crashes is when the value hits a critical maximum and may cause an overflow and/or negative values somewhere within the kernel.
Maybe the "high" timestamp value is caused by an adjustment to the guest OS's clock done by VMware? Maybe this adjustment causes the time to become a negative value (causing an underflow ending in a very high positive value)?
I hope someone finds this helpful
Best regards
Thorsten