Re: VM becomes unresponsive, some vCPUs are 100% utilized

Hi,

today one of our machines showed the same problem. It did became unresponsive and started using 100% vCPU.

Guest OS is SLES 11 SP1, Kernel 2.6.32.

I didn't find the problem that caused the crash but i wanted to share what i found out so far, maybe it is helpful:

I examined /var/log/messages and found several messages repeating over and over again until the crash:

Dec 28 08:54:36 vmguest01 kernel: [83801.848812] update_cpu_power: cpu_power = 83814858285515; scale_rt = 18428813490058355147
Dec 28 08:54:36 vmguest01 kernel: [83801.849402] update_cpu_power: cpu_power = 83814858881712; scale_rt = 18428813490058951344
Dec 28 08:54:36 vmguest01 kernel: [83801.851079] update_cpu_power: cpu_power = 83814860558860; scale_rt = 18428813490060628492
Dec 28 08:54:36 vmguest01 kernel: [83801.851757] update_cpu_power: cpu_power = 83814861225898; scale_rt = 18428813490061295530
Dec 28 08:54:36 vmguest01 kernel: [83801.852086] update_cpu_power: cpu_power = 83814861566880; scale_rt = 18428813490061636512

I went back to the first occurance of such a line and found something interesting at a point about 24 hours before the crash:

Dec 27 09:35:18 vmguest01 kernel: [18446744062.849328] ------------[ cut here ]------------
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849345] WARNING: at /usr/src/packages/BUILD/kernel-default-2.6.32.36/linux-2.6.32/kernel/sched.c:3847 update_cpu_power+0x151/0x160()
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849348] Hardware name: VMware Virtual Platform
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849349] Modules linked in: iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables tun nfs lockd fscache nfs_acl auth_rpcgss sunrpc af_packet st ide_cd_mod joydev vsock(X) vmci(X) vmmemctl(X) acpiphp microcode fuse loop dm_mod rtc_cmos ppdev rtc_core parport_pc rtc_lib parport ac shpchp container pci_hotplug floppy intel_agp button tpm_tis tpm tpm_bios sg mptctl i2c_piix4 pcspkr e1000 i2c_core sr_mod cdrom sd_mod crc_t10dif edd ext3 mbcache jbd fan processor ide_pci_generic piix ide_core ata_generic ata_piix libata mptspi mptscsih mptbase scsi_transport_spi thermal thermal_sys hwmon vmw_pvscsi scsi_mod vmxnet3
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849405] Supported: Yes
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849407] Pid: 11077, comm: export2online-e Tainted: G X 2.6.32.36-0.5-default #1
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849408] Call Trace:
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849426] [<ffffffff810061dc>] dump_trace+0x6c/0x2d0
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849436] [<ffffffff81397d88>] dump_stack+0x69/0x71
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849445] [<ffffffff8104d804>] warn_slowpath_common+0x74/0xd0
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849453] [<ffffffff8103d771>] update_cpu_power+0x151/0x160
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849458] [<ffffffff8103e3b3>] find_busiest_group+0xa83/0xce0
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849461] [<ffffffff810460dd>] load_balance_newidle+0xcd/0x380
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849465] [<ffffffff81398b7b>] thread_return+0x2a7/0x34c
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849468] [<ffffffff81399b9d>] do_nanosleep+0x8d/0xc0
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849475] [<ffffffff81068768>] hrtimer_nanosleep+0xa8/0x140
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849483] [<ffffffff8108b0e6>] compat_sys_nanosleep+0xd6/0x150
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849488] [<ffffffff81033c19>] sysenter_dispatch+0x7/0x27
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849496] [<00000000ffffe430>] 0xffffe430
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849497] ---[ end trace d164ede915d60e57 ]---
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849498] update_cpu_power: cpu_power = 2149207707; scale_rt = 18428729677349277339
Dec 27 09:35:18 vmguest01 kernel: [18446744062.850151] update_cpu_power: cpu_power = 2150067177; scale_rt = 18428729677350136809

The most interesting part is the timestamp: 18446744062.850151.

I went through the sources of the linux kernel and it seems that this value gives the number of seconds since boot up. So this would mean boot up was about 18 billion seconds ago, that's nearly 585 years. Internally this value seems to be handled in nanoseconds. Expressed in nanoseconds, the value is 1.84 * 10^19. This is near to the maximum value an unsigned long integer (2 ^ 64) can hold. And a few lines later, it seems to have overflown:

Dec 27 09:35:28 vmguest01 kernel: [18446744072.839931] update_cpu_power: cpu_power = 12139859978; scale_rt = 18428729687339929610
Dec 27 09:35:28 vmguest01 kernel: [18446744072.840340] update_cpu_power: cpu_power = 12140269170; scale_rt = 18428729687340338802
Dec 27 09:35:33 vmguest01 kernel: [ 4.123417] __ratelimit: 909 callbacks suppressed
Dec 27 09:35:33 vmguest01 kernel: [ 4.123420] update_cpu_power: cpu_power = 17132887892; scale_rt = 18428729692332957524

The values of cpu_power and scale_rt kept increasing monotonically throughout the log file. I guess the point the machine crashes is when the value hits a critical maximum and may cause an overflow and/or negative values somewhere within the kernel.

Maybe the "high" timestamp value is caused by an adjustment to the guest OS's clock done by VMware? Maybe this adjustment causes the time to become a negative value (causing an underflow ending in a very high positive value)?

I hope someone finds this helpful

Best regards

Thorsten

Re: VM becomes unresponsive, some vCPUs are 100% utilized

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112