]> git.baikalelectronics.ru Git - kernel.git/commit
mm/hwpoison: fix error page recovered but reported "not recovered"
authorNaoya Horiguchi <naoya.horiguchi@nec.com>
Tue, 22 Mar 2022 21:44:06 +0000 (14:44 -0700)
committerGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Sun, 15 May 2022 18:18:53 +0000 (20:18 +0200)
commite7263d8914f8f76347a33b516e130e9d1e2be96c
tree5009979a659f77429fabd6342529e715d2cb1ac8
parent34f3a1e73482427ceccf038257aeb07397e52d94
mm/hwpoison: fix error page recovered but reported "not recovered"

commit 4d887c10de4ec1b8a05f3d9ab53227f785dfddcc upstream.

When an uncorrected memory error is consumed there is a race between the
CMCI from the memory controller reporting an uncorrected error with a
UCNA signature, and the core reporting and SRAR signature machine check
when the data is about to be consumed.

If the CMCI wins that race, the page is marked poisoned when
uc_decode_notifier() calls memory_failure() and the machine check
processing code finds the page already poisoned.  It calls
kill_accessing_process() to make sure a SIGBUS is sent.  But returns the
wrong error code.

Console log looks like this:

  mce: Uncorrected hardware memory error in user-access at 3710b3400
  Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered
  Memory failure: 0x3710b3: already hardware poisoned
  Memory failure: 0x3710b3: Sending SIGBUS to einj_mem_uc:361438 due to hardware memory corruption
  mce: Memory error not recovered

kill_accessing_process() is supposed to return -EHWPOISON to notify that
SIGBUS is already set to the process and kill_me_maybe() doesn't have to
send it again.  But current code simply fails to do this, so fix it to
make sure to work as intended.  This change avoids the noise message
"Memory error not recovered" and skips duplicate SIGBUSs.

[tony.luck@intel.com: reword some parts of commit message]

Link: https://lkml.kernel.org/r/20220113231117.1021405-1-naoya.horiguchi@linux.dev
Fixes: cb847afd775e ("mm,hwpoison: send SIGBUS with error virutal address")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Youquan Song <youquan.song@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mm/memory-failure.c