1. 27 Jun, 2012 2 commits
  2. 12 Jun, 2012 4 commits
  3. 11 Jun, 2012 25 commits
    • Mauro Carvalho Chehab's avatar
      edac: remove arch-specific parameter for the error handler · 03f7eae8
      Mauro Carvalho Chehab authored
      Remove the arch-dependent parameter, as it were not used,
      as the MCE tracepoint weren't implemented. It probably doesn't
      make sense to have an MCE-specific tracepoint, as this will
      cost more bytes at the tracepoint, and tracepoint is not free.
      
      The changes at the EDAC drivers were done by this small perl script:
      
      	$file .=$_ while (<>);
      	$file =~ s/(edac_mc_handle_error)\s*\(([^\;]+)\,([^\,\)]+)\s*\)/$1($2)/g;
      	print $file;
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      03f7eae8
    • Mauro Carvalho Chehab's avatar
      amd64_edac: Don't pass driver name as an error parameter · 075f3090
      Mauro Carvalho Chehab authored
      The EDAC driver name doesn't help to handle EDAC errors. So,
      remove it from the EDAC error messages, preserving only the
      error_message.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      075f3090
    • Dan Carpenter's avatar
      edac_mc: check for allocation failure in edac_mc_alloc() · 08a4a136
      Dan Carpenter authored
      Add a check here for if kzalloc() failed.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      08a4a136
    • Mauro Carvalho Chehab's avatar
      edac: Increase version to 3.0.0 · 5156a5f4
      Mauro Carvalho Chehab authored
      There were lots of changes introduced to justify renaming it to
      3.0.0:
      
        - EDAC core were redesigned to represent all types of
          memory controllers;
      
        - EDAC API were redesigned to properly represent the memory
          controller hierarchy;
      
        - a tracepoint-based API were added to report memory errors.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      5156a5f4
    • Mauro Carvalho Chehab's avatar
      edac_mc: Cleanup per-dimm_info debug messages · 6e84d359
      Mauro Carvalho Chehab authored
      The edac_mc_alloc() routine allocates one dimm_info device for all
      possible memories, including the non-filled ones. The debug messages
      there are somewhat confusing. So, cleans them, by moving the code
      that prints the memory location to edac_mc, and using it on both
      edac_mc_sysfs and edac_mc.
      
      Also, only dumps information when DIMM/ranks are actually
      filled.
      
      After this patch, a dimm-based memory controller will print the debug
      info as:
      
      [ 1011.380027] EDAC DEBUG: edac_mc_dump_csrow: csrow->csrow_idx = 0
      [ 1011.380029] EDAC DEBUG: edac_mc_dump_csrow:   csrow = ffff8801169be000
      [ 1011.380031] EDAC DEBUG: edac_mc_dump_csrow:   csrow->first_page = 0x0
      [ 1011.380032] EDAC DEBUG: edac_mc_dump_csrow:   csrow->last_page = 0x0
      [ 1011.380034] EDAC DEBUG: edac_mc_dump_csrow:   csrow->page_mask = 0x0
      [ 1011.380035] EDAC DEBUG: edac_mc_dump_csrow:   csrow->nr_channels = 3
      [ 1011.380037] EDAC DEBUG: edac_mc_dump_csrow:   csrow->channels = ffff8801149c2840
      [ 1011.380039] EDAC DEBUG: edac_mc_dump_csrow:   csrow->mci = ffff880117426000
      [ 1011.380041] EDAC DEBUG: edac_mc_dump_channel:   channel->chan_idx = 0
      [ 1011.380042] EDAC DEBUG: edac_mc_dump_channel:     channel = ffff8801149c2860
      [ 1011.380044] EDAC DEBUG: edac_mc_dump_channel:     channel->csrow = ffff8801169be000
      [ 1011.380046] EDAC DEBUG: edac_mc_dump_channel:     channel->dimm = ffff88010fe90400
      ...
      [ 1011.380095] EDAC DEBUG: edac_mc_dump_dimm: dimm0: channel 0 slot 0 mapped as virtual row 0, chan 0
      [ 1011.380097] EDAC DEBUG: edac_mc_dump_dimm:   dimm = ffff88010fe90400
      [ 1011.380099] EDAC DEBUG: edac_mc_dump_dimm:   dimm->label = 'CPU#0Channel#0_DIMM#0'
      [ 1011.380101] EDAC DEBUG: edac_mc_dump_dimm:   dimm->nr_pages = 0x40000
      [ 1011.380103] EDAC DEBUG: edac_mc_dump_dimm:   dimm->grain = 8
      [ 1011.380104] EDAC DEBUG: edac_mc_dump_dimm:   dimm->nr_pages = 0x40000
      ...
      
      (a rank-based memory controller would print, instead of "dimm?", "rank?"
       on the above debug info)
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      6e84d359
    • Joe Perches's avatar
      edac: Convert debugfX to edac_dbg(X, · 956b9ba1
      Joe Perches authored
      Use a more common debugging style.
      
      Remove __FILE__ uses, add missing newlines,
      coalesce formats and align arguments.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      956b9ba1
    • Joe Perches's avatar
      edac: Use more normal debugging macro style · 7e881856
      Joe Perches authored
      Convert macros to a simpler style and enforce appropriate
      format checking when not CONFIG_EDAC_DEBUG.
      
      Use fmt and __VA_ARGS__, neaten macros.
      
      Move some string arrays to the debugfx uses and remove the
      now unnecessary CONFIG_EDAC_DEBUG variable block definitions.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      7e881856
    • Mauro Carvalho Chehab's avatar
      edac: Don't add __func__ or __FILE__ for debugf[0-9] msgs · dd23cd6e
      Mauro Carvalho Chehab authored
      The debug macro already adds that. Most of the work here was
      made by this small script:
      
      $f .=$_ while (<>);
      
      $f =~ s/(debugf[0-9]\s*\(\s*)__FILE__\s*": /\1"/g;
      $f =~ s/(debugf[0-9]\s*\(\s*)__FILE__\s*/\1/g;
      $f =~ s/(debugf[0-9]\s*\(\s*)__FILE__\s*"MC: /\1"/g;
      
      $f =~ s/(debugf[0-9]\s*\(\")\%s[\:\,\(\)]*\s*([^\"]*\s*[^\)]+)__func__\s*\,\s*/\1\2/g;
      $f =~ s/(debugf[0-9]\s*\(\")\%s[\:\,\(\)]*\s*([^\"]*\s*[^\)]+),\s*__func__\s*\)/\1\2)/g;
      $f =~ s/(debugf[0-9]\s*\(\"MC\:\s*)\%s[\:\,\(\)]*\s*([^\"]*\s*[^\)]+)__func__\s*\,\s*/\1\2/g;
      $f =~ s/(debugf[0-9]\s*\(\"MC\:\s*)\%s[\:\,\(\)]*\s*([^\"]*\s*[^\)]+),\s*__func__\s*\)/\1\2)/g;
      
      $f =~ s/\"MC\: \\n\"/"MC:\\n"/g;
      
      print $f;
      
      After running the script, manual cleanups were done to fix it the remaining
      places.
      
      While here, removed the __LINE__ on most places, as it doesn't actually give
      useful info on most places.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      dd23cd6e
    • Mauro Carvalho Chehab's avatar
      Edac: Add ABI Documentation for the new device nodes · 2639c3ee
      Mauro Carvalho Chehab authored
      The EDAC ABI were extended to add support for per-DIMM or per-rank
      information and silkscreen labels. Properly document them.
      
      Most of the comments there came from edac.txt descriptions of the
      fields that are part of the legacy csrowX ABI (e. g.
      /sys/devices/system/edac/mc/mc*/csrow*/*).
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      2639c3ee
    • Mauro Carvalho Chehab's avatar
      edac: move documentation ABI to ABI/testing/sysfs-devices-edac · 8b6f04ce
      Mauro Carvalho Chehab authored
      The EDAC MC API is currently stored at the wrong place. Move the
      parts of the EDAC MC ABI that will be kept to
      ABI/testing/sysfs-devices-edac.
      
      The Date: field were added based on git timestamps for the git
      commit patches that added the functionality at edac.txt.
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      8b6f04ce
    • Mauro Carvalho Chehab's avatar
      i7core_edac: change the mem allocation scheme to make Documentation/kobject.txt happy · 356f0a30
      Mauro Carvalho Chehab authored
      Kernel kobjects have rigid rules: each container object should be
      dynamically allocated, and can't be allocated into a single kmalloc.
      
      EDAC never obeyed this rule: it has a single malloc function that
      allocates all needed data into a single kzalloc.
      
      As this is not accepted anymore, change the allocation schema of the
      EDAC *_info structs to enforce this Kernel standard.
      
      Cc: Aristeu Rozanski <arozansk@redhat.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      356f0a30
    • Mauro Carvalho Chehab's avatar
      edac: change the mem allocation scheme to make Documentation/kobject.txt happy · de3910eb
      Mauro Carvalho Chehab authored
      Kernel kobjects have rigid rules: each container object should be
      dynamically allocated, and can't be allocated into a single kmalloc.
      
      EDAC never obeyed this rule: it has a single malloc function that
      allocates all needed data into a single kzalloc.
      
      As this is not accepted anymore, change the allocation schema of the
      EDAC *_info structs to enforce this Kernel standard.
      Acked-by: default avatarChris Metcalf <cmetcalf@tilera.com>
      Cc: Aristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Cc: Greg K H <gregkh@linuxfoundation.org>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Mark Gross <mark.gross@intel.com>
      Cc: Tim Small <tim@buttersideup.com>
      Cc: Ranganathan Desikan <ravi@jetztechnologies.com>
      Cc: "Arvind R." <arvino55@gmail.com>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Egor Martovetsky <egor@pasemi.com>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Hitoshi Mitake <h.mitake@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Shaohui Xie <Shaohui.Xie@freescale.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      de3910eb
    • Mauro Carvalho Chehab's avatar
      edac: Only expose csrows/channels on legacy API if they're populated · e39f4ea9
      Mauro Carvalho Chehab authored
      This patch actually fixes a bug with the legacy API, where, at the
      same csrow, some channels may have different DIMMs. This can happen
      on FB-DIMM/RAMBUS and modern Intel controllers.
      
      This is the case, for example, of Nehalem machines:
      
      $ ./edac-ctl --layout
             +-----------------------------------+
             |                mc0                |
             | channel0  | channel1  | channel2  |
      -------+-----------------------------------+
      slot2: |     0 MB  |     0 MB  |     0 MB  |
      slot1: |  1024 MB  |     0 MB  |     0 MB  |
      slot0: |  1024 MB  |  1024 MB  |  1024 MB  |
      -------+-----------------------------------+
      
      Before this patch, non-filled memories were shown. Now, only what's
      filled is there:
      
      grep . /sys/devices/system/edac/mc/mc0/csrow*/ch?*
      /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
      /sys/devices/system/edac/mc/mc0/csrow0/ch0_dimm_label:CPU#0Channel#0_DIMM#0
      /sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
      /sys/devices/system/edac/mc/mc0/csrow0/ch1_dimm_label:CPU#0Channel#0_DIMM#1
      /sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
      /sys/devices/system/edac/mc/mc0/csrow1/ch0_dimm_label:CPU#0Channel#1_DIMM#0
      /sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
      /sys/devices/system/edac/mc/mc0/csrow2/ch0_dimm_label:CPU#0Channel#2_DIMM#0
      
      Thanks-to: Aristeu Rozanski Filho <arozansk@redhat.com>
      Reviewed-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      e39f4ea9
    • Mauro Carvalho Chehab's avatar
      edac: Move grain/dtype/edac_type calculus to be out of channel loop · fd63312d
      Mauro Carvalho Chehab authored
      The 3e7bddc changeset (edac: move dimm properties to struct memset_info)
      moved the calculus inside a loop. However, at those stuff are common to
      all channels, on several drivers, it is better to put the calculus
      outside the loop, to optimize the code.
      Reported-by: default avatarAristeu Rozanski Filho <arozansk@redhat.com>
      Reviewed-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Cc: Mark Gross <mark.gross@intel.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      fd63312d
    • Mauro Carvalho Chehab's avatar
      edac: Add debufs nodes to allow doing fake error inject · 452a6bf9
      Mauro Carvalho Chehab authored
      Sometimes, it is useful to have a mechanism that generates fake
      errors, in order to test the EDAC core code, and the userspace
      tools.
      
      Provide such mechanism by adding a few debugfs nodes.
      Reviewed-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      452a6bf9
    • Mauro Carvalho Chehab's avatar
      edac: add a sysfs node to report the maximum location for the system · 8ad6c78a
      Mauro Carvalho Chehab authored
      The userspace tools need to know what's the maximum location on each
      system, as it helps to create nice maps showing how the memory was
      filled at the system.
      Reviewed-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      8ad6c78a
    • Mauro Carvalho Chehab's avatar
      edac: add a new per-dimm API and make the old per-virtual-rank API obsolete · 19974710
      Mauro Carvalho Chehab authored
      The old EDAC API is broken. It only works fine for systems manufatured
      before 2005 and for AMD 64. The reason is that it forces all memory
      controller drivers to discover rank info.
      
      Also, it doesn't allow grouping the several ranks into a DIMM.
      
      So, what almost all modern drivers do is to create a fake virtual-rank
      information, and use it to cheat the EDAC core to accept the driver.
      
      While this works if the user has enough time to discover what DIMM slot
      corresponds to each "virtual-rank" information, it prevents EDAC usage
      for users with less available time. It also makes life hard for vendors
      that may want to provide a table with their motherboards to the userspace
      tool (edac-utils) as each driver has its own logic for the virtual
      mapping.
      
      So, the old API should be removed, in favor of a more flexible API that
      allows newer drivers to not lie to the EDAC core.
      Reviewed-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: Hui Wang <jason77.wang@gmail.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      19974710
    • Mauro Carvalho Chehab's avatar
      edac: Get rid of the old kobj's from the edac mc code · d90c0089
      Mauro Carvalho Chehab authored
      Now that al users for the old kobj raw access are gone,
      we can get rid of the legacy kobj-based structures and
      data.
      Reviewed-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Cc: Michal Marek <mmarek@suse.cz>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      d90c0089
    • Mauro Carvalho Chehab's avatar
      i7core_edac: convert it to use struct device · 5c4cdb5a
      Mauro Carvalho Chehab authored
      Instead of relying on a complex logic inside the edac core to create
      a "device tree-like" sysfs struct, just use device_add.
      Reviewed-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      5c4cdb5a
    • Mauro Carvalho Chehab's avatar
      amd64_edac: convert sysfs logic to use struct device · c5608759
      Mauro Carvalho Chehab authored
      Now that the EDAC core supports struct device, there's no sense
      on having any logic at the EDAC core to simulate it. So, instead
      of adding such logic there, change the logic at amd64_edac to
      use it.
      Reviewed-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      c5608759
    • Mauro Carvalho Chehab's avatar
      mpc85xx_edac: convert sysfs logic to use struct device · ba004239
      Mauro Carvalho Chehab authored
      Now that the EDAC core supports struct device, there's no sense on
      having any logic at the EDAC core to simulate it. So, instead of adding
      such logic there, change the logic at mpc85xx_edac to use it
      
      compile-tested only.
      Reviewed-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Shaohui Xie <Shaohui.Xie@freescale.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      ba004239
    • Mauro Carvalho Chehab's avatar
      edac: rewrite the sysfs code to use struct device · 7a623c03
      Mauro Carvalho Chehab authored
      The EDAC subsystem uses the old struct sysdev approach,
      creating all nodes using the raw sysfs API. This is bad,
      as the API is deprecated.
      
      As we'll be changing the EDAC API, let's first port the existing
      code to struct device.
      
      There's one drawback on this patch: driver-specific sysfs
      nodes, used by mpc85xx_edac, amd64_edac and i7core_edac
       won't be created anymore. While it would be possible to
      also port the device-specific code, that would mix kobj with
      struct device, with is not recommended. Also, it is easier and nicer
      to move the code to the drivers, instead, as the core can get rid
      of some complex logic that just emulates what the device_add()
      and device_create_file() already does.
      
      The next patches will convert the driver-specific code to use
      the device-specific calls. Then, the remaining bits of the old
      sysfs API will be removed.
      
      NOTE: a per-MC bus is required, otherwise devices with more than
      one memory controller will hit a bug like the one below:
      
      [  819.094946] EDAC DEBUG: find_mci_by_dev: find_mci_by_dev()
      [  819.094948] EDAC DEBUG: edac_create_sysfs_mci_device: edac_create_sysfs_mci_device() idx=1
      [  819.094952] EDAC DEBUG: edac_create_sysfs_mci_device: edac_create_sysfs_mci_device(): creating device mc1
      [  819.094967] EDAC DEBUG: edac_create_sysfs_mci_device: edac_create_sysfs_mci_device creating dimm0, located at channel 0 slot 0
      [  819.094984] ------------[ cut here ]------------
      [  819.100142] WARNING: at fs/sysfs/dir.c:481 sysfs_add_one+0xc1/0xf0()
      [  819.107282] Hardware name: S2600CP
      [  819.111078] sysfs: cannot create duplicate filename '/bus/edac/devices/dimm0'
      [  819.119062] Modules linked in: sb_edac(+) edac_core ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc sunrpc binfmt_misc dm_mirror dm_region_hash dm_log vhost_net macvtap macvlan tun kvm microcode pcspkr iTCO_wdt iTCO_vendor_support igb i2c_i801 i2c_core sg ioatdma dca sr_mod cdrom sd_mod crc_t10dif ahci libahci isci libsas libata scsi_transport_sas scsi_mod wmi dm_mod [last unloaded: scsi_wait_scan]
      [  819.175748] Pid: 10902, comm: modprobe Not tainted 3.3.0-0.11.el7.v12.2.x86_64 #1
      [  819.184113] Call Trace:
      [  819.186868]  [<ffffffff8105adaf>] warn_slowpath_common+0x7f/0xc0
      [  819.193573]  [<ffffffff8105aea6>] warn_slowpath_fmt+0x46/0x50
      [  819.200000]  [<ffffffff811f53d1>] sysfs_add_one+0xc1/0xf0
      [  819.206025]  [<ffffffff811f5cf5>] sysfs_do_create_link+0x135/0x220
      [  819.212944]  [<ffffffff811f7023>] ? sysfs_create_group+0x13/0x20
      [  819.219656]  [<ffffffff811f5df3>] sysfs_create_link+0x13/0x20
      [  819.226109]  [<ffffffff813b04f6>] bus_add_device+0xe6/0x1b0
      [  819.232350]  [<ffffffff813ae7cb>] device_add+0x2db/0x460
      [  819.238300]  [<ffffffffa0325634>] edac_create_dimm_object+0x84/0xf0 [edac_core]
      [  819.246460]  [<ffffffffa0325e18>] edac_create_sysfs_mci_device+0xe8/0x290 [edac_core]
      [  819.255215]  [<ffffffffa0322e2a>] edac_mc_add_mc+0x5a/0x2c0 [edac_core]
      [  819.262611]  [<ffffffffa03412df>] sbridge_register_mci+0x1bc/0x279 [sb_edac]
      [  819.270493]  [<ffffffffa03417a3>] sbridge_probe+0xef/0x175 [sb_edac]
      [  819.277630]  [<ffffffff813ba4e8>] ? pm_runtime_enable+0x58/0x90
      [  819.284268]  [<ffffffff812f430c>] local_pci_probe+0x5c/0xd0
      [  819.290508]  [<ffffffff812f5ba1>] __pci_device_probe+0xf1/0x100
      [  819.297117]  [<ffffffff812f5bea>] pci_device_probe+0x3a/0x60
      [  819.303457]  [<ffffffff813b1003>] really_probe+0x73/0x270
      [  819.309496]  [<ffffffff813b138e>] driver_probe_device+0x4e/0xb0
      [  819.316104]  [<ffffffff813b149b>] __driver_attach+0xab/0xb0
      [  819.322337]  [<ffffffff813b13f0>] ? driver_probe_device+0xb0/0xb0
      [  819.329151]  [<ffffffff813af5d6>] bus_for_each_dev+0x56/0x90
      [  819.335489]  [<ffffffff813b0d7e>] driver_attach+0x1e/0x20
      [  819.341534]  [<ffffffff813b0980>] bus_add_driver+0x1b0/0x2a0
      [  819.347884]  [<ffffffffa0347000>] ? 0xffffffffa0346fff
      [  819.353641]  [<ffffffff813b19f6>] driver_register+0x76/0x140
      [  819.359980]  [<ffffffff8159f18b>] ? printk+0x51/0x53
      [  819.365524]  [<ffffffffa0347000>] ? 0xffffffffa0346fff
      [  819.371291]  [<ffffffff812f5896>] __pci_register_driver+0x56/0xd0
      [  819.378096]  [<ffffffffa0347054>] sbridge_init+0x54/0x1000 [sb_edac]
      [  819.385231]  [<ffffffff8100203f>] do_one_initcall+0x3f/0x170
      [  819.391577]  [<ffffffff810bcd2e>] sys_init_module+0xbe/0x230
      [  819.397926]  [<ffffffff815bb529>] system_call_fastpath+0x16/0x1b
      [  819.404633] ---[ end trace 1654fdd39556689f ]---
      
      This happens because the bus is not being properly initialized.
      Instead of putting the memory sub-devices inside the memory controller,
      it is putting everything under the same directory:
      
      $ tree /sys/bus/edac/
      /sys/bus/edac/
      ├── devices
      │   ├── all_channel_counts -> ../../../devices/system/edac/mc/mc0/all_channel_counts
      │   ├── csrow0 -> ../../../devices/system/edac/mc/mc0/csrow0
      │   ├── csrow1 -> ../../../devices/system/edac/mc/mc0/csrow1
      │   ├── csrow2 -> ../../../devices/system/edac/mc/mc0/csrow2
      │   ├── dimm0 -> ../../../devices/system/edac/mc/mc0/dimm0
      │   ├── dimm1 -> ../../../devices/system/edac/mc/mc0/dimm1
      │   ├── dimm3 -> ../../../devices/system/edac/mc/mc0/dimm3
      │   ├── dimm6 -> ../../../devices/system/edac/mc/mc0/dimm6
      │   ├── inject_addrmatch -> ../../../devices/system/edac/mc/mc0/inject_addrmatch
      │   ├── mc -> ../../../devices/system/edac/mc
      │   └── mc0 -> ../../../devices/system/edac/mc/mc0
      ├── drivers
      ├── drivers_autoprobe
      ├── drivers_probe
      └── uevent
      
      On a multi-memory controller system, the names "csrow%d" and "dimm%d"
      should be under "mc%d", and not at the main hierarchy level.
      
      So, we need to create a per-MC bus, in order to have its own namespace.
      Reviewed-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Cc: Greg K H <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      7a623c03
    • Mauro Carvalho Chehab's avatar
      edac: use Documentation-nano format for some data structs · b0610bb8
      Mauro Carvalho Chehab authored
      No functional changes. Just comment improvements.
      Reviewed-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      b0610bb8
    • Mauro Carvalho Chehab's avatar
      edac: Rename the parent dev to pdev · fd687502
      Mauro Carvalho Chehab authored
      As EDAC doesn't use struct device itself, it created a parent dev
      pointer called as "pdev".  Now that we'll be converting it to use
      struct device, instead of struct devsys, this needs to be fixed.
      
      No functional changes.
      Reviewed-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Acked-by: default avatarChris Metcalf <cmetcalf@tilera.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Mark Gross <mark.gross@intel.com>
      Cc: Jason Uhlenkott <juhlenko@akamai.com>
      Cc: Tim Small <tim@buttersideup.com>
      Cc: Ranganathan Desikan <ravi@jetztechnologies.com>
      Cc: "Arvind R." <arvino55@gmail.com>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Egor Martovetsky <egor@pasemi.com>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Hitoshi Mitake <h.mitake@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "Niklas Söderlund" <niklas.soderlund@ericsson.com>
      Cc: Shaohui Xie <Shaohui.Xie@freescale.com>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      fd687502
    • Mauro Carvalho Chehab's avatar
      RAS: Add a tracepoint for reporting memory controller events · 53f2d028
      Mauro Carvalho Chehab authored
      Add a new tracepoint-based hardware events report method for
      reporting Memory Controller events.
      
      Part of the description bellow is shamelessly copied from Tony
      Luck's notes about the Hardware Error BoF during LPC 2010 [1].
      Tony, thanks for your notes and discussions to generate the
      h/w error reporting requirements.
      
      [1] http://lwn.net/Articles/416669/
      
          We have several subsystems & methods for reporting hardware errors:
      
          1) EDAC ("Error Detection and Correction").  In its original form
          this consisted of a platform specific driver that read topology
          information and error counts from chipset registers and reported
          the results via a sysfs interface.
      
          2) mcelog - x86 specific decoding of machine check bank registers
          reporting in binary form via /dev/mcelog. Recent additions make use
          of the APEI extensions that were documented in version 4.0a of the
          ACPI specification to acquire more information about errors without
          having to rely reading chipset registers directly. A user level
          programs decodes into somewhat human readable format.
      
          3) drivers/edac/mce_amd.c - this driver hooks into the mcelog path and
          decodes errors reported via machine check bank registers in AMD
          processors to the console log using printk();
      
          Each of these mechanisms has a band of followers ... and none
          of them appear to meet all the needs of all users.
      
      As part of a RAS subsystem, let's encapsulate the memory error hardware
      events into a trace facility.
      
      The tracepoint printk will be displayed like:
      
      mc_event: [quant] (Corrected|Uncorrected|Fatal) error:[error msg] on [label] ([location] [edac_mc detail] [driver_detail]
      
      Where:
             	[quant] is the quantity of errors
      	[error msg] is the driver-specific error message
      		    (e. g. "memory read", "bus error", ...);
      	[location] is the location in terms of memory controller and
      		   branch/channel/slot, channel/slot or csrow/channel;
      	[label] is the memory stick label;
      	[edac_mc detail] describes the address location of the error
      			 and the syndrome;
      	[driver detail] is driver-specifig error message details,
      			when needed/provided (e. g. "area:DMA", ...)
      
      For example:
      
      mc_event: 1 Corrected error:memory read on memory stick DIMM_1A (mc:0 location:0:0:0 page:0x586b6e offset:0xa66 grain:32 syndrome:0x0 area:DMA)
      
      Of course, any userspace tools meant to handle errors should not parse
      the above data. They should, instead, use the binary fields provided by
      the tracepoint, mapping them directly into their Management Information
      Base.
      
      NOTE: The original patch was providing an additional mechanism for
      MCA-based trace events that also contained MCA error register data.
      However, as no agreement was reached so far for the MCA-based trace
      events, for now, let's add events only for memory errors.
      A latter patch is planned to change the tracepoint, for those types
      of event.
      
      Cc: Aristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      53f2d028
  4. 28 May, 2012 9 commits
    • Mauro Carvalho Chehab's avatar
      i7core: fix ranks information at the per-channel struct · 0bf09e82
      Mauro Carvalho Chehab authored
      There is a flag at the per-channel struct that indicates if there are
      any 4R dimm on it. The way the presence of this flag were reported
      is not ok, as it might give the false idea that the channel were filled
      with 2R memories:
      
      [  580.588701] EDAC DEBUG: get_dimm_config: Ch1 phy rd1, wr1 (0x063f7431): 2 ranks, UDIMMs
      [  580.588704] EDAC DEBUG: get_dimm_config: 	dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
      
      (in this case, just one 1R memory is filled on channel 1)
      
      So, use a better way to represent the per-channel ranks information.
      After the patch, it will show:
      
      [ 2002.233978] EDAC DEBUG: get_dimm_config: Ch0 phy rd0, wr0 (0x063f7431): UDIMMs
      [ 2002.233982] EDAC DEBUG: get_dimm_config: 	dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
      [ 2002.233988] EDAC DEBUG: get_dimm_config: 	dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400
      
      (in this case, there isn't any 4R memories)
      Reported-by: default avatarBorislav Petkov <borislav.petkov@amd.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      0bf09e82
    • Mauro Carvalho Chehab's avatar
      i5000: Fix the fatal error handling · 486dfb16
      Mauro Carvalho Chehab authored
      The fatal error channel bits point to a single channel, and not
      to a range of channels. Fix the code to properly report it,
      instead of printing messages like:
      	kernel: EDAC MC0: INTERNAL ERROR: channel-b out of range (4 >= 4)
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      486dfb16
    • Mauro Carvalho Chehab's avatar
      i5100_edac: Fix a warning when compiled with 32 bits · 9f70d08a
      Mauro Carvalho Chehab authored
      drivers/edac/i5100_edac.c: In function ‘i5100_init_csrows’:
      drivers/edac/i5100_edac.c:862:3: warning: format ‘%zd’ expects argument of type ‘signed size_t’, but argument 5 has type ‘long unsigned int’ [-Wformat]
      Reviewed-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Cc: "Niklas Söderlund" <niklas.soderlund@ericsson.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      9f70d08a
    • Mauro Carvalho Chehab's avatar
      i82975x_edac: Test nr_pages earlier to save a few CPU cycles · 36683aab
      Mauro Carvalho Chehab authored
      Avoid test nr_pages twice, and initializing some data that won't
      be used.
      
      Cleanup patch only.
      Reported-by: default avatarAristeu Rozanski Filho <arozansk@redhat.com>
      Reviewed-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Cc: Ranganathan Desikan <ravi@jetztechnologies.com>
      Cc: "Arvind R." <arvino55@gmail.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      36683aab
    • Mauro Carvalho Chehab's avatar
      e752x_edac: provide more info about how DIMMS/ranks are mapped · 805afb69
      Mauro Carvalho Chehab authored
      No funtional changes here. Only the comments got updated.
      Reviewed-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Cc: Mark Gross <mark.gross@intel.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      805afb69
    • Mauro Carvalho Chehab's avatar
      i5000_edac: Fix the logic that retrieves memory information · 64e1fdaf
      Mauro Carvalho Chehab authored
      The logic there is broken: it basically creates two csrows for
      each DIMM and assumes that all DIMM's are dual rank. Only one of
      the csrows will contain the entire DIMM size. If single rank
      memories are found, they'll be marked with 0 bytes.
      
      The check if the AMB is present were also wrong.
      
      Yet, as the error reports don't use the memory size in order to
      credit an error to the right DIMM, that part of the driver seems
      to work. That's why probably nobody detected the issue yet.
      
      After this patch, the memory layout is now properly reported,
      when debug mode is enabled, and the number of ranks per dimm is
      now shown:
      
      calculate_dimm_size: ----------------------------------------------------------
      calculate_dimm_size: slot  3       0 MB   |    0 MB   |    0 MB   |    0 MB   |
      calculate_dimm_size: slot  2       0 MB   |    0 MB   |    0 MB   |    0 MB   |
      calculate_dimm_size: ----------------------------------------------------------
      calculate_dimm_size: slot  1       0 MB   |    0 MB   |    0 MB   |    0 MB   |
      calculate_dimm_size: slot  0     512 MB 1R|  512 MB 1R|  512 MB 1R|  512 MB 1R|
      calculate_dimm_size: ----------------------------------------------------------
      calculate_dimm_size:            channel 0 | channel 1 | channel 2 | channel 3 |
      calculate_dimm_size:                   branch 0       |        branch 1       |
      
      (1R above means that all memories on my test machine are single-ranked)
      Reviewed-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      64e1fdaf
    • Mauro Carvalho Chehab's avatar
      i5400_edac: improve debug messages to better represent the filled memory · 68d086f8
      Mauro Carvalho Chehab authored
      Improves the debug output message, in order to better represent the
      memory controller hierarchy, when outputing the debug messages.
      
      No functional changes when debug is disabled.
      Reviewed-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      68d086f8
    • Mauro Carvalho Chehab's avatar
      edac: Cleanup the logs for i7core and sb edac drivers · e17a2f42
      Mauro Carvalho Chehab authored
      Remove some information that it is duplicated at the MCE log,
      and don't have much usage for the error. Those data will be
      added again, when creating a trace function that outputs both
      memory errors and MCE fields.
      
      Cc: Aristeu Rozanski <arozansk@redhat.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      e17a2f42
    • Mauro Carvalho Chehab's avatar
      edac: Initialize the dimm label with the known information · 5926ff50
      Mauro Carvalho Chehab authored
      While userspace doesn't fill the dimm labels, add there the dimm location,
      as described by the used memory model. This could eventually match what
      is described at the dmidecode, making easier for people to identify the
      memory.
      
      For example, on an Intel motherboard where the DMI table is reliable,
      the first memory stick is described as:
      
      Memory Device
      	Array Handle: 0x0029
      	Error Information Handle: Not Provided
      	Total Width: 64 bits
      	Data Width: 64 bits
      	Size: 2048 MB
      	Form Factor: DIMM
      	Set: 1
      	Locator: A1_DIMM0
      	Bank Locator: A1_Node0_Channel0_Dimm0
      	Type: <OUT OF SPEC>
      	Type Detail: Synchronous
      	Speed: 800 MHz
      	Manufacturer: A1_Manufacturer0
      	Serial Number: A1_SerNum0
      	Asset Tag: A1_AssetTagNum0
      	Part Number: A1_PartNum0
      
      The memory named as "A1_DIMM0" is physically located at the first
      memory controller (node 0), at channel 0, dimm slot 0.
      
      After this patch, the memory label will be filled with:
      	/sys/devices/system/edac/mc/csrow0/ch0_dimm_label:mc#0channel#0slot#0
      
      And (after the new EDAC API patches) as:
      	/sys/devices/system/edac/mc/mc0/dimm0/dimm_label:mc#0channel#0slot#0
      
      So, even if the memory label is not initialized on userspace, an useful
      information with the error location is filled there, expecially since
      several systems/motherboards are provided with enough info to map from
      channel/slot (or branch/channel/slot) into the DIMM label. So, letting the
      EDAC core fill it by default is a good thing.
      
      It should noticed that, as the label filling happens at the
      edac_mc_alloc(), drivers can override it to better describe the memories
      (and some actually do it).
      
      Cc: Aristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      5926ff50