1. 01 Mar, 2016 5 commits
    • Filipe Manana's avatar
      Btrfs: fix deadlock between direct IO reads and buffered writes · ade77029
      Filipe Manana authored
      While running a test with a mix of buffered IO and direct IO against
      the same files I hit a deadlock reported by the following trace:
      
      [11642.140352] INFO: task kworker/u32:3:15282 blocked for more than 120 seconds.
      [11642.142452]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.143982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.146332] kworker/u32:3   D ffff880230ef7988 [11642.147737] systemd-journald[571]: Sent WATCHDOG=1 notification.
      [11642.149771]     0 15282      2 0x00000000
      [11642.151205] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
      [11642.154074]  ffff880230ef7988 0000000000000246 0000000000014ec0 ffff88023ec94ec0
      [11642.156722]  ffff880233fe8f80 ffff880230ef8000 ffff88023ec94ec0 7fffffffffffffff
      [11642.159205]  0000000000000002 ffffffff8147b7f9 ffff880230ef79a0 ffffffff8147b541
      [11642.161403] Call Trace:
      [11642.162129]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.163396]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.164871]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
      [11642.167020]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.167931]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
      [11642.182320]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
      [11642.183762]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
      [11642.185308]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
      [11642.186782]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
      [11642.188217]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
      [11642.189626]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
      [11642.190803]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
      [11642.192158]  [<ffffffff8111829f>] __lock_page+0x66/0x68
      [11642.193379]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
      [11642.194831]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
      [11642.197068]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
      [11642.199188]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
      [11642.200723]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [11642.202465]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
      [11642.203836]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
      [11642.205624]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
      [11642.207057]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
      [11642.208529]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
      [11642.210375]  [<ffffffffa0462613>] ? btrfs_scrubparity_helper+0x140/0x33a [btrfs]
      [11642.212132]  [<ffffffffa044f974>] btrfs_run_ordered_extent_work+0x25/0x34 [btrfs]
      [11642.213837]  [<ffffffffa046262f>] btrfs_scrubparity_helper+0x15c/0x33a [btrfs]
      [11642.215457]  [<ffffffffa046293b>] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
      [11642.217095]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
      [11642.218324]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
      [11642.219466]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
      [11642.220801]  [<ffffffff8106a500>] kthread+0xd4/0xdc
      [11642.222032]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
      [11642.223190]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
      [11642.224394]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
      [11642.226295] 2 locks held by kworker/u32:3/15282:
      [11642.227273]  #0:  ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
      [11642.229412]  #1:  ((&work->normal_work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
      [11642.231414] INFO: task kworker/u32:8:15289 blocked for more than 120 seconds.
      [11642.232872]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.234109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.235776] kworker/u32:8   D ffff88020de5f848     0 15289      2 0x00000000
      [11642.237412] Workqueue: writeback wb_workfn (flush-btrfs-481)
      [11642.238670]  ffff88020de5f848 0000000000000246 0000000000014ec0 ffff88023ed54ec0
      [11642.240475]  ffff88021b1ece40 ffff88020de60000 ffff88023ed54ec0 7fffffffffffffff
      [11642.242154]  0000000000000002 ffffffff8147b7f9 ffff88020de5f860 ffffffff8147b541
      [11642.243715] Call Trace:
      [11642.244390]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.245432]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.246392]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
      [11642.247479]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.248551]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
      [11642.249968]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
      [11642.251043]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
      [11642.252202]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
      [11642.253210]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
      [11642.254307]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
      [11642.256118]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
      [11642.257131]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
      [11642.258200]  [<ffffffff8111829f>] __lock_page+0x66/0x68
      [11642.259168]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
      [11642.260516]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
      [11642.261841]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
      [11642.263531]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
      [11642.264747]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [11642.266148]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
      [11642.267264]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
      [11642.268280]  [<ffffffff81192a2b>] __writeback_single_inode+0xda/0x5ba
      [11642.269407]  [<ffffffff811939f0>] writeback_sb_inodes+0x27b/0x43d
      [11642.270476]  [<ffffffff81193c28>] __writeback_inodes_wb+0x76/0xae
      [11642.271547]  [<ffffffff81193ea6>] wb_writeback+0x19e/0x41c
      [11642.272588]  [<ffffffff81194821>] wb_workfn+0x201/0x341
      [11642.273523]  [<ffffffff81194821>] ? wb_workfn+0x201/0x341
      [11642.274479]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
      [11642.275497]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
      [11642.276518]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
      [11642.277520]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
      [11642.278517]  [<ffffffff8106a500>] kthread+0xd4/0xdc
      [11642.279371]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
      [11642.280468]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
      [11642.281607]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
      [11642.282604] 3 locks held by kworker/u32:8/15289:
      [11642.283423]  #0:  ("writeback"){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
      [11642.285629]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
      [11642.287538]  #2:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff81171217>] trylock_super+0x1b/0x4b
      [11642.289423] INFO: task fdm-stress:26848 blocked for more than 120 seconds.
      [11642.290547]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.291453] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.292864] fdm-stress      D ffff88022c107c20     0 26848  26591 0x00000000
      [11642.294118]  ffff88022c107c20 000000038108affa 0000000000014ec0 ffff88023ed54ec0
      [11642.295602]  ffff88013ab1ca40 ffff88022c108000 ffff8800b2fc19d0 00000000000e0fff
      [11642.297098]  ffff8800b2fc19b0 ffff88022c107c88 ffff88022c107c38 ffffffff8147b541
      [11642.298433] Call Trace:
      [11642.298896]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.299738]  [<ffffffffa045225d>] lock_extent_bits+0xfe/0x1a3 [btrfs]
      [11642.300833]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
      [11642.301943]  [<ffffffffa0447516>] lock_and_cleanup_extent_if_need+0x68/0x18e [btrfs]
      [11642.303270]  [<ffffffffa04485ba>] __btrfs_buffered_write+0x238/0x4c1 [btrfs]
      [11642.304552]  [<ffffffffa044b50a>] ? btrfs_file_write_iter+0x17c/0x408 [btrfs]
      [11642.305782]  [<ffffffffa044b682>] btrfs_file_write_iter+0x2f4/0x408 [btrfs]
      [11642.306878]  [<ffffffff8116e298>] __vfs_write+0x7c/0xa5
      [11642.307729]  [<ffffffff8116e7d1>] vfs_write+0x9d/0xe8
      [11642.308602]  [<ffffffff8116efbb>] SyS_write+0x50/0x7e
      [11642.309410]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [11642.310403] 3 locks held by fdm-stress/26848:
      [11642.311108]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
      [11642.312578]  #1:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
      [11642.314170]  #2:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa044b401>] btrfs_file_write_iter+0x73/0x408 [btrfs]
      [11642.316796] INFO: task fdm-stress:26849 blocked for more than 120 seconds.
      [11642.317842]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.318691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.319959] fdm-stress      D ffff8801964ffa68     0 26849  26591 0x00000000
      [11642.321312]  ffff8801964ffa68 00ff8801e9975f80 0000000000014ec0 ffff88023ed94ec0
      [11642.322555]  ffff8800b00b4840 ffff880196500000 ffff8801e9975f20 0000000000000002
      [11642.323715]  ffff8801e9975f18 ffff8800b00b4840 ffff8801964ffa80 ffffffff8147b541
      [11642.325096] Call Trace:
      [11642.325532]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.326303]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
      [11642.327180]  [<ffffffff8108ae40>] ? mark_held_locks+0x5e/0x74
      [11642.328114]  [<ffffffff8147f30e>] ? _raw_spin_unlock_irq+0x2c/0x4a
      [11642.329051]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
      [11642.330053]  [<ffffffff8147bceb>] __wait_for_common+0x109/0x147
      [11642.330952]  [<ffffffff8147bceb>] ? __wait_for_common+0x109/0x147
      [11642.331869]  [<ffffffff8147e7bb>] ? usleep_range+0x4a/0x4a
      [11642.332925]  [<ffffffff81074075>] ? wake_up_q+0x47/0x47
      [11642.333736]  [<ffffffff8147bd4d>] wait_for_completion+0x24/0x26
      [11642.334672]  [<ffffffffa044f5ce>] btrfs_wait_ordered_extents+0x1c8/0x217 [btrfs]
      [11642.335858]  [<ffffffffa0465b5a>] btrfs_mksubvol+0x224/0x45d [btrfs]
      [11642.336854]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
      [11642.337820]  [<ffffffffa0465edb>] btrfs_ioctl_snap_create_transid+0x148/0x17a [btrfs]
      [11642.339026]  [<ffffffffa046603b>] btrfs_ioctl_snap_create_v2+0xc7/0x110 [btrfs]
      [11642.340214]  [<ffffffffa0468582>] btrfs_ioctl+0x590/0x27bd [btrfs]
      [11642.341123]  [<ffffffff8147dc00>] ? mutex_unlock+0xe/0x10
      [11642.341934]  [<ffffffffa00fa6e9>] ? ext4_file_write_iter+0x2a3/0x36f [ext4]
      [11642.342936]  [<ffffffff8108895d>] ? __lock_is_held+0x3c/0x57
      [11642.343772]  [<ffffffff81186a1d>] ? rcu_read_unlock+0x3e/0x5d
      [11642.344673]  [<ffffffff8117dc95>] do_vfs_ioctl+0x458/0x4dc
      [11642.346024]  [<ffffffff81186bbe>] ? __fget_light+0x62/0x71
      [11642.346873]  [<ffffffff8117dd70>] SyS_ioctl+0x57/0x79
      [11642.347720]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [11642.350222] 4 locks held by fdm-stress/26849:
      [11642.350898]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
      [11642.352375]  #1:  (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffffa0465981>] btrfs_mksubvol+0x4b/0x45d [btrfs]
      [11642.354072]  #2:  (&fs_info->subvol_sem){++++..}, at: [<ffffffffa0465a2a>] btrfs_mksubvol+0xf4/0x45d [btrfs]
      [11642.355647]  #3:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      [11642.357516] INFO: task fdm-stress:26850 blocked for more than 120 seconds.
      [11642.358508]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.359376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.368625] fdm-stress      D ffff88021f167688     0 26850  26591 0x00000000
      [11642.369716]  ffff88021f167688 0000000000000001 0000000000014ec0 ffff88023edd4ec0
      [11642.370950]  ffff880128a98680 ffff88021f168000 ffff88023edd4ec0 7fffffffffffffff
      [11642.372210]  0000000000000002 ffffffff8147b7f9 ffff88021f1676a0 ffffffff8147b541
      [11642.373430] Call Trace:
      [11642.373853]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.374623]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.375948]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
      [11642.376862]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.377637]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
      [11642.378610]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
      [11642.379457]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
      [11642.380366]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
      [11642.381353]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
      [11642.382255]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
      [11642.383162]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
      [11642.383945]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
      [11642.384875]  [<ffffffff8111829f>] __lock_page+0x66/0x68
      [11642.385749]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
      [11642.386721]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
      [11642.387596]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
      [11642.389030]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
      [11642.389973]  [<ffffffff810a25ad>] ? rcu_read_lock_sched_held+0x61/0x69
      [11642.390939]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [11642.392271]  [<ffffffffa0451c32>] ? __clear_extent_bit+0x26e/0x2c0 [btrfs]
      [11642.393305]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
      [11642.394239]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
      [11642.395045]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
      [11642.395991]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
      [11642.397144]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
      [11642.398392]  [<ffffffffa0452094>] ? clear_extent_bit+0x17/0x19 [btrfs]
      [11642.399363]  [<ffffffffa0445945>] btrfs_get_blocks_direct+0x12b/0x61c [btrfs]
      [11642.400445]  [<ffffffff8119f7a1>] ? dio_bio_add_page+0x3d/0x54
      [11642.401309]  [<ffffffff8119fa93>] ? submit_page_section+0x7b/0x111
      [11642.402213]  [<ffffffff811a0258>] do_blockdev_direct_IO+0x685/0xc24
      [11642.403139]  [<ffffffffa044581a>] ? btrfs_page_exists_in_range+0x1a1/0x1a1 [btrfs]
      [11642.404360]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
      [11642.406187]  [<ffffffff811a0828>] __blockdev_direct_IO+0x31/0x33
      [11642.407070]  [<ffffffff811a0828>] ? __blockdev_direct_IO+0x31/0x33
      [11642.407990]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
      [11642.409192]  [<ffffffffa043b4ca>] btrfs_direct_IO+0x1c7/0x27e [btrfs]
      [11642.410146]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
      [11642.411291]  [<ffffffff81119a2c>] generic_file_read_iter+0x89/0x4e1
      [11642.412263]  [<ffffffff8108ac05>] ? mark_lock+0x24/0x201
      [11642.413057]  [<ffffffff8116e1f8>] __vfs_read+0x79/0x9d
      [11642.413897]  [<ffffffff8116e6f1>] vfs_read+0x8f/0xd2
      [11642.414708]  [<ffffffff8116ef3d>] SyS_read+0x50/0x7e
      [11642.415573]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [11642.416572] 1 lock held by fdm-stress/26850:
      [11642.417345]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
      [11642.418703] INFO: task fdm-stress:26851 blocked for more than 120 seconds.
      [11642.419698]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.420612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.421807] fdm-stress      D ffff880196483d28     0 26851  26591 0x00000000
      [11642.422878]  ffff880196483d28 00ff8801c8f60740 0000000000014ec0 ffff88023ed94ec0
      [11642.424149]  ffff8801c8f60740 ffff880196484000 0000000000000246 ffff8801c8f60740
      [11642.425374]  ffff8801bb711840 ffff8801bb711878 ffff880196483d40 ffffffff8147b541
      [11642.426591] Call Trace:
      [11642.427013]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.427856]  [<ffffffff8147b6d5>] schedule_preempt_disabled+0x18/0x24
      [11642.428852]  [<ffffffff8147c23a>] mutex_lock_nested+0x1d7/0x3b4
      [11642.429743]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      [11642.430911]  [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      [11642.432102]  [<ffffffffa044f674>] ? btrfs_wait_ordered_roots+0x57/0x191 [btrfs]
      [11642.433259]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      [11642.434431]  [<ffffffffa044f6ea>] btrfs_wait_ordered_roots+0xcd/0x191 [btrfs]
      [11642.436079]  [<ffffffffa0410cab>] btrfs_sync_fs+0xe0/0x1ad [btrfs]
      [11642.437009]  [<ffffffff81197900>] ? SyS_tee+0x23c/0x23c
      [11642.437860]  [<ffffffff81197920>] sync_fs_one_sb+0x20/0x22
      [11642.438723]  [<ffffffff81171435>] iterate_supers+0x75/0xc2
      [11642.439597]  [<ffffffff81197d00>] sys_sync+0x52/0x80
      [11642.440454]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [11642.441533] 3 locks held by fdm-stress/26851:
      [11642.442370]  #0:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff8117141f>] iterate_supers+0x5f/0xc2
      [11642.444043]  #1:  (&fs_info->ordered_operations_mutex){+.+...}, at: [<ffffffffa044f661>] btrfs_wait_ordered_roots+0x44/0x191 [btrfs]
      [11642.446010]  #2:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      
      This happened because under specific timings the path for direct IO reads
      can deadlock with concurrent buffered writes. The diagram below shows how
      this happens for an example file that has the following layout:
      
           [  extent A  ]  [  extent B  ]  [ ....
           0K              4K              8K
      
           CPU 1                                               CPU 2                             CPU 3
      
      DIO read against range
       [0K, 8K[ starts
      
      btrfs_direct_IO()
        --> calls btrfs_get_blocks_direct()
            which finds the extent map for the
            extent A and leaves the range
            [0K, 4K[ locked in the inode's
            io tree
      
                                                         buffered write against
                                                         range [4K, 8K[ starts
      
                                                         __btrfs_buffered_write()
                                                           --> dirties page at 4K
      
                                                                                           a user space
                                                                                           task calls sync
                                                                                           for e.g or
                                                                                           writepages() is
                                                                                           invoked by mm
      
                                                                                           writepages()
                                                                                             run_delalloc_range()
                                                                                               cow_file_range()
                                                                                                 --> ordered extent X
                                                                                                     for the buffered
                                                                                                     write is created
                                                                                                     and
                                                                                                     writeback starts
      
        --> calls btrfs_get_blocks_direct()
            again, without submitting first
            a bio for reading extent A, and
            finds the extent map for extent B
      
        --> calls lock_extent_direct()
      
            --> locks range [4K, 8K[
            --> finds ordered extent X
                covering range [4K, 8K[
            --> unlocks range [4K, 8K[
      
                                                        buffered write against
                                                        range [0K, 8K[ starts
      
                                                        __btrfs_buffered_write()
                                                          prepare_pages()
                                                            --> locks pages with
                                                                offsets 0 and 4K
                                                          lock_and_cleanup_extent_if_need()
                                                            --> blocks attempting to
                                                                lock range [0K, 8K[ in
                                                                the inode's io tree,
                                                                because the range [0, 4K[
                                                                is already locked by the
                                                                direct IO task at CPU 1
      
            --> calls
                btrfs_start_ordered_extent(oe X)
      
                btrfs_start_ordered_extent(oe X)
      
                  --> At this point writeback for ordered
                      extent X has not finished yet
      
                  filemap_fdatawrite_range()
                    btrfs_writepages()
                      extent_writepages()
                        extent_write_cache_pages()
                          --> finds page with offset 0
                              with the writeback tag
                              (and not dirty)
                          --> tries to lock it
                               --> deadlock, task at CPU 2
                                   has the page locked and
                                   is blocked on the io range
                                   [0, 4K[ that was locked
                                   earlier by this task
      
      So fix this by falling back to a buffered read in the direct IO read path
      when an ordered extent for a buffered write is found.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ade77029
    • Filipe Manana's avatar
      Btrfs: fix extent_same allowing destination offset beyond i_size · f4dfe687
      Filipe Manana authored
      When using the same file as the source and destination for a dedup
      (extent_same ioctl) operation we were allowing it to dedup to a
      destination offset beyond the file's size, which doesn't make sense and
      it's not allowed for the case where the source and destination files are
      not the same file. This made de deduplication operation successful only
      when the source range corresponded to a hole, a prealloc extent or an
      extent with all bytes having a value of 0x00. This was also leaving a
      file hole (between i_size and destination offset) without the
      corresponding file extent items, which can be reproduced with the
      following steps for example:
      
        $ mkfs.btrfs -f /dev/sdi
        $ mount /dev/sdi /mnt/sdi
      
        $ xfs_io -f -c "pwrite -S 0xab 304457 404990" /mnt/sdi/foobar
        wrote 404990/404990 bytes at offset 304457
        395 KiB, 99 ops; 0.0000 sec (31.150 MiB/sec and 7984.5149 ops/sec)
      
        $ /git/hub/duperemove/btrfs-extent-same 24576 /mnt/sdi/foobar 28672 /mnt/sdi/foobar 929792
        Deduping 2 total files
        (28672, 24576): /mnt/sdi/foobar
        (929792, 24576): /mnt/sdi/foobar
        1 files asked to be deduped
        i: 0, status: 0, bytes_deduped: 24576
        24576 total bytes deduped in this operation
      
        $ umount /mnt/sdi
        $ btrfsck /dev/sdi
        Checking filesystem on /dev/sdi
        UUID: 98c528aa-0833-427d-9403-b98032ffbf9d
        checking extents
        checking free space cache
        checking fs roots
        root 5 inode 257 errors 100, file extent discount
        Found file extent holes:
                start: 712704, len: 217088
        found 540673 bytes used err is 1
        total csum bytes: 400
        total tree bytes: 131072
        total fs tree bytes: 32768
        total extent tree bytes: 16384
        btree space waste bytes: 123675
        file data blocks allocated: 671744
          referenced 671744
        btrfs-progs v4.2.3
      
      So fix this by not allowing the destination to go beyond the file's size,
      just as we do for the same where the source and destination files are not
      the same.
      
      A test for xfstests follows.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f4dfe687
    • Filipe Manana's avatar
      Btrfs: fix file loss on log replay after renaming a file and fsync · 2be63d5c
      Filipe Manana authored
      We have two cases where we end up deleting a file at log replay time
      when we should not. For this to happen the file must have been renamed
      and a directory inode must have been fsynced/logged.
      
      Two examples that exercise these two cases are listed below.
      
        Case 1)
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ mkdir -p /mnt/a/b
        $ mkdir /mnt/c
        $ touch /mnt/a/b/foo
        $ sync
        $ mv /mnt/a/b/foo /mnt/c/
        # Create file bar just to make sure the fsync on directory a/ does
        # something and it's not a no-op.
        $ touch /mnt/a/bar
        $ xfs_io -c "fsync" /mnt/a
        < power fail / crash >
      
        The next time the filesystem is mounted, the log replay procedure
        deletes file foo.
      
        Case 2)
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ mkdir /mnt/a
        $ mkdir /mnt/b
        $ mkdir /mnt/c
        $ touch /mnt/a/foo
        $ ln /mnt/a/foo /mnt/b/foo_link
        $ touch /mnt/b/bar
        $ sync
        $ unlink /mnt/b/foo_link
        $ mv /mnt/b/bar /mnt/c/
        $ xfs_io -c "fsync" /mnt/a/foo
        < power fail / crash >
      
        The next time the filesystem is mounted, the log replay procedure
        deletes file bar.
      
      The reason why the files are deleted is because when we log inodes
      other then the fsync target inode, we ignore their last_unlink_trans
      value and leave the log without enough information to later replay the
      rename operations. So we need to look at the last_unlink_trans values
      and fallback to a transaction commit if they are greater than the
      id of the last committed transaction.
      
      So fix this by looking at the last_unlink_trans values and fallback to
      transaction commits when needed. Also, when logging other inodes (for
      case 1 we logged descendants of the fsync target inode while for case 2
      we logged ascendants) we need to care about concurrent tasks updating
      the last_unlink_trans of inodes we are logging (which was already an
      existing problem in check_parent_dirs_for_sync()). Since we can not
      acquire their inode mutex (vfs' struct inode ->i_mutex), as that causes
      deadlocks with other concurrent operations that acquire the i_mutex of
      2 inodes (other fsyncs or renames for example), we need to serialize on
      the log_mutex of the inode we are logging. A task setting a new value for
      an inode's last_unlink_trans must acquire the inode's log_mutex and it
      must do this update before doing the actual unlink operation (which is
      already the case except when deleting a snapshot). Conversely the task
      logging the inode must first log the inode and then check the inode's
      last_unlink_trans value while holding its log_mutex, as if its value is
      not greater then the id of the last committed transaction it means it
      logged a safe state of the inode's items, while if its value is not
      smaller then the id of the last committed transaction it means the inode
      state it has logged might not be safe (the concurrent task might have
      just updated last_unlink_trans but hasn't done yet the unlink operation)
      and therefore a transaction commit must be done.
      
      Test cases for xfstests follow in separate patches.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2be63d5c
    • Filipe Manana's avatar
      Btrfs: fix unreplayable log after snapshot delete + parent dir fsync · 1ec9a1ae
      Filipe Manana authored
      If we delete a snapshot, fsync its parent directory and crash/power fail
      before the next transaction commit, on the next mount when we attempt to
      replay the log tree of the root containing the parent directory we will
      fail and prevent the filesystem from mounting, which is solvable by wiping
      out the log trees with the btrfs-zero-log tool but very inconvenient as
      we will lose any data and metadata fsynced before the parent directory
      was fsynced.
      
      For example:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
        $ mkdir /mnt/testdir
        $ btrfs subvolume snapshot /mnt /mnt/testdir/snap
        $ btrfs subvolume delete /mnt/testdir/snap
        $ xfs_io -c "fsync" /mnt/testdir
        < crash / power failure and reboot >
        $ mount /dev/sdc /mnt
        mount: mount(2) failed: No such file or directory
      
      And in dmesg/syslog we get the following message and trace:
      
      [192066.361162] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257
      [192066.363010] ------------[ cut here ]------------
      [192066.365268] WARNING: CPU: 4 PID: 5130 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x17a/0x354 [btrfs]()
      [192066.367250] BTRFS: Transaction aborted (error -2)
      [192066.368401] Modules linked in: btrfs dm_flakey dm_mod ppdev sha256_generic xor raid6_pq hmac drbg ansi_cprng aesni_intel acpi_cpufreq tpm_tis aes_x86_64 tpm ablk_helper evdev cryptd sg parport_pc i2c_piix4 psmouse lrw parport i2c_core pcspkr gf128mul processor serio_raw glue_helper button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
      [192066.377154] CPU: 4 PID: 5130 Comm: mount Tainted: G        W       4.4.0-rc6-btrfs-next-20+ #1
      [192066.378875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [192066.380889]  0000000000000000 ffff880143923670 ffffffff81257570 ffff8801439236b8
      [192066.382561]  ffff8801439236a8 ffffffff8104ec07 ffffffffa039dc2c 00000000fffffffe
      [192066.384191]  ffff8801ed31d000 ffff8801b9fc9c88 ffff8801086875e0 ffff880143923710
      [192066.385827] Call Trace:
      [192066.386373]  [<ffffffff81257570>] dump_stack+0x4e/0x79
      [192066.387387]  [<ffffffff8104ec07>] warn_slowpath_common+0x99/0xb2
      [192066.388429]  [<ffffffffa039dc2c>] ? __btrfs_unlink_inode+0x17a/0x354 [btrfs]
      [192066.389236]  [<ffffffff8104ec68>] warn_slowpath_fmt+0x48/0x50
      [192066.389884]  [<ffffffffa039dc2c>] __btrfs_unlink_inode+0x17a/0x354 [btrfs]
      [192066.390621]  [<ffffffff81184b55>] ? iput+0xb0/0x266
      [192066.391200]  [<ffffffffa039ea25>] btrfs_unlink_inode+0x1c/0x3d [btrfs]
      [192066.391930]  [<ffffffffa03ca623>] check_item_in_log+0x1fe/0x29b [btrfs]
      [192066.392715]  [<ffffffffa03ca827>] replay_dir_deletes+0x167/0x1cf [btrfs]
      [192066.393510]  [<ffffffffa03cccc7>] replay_one_buffer+0x417/0x570 [btrfs]
      [192066.394241]  [<ffffffffa03ca164>] walk_up_log_tree+0x10e/0x1dc [btrfs]
      [192066.394958]  [<ffffffffa03cac72>] walk_log_tree+0xa5/0x190 [btrfs]
      [192066.395628]  [<ffffffffa03ce8b8>] btrfs_recover_log_trees+0x239/0x32c [btrfs]
      [192066.396790]  [<ffffffffa03cc8b0>] ? replay_one_extent+0x50a/0x50a [btrfs]
      [192066.397891]  [<ffffffffa0394041>] open_ctree+0x1d8b/0x2167 [btrfs]
      [192066.398897]  [<ffffffffa03706e1>] btrfs_mount+0x5ef/0x729 [btrfs]
      [192066.399823]  [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf
      [192066.400739]  [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3
      [192066.401700]  [<ffffffff811714b9>] mount_fs+0x67/0x131
      [192066.402482]  [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde
      [192066.403930]  [<ffffffffa03702bd>] btrfs_mount+0x1cb/0x729 [btrfs]
      [192066.404831]  [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf
      [192066.405726]  [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3
      [192066.406621]  [<ffffffff811714b9>] mount_fs+0x67/0x131
      [192066.407401]  [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde
      [192066.408247]  [<ffffffff8118ae36>] do_mount+0x893/0x9d2
      [192066.409047]  [<ffffffff8113009b>] ? strndup_user+0x3f/0x8c
      [192066.409842]  [<ffffffff8118b187>] SyS_mount+0x75/0xa1
      [192066.410621]  [<ffffffff8147e517>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [192066.411572] ---[ end trace 2de42126c1e0a0f0 ]---
      [192066.412344] BTRFS: error (device dm-0) in __btrfs_unlink_inode:3986: errno=-2 No such entry
      [192066.413748] BTRFS: error (device dm-0) in btrfs_replay_log:2464: errno=-2 No such entry (Failed to recover log tree)
      [192066.415458] BTRFS error (device dm-0): cleaner transaction attach returned -30
      [192066.444613] BTRFS: open_ctree failed
      
      This happens because when we are replaying the log and processing the
      directory entry pointing to the snapshot in the subvolume tree, we treat
      its btrfs_dir_item item as having a location with a key type matching
      BTRFS_INODE_ITEM_KEY, which is wrong because the type matches
      BTRFS_ROOT_ITEM_KEY and therefore must be processed differently, as the
      object id refers to a root number and not to an inode in the root
      containing the parent directory.
      
      So fix this by triggering a transaction commit if an fsync against the
      parent directory is requested after deleting a snapshot. This is the
      simplest approach for a rare use case. Some alternative that avoids the
      transaction commit would require more code to explicitly delete the
      snapshot at log replay time (factoring out common code from ioctl.c:
      btrfs_ioctl_snap_destroy()), special care at fsync time to remove the
      log tree of the snapshot's root from the log root of the root of tree
      roots, amongst other steps.
      
      A test case for xfstests that triggers the issue follows.
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            cd /
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_dm_target flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create a snapshot at the root of our filesystem (mount point path), delete it,
        # fsync the mount point path, crash and mount to replay the log. This should
        # succeed and after the filesystem is mounted the snapshot should not be visible
        # anymore.
        _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
        _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT
        _flakey_drop_and_remount
        [ -e $SCRATCH_MNT/snap1 ] && \
            echo "Snapshot snap1 still exists after log replay"
      
        # Similar scenario as above, but this time the snapshot is created inside a
        # directory and not directly under the root (mount point path).
        mkdir $SCRATCH_MNT/testdir
        _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/testdir/snap2
        _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap2
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir
        _flakey_drop_and_remount
        [ -e $SCRATCH_MNT/testdir/snap2 ] && \
            echo "Snapshot snap2 still exists after log replay"
      
        _unmount_flakey
      
        echo "Silence is golden"
        status=0
        exit
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Tested-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1ec9a1ae
    • Chris Mason's avatar
      Merge tag 'for-chris' of... · c05c5ee5
      Chris Mason authored
      Merge tag 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.6
      
      Btrfs patchsets for 4.6
      c05c5ee5
  2. 28 Feb, 2016 15 commits
    • Linus Torvalds's avatar
      Linux 4.5-rc6 · fc77dbd3
      Linus Torvalds authored
      fc77dbd3
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1b9540ce
      Linus Torvalds authored
      Pull perf fixes from Thomas Gleixner:
       "A rather largish series of 12 patches addressing a maze of race
        conditions in the perf core code from Peter Zijlstra"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf: Robustify task_function_call()
        perf: Fix scaling vs. perf_install_in_context()
        perf: Fix scaling vs. perf_event_enable()
        perf: Fix scaling vs. perf_event_enable_on_exec()
        perf: Fix ctx time tracking by introducing EVENT_TIME
        perf: Cure event->pending_disable race
        perf: Fix race between event install and jump_labels
        perf: Fix cloning
        perf: Only update context time when active
        perf: Allow perf_release() with !event->ctx
        perf: Do not double free
        perf: Close install vs. exit race
      1b9540ce
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 4b696dcb
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
       "This update contains:
      
         - Hopefully the last ASM CLAC fixups
      
         - A fix for the Quark family related to the IMR lock which makes
           kexec work again
      
         - A off-by-one fix in the MPX code.  Ironic, isn't it?
      
         - A fix for X86_PAE which addresses once more an unsigned long vs
           phys_addr_t hickup"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/mpx: Fix off-by-one comparison with nr_registers
        x86/mm: Fix slow_virt_to_phys() for X86_PAE again
        x86/entry/compat: Add missing CLAC to entry_INT80_32
        x86/entry/32: Add an ASM_CLAC to entry_SYSENTER_32
        x86/platform/intel/quark: Change the kernel's IMR lock bit to false
      4b696dcb
    • Linus Torvalds's avatar
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 76c03f0f
      Linus Torvalds authored
      Pull scheduler fixlet from Thomas Gleixner:
       "A trivial printk typo fix"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/deadline: Fix trivial typo in printk() message
      76c03f0f
    • Linus Torvalds's avatar
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · f055ae04
      Linus Torvalds authored
      Pull irq fixes from Thomas Gleixner:
       "Four small fixes for irqchip drivers:
      
         - Add missing low level irq handler initialization on mxs, so
           interrupts can acutally be delivered
      
         - Add a missing barrier to the GIC driver
      
         - Two fixes for the GIC-V3-ITS driver, addressing a double EOI write
           and a cache flush beyond the actual region"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/gic-v3: Add missing barrier to 32bit version of gic_read_iar()
        irqchip/mxs: Add missing set_handle_irq()
        irqchip/gicv3-its: Avoid cache flush beyond ITS_BASERn memory size
        irqchip/gic-v3-its: Fix double ICC_EOIR write for LPI in EOImode==1
      f055ae04
    • Linus Torvalds's avatar
      Merge tag 'staging-4.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 8da51430
      Linus Torvalds authored
      Pull staging/android fix from Greg KH:
       "Here is one patch, for the android binder driver, to resolve a
        reported problem.  Turns out it has been around for a while (since
        3.15), so it is good to finally get it resolved.
      
        It has been in linux-next for a while with no reported issues"
      
      * tag 'staging-4.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        drivers: android: correct the size of struct binder_uintptr_t for BC_DEAD_BINDER_DONE
      8da51430
    • Linus Torvalds's avatar
      Merge tag 'usb-4.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 62718e30
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are a few USB fixes for 4.5-rc6
      
        They fix a reported bug for some USB 3 devices by reverting the recent
        patch, a MAINTAINERS change for some drivers, some new device ids, and
        of course, the usual bunch of USB gadget driver fixes.
      
        All have been in linux-next for a while with no reported issues"
      
      * tag 'usb-4.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        MAINTAINERS: drop OMAP USB and MUSB maintainership
        usb: musb: fix DMA for host mode
        usb: phy: msm: Trigger USB state detection work in DRD mode
        usb: gadget: net2280: fix endpoint max packet for super speed connections
        usb: gadget: gadgetfs: unregister gadget only if it got successfully registered
        usb: gadget: remove driver from pending list on probe error
        Revert "usb: hub: do not clear BOS field during reset device"
        usb: chipidea: fix return value check in ci_hdrc_pci_probe()
        usb: chipidea: error on overflow for port_test_write
        USB: option: add "4G LTE usb-modem U901"
        USB: cp210x: add IDs for GE B650V3 and B850V3 boards
        USB: option: add support for SIM7100E
        usb: musb: Fix DMA desired mode for Mentor DMA engine
        usb: gadget: fsl_qe_udc: fix IS_ERR_VALUE usage
        usb: dwc2: USB_DWC2 should depend on HAS_DMA
        usb: dwc2: host: fix the data toggle error in full speed descriptor dma
        usb: dwc2: host: fix logical omissions in dwc2_process_non_isoc_desc
        usb: dwc3: Fix assignment of EP transfer resources
        usb: dwc2: Add extra delay when forcing dr_mode
      62718e30
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 12b9fa6a
      Linus Torvalds authored
      Pull vfs fixes from Al Viro.
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        do_last(): ELOOP failure exit should be done after leaving RCU mode
        should_follow_link(): validate ->d_seq after having decided to follow
        namei: ->d_inode of a pinned dentry is stable only for positives
        do_last(): don't let a bogus return value from ->open() et.al. to confuse us
        fs: return -EOPNOTSUPP if clone is not supported
        hpfs: don't truncate the file when delete fails
      12b9fa6a
    • Linus Torvalds's avatar
      Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 340b3a5b
      Linus Torvalds authored
      Pull ARM SoC fixes from Olof Johansson:
       "We didn't have a batch last week, so this one is slightly larger.
      
        None of them are scary though, a handful of fixes for small DT pieces,
        replacing properties with newer conventions.
      
        Highlights:
         - N900 fix for setting system revision
         - onenand init fix to avoid filesystem corruption
         - Clock fix for audio on Beaglebone-x15
         - Fixes on shmobile to deal with CONFIG_DEBUG_RODATA (default y in 4.6)
      
        + misc smaller stuff"
      
      * tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        MAINTAINERS: Extend info, add wiki and ml for meson arch
        MAINTAINERS: alpine: add a new maintainer and update the entry
        ARM: at91/dt: fix typo in sama5d2 pinmux descriptions
        ARM: OMAP2+: Fix onenand initialization to avoid filesystem corruption
        Revert "regulator: tps65217: remove tps65217.dtsi file"
        ARM: shmobile: Remove shmobile_boot_arg
        ARM: shmobile: Move shmobile_smp_{mpidr, fn, arg}[] from .text to .bss
        ARM: shmobile: r8a7779: Remove remainings of removed SCU boot setup code
        ARM: shmobile: Move shmobile_scu_base from .text to .bss
        ARM: OMAP2+: Fix omap_device for module reload on PM runtime forbid
        ARM: OMAP2+: Improve omap_device error for driver writers
        ARM: DTS: am57xx-beagle-x15: Select SYS_CLK2 for audio clocks
        ARM: dts: am335x/am57xx: replace gpio-key,wakeup with wakeup-source property
        ARM: OMAP2+: Set system_rev from ATAGS for n900
        ARM: dts: orion5x: fix the missing mtd flash on linkstation lswtgl
        ARM: dts: kirkwood: use unique machine name for ds112
        ARM: dts: imx6: remove bogus interrupt-parent from CAAM node
      340b3a5b
    • Al Viro's avatar
      do_last(): ELOOP failure exit should be done after leaving RCU mode · 5129fa48
      Al Viro authored
      ... or we risk seeing a bogus value of d_is_symlink() there.
      
      Cc: stable@vger.kernel.org # v4.2+
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      5129fa48
    • Al Viro's avatar
      should_follow_link(): validate ->d_seq after having decided to follow · a7f77542
      Al Viro authored
      ... otherwise d_is_symlink() above might have nothing to do with
      the inode value we've got.
      
      Cc: stable@vger.kernel.org # v4.2+
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      a7f77542
    • Al Viro's avatar
      namei: ->d_inode of a pinned dentry is stable only for positives · d4565649
      Al Viro authored
      both do_last() and walk_component() risk picking a NULL inode out
      of dentry about to become positive, *then* checking its flags and
      seeing that it's not negative anymore and using (already stale by
      then) value they'd fetched earlier.  Usually ends up oopsing soon
      after that...
      
      Cc: stable@vger.kernel.org # v3.13+
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      d4565649
    • Al Viro's avatar
      do_last(): don't let a bogus return value from ->open() et.al. to confuse us · c80567c8
      Al Viro authored
      ... into returning a positive to path_openat(), which would interpret that
      as "symlink had been encountered" and proceed to corrupt memory, etc.
      It can only happen due to a bug in some ->open() instance or in some LSM
      hook, etc., so we report any such event *and* make sure it doesn't trick
      us into further unpleasantness.
      
      Cc: stable@vger.kernel.org # v3.6+, at least
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      c80567c8
    • Christoph Hellwig's avatar
      fs: return -EOPNOTSUPP if clone is not supported · 0fcbf996
      Christoph Hellwig authored
      -EBADF is a rather confusing error if an operations is not supported,
      and nfsd gets rather upset about it.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      0fcbf996
    • Mikulas Patocka's avatar
      hpfs: don't truncate the file when delete fails · b6853f78
      Mikulas Patocka authored
      The delete opration can allocate additional space on the HPFS filesystem
      due to btree split. The HPFS driver checks in advance if there is
      available space, so that it won't corrupt the btree if we run out of space
      during splitting.
      
      If there is not enough available space, the HPFS driver attempted to
      truncate the file, but this results in a deadlock since the commit
      7dd29d8d ("HPFS: Introduce a global mutex
      and lock it on every callback from VFS").
      
      This patch removes the code that tries to truncate the file and -ENOSPC is
      returned instead. If the user hits -ENOSPC on delete, he should try to
      delete other files (that are stored in a leaf btree node), so that the
      delete operation will make some space for deleting the file stored in
      non-leaf btree node.
      Reported-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarMikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
      Cc: stable@vger.kernel.org	# 2.6.39+
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      b6853f78
  3. 27 Feb, 2016 17 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 691429e1
      Linus Torvalds authored
      Merge fixes from Andrew Morton:
       "10 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        dax: move writeback calls into the filesystems
        dax: give DAX clearing code correct bdev
        ext4: online defrag not supported with DAX
        ext2, ext4: only set S_DAX for regular inodes
        block: disable block device DAX by default
        ocfs2: unlock inode if deleting inode from orphan fails
        mm: ASLR: use get_random_long()
        drivers: char: random: add get_random_long()
        mm: numa: quickly fail allocations for NUMA balancing on full nodes
        mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED
      691429e1
    • Linus Torvalds's avatar
      Merge tag 'tags/ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · 1c271479
      Linus Torvalds authored
      Pull ext2/4 DAX fix from Ted Ts'o:
       "This fixes a file system corruption bug with DAX"
      
      * tag 'tags/ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        ext2, ext4: fix issue with missing journal entry in ext4_dax_mkwrite()
      1c271479
    • Linus Torvalds's avatar
      Merge tag 'pci-v4.5-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · a9f8094a
      Linus Torvalds authored
      Pull PCI fixes from Bjorn Helgaas:
       "Enumeration:
          Revert x86 pcibios_alloc_irq() to fix regression (Bjorn Helgaas)
      
        Marvell MVEBU host bridge driver:
          Restrict build to 32-bit ARM (Thierry Reding)"
      
      * tag 'pci-v4.5-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
        PCI: mvebu: Restrict build to 32-bit ARM
        Revert "PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()"
        Revert "PCI: Add helpers to manage pci_dev->irq and pci_dev->irq_managed"
        Revert "x86/PCI: Don't alloc pcibios-irq when MSI is enabled"
      a9f8094a
    • Ross Zwisler's avatar
      ext2, ext4: fix issue with missing journal entry in ext4_dax_mkwrite() · 1e9d180b
      Ross Zwisler authored
      As it is currently written ext4_dax_mkwrite() assumes that the call into
      __dax_mkwrite() will not have to do a block allocation so it doesn't create
      a journal entry.  For a read that creates a zero page to cover a hole
      followed by a write that actually allocates storage this is incorrect.  The
      ext4_dax_mkwrite() -> __dax_mkwrite() -> __dax_fault() path calls
      get_blocks() to allocate storage.
      
      Fix this by having the ->page_mkwrite fault handler call ext4_dax_fault()
      as this function already has all the logic needed to allocate a journal
      entry and call __dax_fault().
      
      Also update the ext2 fault handlers in this same way to remove duplicate
      code and keep the logic between ext2 and ext4 the same.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      1e9d180b
    • Linus Torvalds's avatar
      Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux · b9ea44bf
      Linus Torvalds authored
      Pull clk fix from Stephen Boyd:
       "One small fix to keep OMAP platforms working across a suspend/resume
        cycle"
      
      * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
        clk: ti: omap3+: dpll: use non-locking version of clk_get_rate
      b9ea44bf
    • Ross Zwisler's avatar
      dax: move writeback calls into the filesystems · 7f6d5b52
      Ross Zwisler authored
      Previously calls to dax_writeback_mapping_range() for all DAX filesystems
      (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
      
      dax_writeback_mapping_range() needs a struct block_device, and it used
      to get that from inode->i_sb->s_bdev.  This is correct for normal inodes
      mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
      block devices and for XFS real-time files.
      
      Instead, call dax_writeback_mapping_range() directly from the filesystem
      ->writepages function so that it can supply us with a valid block
      device.  This also fixes DAX code to properly flush caches in response
      to sync(2).
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f6d5b52
    • Ross Zwisler's avatar
      dax: give DAX clearing code correct bdev · 20a90f58
      Ross Zwisler authored
      dax_clear_blocks() needs a valid struct block_device and previously it
      was using inode->i_sb->s_bdev in all cases.  This is correct for normal
      inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
      DAX raw block devices and for XFS real-time devices.
      
      Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
      its arguments to take a bdev and a sector instead of an inode and a
      block.  This better reflects what the function does, and it allows the
      filesystem and raw block device code to pass in an appropriate struct
      block_device.
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20a90f58
    • Ross Zwisler's avatar
      ext4: online defrag not supported with DAX · 73f34a5e
      Ross Zwisler authored
      Online defrag operations for ext4 are hard coded to use the page cache.
      See ext4_ioctl() -> ext4_move_extents() -> move_extent_per_page()
      
      When combined with DAX I/O, which circumvents the page cache, this can
      result in data corruption.  This was observed with xfstests ext4/307 and
      ext4/308.
      
      Fix this by only allowing online defrag for non-DAX files.
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73f34a5e
    • Ross Zwisler's avatar
      ext2, ext4: only set S_DAX for regular inodes · 0a6cf913
      Ross Zwisler authored
      When S_DAX is set on an inode we assume that if there are pages attached
      to the mapping (mapping->nrpages != 0), those pages are clean zero pages
      that were used to service reads from holes.  Any dirty data associated
      with the inode should be in the form of DAX exceptional entries
      (mapping->nrexceptional) that is written back via
      dax_writeback_mapping_range().
      
      With the current code, though, this isn't always true.  For example,
      ext2 and ext4 directory inodes can have S_DAX set, but have their dirty
      data stored as dirty page cache entries.  For these types of inodes,
      having S_DAX set doesn't really make sense since their I/O doesn't
      actually happen through the DAX code path.
      
      Instead, only allow S_DAX to be set for regular inodes for ext2 and
      ext4.  This allows us to have strict DAX vs non-DAX paths in the
      writeback code.
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a6cf913
    • Dan Williams's avatar
      block: disable block device DAX by default · 03cdadb0
      Dan Williams authored
      The recent *sync enabling discovered that we are inserting into the
      block_device pagecache counter to the expectations of the dirty data
      tracking for dax mappings.  This can lead to data corruption.
      
      We want to support DAX for block devices eventually, but it requires
      wider changes to properly manage the pagecache.
      
         dump_stack+0x85/0xc2
         dax_writeback_mapping_range+0x60/0xe0
         blkdev_writepages+0x3f/0x50
         do_writepages+0x21/0x30
         __filemap_fdatawrite_range+0xc6/0x100
         filemap_write_and_wait+0x4a/0xa0
         set_blocksize+0x70/0xd0
         sb_set_blocksize+0x1d/0x50
         ext4_fill_super+0x75b/0x3360
         mount_bdev+0x180/0x1b0
         ext4_mount+0x15/0x20
         mount_fs+0x38/0x170
      
      Mark the support broken so its disabled by default, but otherwise still
      available for testing.
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: default avatarDave Chinner <david@fromorbit.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03cdadb0
    • Guozhonghua's avatar
      ocfs2: unlock inode if deleting inode from orphan fails · a4a8481f
      Guozhonghua authored
      When doing append direct io cleanup, if deleting inode fails, it goes
      out without unlocking inode, which will cause the inode deadlock.
      
      This issue was introduced by commit cf1776a9 ("ocfs2: fix a tiny
      race when truncate dio orohaned entry").
      Signed-off-by: default avatarGuozhonghua <guozhonghua@h3c.com>
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: default avatarGang He <ghe@suse.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4a8481f
    • Daniel Cashman's avatar
      mm: ASLR: use get_random_long() · 5ef11c35
      Daniel Cashman authored
      Replace calls to get_random_int() followed by a cast to (unsigned long)
      with calls to get_random_long().  Also address shifting bug which, in
      case of x86 removed entropy mask for mmap_rnd_bits values > 31 bits.
      Signed-off-by: default avatarDaniel Cashman <dcashman@android.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Nick Kralevich <nnk@google.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ef11c35
    • Daniel Cashman's avatar
      drivers: char: random: add get_random_long() · ec9ee4ac
      Daniel Cashman authored
      Commit d07e2259 ("mm: mmap: add new /proc tunable for mmap_base
      ASLR") added the ability to choose from a range of values to use for
      entropy count in generating the random offset to the mmap_base address.
      
      The maximum value on this range was set to 32 bits for 64-bit x86
      systems, but this value could be increased further, requiring more than
      the 32 bits of randomness provided by get_random_int(), as is already
      possible for arm64.  Add a new function: get_random_long() which more
      naturally fits with the mmap usage of get_random_int() but operates
      exactly the same as get_random_int().
      
      Also, fix the shifting constant in mmap_rnd() to be an unsigned long so
      that values greater than 31 bits generate an appropriate mask without
      overflow.  This is especially important on x86, as its shift instruction
      uses a 5-bit mask for the shift operand, which meant that any value for
      mmap_rnd_bits over 31 acts as a no-op and effectively disables mmap_base
      randomization.
      
      Finally, replace calls to get_random_int() with get_random_long() where
      appropriate.
      
      This patch (of 2):
      
      Add get_random_long().
      Signed-off-by: default avatarDaniel Cashman <dcashman@android.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Nick Kralevich <nnk@google.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec9ee4ac
    • Mel Gorman's avatar
      mm: numa: quickly fail allocations for NUMA balancing on full nodes · 8479eba7
      Mel Gorman authored
      Commit 4167e9b2 ("mm: remove GFP_THISNODE") removed the GFP_THISNODE
      flag combination due to confusing semantics.  It noted that
      alloc_misplaced_dst_page() was one such user after changes made by
      commit e97ca8e5 ("mm: fix GFP_THISNODE callers and clarify").
      
      Unfortunately when GFP_THISNODE was removed, users of
      alloc_misplaced_dst_page() started waking kswapd and entering direct
      reclaim because the wrong GFP flags are cleared.  The consequence is
      that workloads that used to fit into memory now get reclaimed which is
      addressed by this patch.
      
      The problem can be demonstrated with "mutilate" that exercises memcached
      which is software dedicated to memory object caching.  The configuration
      uses 80% of memory and is run 3 times for varying numbers of clients.
      The results on a 4-socket NUMA box are
      
      mutilate
                                  4.4.0                 4.4.0
                                vanilla           numaswap-v1
      Hmean    1      8394.71 (  0.00%)     8395.32 (  0.01%)
      Hmean    4     30024.62 (  0.00%)    34513.54 ( 14.95%)
      Hmean    7     32821.08 (  0.00%)    70542.96 (114.93%)
      Hmean    12    55229.67 (  0.00%)    93866.34 ( 69.96%)
      Hmean    21    39438.96 (  0.00%)    85749.21 (117.42%)
      Hmean    30    37796.10 (  0.00%)    50231.49 ( 32.90%)
      Hmean    47    18070.91 (  0.00%)    38530.13 (113.22%)
      
      The metric is queries/second with the more the better.  The results are
      way outside of the noise and the reason for the improvement is obvious
      from some of the vmstats
      
                                       4.4.0       4.4.0
                                     vanillanumaswap-v1r1
      Minor Faults                1929399272  2146148218
      Major Faults                  19746529        3567
      Swap Ins                      57307366        9913
      Swap Outs                     50623229       17094
      Allocation stalls                35909         443
      DMA allocs                           0           0
      DMA32 allocs                  72976349   170567396
      Normal allocs               5306640898  5310651252
      Movable allocs                       0           0
      Direct pages scanned         404130893      799577
      Kswapd pages scanned         160230174           0
      Kswapd pages reclaimed        55928786           0
      Direct pages reclaimed         1843936       41921
      Page writes file                  2391           0
      Page writes anon              50623229       17094
      
      The vanilla kernel is swapping like crazy with large amounts of direct
      reclaim and kswapd activity.  The figures are aggregate but it's known
      that the bad activity is throughout the entire test.
      
      Note that simple streaming anon/file memory consumers also see this
      problem but it's not as obvious.  In those cases, kswapd is awake when
      it should not be.
      
      As there are at least two reclaim-related bugs out there, it's worth
      spelling out the user-visible impact.  This patch only addresses bugs
      related to excessive reclaim on NUMA hardware when the working set is
      larger than a NUMA node.  There is a bug related to high kswapd CPU
      usage but the reports are against laptops and other UMA hardware and is
      not addressed by this patch.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[4.1+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8479eba7
    • Andrea Arcangeli's avatar
      mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED · ad33bb04
      Andrea Arcangeli authored
      pmd_trans_unstable()/pmd_none_or_trans_huge_or_clear_bad() were
      introduced to locklessy (but atomically) detect when a pmd is a regular
      (stable) pmd or when the pmd is unstable and can infinitely transition
      from pmd_none() and pmd_trans_huge() from under us, while only holding
      the mmap_sem for reading (for writing not).
      
      While holding the mmap_sem only for reading, MADV_DONTNEED can run from
      under us and so before we can assume the pmd to be a regular stable pmd
      we need to compare it against pmd_none() and pmd_trans_huge() in an
      atomic way, with pmd_trans_unstable().  The old pmd_trans_huge() left a
      tiny window for a race.
      
      Useful applications are unlikely to notice the difference as doing
      MADV_DONTNEED concurrently with a page fault would lead to undefined
      behavior.
      
      [akpm@linux-foundation.org: tidy up comment grammar/layout]
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad33bb04
    • Thierry Reding's avatar
      PCI: mvebu: Restrict build to 32-bit ARM · 61d9e854
      Thierry Reding authored
      This driver uses PCI glue that is only available on 32-bit ARM.  This used
      to work fine as long as ARCH_MVEBU and ARCH_DOVE were exclusively 32-bit,
      but there's a patch in the pipe to make ARCH_MVEBU also available on 64-bit
      ARM.
      
      [bhelgaas: changelog; patch is coming but not merged yet]
      Signed-off-by: default avatarThierry Reding <treding@nvidia.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Acked-by: default avatarThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      61d9e854
    • Bjorn Helgaas's avatar
      Revert "PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()" · 6c777e87
      Bjorn Helgaas authored
      991de2e5 ("PCI, x86: Implement pcibios_alloc_irq() and
      pcibios_free_irq()") appeared in v4.3 and helps support IOAPIC hotplug.
      
      Олег reported that the Elcus-1553 TA1-PCI driver worked in v4.2 but not
      v4.3 and bisected it to 991de2e5.  Sunjin reported that the RocketRAID
      272x driver worked in v4.2 but not v4.3.  In both cases booting with
      "pci=routirq" is a workaround.
      
      I think the problem is that after 991de2e5, we no longer call
      pcibios_enable_irq() for upstream bridges.  Prior to 991de2e5, when a
      driver called pci_enable_device(), we recursively called
      pcibios_enable_irq() for upstream bridges via pci_enable_bridge().
      
      After 991de2e5, we call pcibios_enable_irq() from pci_device_probe()
      instead of the pci_enable_device() path, which does *not* call
      pcibios_enable_irq() for upstream bridges.
      
      Revert 991de2e5 to fix these driver regressions.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=111211
      Fixes: 991de2e5 ("PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()")
      Reported-and-tested-by: default avatarОлег Мороз <oleg.moroz@mcc.vniiem.ru>
      Reported-by: default avatarSunjin Yang <fan4326@gmail.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Acked-by: default avatarRafael J. Wysocki <rafael@kernel.org>
      CC: Jiang Liu <jiang.liu@linux.intel.com>
      6c777e87
  4. 26 Feb, 2016 3 commits