xfs: remote attribute headers contain an invalid LSN

commit e3c32ee9 upstream. In recent testing, a system that crashed failed log recovery on restart with a bad symlink buffer magic number: XFS (vda): Starting recovery (logdev: internal) XFS (vda): Bad symlink block magic! XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060 On examination of the log via xfs_logprint, none of the symlink buffers in the log had a bad magic number, nor were any other types of buffer log format headers mis-identified as symlink buffers. Tracing was used to find the buffer the kernel was tripping over, and xfs_db identified it's contents as: 000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e 020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d 040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d ..... This is a remote attribute buffer, which are notable in that they are not logged but are instead written synchronously by the remote attribute code so that they exist on disk before the attribute transactions are committed to the journal. The above remote attribute block has an invalid LSN in it - cycle 0xd002000, block 0 - which means when log recovery comes along to determine if the transaction that writes to the underlying block should be replayed, it sees a block that has a future LSN and so does not replay the buffer data in the transaction. Instead, it validates the buffer magic number and attaches the buffer verifier to it. It is this buffer magic number check that is failing in the above assert, indicating that we skipped replay due to the LSN of the underlying buffer. The problem here is that the remote attribute buffers cannot have a valid LSN placed into them, because the transaction that contains the attribute tree pointer changes and the block allocation that the attribute data is being written to hasn't yet been committed. Hence the LSN field in the attribute block is completely unwritten, thereby leaving the underlying contents of the block in the LSN field. It could have any value, and hence a future overwrite of the block by log recovery may or may not work correctly. Fix this by always writing an invalid LSN to the remote attribute block, as any buffer in log recovery that needs to write over the remote attribute should occur. We are protected from having old data written over the attribute by the fact that freeing the block before the remote attribute is written will result in the buffer being marked stale in the log and so all changes prior to the buffer stale transaction will be cancelled by log recovery. Hence it is safe to ignore the LSN in the case or synchronously written, unlogged metadata such as remote attribute blocks, and to ensure we do that correctly, we need to write an invalid LSN to all remote attribute blocks to trigger immediate recovery of metadata that is written over the top. As a further protection for filesystems that may already have remote attribute blocks with bad LSNs on disk, change the log recovery code to always trigger immediate recovery of metadata over remote attribute blocks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Kamal Mostafa <kamal@canonical.com>

xfs: remote attribute headers contain an invalid LSN
commit e3c32ee9 upstream. In recent testing, a system that crashed failed log recovery on restart with a bad symlink buffer magic number: XFS (vda): Starting recovery (logdev: internal) XFS (vda): Bad symlink block magic! XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060 On examination of the log via xfs_logprint, none of the symlink buffers in the log had a bad magic number, nor were any other types of buffer log format headers mis-identified as symlink buffers. Tracing was used to find the buffer the kernel was tripping over, and xfs_db identified it's contents as: 000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e 020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d 040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d ..... This is a remote attribute buffer, which are notable in that they are not logged but are instead written synchronously by the remote attribute code so that they exist on disk before the attribute transactions are committed to the journal. The above remote attribute block has an invalid LSN in it - cycle 0xd002000, block 0 - which means when log recovery comes along to determine if the transaction that writes to the underlying block should be replayed, it sees a block that has a future LSN and so does not replay the buffer data in the transaction. Instead, it validates the buffer magic number and attaches the buffer verifier to it. It is this buffer magic number check that is failing in the above assert, indicating that we skipped replay due to the LSN of the underlying buffer. The problem here is that the remote attribute buffers cannot have a valid LSN placed into them, because the transaction that contains the attribute tree pointer changes and the block allocation that the attribute data is being written to hasn't yet been committed. Hence the LSN field in the attribute block is completely unwritten, thereby leaving the underlying contents of the block in the LSN field. It could have any value, and hence a future overwrite of the block by log recovery may or may not work correctly. Fix this by always writing an invalid LSN to the remote attribute block, as any buffer in log recovery that needs to write over the remote attribute should occur. We are protected from having old data written over the attribute by the fact that freeing the block before the remote attribute is written will result in the buffer being marked stale in the log and so all changes prior to the buffer stale transaction will be cancelled by log recovery. Hence it is safe to ignore the LSN in the case or synchronously written, unlogged metadata such as remote attribute blocks, and to ensure we do that correctly, we need to write an invalid LSN to all remote attribute blocks to trigger immediate recovery of metadata that is written over the top. As a further protection for filesystems that may already have remote attribute blocks with bad LSNs on disk, change the log recovery code to always trigger immediate recovery of metadata over remote attribute blocks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Kamal Mostafa <kamal@canonical.com>
20149e85 · Dave Chinner · Kamal Mostafa · 410f970d · 20149e85 · 20149e85
Commit 20149e85 authored Jul 29, 2015 by Dave Chinner Committed by Kamal Mostafa Aug 28, 2015
Hide whitespace changes
Inline Side-by-side

Showing with 31 additions and 9 deletions

fs/xfs/libxfs/xfs_attr_remote.c fs/xfs/libxfs/xfs_attr_remote.c +23 -6

fs/xfs/xfs_log_recover.c fs/xfs/xfs_log_recover.c +8 -3

No files found.
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -159,11 +159,10 @@ xfs_attr3_rmt_write_verify(
 	struct xfs_buf	*bp)
 {
 	struct xfs_mount *mp = bp->b_target->bt_mount;
-	struct xfs_buf_log_item	*bip = bp->b_fspriv;
+	int		blksize = mp->m_attr_geo->blksize;
 	char		*ptr;
 	int		len;
 	xfs_daddr_t	bno;
-	int		blksize = mp->m_attr_geo->blksize;
 	/* no verification of non-crc buffers */
 	if (!xfs_sb_version_hascrc(&mp->m_sb))
@@ -175,16 +174,22 @@ xfs_attr3_rmt_write_verify(
 	ASSERT(len >= blksize);
 	while (len > 0) {
+		struct xfs_attr3_rmt_hdr *rmt = (struct xfs_attr3_rmt_hdr *)ptr;
 		if (!xfs_attr3_rmt_verify(mp, ptr, blksize, bno)) {
 			xfs_buf_ioerror(bp, -EFSCORRUPTED);
 			xfs_verifier_error(bp);
 			return;
 		}
-		if (bip) {
-			struct xfs_attr3_rmt_hdr *rmt;
-			rmt = (struct xfs_attr3_rmt_hdr *)ptr;
+		/*
-			rmt->rm_lsn = cpu_to_be64(bip->bli_item.li_lsn);
+		 * Ensure we aren't writing bogus LSNs to disk. See
+		 * xfs_attr3_rmt_hdr_set() for the explanation.
+		 */
+		if (rmt->rm_lsn != cpu_to_be64(NULLCOMMITLSN)) {
+			xfs_buf_ioerror(bp, -EFSCORRUPTED);
+			xfs_verifier_error(bp);
+			return;
 		}
 		xfs_update_cksum(ptr, blksize, XFS_ATTR3_RMT_CRC_OFF);
@@ -221,6 +226,18 @@ xfs_attr3_rmt_hdr_set(
 	rmt->rm_owner = cpu_to_be64(ino);
 	rmt->rm_blkno = cpu_to_be64(bno);
+	/*
+	 * Remote attribute blocks are written synchronously, so we don't
+	 * have an LSN that we can stamp in them that makes any sense to log
+	 * recovery. To ensure that log recovery handles overwrites of these
+	 * blocks sanely (i.e. once they've been freed and reallocated as some
+	 * other type of metadata) we need to ensure that the LSN has a value
+	 * that tells log recovery to ignore the LSN and overwrite the buffer
+	 * with whatever is in it's log. To do this, we use the magic
+	 * NULLCOMMITLSN to indicate that the LSN is invalid.
+	 */
+	rmt->rm_lsn = cpu_to_be64(NULLCOMMITLSN);
 	return sizeof(struct xfs_attr3_rmt_hdr);
 }

--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1887,9 +1887,14 @@ xlog_recover_get_buf_lsn(
 		uuid = &((struct xfs_dir3_blk_hdr *)blk)->uuid;
 		break;
 	case XFS_ATTR3_RMT_MAGIC:
-		lsn = be64_to_cpu(((struct xfs_attr3_rmt_hdr *)blk)->rm_lsn);
+		/*
-		uuid = &((struct xfs_attr3_rmt_hdr *)blk)->rm_uuid;
+		 * Remote attr blocks are written synchronously, rather than
-		break;
+		 * being logged. That means they do not contain a valid LSN
+		 * (i.e. transactionally ordered) in them, and hence any time we
+		 * see a buffer to replay over the top of a remote attribute
+		 * block we should simply do so.
+		 */
+		goto recover_immediately;
 	case XFS_SB_MAGIC:
 		lsn = be64_to_cpu(((struct xfs_dsb *)blk)->sb_lsn);
 		uuid = &((struct xfs_dsb *)blk)->sb_uuid;