• Darrick J. Wong's avatar
    xfs: stabilize the dirent name transformation function used for ascii-ci dir hash computation · a9248538
    Darrick J. Wong authored
    Back in the old days, the "ascii-ci" feature was created to implement
    case-insensitive directory entry lookups for latin1-encoded names and
    remove the large overhead of Samba's case-insensitive lookup code.  UTF8
    names were not allowed, but nobody explicitly wrote in the documentation
    that this was only expected to work if the system used latin1 names.
    The kernel tolower function was selected to prepare names for hashed
    lookups.
    
    There's a major discrepancy in the function that computes directory entry
    hashes for filesystems that have ASCII case-insensitive lookups enabled.
    The root of this is that the kernel and glibc's tolower implementations
    have differing behavior for extended ASCII accented characters.  I wrote
    a program to spit out characters for which the tolower() return value is
    different from the input:
    
    glibc tolower:
    65:A 66:B 67:C 68:D 69:E 70:F 71:G 72:H 73:I 74:J 75:K 76:L 77:M 78:N
    79:O 80:P 81:Q 82:R 83:S 84:T 85:U 86:V 87:W 88:X 89:Y 90:Z
    
    kernel tolower:
    65:A 66:B 67:C 68:D 69:E 70:F 71:G 72:H 73:I 74:J 75:K 76:L 77:M 78:N
    79:O 80:P 81:Q 82:R 83:S 84:T 85:U 86:V 87:W 88:X 89:Y 90:Z 192:À 193:Á
    194:Â 195:Ã 196:Ä 197:Å 198:Æ 199:Ç 200:È 201:É 202:Ê 203:Ë 204:Ì 205:Í
    206:Î 207:Ï 208:Ð 209:Ñ 210:Ò 211:Ó 212:Ô 213:Õ 214:Ö 215:× 216:Ø 217:Ù
    218:Ú 219:Û 220:Ü 221:Ý 222:Þ
    
    Which means that the kernel and userspace do not agree on the hash value
    for a directory filename that contains those higher values.  The hash
    values are written into the leaf index block of directories that are
    larger than two blocks in size, which means that xfs_repair will flag
    these directories as having corrupted hash indexes and rewrite the index
    with hash values that the kernel now will not recognize.
    
    Because the ascii-ci feature is not frequently enabled and the kernel
    touches filesystems far more frequently than xfs_repair does, fix this
    by encoding the kernel's toupper predicate and tolower functions into
    libxfs.  Give the new functions less provocative names to make it really
    obvious that this is a pre-hash name preparation function, and nothing
    else.  This change makes userspace's behavior consistent with the
    kernel.
    
    Found by auditing obfuscate_name in xfs_metadump as part of working on
    parent pointers, wondering how it could possibly work correctly with ci
    filesystems, writing a test tool to create a directory with
    hash-colliding names, and watching xfs_repair flag it.
    Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    a9248538
xfs_dir2.h 8.46 KB