• Varun Gupta's avatar
    MDEV-11563: GROUP_CONCAT(DISTINCT ...) may produce a non-distinct list · 81a08c54
    Varun Gupta authored
     Backported from MYSQL
     Bug #25331425: DISTINCT CLAUSE DOES NOT WORK IN GROUP_CONCAT
        Issue:
        ------
        The problem occurs when:
        1) GROUP_CONCAT (DISTINCT ....) is used in the query.
        2) Data size greater than value of system variable:
        tmp_table_size.
    
        The result would contain values that are non-unique.
    
        Root cause:
        -----------
        An in-memory structure is used to filter out non-unique
        values. When the data size exceeds tmp_table_size, the
        overflow is written to disk as a separate file. The
        expectation here is that when all such files are merged,
        the full set of unique values can be obtained.
    
        But the Item_func_group_concat::add function is in a bit of
        hurry. Even as it is adding values to the tree, it wants to
        decide if a value is unique and write it to the result
        buffer. This works fine if the configured maximum size is
        greater than the size of the data. But since tmp_table_size
        is set to a low value, the size of the tree is smaller and
        hence requires the creation of multiple copies on disk.
    
        Item_func_group_concat currently has no mechanism to merge
        all the copies on disk and then generate the result. This
        results in duplicate values.
    
        Solution:
        ---------
        In case of the DISTINCT clause, don't write to the result
        buffer immediately. Do the merge and only then put the
        unique values in the result buffer. This has be done in
        Item_func_group_concat::val_str.
    
        Note regarding result file changes:
        -----------------------------------
        Earlier when a unique value was seen in
        Item_func_group_concat::add, it was dumped to the output.
        So result is in the order stored in SE. But with this fix,
        we wait until all the data is read and the final set of
        unique values are written to output buffer. So the data
        appears in the sorted order.
    
        This only fixes the cases when we have DISTINCT without ORDER BY clause
        in GROUP_CONCAT.
    81a08c54
func_gconcat.test 33.1 KB