Handle encoding in non-binary Blob instances

gitlab_git 10.6.4 relies on Rugged marking blobs as binary or not, instead of relying on Linguist. Linguist in turn would mark text blobs as binary whenever they would contain byte sequences that could not be encoded using UTF-8. However, marking such blobs as binary is not correct. If one pushes a Markdown document with invalid character sequences it's still a text based Markdown document and not some random binary blob. This commit overwrites Blob#data so it automatically converts text-based content to UTF-8 (the encoding we use everywhere else) while taking care of replacing any invalid sequences with the UTF-8 replacement character. The data of binary blobs is left as-is.

Handle encoding in non-binary Blob instances
gitlab_git 10.6.4 relies on Rugged marking blobs as binary or not, instead of relying on Linguist. Linguist in turn would mark text blobs as binary whenever they would contain byte sequences that could not be encoded using UTF-8. However, marking such blobs as binary is not correct. If one pushes a Markdown document with invalid character sequences it's still a text based Markdown document and not some random binary blob. This commit overwrites Blob#data so it automatically converts text-based content to UTF-8 (the encoding we use everywhere else) while taking care of replacing any invalid sequences with the UTF-8 replacement character. The data of binary blobs is left as-is.
0bc443e3 · Yorick Peterse · 9980f52c · 0bc443e3 · 0bc443e3
Commit 0bc443e3 authored Sep 12, 2016 by Yorick Peterse
Hide whitespace changes
Inline Side-by-side

Showing with 32 additions and 0 deletions

app/models/blob.rb app/models/blob.rb +12 -0

spec/models/blob_spec.rb spec/models/blob_spec.rb +20 -0

No files found.
--- a/app/models/blob.rb
+++ b/app/models/blob.rb
@@ -22,6 +22,18 @@ class Blob < SimpleDelegator
    new(blob)
  end
+  # Returns the data of the blob.
+  #
+  # If the blob is a text based blob the content is converted to UTF-8 and any
+  # invalid byte sequences are replaced.
+  def data
+    if binary?
+      super
+    else
+      @data ||= super.encode(Encoding::UTF_8, invalid: :replace, undef: :replace)
+    end
+  end
  def no_highlighting?
    size && size > 1.megabyte
  end

--- a/spec/models/blob_spec.rb
+++ b/spec/models/blob_spec.rb
+# encoding: utf-8
 require 'rails_helper'
 describe Blob do
@@ -7,6 +8,25 @@ describe Blob do
    end
  end
+  describe '#data' do
+    context 'using a binary blob' do
+      it 'returns the data as-is' do
+        data = "\n\xFF\xB9\xC3"
+        blob = described_class.new(double(binary?: true, data: data))
+        expect(blob.data).to eq(data)
+      end
+    end
+    context 'using a text blob' do
+      it 'converts the data to UTF-8' do
+        blob = described_class.new(double(binary?: false, data: "\n\xFF\xB9\xC3"))
+        expect(blob.data).to eq("\n���")
+      end
+    end
+  end
  describe '#svg?' do
    it 'is falsey when not text' do
      git_blob = double(text?: false)