ipset: Optimise hash table size

ipset uses a hash table internally which can be dynamically sized to chose whether more space efficiency or performance is required. Previously to this patch, we always set the size of the hash table to 1024 buckets. Having large sets with almost half a million entries, this is not performing well since we will spend a lot of time in searching the linked list. This will probably perform even slower on systems with smaller cache sizes like the IPFire Mini Appliance. Having more buckets that are sparesely filled, will result in less memory fetches at the cost of more wastage. Throughout the whole IPv4 set, this ranges from about 50 MB for a factor of 4, to about 100 MB for a factor of 0.75. Since memory of this quantity is cheap and since we want to increase throughput, I have chosen to set the fill factor to 0.75. Logistically, it is a little bit complicated to know this in advance when we have to write the header, so we will write the entire file first, and then come back to write the header again. This is required to keep memory consumption down during the export. Signed-off-by: Michael Tremer <michael.tremer@ipfire.org>

ipset: Optimise hash table size
ipset uses a hash table internally which can be dynamically sized to chose whether more space efficiency or performance is required. Previously to this patch, we always set the size of the hash table to 1024 buckets. Having large sets with almost half a million entries, this is not performing well since we will spend a lot of time in searching the linked list. This will probably perform even slower on systems with smaller cache sizes like the IPFire Mini Appliance. Having more buckets that are sparesely filled, will result in less memory fetches at the cost of more wastage. Throughout the whole IPv4 set, this ranges from about 50 MB for a factor of 4, to about 100 MB for a factor of 0.75. Since memory of this quantity is cheap and since we want to increase throughput, I have chosen to set the fill factor to 0.75. Logistically, it is a little bit complicated to know this in advance when we have to write the header, so we will write the entire file first, and then come back to write the header again. This is required to keep memory consumption down during the export. Signed-off-by: Michael Tremer <michael.tremer@ipfire.org>
47de14b0 · Michael Tremer · 181220ac · 47de14b0
Commit 47de14b0 authored Mar 01, 2022 by Michael Tremer
Hide whitespace changes
Inline Side-by-side

Showing with 62 additions and 1 deletion

src/python/export.py src/python/export.py +62 -1

No files found.
--- a/src/python/export.py
+++ b/src/python/export.py
@@ -20,6 +20,7 @@
 import io
 import ipaddress
 import logging
+import math
 import os
 import socket

@@ -43,9 +44,18 @@ class OutputWriter(object):
 	def __init__(self, f, prefix=None):
 		self.f, self.prefix = f, prefix

+		# Call any custom initialization
+		self.init()
+
 		# Immediately write the header
 		self._write_header()

+	def init(self):
+		"""
+			To be overwritten by anything that inherits from this
+		"""
+		pass
+
 	@classmethod
 	def open(cls, filename, **kwargs):
 		"""
@@ -89,13 +99,64 @@ class IpsetOutputWriter(OutputWriter):
 	"""
 	suffix = "ipset"

+	# The value is being used if we don't know any better
+	DEFAULT_HASHSIZE = 64
+
+	# We aim for this many networks in a bucket on average. This allows us to choose
+	# how much memory we want to sacrifice to gain better performance. The lower the
+	# factor, the faster a lookup will be, but it will use more memory.
+	# We will aim for only using three quarters of all buckets to avoid any searches
+	# through the linked lists.
+	HASHSIZE_FACTOR = 0.75
+
+	def init(self):
+		# Count all networks
+		self.networks = 0
+
+	@property
+	def hashsize(self):
+		"""
+			Calculates an optimized hashsize
+		"""
+		# Return the default value if we don't know the size of the set
+		if not self.networks:
+			return self.DEFAULT_HASHSIZE
+
+		# Find the nearest power of two that is larger than the number of networks
+		# divided by the hashsize factor.
+		exponent = math.log(self.networks / self.HASHSIZE_FACTOR, 2)
+
+		# Return the size of the hash
+		return 2 ** math.ceil(exponent)
+
+	@property
+	def maxelem(self):
+		"""
+			Tells ipset how large the set will be.
+
+			Since these are considered immutable, we will use the total number of networks.
+		"""
+		return self.networks
+
 	def _write_header(self):
-		self.f.write("create %s hash:net family inet hashsize 1024 maxelem 65536 -exist\n" % self.prefix)
+		# This must have a fixed size, because we will write the header again in the end
+		self.f.write("create %s hash:net family inet "
+			"hashsize %8d maxelem %8d -exist\n" % (self.prefix, self.hashsize, self.maxelem))
 		self.f.write("flush %s\n" % self.prefix)

 	def write(self, network):
 		self.f.write("add %s %s\n" % (self.prefix, network))

+		# Increment network counter
+		self.networks += 1
+
+	def _write_footer(self):
+		# Jump back to the beginning of the file
+		self.f.seek(0)
+
+		# Rewrite the header with better configuration
+		self._write_header()
+

 class NftablesOutputWriter(OutputWriter):
 	"""