TBH, I find the whole premise a bit silly. The one use case the original video talks about there this might be a good idea is when you have something like a MonsterId(String), and argues for MonsterId(Arc<str>) instead - which I think is an anti-pattern. Use MonsterId(usize) and it becomes far cheaper to clone than even an Arc<str>. There is no need to lug around strings as ids - which are presumably going to be used to look up values in a hashmap or similar (which again, I would suspect it is faster to has a usize than a string of any type).
Most of the rest of the time cloning a string is not a huge issue or if it is &str is often good enough. I find it rare that you would ever need the clone performance and multiple ownership of a Arc<str>.
There are also crates like smallstring, smallstr etc, that are stack allocated strings up to a certain limit before they start allocating. Which would be worth a look at and see how they preform compared to Arc or String. Since most things I bet they were using Strings for would fit inside these types just find without the extra allocation.
It is a neat trick to have in your toolbag - but not the first thing I would jump for when dealing with strings.
The main use case is “uuid,” but those can be represented as 128-bit integers instead of character sequences if cloning is a concern. This emphasizes your point, usually there’s a more obvious, good enough solution to performance issues than moving a string to an Arc.
It is a neat trick, and something I love reading about, but I would very much expect a lengthy comment in the code if someone does this in practice because it is not obvious at all.
The main use case is “uuid,” but those can be represented as 128-bit integers
Oh man. I’ve seen so much software that treats UUIDs as strings internally. I’ve also seen things like IPv4 addresses being used in dotted notation as strings, and then the developers asking themselves why calculating “is this addr in this subnet” is so complicated.
I blame it on many people only learning high level scripting languages.
The last team I worked with did a lot with addresses and we used Go, and way too many of my coworkers were confused when the standard library used a 4-byte array to represent IP addresses instead of a string. I’ve even had to read IP addresses in hex (e.g. in raw packet dumps), so I’m used to counting octet offsets.
ULID
Ew, that violates much of the point of an ID, which should be entirely opaque. One of the best parts about a UUID is that it can be exposed to the user without fear of providing any data to an attacker (we use UUIDv4 for that reason).
But maybe it’s useful for distributed systems where you trust the machines providing the IDs to have accurate timestamps, but then why not just use a tuple of a random ID and a timestamp? That way you can pick if you want lexicographical sorting or random distribution by swapping the parts of the tuple, and you can use standard 64-bit timestamps (and why milliseconds? Every time system I’ve used uses nanoseconds or seconds, only JS does the silly ms thing; maybe Windows does?). I guess compatibility with UUID is useful kinda, but I honestly don’t see a ton of value here (saves you a column in the DB and an import I guess).
So to me, this seems like it’s going to be misused a ton. I’m really scratching my head over using base32, because that’s only useful for preventing transcription errors, which means it’s intended to be seen and used.
I hope I’m missing something because this seems like an obvious “don’t do this” situation.
Great points about ULID, to be honest I haven’t looked into it in detail, I just saw it mentioned somewhere and glanced over their self-description. The main reason I wouldn’t use it is because Postgres already has great built-in support for UUIDs
which are presumably going to be used to look up values in a hashmap or similar (which again, I would suspect it is faster to has a usize than a string of any type).
Yeah, for any primitive types, you can just use the value itself as the hash value here. So, effectively a noop. I assume, Rust’s implementation makes use of this…
Hmm, I was wondering, if there’s an overlap with using hashes for security stuff. Do you happen to know of an exploit that makes use of something like predictable placement in a hashmap?
Or is your assumption rather that they wouldn’t include special treatment for primitive types in the hashmap implementation?
It definitely feels a bit freaky to me, too, since you’d rob users of the ability to customize the Hasher implementation, but I also felt like they almost have to do it, because it might make a massive difference in performance.
…but after thinking about it some more, I guess, you’d typically use a BTreeMap when your keys are primitives. So, now I’m on board with the guess, that they wouldn’t include special treatment into HashMap. 🙃
By default, HashMap uses a hashing algorithm selected to provide resistance against HashDoS attacks. The algorithm is randomly seeded, and a reasonable best-effort is made to generate this seed from a high quality, secure source of randomness provided by the host without blocking the program.
The default hashing algorithm is currently SipHash 1-3, though this is subject to change at any point in the future. While its performance is very competitive for medium sized keys, other hashing algorithms will outperform it for small keys such as integers as well as large keys such as long strings, though those algorithms will typically not protect against attacks such as HashDoS.
Basically, if the attacker has control over the key inserted into a hashmap then with a simple hashing algorithm they can force collisions which results in the hashmap falling back to a much slower linear lookup. This can be enough to stress a server and slow down all requests going through it or even cause it to crash. So a lot of effort is made in the default hasher to mitigate against this. There are faster hashing implementations out there if you are not worried about this that you can opt into. But the default is to be secure.
TBH, I find the whole premise a bit silly. The one use case the original video talks about there this might be a good idea is when you have something like a
MonsterId(String)
, and argues forMonsterId(Arc<str>)
instead - which I think is an anti-pattern. UseMonsterId(usize)
and it becomes far cheaper to clone than even anArc<str>
. There is no need to lug around strings as ids - which are presumably going to be used to look up values in a hashmap or similar (which again, I would suspect it is faster to has a usize than a string of any type).Most of the rest of the time cloning a string is not a huge issue or if it is &str is often good enough. I find it rare that you would ever need the clone performance and multiple ownership of a
Arc<str>
.There are also crates like smallstring, smallstr etc, that are stack allocated strings up to a certain limit before they start allocating. Which would be worth a look at and see how they preform compared to Arc or String. Since most things I bet they were using Strings for would fit inside these types just find without the extra allocation.
It is a neat trick to have in your toolbag - but not the first thing I would jump for when dealing with strings.
The main use case is “uuid,” but those can be represented as 128-bit integers instead of character sequences if cloning is a concern. This emphasizes your point, usually there’s a more obvious, good enough solution to performance issues than moving a string to an Arc.
It is a neat trick, and something I love reading about, but I would very much expect a lengthy comment in the code if someone does this in practice because it is not obvious at all.
Oh man. I’ve seen so much software that treats UUIDs as strings internally. I’ve also seen things like IPv4 addresses being used in dotted notation as strings, and then the developers asking themselves why calculating “is this addr in this subnet” is so complicated.
I blame it on many people only learning high level scripting languages.
Also, you might be interested in ULID: https://github.com/ulid/spec
Yes! This bothers me as well!
The last team I worked with did a lot with addresses and we used Go, and way too many of my coworkers were confused when the standard library used a 4-byte array to represent IP addresses instead of a string. I’ve even had to read IP addresses in hex (e.g. in raw packet dumps), so I’m used to counting octet offsets.
Ew, that violates much of the point of an ID, which should be entirely opaque. One of the best parts about a UUID is that it can be exposed to the user without fear of providing any data to an attacker (we use UUIDv4 for that reason).
But maybe it’s useful for distributed systems where you trust the machines providing the IDs to have accurate timestamps, but then why not just use a tuple of a random ID and a timestamp? That way you can pick if you want lexicographical sorting or random distribution by swapping the parts of the tuple, and you can use standard 64-bit timestamps (and why milliseconds? Every time system I’ve used uses nanoseconds or seconds, only JS does the silly ms thing; maybe Windows does?). I guess compatibility with UUID is useful kinda, but I honestly don’t see a ton of value here (saves you a column in the DB and an import I guess).
So to me, this seems like it’s going to be misused a ton. I’m really scratching my head over using base32, because that’s only useful for preventing transcription errors, which means it’s intended to be seen and used.
I hope I’m missing something because this seems like an obvious “don’t do this” situation.
Great points about ULID, to be honest I haven’t looked into it in detail, I just saw it mentioned somewhere and glanced over their self-description. The main reason I wouldn’t use it is because Postgres already has great built-in support for UUIDs
Yeah, for any primitive types, you can just use the value itself as the hash value here. So, effectively a noop. I assume, Rust’s implementation makes use of this…
Probably not for ddos/security reasons. Would need to use something like nohasher to get noops.
Hmm, I was wondering, if there’s an overlap with using hashes for security stuff. Do you happen to know of an exploit that makes use of something like predictable placement in a hashmap?
Or is your assumption rather that they wouldn’t include special treatment for primitive types in the hashmap implementation?
It definitely feels a bit freaky to me, too, since you’d rob users of the ability to customize the
Hasher
implementation, but I also felt like they almost have to do it, because it might make a massive difference in performance.…but after thinking about it some more, I guess, you’d typically use a BTreeMap when your keys are primitives. So, now I’m on board with the guess, that they wouldn’t include special treatment into HashMap. 🙃
It is talked about in the hashmap docs:
Basically, if the attacker has control over the key inserted into a hashmap then with a simple hashing algorithm they can force collisions which results in the hashmap falling back to a much slower linear lookup. This can be enough to stress a server and slow down all requests going through it or even cause it to crash. So a lot of effort is made in the default hasher to mitigate against this. There are faster hashing implementations out there if you are not worried about this that you can opt into. But the default is to be secure.
Thanks. :)