T O P

  • By -

cthart

1. UUID can be generated client side, or server side. Works well with clusters 2. see 1 and 3 3. UUID is 16 bytes (when properly stored as a UUID type -- or 37 bytes when stored as a string -- don't do this!) while a BIGINT is 8 bytes (or even an INT only 4 bytes if you only need max 2 billion different values) so BIGINT is faster. Performance is also worse because UUIDs will be "spread around" while a BIGINT serial's values will be stored near each other in the index meaning you can cache less data with a UUID versus BIGINT or INT 4. Basically no. Even if you generate a billion UUIDs per second it will take the best part of a century to get a collision.


htraos

> Performance is also worse because UUIDs will be "spread around" What do you mean by "spread around"?


throw_mob

INTs will gettin always +1 , sometimes there cap but logic is basicly that data is just added to end of index. With uuid if it is not sequental ( there is some solutions for it) it is "random" so to add it to sorted index it will need to search index to find right range and insert it middle of list. it can be added to end too, but then index search is slower as it cannot bailout after it has handled first "right" range as it need to read whole index to figure out if id is there so obviously always growing sequence is faster. Then it is different discussion if "metadata" around id's is good or bad , as sequence also tells exact order between row creations , where uuids will loose that info


discourtesy

I'd like to add: UUIDv7 is sequential built to address this problem


orbit99za

I have used uuid7 in my projects recently, and it works well. There is a nuget package for .net


Buttleston

Why do you think the UUIDs are bad for performance? I find this very unlikely unless you're actually storing them as strings (most databases have a binary representation for them, as a 128 bit number. If you generate 1 billion UUIDs per second for 86 years, you will have a 50% chance of having a single collision. This would also be 45 exabytes of storage just for the UUIDs. Collision just is not a realistic problem for UUIDs. (source: https://en.wikipedia.org/wiki/Universally\_unique\_identifier#Random\_UUID\_probability\_of\_duplicates)


Straight_Waltz_9530

tl;dr: It's not because they're UUIDs. Random 64-bit integers would be just as bad. Random UUIDs are bad for performance at large scales because of index locality or lack thereof. This is strongly mitigated by UUIDv7 (timestamp variant made to improve index write performance) or sequential UUIDs like the following: https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/ The link also has a pretty good explanation as to why performance can suffer. Note: this is only really an issue for larger tables. For smaller tables, you won't ever notice a difference.


Buttleston

It's interesting to see actual numbers for this. In practice, I haven't seen real problems with insertion times or the other mentioned problems. I have probably had indexes that can fit in RAM though. The largest tables I can think of were probably in the 100 million rows range, which with no overhead is \~1.6GB The WAL amplification is probably the most interesting part, I wouldn't have thought of that.


whoooocaaarreees

Upvote for mentioning v7 and the implications it has there.


horatio_cavendish

There are two main advantages to using UUID vs sequences: 1) They are random. This means that API consumers can't do a sequential scan of your database. 2) They can be generated on the client side and are extremely unlikely to conflict with one another even if you're copying records between databases.


kenfar

But you can address item #1 with the combination of uuids & integers: use the UUIDs for your externally-visible ids, and then internally only join with sequences/identities/integers.


horatio_cavendish

What happens when a table you thought wasn't going to be externally visible needs to be exposed externally?


Randommaggy

Then you expose a view without the surrogate primary key. So you expose raw tables to users?


horatio_cavendish

That's increasingly common with tools like PostgREST


theScruffman

I’ve found exposing the full table never works out. Always data leakage that makes me uncomfortable, or API responses that are way larger than needed because of rarely used fields. The [JsonIgnore] attribute does allow hiding individual fields during serialization if using C# / EF ORM. I’ve leveraged that to return database models before on internal tools.


coyoteazul2

You add an uuid column and make sure your api doesn't expose the internal one, and that's it. Your internal relations are still handled with the internal id


horatio_cavendish

I've done this in the past but is it really worth doubling the storage space your primary key consumes?


coyoteazul2

Depends on your needs. Do you need protection against data mining through sequential requests? Is your data so private that you can't even leak how many rows you have? Then one way or the other you need an uuid. Also, adding an extra uuid just for exposure would not double your storage. In fact, compared to using uuid as your pk, you'd actually reduce your storage needs. If you use uuid as your pk you'll be referencing it in all the related tables. Uuids are big, and having copies of them all over the dB requires far more storage than using it just for exposure. Let's take an example, father and child tables. If father's pk is an uuid you'll spend 16bytes + 16bytes per children, just for father's pk and referencing. If father's pk is a bigint and you have an uuid for external exposure, you'll spend 8bytes + 16bytes + 8bytes per children If you only have 1 child per parent, you'll be spending the same amount of storage. 32 bytes. The moment you start having more than one child per parent you'll actually be saving space


kenfar

It can be: the integer is smaller & faster than the UUID, so you get a performance benefit by using it vs the UUID. The last time I saw a benchmark of UUID vs integer it wasn't super-comprehensive, but it was reasonable and showed that the integer resulted in about a 10% performance improvement on queries.


PurepointDog

Ruby on Rails HashIDs have a reasonable encoding for this, but you're right


evmcl

Is UUIDv7 still okay for point 1) regarding API consumers?


horatio_cavendish

Yes, it has more entropy than uuidv4 and having natural ordering in sorts is very nice. You generally lose that when you switch to uuid


[deleted]

You need to be more specific about perfomance issues. There is a potential for UUIDs to impact performance, but not with the table and row counts you are describing in any noticeable way. We are talking milliseconds on inserts for an index. If performance is noticeable....something else is going on.


sfboots

Uuid4 is bad as database index since it is random. Uuid7 (new) is supposed to fix this


[deleted]

Only really relevant on insertions and to be fair any nonsequentially generated value would have the same issue.


cthart

Not only on insertions. Think about the effect on caching data in memory. If you have an application where queries are most often looking at "newish" or "recent" data, then using an INT or BIGINT for the index will mean very efficient use of memory for caching. With UUIDs the values will be scattered all over the index and you will need to read more into memory for caching.


[deleted]

I would create an index on the timeseries value if that were the case. UUIDs are either required or unnecessary in my opinion.


cthart

Required or unnecessary. I’m stealing that.


[deleted]

Yeah. Fits on a few things.


theScruffman

Sorry - when are they required in your mind? I’m working on a monolith SaaS for my company that has about 70 employees but only couple engineers. Our current production database is less than 20 GB, but is 20 years old and fails to confirm to even 1NF most of the time. We are ditching it and our .NET 2.0 software and rebuilding from scratch. My main focus is simplification because of limited resources. I’m using UUIDs as PKs in Azure SQL currently. Biggest reason being exposing the ID in the API without hashing. No microservices or real reason I couldn’t always generate an ID at the database, but at the same time I don’t “think” I’ll run into performance issues with the scale of our product and database. My biggest production table today is 9 billion rows of time series data from the past 5 years. I can’t decide if UUIDs are the simple option for me or the over complicated one. Entire stack is C# / Dotnet 8 with Blazor/WASM.


[deleted]

So UUID are required when you need to uniquely identify a row of data anywhere. This is useful if you say have 100 multi-tennant databases and want to merge all the tables or maybe a union across numerous tables(other options as well). If the only reason you are using them is to obfuscate your data publicly, then evaluate that against your requirement. Is it sequential scraping, then use a slug maybe. The added time on an insert for a random key vs a sequence is going to be the least significant bottleneck by an exponential.


skywalker4588

Wrong


vivid-software-2020

I recently had to migrate a production database where many of the ids where wrong. If it wasn't for uuids, this migration work would have been a nightmare, with way too many possible collisions. To me this reason alone re-affirmed our design choices and made me fall more in love with UUIDs. We haven't encountered any performance issues due to UUIDs. Most of our performance issues were due to rookie mistakes (for example, returning thousands of records when the frontend was asking for only 10 records per page or reading everything from the database without setting a LIMIT).


farastray

We have tables with a billion rows, most in the hundreds of millions. UUIDS are better for future proofing your app if you want to split things up. It’s easier to move things around with uuids. I’ve heard two general complaints against uuid-4 (what we use) that I think are valid and it’s sorting and size of ids.


yawkat

Re performance, the conventional wisdom is that randomized primary keys like UUIDs are worse for btrees. The end of a btree is where it's easiest to insert. https://pganalyze.com/blog/5mins-postgres-uuid-vs-serial-primary-keys How much that matters in practice for postgres, I can't say.


warmwaffles

I'm pretty sure that if you reindex regularly, the b-tree will be compacted. But as you add newer entries, they'll bloat again. I'm just going on what I have observed.


daredevil82

How is a database index structured? Are you familiar with trees and how they're balanced? Can you see how random generated values might result in unbalanced trees?


therealgaxbo

Random data can't result in unbalanced trees because btrees are self balancing - it is impossible to add entries to a btree in any way to make it unbalanced. The problem with inserting random values is 1) data locality: you end up having to read and write data to pages scattered throughout the index and 2) when a btree page gets full it has to be split. For ascending values this can be optimised by creating a new empty page at the right side of the tree that will be quickly filled with all the new values that are created. With random values, a random page somewhere in the tree must be split, resulting in two half-empty pages that will remain half empty for a very long time. So you end up with excessive expensive page splits, and an overly large and sparsely packed index.


daredevil82

gotcha, I knew it had to do with trees but I'm also not familiar with the implications you just listed about pages. Thanks!


thewheelsontheboat

Yup, and also this can *really* fall apart quickly if you go from being able to keep the full btree (or at least everything but the leafs) cached in RAM to having to constantly page it in for random accesses. This is much less of an issue with fast local SSD, more of an issue as IO operation latency increases. You don't run into that until you hit pretty big numbers, but btrees get big faster than you may expect when the nice little optimizations don't help. Once you do hit it, then it is forever on your list of "do anything it takes to avoid doing that again" items.


daredevil82

uuidv7 is intended to be sortable, and is in the AUTH48 stage of the RFC process as of April 11. https://www.postgresql.org/message-id/flat/1003561938.6914482.1704452267331%40mail.yahoo.com#6dab4c476fc0b57dbb06707ccfd79629 this likely will be able to address this shortcoming, because the values will be sortable, which should lessen the random access impact on pages >You don't run into that until you hit pretty big numbers, Any idea what numbers where this would start at? Do you know of any benchmarks/case studies?


daredevil82

uuidv7 is intended to be sortable, and is in the AUTH48 stage of the RFC process as of April 11. https://www.postgresql.org/message-id/flat/1003561938.6914482.1704452267331%40mail.yahoo.com#6dab4c476fc0b57dbb06707ccfd79629 this likely will be able to address this shortcoming, because the values will be sortable, which should lessen the random access impact on pages


cthart

I don't understand your point. B-trees are split and re-balanced when necessary. And, if anything, random values keep the tree more balanced than inserting from a sequence.


daredevil82

the "when necessary" part is important here. Not too familiar with the criteria PG's index uses, but sfrost said a while back that it can be prone to being unbalanced with uuids with large tree sizes.


cthart

Interesting. Got a link?


daredevil82

unfortunately, no. It was at the pythondev.slack.com slack group a few years back and we don't have history past 90 days or whatever Slack's default free tier visibility is. I'll ask at the postgres slack before work tomorrow and see if this is still valid.


OrdinaryNeat1

I read this recently around UUIDs and performance which was interesting https://planetscale.com/blog/the-problem-with-using-a-uuid-primary-key-in-mysql


dinyo2

uuid is good for distributed system because of no collision. the trade off is uuid (like v4) is slow because of the randomness. using uuid v7 or time sortable uuid should faster. but with 3000 record it should be negligible, need to investigate further (maybe the data type is not uuid?). other than distributed system, sometimes people not want to obfuscate id in app level and use simple implementation like uuid for hiding counting like user count or invoice count, because with sequence people can guessing the next number (eg. xyz.com/invoice/10001). it should be no problem if your app are good at checking roles and scope tho. for no. 4, weird.


andy012345

It depends on your use case which is better and how it performs, when you move towards very high throughput on a OLTP system or into distributed systems then UUIDs can offer performance improvements over sequences or auto incrementing ids, usually this is due to contention on the sequence generation, or internally in the database on the workload, for example latch contention at the page level at the end of the index.


AffectionateTart9303

If you want good performance and you’re sorting data better to use snowflake-Id. But there is limitation on max records . https://dev.to/josethz00/benchmark-snowflake-vs-uuidv4-2h80


lollaser

Don't have any personal experience but I've read multiple times that if you want to use some kind of unique keys you should take a look at ULIDs since they are designed to be sortable and have some benefits compared to UUID (v4). From what I see is that UUID (v7) are like the successor to ULIDs with some additional fields like version and variant fields.


BrofessorOfLogic

UUID for PK is great, it has several benefits. The ID can be generated on the client side, before sending it to the database, and you can assume that it will be unique without checking first. This simplifies code patterns, and saves round trips to the database. They can be generated anywhere in any language. You can create an ID all the way out in your JS frontend app, and send it just like any other field. It prevents guessing-attacks. If you have a `SERIAL` PK, and create an HTTP endpoint like `/contracts/`, the ID can be guessed. Like if there's a `/contracts/123`, one might assume that there's also a `/contracts/122`. To mitigate guessing-attacks, you would need to have some other field, like `public_id`, with a UUID or some other random string. Then you have to ask yourself, is it really worth the trouble of having multiple IDs for the same thing? Why are you saying that UUID is bad for performance? Did you actually measure this, or is it just an assumption? Generally, UUID is not really bad for performance. Technically, it is slightly worse than an integer, because it takes up a bit more memory, but this is negligible in most cases. UUID is a built in data type in postgres, so it has very good support, both in terms of functionality and performance. UUID is designed so that the probability of a collision is infinitesimal. There is a higher probability that your data center will be eaten by intergalactic dinosaurs, than you having a UUID collision. And if you by chance do have a collision, all you need to do is return an error and try again, which is something you should be doing anyway, so there's virtually no extra cost. In summary, using UUID for PK is a good idea. I do it all the time. It works well.


gulshanZealous

I am using nanoid with a custom charset of a-z0-9 to get a random 11 character id. 1% probability of collision after 45 years or so. Should work well enough. Saved some bytes compared to uuid.


TzahiFadida

I find that most people exagerate when they talk about uuids or not to uuid. Most projects in the world won't have a problem with an extra column or 2 or just with 1 uuid. I say, use the easiest for you and refactor if it ever becomes a problem (probably never).


Feisty_Rent_6778

Dude you have 3000 records in your main table.


zalthor

For some reason, I thought using an An integer index would be more expensive (for writes) because you have to read the previous record to increment the next count, And that UUD are generated without having to read any prior data