[enhancement] never display sequential primary keys: postgresql solution

Question

[enhancement] never display sequential primary keys: postgresql solution

netvigator opened this issue 6 years ago · comments

Location within the Book 365-366

Chapter or Appendix: Chapter 26
Section: 26.27
Subsection: to be implemented

You advise never to display sequential primary keys. There is a solution that might be worth mentioning: in the tables where, by default, django displays the key in the url to access the record, use non-sequential integer primary keys.

Databases in general have an easier time with integer keys. So non-sequential integer keys might be a better option than slugs or UUID's. And making the integer keys non-sequential avoids the additional overhead of adding an extra field and index.

The idea has come up before, and there are solutions out there. Here are the links I found:

Pseudo_encrypt
Pseudo_encrypt_constrained_to_an_arbitrary_range

Here is the code I used to make integer keys at least 7 digits in length:

CREATE OR REPLACE FUNCTION pseudo_encrypt(VALUE int) returns int AS $$
DECLARE
l1 int;
l2 int;
r1 int;
r2 int;
i int:=0;
BEGIN
 l1:= (VALUE >> 16) & 65535;
 r1:= VALUE & 65535;
 WHILE i < 3 LOOP
   l2 := r1;
   r2 := l1 # (((1366 * r1 + 150889) % 714025) / 714025.0) * 32767)::int;
   l1 := l2;
   r1 := r2;
   i := i + 1;
 END LOOP;
 RETURN ((r1 << 16) + l1);
END;
$$ LANGUAGE plpgsql strict immutable;

CREATE OR REPLACE FUNCTION randomized(VALUE int) returns int AS $$
BEGIN
  LOOP
    VALUE := pseudo_encrypt(VALUE);
    EXIT WHEN VALUE >= 1000000;
  END LOOP;
  RETURN VALUE;
END
$$ LANGUAGE plpgsql strict immutable;

Make your own "secret sauce"! Tweak the numbers:
(((1366 * r1 + 150889) % 714025) / 714025.0)
As explained here:
sql-keys-in-depth

Tweaking the numbers, note that the last two should be the "same", the last one being the float version of the prior integer.

Yes, in my implementation, I tweaked the numbers. But for the following example, I used the non-tweaked numbers:

# using secret sauce values from https://wiki.postgresql.org/wiki/Pseudo_encrypt
# r2 := l1 # ((((1366 * r1 + 150889) % 714025) / 714025.0) * 32767)::int;
create table test ( id serial, text character ) ;
insert into test (text) values ('a');
insert into test (text) values ('b');
insert into test (text) values ('c');
ALTER TABLE test ALTER COLUMN id SET DEFAULT randomized(nextval('test_id_seq')::int);
insert into test (text) values ('d');
insert into test (text) values ('e');
insert into test (text) values ('f');
select * from test ;
     id     | text 
------------+------
          1 | a
          2 | b
          3 | c
  483424269 | d
 1905133426 | e
  971249312 | f

This solution can be implemented as a retrofit, just make your minimum value non-sequential key bigger that the biggest sequential key in your tables; change this line:

EXIT WHEN VALUE >= 1000000;

Also note that the secret sauce values are stored in the database, so keeping them out of version control is feasible.

I would not recommend this for the user table, as there is utility in having a recognizable alpha user name. This option can be a good fit for any other table.

Real world example: ebay item numbers

Nathan Cox · Answer 1 · Wed May 16 2018 06:23:39 GMT+0800 (China Standard Time)

Part of the reason not to display sequential primary keys comes down to user experience; numeric identifiers convey no meaning to a user and are virtually impossible to use as a reference point from memory to find something later. If you're going to be displaying something in the URL on a public facing page, you should really be using a slug.

The other half of not displaying sequential keys is to give you security-by-depth. Your users should not be able to just iterate through a range of numbers to figure out what the shape of your dataset is. Random number sets are a step in the right direction, but I don't feel like they go far enough. UUIDv4 is complex and random enough that it makes any kind of experimentation to discover the shape of your dataset practically infeasible. I.e., the cost and time of attempting aren't generally worth the payoff.

For the above reasons, I'd strongly advise simply using UUIDv4 as a PK and providing a user-facing slug whenever possible. That being said, there are a couple of Django-specific corner cases where switching the PK to something generated at runtime isn't really feasible; e.g., if you want to check to see if an object exists by referencing the pk (which won't have been set until the model instance has been saved). I personally feel like this is a bit of an anti-pattern, but it certainly exists in the wild. If you're dealing with a situation like this that isn't particularly feasible to fix, randomizing the PK pool as above is probably a decent compromise.

Rick Graves · Answer 2 · Wed May 16 2018 06:53:11 GMT+0800 (China Standard Time)

Nathan, I was proceeding based on the content of the book. The book covers slugs and UUIDs, and mentions they both have disadvantages. The book also advises against mere obfuscation. (Randomized integer primary keys are not mere obfuscation because you can pick your own secret sauce.) Your objections to randomized primary keys are largely or wholly outside of the book content.

UUIDv4 is complex and random enough that it makes any kind of experimentation to discover the shape of your dataset practically infeasible. I.e., the cost and time of attempting aren't generally worth the payoff.

I would say that applies equally to randomized integer keys. Because you can pick your own secret sauce, maybe randomized integer primary keys are better than UUIDs. Note the cited real world example: ebay item numbers. Q: Why doesn't eBay just use an alphanumeric scheme for IDs A: While alphanumeric IDs would work from a functional perspective, there are major performance reasons that favor the use of numeric IDs. EBay's scalability challenges are tremendous. Numeric IDs are more space-efficient than alphanumeric IDs. In larger scale tables, indexes on alphanumeric columns are slower than on numeric columns. https://ebaydts.com/eBayKBDetails?KBid=468 Databases handle integer keys better and faster. UUID's and slugs introduce extra overhead. The book lists UUIDs and slugs as options, but randomized integer primary keys are arguably better. Rick From: Nathan Cox <notifications@github.com> To: twoscoops/two-scoops-of-django-1.11 <two-scoops-of-django-1.11@noreply.github.com> Cc: Rick Graves <gravesricharde@yahoo.com>; Author <author@noreply.github.com> Sent: Wednesday, May 16, 2018 5:23 AM Subject: Re: [twoscoops/two-scoops-of-django-1.11] [enhancement] never display sequential primary keys: postgresql solution (#133) Part of the reason not to display sequential primary keys comes down to user experience; numeric identifiers convey no meaning to a user and are virtually impossible to use as a reference point from memory to find something later. If you're going to be displaying something in the URL on a public facing page, you should really be using a slug.The other half of not displaying sequential keys is to give you security-by-depth. Your users should not be able to just iterate through a range of numbers to figure out what the shape of your dataset is. Random number sets are a step in the right direction, but I don't feel like they go far enough. UUIDv4 is complex and random enough that it makes any kind of experimentation to discover the shape of your dataset practically infeasible. I.e., the cost and time of attempting aren't generally worth the payoff.For the above reasons, I'd strongly advise simply using UUIDv4 as a PK and providing a user-facing slug whenever possible. That being said, there are a couple of Django-specific corner cases where switching the PK to something generated at runtime isn't really feasible; e.g., if you want to check to see if an object exists by referencing the pk (which won't have been set until the model instance has been saved). I personally feel like this is a bit of an anti-pattern, but it certainly exists in the wild. If you're dealing with a situation like this that isn't particularly feasible to fix, randomizing the PK pool as above is probably a decent compromise.— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.