How we fix database “Frozen” while alter tables in production

Nitin goyal
2 min readFeb 25, 2020

This evening for a while when altering a table, adding a column actually our production environment just froze.

Table which we were altering was having only 1 Lac rows. But still below query taking very long time (Approx 1hour) still not completed.

ALTER TABLE users ADD column_1 TINYINT;

It was very surprize to us. that a very small table taking this much time. so we got lots of thing in our mind i.e.

  1. Disable CONSTRAINTS
  2. Disable TRIGGERS (later we find we don’t have any)
  3. Other Database Tuning

The command which we wish to run does take an ‘ACCESS EXCLUSIVE’ lock on the table, which will prevent all other access to that table. But duration of this lock should be just a few milliseconds, as adding a column. After a while finally we start digging the database query stats and we find something which saves our night.

The problem was is in lock priorities. Some user has a weak lock like ACCESS SHARE lock, on the table, and they are camping on it indefinitely (in our case was an idle-in-tranasaction) connection which has been leaked.

The ADD COLUMN tries to take the ACCESS EXCLUSIVE it needs, and it queues up behind the first lock which was ACCESS SHARE .

Now all of our future lock reques queue up behind the waiting ACCESS EXCLUSIVE request.

Conceptually, incoming lock request which are compatible with the already-granted lock could jump over the waiting ACCESS EXCLUSIVE and be granted out of turn, but that is no how PostgreSQL does it.

So we find the prcoess which is holding the long-lived weak lock by quering the pg_locks table.

SELECT * FROM pg_locks WHERE granted AND relation = 'users'::regclass;

We did this while everything was locked up, we get only one answer (you may get multiple long-lived culprits). After running above query a couple of times we found one query (in your case in may many) which are staying around each time.

we then pick the PID that we got from pg_lock , and query with that into pg_stat_activity to see what the offender is doing:

SELECT * FROM pg_stat_activity WHERE pid=123;

Query Results

backend_start    | 2020-02-25 19:58:30.849405-07
xact_start | 2020-02-25 19:58:36.797703-07
query_start | 2020-02-25 19:58:36.799021-07
state_change | 2020-02-25 19:58:36.824369-07
waiting | f
state | idle in transaction
backend_xid |
backend_xmin |
query | select * from users limit 1;

So, it ran a query, inside a transaction, and then went idle without ever closing the transaction. It is now 20:03, so they have been idle for 5 minutes. so we killed the idle transaction using the PID what we get above. After killing cluprit(s) ALTER query finishes within miliseconds.

If you face such issue you can use above way to figure out the cluprits and by killing them you can save your day as we did our night.

--

--

Nitin goyal

working as DevSecOps Engineer at Pixlevide Solutions. Loves to solve everyday problems with Automation.