Update millions of rows in RDBMS
The goal of this post is to explain how to handle updates on extremely large database tables.
Prérequis
Modifications are not atomic, which means you don't need to make sure all modifications are completed or none are.
When should I use it ?
- 1.000 rows: You should have that in mind
- 1.000.000 rows: You start stressing the locking system which introduce latency and burn db freespace or tmp files. It starts costing money.
- >100.000.000: you'll face technical issues.
- >1.000.000.000: it's far too late
Why should I care ?
Let's say resources (time, disk space, memory, $ € £...) are never unlimited or you may need your change to end before your retirement, without preventing users to run the application, until the change is complete.
How
Perform a query which retrieve the Nth first line to modify and modify them, adn then run it as long as there are results.
| Base de données | Solution SQL | Requête d'update avec pagination | Requête de suppression avec pagination | 
|---|---|---|---|
| MySQL | sql SELECT * FROM table ORDER BY column LIMIT 10 OFFSET 20; | sql UPDATE table_to_update JOIN ( SELECT id FROM table_to_update ORDER BY some_column LIMIT 10 OFFSET 20 ) AS subquery ON table_to_update.id = subquery.id SET some_column = 'new_value'; | sql DELETE FROM table_to_update WHERE id IN (SELECT id FROM table_to_update ORDER BY some_column LIMIT 10 OFFSET 20); | 
| PostgreSQL | sql SELECT * FROM table ORDER BY column LIMIT 10 OFFSET 20; | sql UPDATE table_to_update SET some_column = 'new_value' FROM ( SELECT id FROM table_to_update ORDER BY some_column LIMIT 10 OFFSET 20 ) AS subquery WHERE table_to_update.id = subquery.id; | sql DELETE FROM table_to_update WHERE id IN (SELECT id FROM table_to_update ORDER BY some_column LIMIT 10 OFFSET 20); | 
| SQL Server | sql SELECT * FROM (SELECT *, ROW_NUMBER() OVER (ORDER BY column) AS row_num FROM table) AS numbered_rows WHERE row_num BETWEEN 21 AND 30; | sql UPDATE table_to_update SET some_column = 'new_value' FROM ( SELECT id, ROW_NUMBER() OVER (ORDER BY some_column) AS row_num FROM table_to_update ) AS numbered_rows WHERE numbered_rows.row_num BETWEEN 21 AND 30 AND table_to_update.id = numbered_rows.id; | sql DELETE FROM table_to_update WHERE id IN (SELECT id FROM (SELECT id, ROW_NUMBER() OVER (ORDER BY some_column) AS row_num FROM table_to_update) AS numbered_rows WHERE row_num BETWEEN 21 AND 30); | 
| Oracle | sql SELECT * FROM (SELECT *, ROW_NUMBER() OVER (ORDER BY column) AS row_num FROM table) WHERE row_num BETWEEN 21 AND 30; | sql UPDATE ( SELECT id, some_column FROM table_to_update ORDER BY some_column OFFSET 20 ROWS FETCH NEXT 10 ROWS ONLY ) SET some_column = 'new_value'; | sql DELETE FROM table_to_update WHERE id IN (SELECT id FROM (SELECT id, ROW_NUMBER() OVER (ORDER BY some_column) AS row_num FROM table_to_update) WHERE row_num BETWEEN 21 AND 30); | 
| DB2 | sql SELECT * FROM table ORDER BY column FETCH FIRST 10 ROWS ONLY OFFSET 20; | sql UPDATE ( SELECT id, some_column FROM table_to_update ORDER BY some_column FETCH FIRST 10 ROWS ONLY OFFSET 20 ) SET some_column = 'new_value'; | sql DELETE FROM table_to_update WHERE id IN (SELECT id FROM table_to_update ORDER BY some_column FETCH FIRST 10 ROWS ONLY OFFSET 20); | 
Loop
Running a query as long as there are result, means looping and so stored procedure or any other programming language with loops.
- make sure your modification statement exclude already modified rows
- don't loop forever, just loop a maximum number of time. Even more if running on a Cloud system where you pay CPU !
Optimize
There are no rules about the size of the page. I would say between a 1.000 and 10.000 would be a good first step. Keep this value outside of the SQL and adjust it accordingly to the duration of your statement.