Here comes the scenario, we have a bug in our app which introduced duplicate rows into the table where we don't have unique key on the table and How do we find and safely delete the duplicate rows and have a copy of rows what we deleted.
Preparing a test setup with scenario, create table, insert some duplicates as below and do a select of rows...
Now, we have a table called test with duplicate rows in the table.
Preparing a test setup with scenario, create table, insert some duplicates as below and do a select of rows...
CREATE TABLE `test` (
`bid` int(11) NOT NULL AUTO_INCREMENT,
`addtime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`refid` varchar(255) NOT NULL,
PRIMARY KEY (`bid`),
KEY `addtime` (`addtime`),
KEY `refid` (`refid`)
) ENGINE=InnoDB AUTO_INCREMENT=128 DEFAULT CHARSET=utf8
MariaDB [test]> insert into test(refid) values ('suresh'),('kumar'),('kuna') ;
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0
MariaDB [test]> insert into test(refid) values ('suresh'),('kumar'),('kuna') ;
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0
MariaDB [test]> insert into test(refid) values ('suresh'),('kumar'),('kuna') ;
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0
MariaDB [test]> insert into test(refid) values ('suresh'),('kumar'),('kuna') ;
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0
MariaDB [test]> select * from test ;
+-----+---------------------+--------+
| bid | addtime | refid |
+-----+---------------------+--------+
| 84 | 2015-10-05 14:31:36 | test |
| 85 | 2015-10-05 14:31:36 | test1 |
| 113 | 2015-10-05 14:31:50 | suresh |
| 114 | 2015-10-05 14:31:50 | kumar |
| 115 | 2015-10-05 14:31:50 | kuna |
| 116 | 2015-10-05 15:35:39 | suresh |
| 117 | 2015-10-05 15:35:39 | kumar |
| 118 | 2015-10-05 15:35:39 | kuna |
| 119 | 2015-10-05 15:35:40 | suresh |
| 120 | 2015-10-05 15:35:40 | kumar |
| 121 | 2015-10-05 15:35:40 | kuna |
| 122 | 2015-10-05 15:35:40 | suresh |
| 123 | 2015-10-05 15:35:40 | kumar |
| 124 | 2015-10-05 15:35:40 | kuna |
| 125 | 2015-10-05 15:35:41 | suresh |
| 126 | 2015-10-05 15:35:41 | kumar |
| 127 | 2015-10-05 15:35:41 | kuna |
+-----+---------------------+--------+
17 rows in set (0.00 sec)
Now, we have a table called test with duplicate rows in the table.
We can delete the duplicates in several ways...and the most easier way if we have only one table with few rows to delete is to a simple select with where condition as below and delete using subquery select. Run the delete statement in for loop until we see the zero rows left in the where condition.
MariaDB [test]> select refid, count(1) from test where addtime between '2015-09-23 00:00:00' and '2015-10-30 00:00:00' group by refid HAVING count(*) > 1;
+--------+----------+
| refid | count(1) |
+--------+----------+
| kumar | 5 |
| kuna | 5 |
| suresh | 5 |
+--------+----------+
3 rows in set (0.00 sec)
MariaDB [test]> delete from test where refid in (select * from (select refid from test where addtime between '2015-09-23 00:00:00' and '2015-10-30 00:00:00' group by refid HAVING count(*) > 1 ) a) limit 1 ;
Query OK, 1 row affected, 2 warnings (0.00 sec)
MariaDB [test]> select refid, count(1) from test where addtime between '2015-09-23 00:00:00' and '2015-10-30 00:00:00' group by refid HAVING count(*) > 1;
+--------+----------+
| refid | count(1) |
+--------+----------+
| kumar | 5 |
| kuna | 5 |
| suresh | 4 |
+--------+----------+
3 rows in set (0.00 sec)
MariaDB [test]> delete from test where refid in (select * from (select refid from test where addtime between '2015-09-23 00:00:00' and '2015-10-30 00:00:00' group by refid HAVING count(*) > 1 ) a) limit 1 ;
Query OK, 1 row affected, 2 warnings (0.00 sec)
MariaDB [test]> select refid, count(1) from test where addtime between '2015-09-23 00:00:00' and '2015-10-30 00:00:00' group by refid HAVING count(*) > 1;
+--------+----------+
| refid | count(1) |
+--------+----------+
| kumar | 4 |
| kuna | 5 |
| suresh | 4 |
+--------+----------+
3 rows in set (0.00 sec)
MariaDB [test]> delete from test where refid in (select * from (select refid from test where addtime between '2015-09-23 00:00:00' and '2015-10-30 00:00:00' group by refid HAVING count(*) > 1 ) a) limit 1 ;
Query OK, 1 row affected, 2 warnings (0.01 sec)
MariaDB [test]> select refid, count(1) from test where addtime between '2015-09-23 00:00:00' and '2015-10-30 00:00:00' group by refid HAVING count(*) > 1;
+--------+----------+
| refid | count(1) |
+--------+----------+
| kumar | 4 |
| kuna | 4 |
| suresh | 4 |
+--------+----------+
3 rows in set (0.00 sec)
For that, we need to have the percona tool called "pt-archiver". Download the tool on the server where you have the databases & tables which have duplicate and make it runnable or executable file.
Create a new table as same as the table which have duplicates to store the deleted duplicate row ( just in case if we need the data later to compare or to see for some or other reason ). Altered the table and drop the the auto_increament key as we will have duplicates to insert into when we have multiple tables and dropped the primary key.
Finally, use the pt-archiver to store the duplicate rows in a separate table and start deleting from the bunch of tables. Verify the data in the test_duplicates table which has got deleted.
NOTE : Please use the --dry-run before running the pt-archiver to check on what is happening with the created statement.
wget percona.com/get/pt-archiver
chmod +x pt-archiver
create table test_duplicates like test ;
alter table test_duplicates modify bid int(11) NOT NULL ;
alter table test_duplicates drop primary key ;
* Inserted a bunch of rows again into the table and executing the pt-archiver.
for i in `mysql -u root -pxxxx test -s -N -e "show tables like 'test%';"` ; do /root/pt-archiver --source h=localhost,D=test,t=$i,u=root,p=xxxx --dest D=test,t=test_duplicates --where " refid in (select max(refid) from $i where addtime between '2015-09-23 00:00:00' and '2015-10-30 00:00:00' group by refid HAVING count(*) > 1 ) " --purge --no-check-columns --no-check-charset --limit=1 --commit-each --statistics --progress=1 ; done
TIME ELAPSED COUNT
2015-10-05T14:35:35 0 0
2015-10-05T14:35:35 0 1
2015-10-05T14:35:35 0 2
2015-10-05T14:35:35 0 3
2015-10-05T14:35:35 0 4
2015-10-05T14:35:35 0 5
2015-10-05T14:35:35 0 6
2015-10-05T14:35:35 0 7
2015-10-05T14:35:35 0 8
2015-10-05T14:35:35 0 9
2015-10-05T14:35:35 0 10
2015-10-05T14:35:35 0 11
2015-10-05T14:35:35 0 12
2015-10-05T14:35:35 0 13
2015-10-05T14:35:35 0 14
2015-10-05T14:35:35 0 15
2015-10-05T14:35:35 0 16
2015-10-05T14:35:35 0 17
2015-10-05T14:35:35 0 18
2015-10-05T14:35:35 0 19
2015-10-05T14:35:35 0 20
2015-10-05T14:35:35 0 21
2015-10-05T14:35:35 0 22
2015-10-05T14:35:35 0 23
2015-10-05T14:35:35 0 24
2015-10-05T14:35:35 0 25
2015-10-05T14:35:35 0 26
2015-10-05T14:35:35 0 27
2015-10-05T14:35:35 0 28
2015-10-05T14:35:35 0 29
2015-10-05T14:35:35 0 30
2015-10-05T14:35:35 0 31
2015-10-05T14:35:35 0 32
2015-10-05T14:35:35 0 32
Started at 2015-10-05T14:35:35, ended at 2015-10-05T14:35:35
Source: D=test,h=localhost,p=...,t=test,u=root
Dest: D=test,h=localhost,p=...,t=test5,u=root
SELECT 32
INSERT 32
DELETE 32
Action Count Time Pct
select 33 0.0150 37.79
commit 66 0.0075 18.95
deleting 32 0.0058 14.56
inserting 32 0.0057 14.32
other 0 0.0057 14.37
MariaDB [test]> select * from test_duplicates limit 32 ;
+-----+---------------------+--------+
| bid | addtime | refid |
+-----+---------------------+--------+
| 1 | 2015-10-05 12:13:25 | suresh |
| 2 | 2015-10-05 12:13:25 | kumar |
| 3 | 2015-10-05 12:13:25 | kuna |
| 4 | 2015-10-05 12:13:28 | suresh |
| 5 | 2015-10-05 12:13:28 | kumar |
| 6 | 2015-10-05 12:13:28 | kuna |
| 7 | 2015-10-05 12:13:29 | suresh |
| 8 | 2015-10-05 12:13:29 | kumar |
| 9 | 2015-10-05 12:13:29 | kuna |
| 10 | 2015-10-05 12:13:29 | suresh |
| 11 | 2015-10-05 12:13:29 | kumar |
| 12 | 2015-10-05 12:13:29 | kuna |
| 13 | 2015-10-05 12:13:30 | suresh |
| 14 | 2015-10-05 12:13:30 | kumar |
| 15 | 2015-10-05 12:13:30 | kuna |
| 16 | 2015-10-05 12:13:31 | suresh |
| 17 | 2015-10-05 12:13:31 | kumar |
| 18 | 2015-10-05 12:13:31 | kuna |
| 19 | 2015-10-05 12:13:31 | suresh |
| 20 | 2015-10-05 12:13:31 | kumar |
| 21 | 2015-10-05 12:13:31 | kuna |
| 22 | 2015-10-05 12:13:32 | suresh |
| 23 | 2015-10-05 12:13:32 | kumar |
| 24 | 2015-10-05 12:13:32 | kuna |
| 25 | 2015-10-05 12:13:32 | suresh |
| 26 | 2015-10-05 12:13:32 | kumar |
| 27 | 2015-10-05 12:13:32 | kuna |
| 28 | 2015-10-05 12:13:33 | suresh |
| 29 | 2015-10-05 12:13:33 | kumar |
| 30 | 2015-10-05 12:13:33 | kuna |
| 31 | 2015-10-05 12:13:33 | suresh |
| 32 | 2015-10-05 12:13:33 | kumar |
+-----+---------------------+--------+
32 rows in set (0.00 sec)
MariaDB [test]> select * from test;
+-----+---------------------+--------+
| bid | addtime | refid |
+-----+---------------------+--------+
| 33 | 2015-10-05 XX:XX:XX | suresh |
| 34 | 2015-10-05 XX:XX:XX | kumar |
| 35 | 2015-10-05 XX:XX:XX | kuna |
+-----+---------------------+--------+
5 rows in set (0.00 sec)
Now, we have all the duplicates deleted from all the tables. The process with take time as per your number of tables and duplicates. It does take really very long time....
Due to the time contraint and with some options available in hand, started looking for another option. Select the duplicates from all the tables and insert into a second table with the table name and then do a delete.
Create a table with additional column tablename, with out primary keys or unique keys.
create table test5 (
`bid` int(11) NOT NULL,
`addtime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`refid` varchar(255) NOT NULL,
`tablename` varchar(255) NOT NULL,
KEY `addtime` (`addtime`),
KEY `refid` (`refid`)
) ENGINE=InnoDB
For a bunch of tables with duplicate rows, let's use s simple for loop to select and insert the duplicates into second table and then do a delete with join.
for i in `mysql -u root -proot test -s -N -e " show tables like 'test%';"` ; do `mysql -u root -pxxxx test -e "insert into test1.test5(bid,addtime,refid,tablename) select bid,addtime,refid,'$i' from $i where addtime between '2015-09-23 00:00:00' and '2015-10-11 00:00:00' group by refid HAVING count(*) > 1 ; select sleep(1) ; delete $i from $i inner join test1.test5 b on $i.refid=b.refid and $i.addtime=b.addtime and $i.bid=b.bid where 1=1 ; ” ` ; done
This one worked pretty well and we will have the backup of deleted rows. Go ahead and try...