I am working on a database with a couple of tables. They are a
districts table
PK district_id
student_data table
PK study_id
FK district_id
ga_data table
PK study_id
district_id
The ga_data table is data that I am adding in. Both the student_data table and ga_data have 1.3 million records. The study_id's are 1 to 1 between the two tables, but the ga_data.district_id's are NULL and need to be updated. I am having trouble with the following PL/SQL:
update ga_data
set district_id = (select district_id from student_data
where student_data.study_id = ga_data.study_id)
where ga_data.district_id is null and rownum < 100;
I need to do it incremently so that's why I need rownum. But am I using it correctly? After running the query a bunch of times, it only updated about 8,000 records o开发者_如何学编程f the 1.3 million (should be about 1.1 million updates since some of the district_ids are null in student_data). Thanks!
ROWNUM just chops off query after the first n rows. You have some rows in STUDENT_DATA which have a NULL for DISTRICT_ID. So after a number of runs your query is liable to get stuck in a rut, returning the same 100 QA_DATA records, all of which match one of those pesky STUDENT_DATA rows.
So you need some mechanism for ensuring that you are working your way progressively through the QA_DATA table. A flag column would be one solution. Partitioning the query so it hits a different set of STUDENT_IDs is another.
It's not clear why you have to do this in batches of 100, but perhaps the easiest way of doing this would be to use BULK PROCESSING (at least in Oracle: this PL/SQL syntax won't work in MySQL).
Here is some test data:
SQL> select district_id, count(*)
2 from student_data
3 group by district_id
4 /
DISTRICT_ID COUNT(*)
----------- ----------
7369 192
7499 190
7521 192
7566 190
7654 192
7698 191
7782 191
7788 191
7839 191
7844 192
7876 191
7900 192
7902 191
7934 192
8060 190
8061 193
8083 190
8084 193
8085 190
8100 193
8101 190
183
22 rows selected.
SQL> select district_id, count(*)
2 from qa_data
3 group by district_id
4 /
DISTRICT_ID COUNT(*)
----------- ----------
4200
SQL>
This anonymous block uses the Bulk processing LIMIT clause to batch the result set into chunks of 100 rows.
SQL> declare
2 type qa_nt is table of qa_data%rowtype;
3 qa_recs qa_nt;
4
5 cursor c_qa is
6 select qa.student_id
7 , s.district_id
8 from qa_data qa
9 join student_data s
10 on (s.student_id = qa.student_id);
11 begin
12 open c_qa;
13
14 loop
15 fetch c_qa bulk collect into qa_recs limit 100;
16 exit when qa_recs.count() = 0;
17
18 for i in qa_recs.first()..qa_recs.last()
19 loop
20 update qa_data qt
21 set qt.district_id = qa_recs(i).district_id
22 where qt.student_id = qa_recs(i).student_id;
23 end loop;
24
25 end loop;
26 end;
27 /
PL/SQL procedure successfully completed.
SQL>
Note that this construct allows us to do additional processing on the selected rows before issuing the update. This is handy if we need to apply complicated fixes programmatically.
As you can see, the data in QA_DATA now matches that in STUDENT_DATA
SQL> select district_id, count(*)
2 from qa_data
3 group by district_id
4 /
DISTRICT_ID COUNT(*)
----------- ----------
7369 192
7499 190
7521 192
7566 190
7654 192
7698 191
7782 191
7788 191
7839 191
7844 192
7876 191
7900 192
7902 191
7934 192
8060 190
8061 193
8083 190
8084 193
8085 190
8100 193
8101 190
183
22 rows selected.
SQL>
It is kind of an odd requirement to only update 100 rows at a time. Why is that?
Anyway, since district_id in student_data can be null, you might be updating the same 100 rows over and over again.
If you extend your query to make sure a non-null district_id exists, you might end up where you want to be:
update ga_data
set district_id = (
select district_id
from student_data
where student_data.study_id = ga_data.study_id
)
where ga_data.district_id is null
and exists (
select 1
from student_data
where student_data.study_id = ga_data.study_id
and district_id is not null
)
and rownum < 100;
If this is a one-time conversion you should consider a completely different approach. Recreate the table as the join of your two tables. I promise you will laugh out loud when you realise how fast it is compared to all kinds of funny 100-rows-at-a-time updates.
create table new_table as
select study_id
,s.district_id
,g.the_remaining_columns_in_ga_data
from student_data s
join ga_data g using(study_id);
create indexes, constraints etc
drop table ga_data;
alter table new_table rename to ga_data;
Or if it isn't a one time conversion or you can't re-create/drop tables or you just feel like spending a few extra hours on data loading:
merge
into ga_data g
using student_data s
on (g.study_id = s.study_id)
when matched then
update
set g.district_id = s.district_id;
The last statement can also be rewritten as an updatable-view, but I personally never use them.
Drop/disable indexes/constraints on ga_data.district_id
before running the merge and recreate them afterward will improve on the performance.
精彩评论