开发者

PL/SQL rownum updates

开发者 https://www.devze.com 2023-02-17 09:26 出处:网络
I am working on a database with a couple of tables. They are a districts table PK district_id student_data table

I am working on a database with a couple of tables. They are a

      districts table
      PK district_id

      student_data table
      PK study_id
      FK district_id

      ga_data table
      PK study_id
      district_id

The ga_data table is data that I am adding in. Both the student_data table and ga_data have 1.3 million records. The study_id's are 1 to 1 between the two tables, but the ga_data.district_id's are NULL and need to be updated. I am having trouble with the following PL/SQL:

update ga_data
set district_id = (select district_id from student_data
where student_data.study_id = ga_data.study_id)
where ga_data.district_id is null and rownum < 100;

I need to do it incremently so that's why I need rownum. But am I using it correctly? After running the query a bunch of times, it only updated about 8,000 records o开发者_如何学编程f the 1.3 million (should be about 1.1 million updates since some of the district_ids are null in student_data). Thanks!


ROWNUM just chops off query after the first n rows. You have some rows in STUDENT_DATA which have a NULL for DISTRICT_ID. So after a number of runs your query is liable to get stuck in a rut, returning the same 100 QA_DATA records, all of which match one of those pesky STUDENT_DATA rows.

So you need some mechanism for ensuring that you are working your way progressively through the QA_DATA table. A flag column would be one solution. Partitioning the query so it hits a different set of STUDENT_IDs is another.

It's not clear why you have to do this in batches of 100, but perhaps the easiest way of doing this would be to use BULK PROCESSING (at least in Oracle: this PL/SQL syntax won't work in MySQL).

Here is some test data:

SQL> select district_id, count(*)
  2  from student_data
  3  group by district_id
  4  /

DISTRICT_ID   COUNT(*)
----------- ----------
   7369        192
   7499        190
   7521        192
   7566        190
   7654        192
   7698        191
   7782        191
   7788        191
   7839        191
   7844        192
   7876        191
   7900        192
   7902        191
   7934        192
   8060        190
   8061        193
   8083        190
   8084        193
   8085        190
   8100        193
   8101        190
               183

22 rows selected.

SQL> select district_id, count(*)
  2  from qa_data
  3  group by district_id
  4  /

DISTRICT_ID   COUNT(*)
----------- ----------
                  4200

SQL>

This anonymous block uses the Bulk processing LIMIT clause to batch the result set into chunks of 100 rows.

SQL> declare
  2      type qa_nt is table of qa_data%rowtype;
  3      qa_recs qa_nt;
  4
  5      cursor c_qa is
  6          select qa.student_id
  7                 , s.district_id
  8          from qa_data qa
  9                  join student_data s
 10                      on (s.student_id = qa.student_id);
 11  begin
 12      open c_qa;
 13
 14      loop
 15          fetch c_qa bulk collect into qa_recs limit 100;
 16          exit when qa_recs.count() = 0;
 17
 18          for i in qa_recs.first()..qa_recs.last()
 19          loop
 20              update qa_data qt
 21                  set qt.district_id = qa_recs(i).district_id
 22                  where qt.student_id = qa_recs(i).student_id;
 23          end loop;
 24
 25      end loop;
 26  end;
 27  /

PL/SQL procedure successfully completed.

SQL>

Note that this construct allows us to do additional processing on the selected rows before issuing the update. This is handy if we need to apply complicated fixes programmatically.

As you can see, the data in QA_DATA now matches that in STUDENT_DATA

SQL> select district_id, count(*)
  2  from qa_data
  3  group by district_id
  4  /

DISTRICT_ID   COUNT(*)
----------- ----------
   7369        192
   7499        190
   7521        192
   7566        190
   7654        192
   7698        191
   7782        191
   7788        191
   7839        191
   7844        192
   7876        191
   7900        192
   7902        191
   7934        192
   8060        190
   8061        193
   8083        190
   8084        193
   8085        190
   8100        193
   8101        190
               183

22 rows selected.

SQL>


It is kind of an odd requirement to only update 100 rows at a time. Why is that?

Anyway, since district_id in student_data can be null, you might be updating the same 100 rows over and over again.

If you extend your query to make sure a non-null district_id exists, you might end up where you want to be:

update ga_data
set district_id = (
  select district_id 
  from student_data
  where student_data.study_id = ga_data.study_id
)
where ga_data.district_id is null 
and exists (
  select 1
  from student_data
  where student_data.study_id = ga_data.study_id
  and district_id is not null
)
and rownum < 100;


If this is a one-time conversion you should consider a completely different approach. Recreate the table as the join of your two tables. I promise you will laugh out loud when you realise how fast it is compared to all kinds of funny 100-rows-at-a-time updates.

create table new_table as
   select study_id
         ,s.district_id
         ,g.the_remaining_columns_in_ga_data
    from student_data s
    join ga_data      g using(study_id);

   create indexes, constraints etc 
   drop table ga_data;
   alter table new_table rename to ga_data;

Or if it isn't a one time conversion or you can't re-create/drop tables or you just feel like spending a few extra hours on data loading:

merge
 into ga_data      g
using student_data s
   on (g.study_id  = s.study_id)
when matched then
   update
      set g.district_id = s.district_id;

The last statement can also be rewritten as an updatable-view, but I personally never use them.

Drop/disable indexes/constraints on ga_data.district_id before running the merge and recreate them afterward will improve on the performance.

0

精彩评论

暂无评论...
验证码 换一张
取 消