Write >1 files (limited by size) from a spark partition_问答_开发者

Write >1 files (limited by size) from a spark partition

开发者 https://www.devze.com 2022-12-07 20:04 出处：网络

I am fetching an RDBMS table using JDBC with some 10-20 partitions using ROW_NUM. Then from each of these partitions I want to process/format the data, and write one or more files out to file storage

相关专题：apache-spark

I am fetching an RDBMS table using JDBC with some 10-20 partitions using ROW_NUM. Then from each of these partitions I want to process/format the data, and write one or more files out to file storage based on the file size Each file must be less 500MB. How do I write multiple files out from a single partition? This spark config property 'spark.sql.files.maxRecord开发者_开发技巧sPerFile' wont work for me because each row can be of different size as there is blob data in the row. The size of this blob may vary anywhere from a few hundred bytes to 50MB. So I cannot really limit the write by maxRecordsPerFile.

How do I further split each DB partition, into smaller partitions and then write out the files? If I do a repartition, it does shuffle across all executors. I am trying to keep all data within the same executor to avoid shuffle. Is it possible to repartition within the same executor core (repartition the current partition), and then write a single file from each?

Read JDBC DB and write files of size 500MB