开发者

Parallel processing and temporary files

开发者 https://www.devze.com 2023-02-16 14:15 出处:网络
I\'m using the mclapply function in the multicore package to do parallel processing. It seems that all child processes started produce the same names for temporary files given by the tempfile function

I'm using the mclapply function in the multicore package to do parallel processing. It seems that all child processes started produce the same names for temporary files given by the tempfile function. i.e. if I have four processors,

library(multicore)
mclapply(1:4, function(x) tempfile())

will give four exactly same filenames. Obviously I need the temporary files to be different so that the child processes don't overwrite each others' files. When using tempfile indirectly, i.e. calling some function that call开发者_如何学Pythons tempfile I have no control over the filename.

Is there a way around this? Do other parallel processing packages for R (e.g. foreach) have the same problem?

Update: This is no longer an issue since R 2.14.1.

CHANGES IN R VERSION 2.14.0 patched:

[...]

o tempfile() on a Unix-alike now takes the process ID into account.
  This is needed with multicore (and as part of parallel) because
  the parent and all the children share a session temporary
  directory, and they can share the C random number stream used to
  produce the uniaue part.  Further, two children can call
  tempfile() simultaneously.


I believe multicore spins off a separate process for each subtask. If that assumption is correct, then you should be able to use Sys.getpid() to "seed" tempfile:

tempfile(pattern=paste("foo", Sys.getpid(), sep=""))


Use the x in your function:

mclapply(1:4, function(x) tempfile(pattern=paste("file",x,"-",sep=""))


Because the parallel jobs all run at the same time, and because the random seed comes from the system time, running four instances of tempfile in parallel will typically produce the same results (if you have 4 cores, that is. If you only have two cores, you'll get two pairs of identical temp file names).

Better to generate the tempfile names first and give them to your function as an argument:

filenames <- tempfile( rep("file",4) )
mclapply( filenames, function(x){})

If you're using someone else's function that has a tempfile call in it, then working the PID into the tempfile name by modifying the tempfile function, as previously suggested, is probably the simplest plan:

tempfile <- function( pattern = "file", tmpdir = tempdir(), fileext = ""){
   .Internal(tempfile(paste("pid", Sys.getpid(), pattern, sep=""), tmpdir, fileext))}
mclapply( 1:4, function(x) tempfile() )


At least for now, I chose to monkey-patch my way around this by using the following code in my .Rprofile following Daniel's advice to use PID values.

assignInNamespace("tempfile.orig", tempfile, ns="base")
.tempfile = function(pattern="file", tmpdir=tempdir())
    tempfile.orig(paste(pattern, Sys.getpid(), sep=""), tmpdir)
assignInNamespace("tempfile", .tempfile, ns="base")

Obviously it's not a good option for any package you'd distribute, but for a single user's need it's the best option thus far since it works in all cases.

0

精彩评论

暂无评论...
验证码 换一张
取 消