How to find the relation between a number of input variables and a resulting output_问答_开发者

I have a set of three (or more) known variables that are related to the input of a process. I also have the (measured) results of the process, in this case the time it took for the process to complete.

In order to be able to give an estimated duration and create a progress indicator based on the input, i would need to find the relation (if any) between the variables and the results.

What is the best way to determine if there is a relation, and if a relation exists to create a formula.

I have a number of data sets to work with (input variable values and resulting time).

Any suggestions or links related to this? A hint on how to solve this using code or a pointer to some theory would be helpful.

Some added background:

The process consists of a number of files to be processed (the main input) with an additional secondary input consisting of another set of (reference) files directly related to the main input's contents. Currently the progress is indicated by showing the overall file progress (related to total number of main inputs) combined with the in-file progress based on the position in the contents of the current input file. Since the overall time required per file (set) can be rather long (depending on the contents) I would like to add some kind of "time left" or "expected finish time" indicator.

The actual code consists of a merge of a subset of data from a list (Excel format) with XML files into a legacy format file. The "time consuming" part is the parsing of the Excel files, but this is greatly affected by the actual size of the file, the number of items that need to be processed and the number of files that need to be created as output. In some cases a large file res开发者_高级运维ults in one output whereas in other cases a small file can result in a large number of outputs. Since a lot of file-access is performed a secondary factor (that is hard to put into numbers) are the number of identical processes running at the same time.

The idea is to be able to give an estimated throughput based on the input.

@Heatsink's answer is right, but selecting the function family and the number of free parameters requires some experience. It is called "modelling", and physicists are the masters of the trade. Also, the general (non-linear) regression problem is not always trivial to solve.

Perhaps you may try this software package that sometimes is smart enough to select the right function and parameters. I've had a few nice experiences with it.

How to find the relation between a number of input variables and a resulting output

HTH!

BTW ... If you can post your 4D data somewhere we could research a bit more

The first step is to restrict the relation you're looking for to some family of functions. To do that, you need to come up with a model of how the inputs affect the measured results. Once you've picked the family of functions, the next step is to figure out which member of the family best represents your data.

For example, you might decide the system you're measuring can be modeled with a linear relationship, time = a*x + b*y + c*z; then you can use a linear regression to find parameters a, b, c that best fit your data.

How will you be updating the progress meter as the computation progresses? If it splits up into a large number of steps that run in roughly equal time, then you can just report how many steps have been completed as a percentage of the total, without needing any prior knowledge.

If the computation has a few distinct phases, then you would have to estimate the total contribution of each by some formula, as you say. Still for each phase, you will need a model for how far the computation has progressed through that phase, and that requires some knowledge of the code itself.

Knowing more about the nature of input variables here would help. Do you have computation bounds on the code itself, i.e., can you prove that it runs linearly or quadratically in each dimension of the input? Is it a brute-force type of method that is factorial or exponential in one of the inputs? Trying to derive a formula for the running time of the code based the choice of algorithm would likely be more accurate than an empirical regression alone, and may lead you to find a faster algorithm.