I'm working on trying to speed up some general data processing in C. I've written several subroutines of the form:
double *do_something(double *arr_in, ...) {
double *arr_out;
arr_out = malloc(...)
for (...) {
do the something on arr_in and put into arr_out
}
return arr_out;
}
I like this style because it's easy to read and use, but often I call it as:
array = do_something(array,...);
Would it make for faster code (and maybe prevent memory leaks) if I were to instead use void subfunctions as:
void do_something(double *arr_in, ...) {
for (...) {
arr_in = do that something;
}
return;
}
update 1: I ran valgrind --leak-check=full on the executable and it appears there were no memory leaks using the first method. However, the executable links to a library which contains all the subroutines I made with this form, so it might not catch leaks from the library.
I'm curious as to how I would write the wrappers to free the memory and what the ** really does, or what a pointer to a pointer is, so I'm avoiding using the ** route (that and maybe I did it wrong because it didn't compile on my mac).
Here's one current subroutine:
double *cos_taper(double *arr_in, int npts)
{
int i;
double *arr_out;
double cos_taper[npts];
int M;
M = floor( ((npts - 2) / 10) + .5);
arr_out = malloc(npts*sizeof(arr_out));
for (i=0; i<npts; i++) {
if (i<M) {
cos_taper[i] = .5 * (1-cos(i * PI / (M + 1)));
}
else if (i<npts - M - 2) {
cos_taper[i] = 1;
}
else if (i<npts) {
cos_taper[i] = .5 * (1-cos((npts - i - 1) * PI / (M + 1)));
}
arr_out[i] = arr_in[i] * cos_taper[i];
}
return arr_out;
}
From the advice I've gotten here, it sounds like a better method would be:
void *cos_taper(double *arr_in, double *arr_out, int npts)
{
int i;
double cos_taper[npts];
int M;
M = floor( ((npts - 2) / 10) + .5);
for (i=0; i<npts; i++) {
if (i<M) {
cos_taper[i] = .5 * (1-cos(i * PI / (M + 1)));
}
else if (i<npts - M - 2) {
cos_taper[i] = 1;
}
else if (i<npts) {
cos_taper[i] = .5 * (1-cos((npts - i - 1) * PI / (M + 1)));
}
arr_out[i] = arr_in[i] * cos_taper[i];
}
return
}
call:
int main() {
int npts;
double *data, *开发者_StackOverflowcos_tapered;
data = malloc(sizeof(data) * npts);
cos_tapered = malloc(sizeof(cos_tapered) * npts);
//fill data
cos_taper(data, cos_tapered, npts);
free(data);
...
free(cos_tapered);
...
return 0;
}
The malloc can be expensive relative to the processing you are doing, depending on what it is. Rather than restrict yourself to in-place processing, just use two parameters, in and out, and leave allocation to the caller. This gives the caller the option to reuse memory without allocating a new array for each call.
If you can do your operation in place, doing so will probably help prevent bugs (at least memory related ones) and will be faster by at least the time taken to do the malloc()
operation. The actual return type of your function probably doesn't affect the speed in any way.
The returning of the double itself doesn't cost you much in terms of execution time.
Much more significant is the allocation of memory each time you come into the function. If you can pre-allocate, or store the result in place as you suggested, that should greatly improve the speed.
Another thing to consider is whether you actually need all of the precision that a double provides (vs. a float type). In many cases, floats are much faster.
I'd opt for letting the caller allocate the memory if they want to, but also be able to choose to have the operation done in place, or to have you do the allocation.
For operations that can't be done in place, you can manually check if the caller has given you the same input and output locations, and make a copy of the input yourself. Then process using that copy as input. This makes it look in place to the function caller.
For example, suppose you want to create a function that takes an shuffles an array of indexes such that output[i] == input[ input[i] ]
(a silly function, true, but one that's nontrivial to do in place).
#include <stdlib.h>
#include <string.h>
int shuffle(size_t const * input, size_t const size, size_t ** p_output)
{
int retval = 0;
size_t i;
char in_place = 0;
char cleanup_output = 0;
if (size == 0)
{
return 0; // nothing to do
}
// make sure we can read our input and write our output
else if (input == NULL || p_output == NULL)
{
return -2; // illegal input
}
// allocate memory for the output
else if (*p_output == NULL)
{
*p_output = malloc(size * sizeof(size_t));
if (*p_output == NULL) return -1; // memory allocation problem
cleanup_output = 1; // free this memory if we run into errors
}
// use a copy of our input, since the algorithm doesn't operate in place.
// and the input and output overlap at least partially
else if (*p_output - size < input && input < *p_output + size)
{
size_t * const input_copy = malloc(size * sizeof(size_t));
if (input_copy == NULL) return -1; // memory allocation problem
memcpy( input_copy, input, size * sizeof(size_t));
input = input_copy;
in_place = 1;
}
// shuffle
for (i = 0; i < size; i++)
{
if (input[i] >= size)
{
retval = -2; // illegal input
break;
}
(*p_output)[i] = input[ input[i] ];
}
// cleanup
if (in_place)
{
free((size_t *) input);
}
if (retval != 0 && cleanup_output)
{
free(*p_output);
*p_output = NULL;
}
return retval;
}
This makes your function more robust - the function caller can allocate memory for the output (if they want to keep the input around), or have the output appear in the same place as the input, or have you allocate the memory for the output. This is especially nice if they got the input and output locations from somewhere else themselves, and aren't sure whether they're distinct. They don't have to know anything about the workings of the function.
// caller allocated memory
my_allocated_mem = malloc( my_array_size * sizeof(size_t) );
if(my_allocated_mem == NULL) { /*... */ }
shuffle( my_array, my_array_size, &my_allocated_mem );
// function allocated memory
my_allocated_mem = NULL;
shuffle( my_array, my_array_size, &my_allocated_mem );
// in place calculation
shuffle( my_array, my_array_size, &my_array);
// (naughty user isn't checking the function for error values, but you get the idea...)
You can see a full example of use here.
Since C doesn't have exceptions, it's fairly standard to use the return value of a function to report errors, and pass calculated values back via function pointer.
The first invocation can easily leak memory if there is no other pointer to the original memory allocation - as you are probably aware since you are asking.
Yes, if you can sensibly write the second version of the called function without memory allocation, it will likely be faster, because memory allocation takes time. If you just modify the called function so it has pre-allocated input and output arrays, it might just transfer the memory allocation cost to the calling function.
But disciplined use of the first version is fine; the function allocates space, and as long as you keep track of both the original space passed in and the new space passed back and are able to release both, there is no problem.
You can run yourself into the 'same' problem with:
xyz = realloc(xyz, newsize);
If xyz is the only pointer to the allocated memory, that leaks memory on an allocation failure because you've just clobbered xyz with a null pointer. If there's another pointer that you will use to release the original space, this idiom does not matter - but be cautious with it.
I've not fully digested the additional information in the question since writing the original version of this answer.
I just ran your code (after fixing a number of small errors). Then I took several stackshots. The people who said malloc
would be your culprit were right. Nearly all of your time is spent in there. Compared to that, the rest of your code is not very significant. Here's the code:
#include <math.h>
#include <stdlib.h>
const double PI = 3.1415926535897932384626433832795;
void cos_taper(double *arr_in, double *arr_out, int npts){
int i;
// double taper[npts];
double* taper = (double*)malloc(sizeof(double) * npts);
int M;
M = (int)floor( ((npts - 2) / 10) + .5);
for (i=0; i<npts; i++){
if (i<M) {
taper[i] = .5 * (1-cos(i * PI / (M + 1)));
}
else if (i<npts - M - 2) {
taper[i] = 1;
}
else if (i<npts) {
taper[i] = .5 * (1-cos((npts - i - 1) * PI / (M + 1)));
}
arr_out[i] = arr_in[i] * taper[i];
}
free(taper);
return;
}
void doit(){
int i;
int npts = 100;
double *data, *cos_tapered;
data = (double*)malloc(sizeof(double) * npts);
cos_tapered = (double*)malloc(sizeof(double) * npts);
//fill data
for (i = 0; i < npts; i++) data[i] = 1;
cos_taper(data, cos_tapered, npts);
free(data);
free(cos_tapered);
}
int main(int argc, char* argv[]){
int i;
for (i = 0; i < 1000000; i++){
doit();
}
return 0;
}
EDIT: I timed the above code, which took 22us on my machine (mostly in malloc
). Then I modified it to do the mallocs only once on the outside. That dropped the time to 5.0us, which was mostly in the cos
function. Then I switched from Debug to Release build, which dropped the time to 3.7us (now even more in the cos
function, obviously). So if you really want to make it fast, I recommend stackshots to find out what you're mostly doing, and then see if you can avoid doing it.
In your function
void do_something(double *arr_in, ...) { for (...) { arr_in = do_that_something; } }
That would be incorrect as you have no parameter-by-reference to pass back out the array once the do_something
function goes out of scope..it should look something like this
void do_something(double **arr_in, ...) { for (...) { *arr_in = do_that_something; } } /* ** Would be called like this: ** do_something(&array, ...); */
Stick to the first example as it is easier to read. You need to add error checking in the first example if the call to malloc
failed and continue processing with a NULL pointer...
Hope this helps, Best regards, Tom.
You also have the option to pass a second parameter as your out parameter. For instance
int do_something (double * in , double * out) {
/*
* Do stuff here
*/
if (out is modified)
return 1;
return 0;
}
Or similar without the return.
I would suggest that if you allocate memory within a sub-function, that you either create a corresponding wrapper to clean-up, free the memory allocated, or make it blindingly obvious that the function is allocating memory, to prevent forgetting to free the memory.
In regards to memory footprint, the second approach would use less memory, but it only works if the functions don't modify the size of the initial array. Depending on the usage this is not always true.
In regards to speed, the second approach should be theoretically faster, because one less pointer is pushed onto the stack at the end of the function call (do_something
), but a single pointer is minimal difference unless there is heavy repeated usage, in which case carefully considering inlining should already be an consideration. So unless you have actually measured the function call's overhead as an issue (by profiling), I wouldn't bother with such micro-optimizations without a good reason (memory footprint or profiling).
You would be saving a small amount of time by not having the malloc but this may add up quickly and make a noticeable difference if you call do_something in a tight loop. You would also save a small amount of time by not having to return the double * but again, this can add up if do_something is call frequently.
As for the processing itself, there would be no difference since both case are operating on a double *
Since you are not using dynamic memory allocation in your proposed method there is no longer a possibility of memory leaks.
The type of the function determines the interface between the function and the places in the code that calls it, which is to say that there is likely to be important code design issues involved in the choice. As a rule, these are more worth thinking about than speed (provided the speed issue isn't one of memory leaks so large that the application suffers DOS through thrashing...)
The second type pretty much indicates the inetent to mutate the array. The first is ambiguous: maybe you will always mutate the array, maybe you will always provide a freshly allocate result, maybe your code sometimes does one and sometimes does another. The freedom comes with a minefield of difficulties making sure the code is correct. If you go this route, the effort putting a liberal sprinkling of assert()
s through your code, to assert invariants about the freshness and sharedness of pointers, will likely pay for itself with ample interest when debugging.
Well, you started your question talking of speed and I don't believe this subject was really answered. First thing to say is that working on parameter passing seems not to be the better way to speed up things...
I agree with other answers : the first proposal using malloc is an highway to memory leaks (and is probably slower anyway), the other proposal you came up to is much better. Following ergosys suggestions in comment you can easily enhance it and get good C code.
Now with a few math you can still get better.
First, no need to use double and floor call to compute integers. You get the same M without floor nor adding 0.5 just writing M = (nbelts-2) / 10; (Hint: integer division truncate to integer).
If you also notice that you always have M < nbelt - M - 2 < nbelt (Well, you certainly allready know it) you can avoid testing limits inside loops by splitting the loop in three independent parts. And this can still be optimized in the case where in array is the same as out array.
Your function could become something like this:
void cos_taper(double *arr_in, double *arr_out, int npts)
{
int i;
int M;
M = (npts - 2) / 10;
if (arr_out == arr_in) {
for (i=0; i<M; i++) {
arr_out[i] *= .5 * (1-cos(i * PI / (M + 1)));
}
for (i = npts - M - 2; i<npts; i++) {
arr_out[i] *= .5 * (1-cos((npts - i - 1) * PI / (M + 1)));
}
}
else {
for (i=0; i<M; i++) {
arr_out[i] = arr_in[i] * (.5 * (1-cos(i * PI / (M + 1))));
}
for (; i<npts - M - 2; i++) {
arr_out[i] = arr_in[i];
}
for (; i<npts; i++) {
arr_out[i] = arr_in[i] * (.5 * (1-cos((npts - i - 1) * PI / (M + 1))));
}
}
}
That's positively not the end of it and with some more thought more optimizations are possible, for instance expressions like (.5 * (1-cos(i * PI / (M + 1))));
looks like they could get a relatively small number of values (depends of size of nbelt as it's a function of i and nbelt, number of different results is a square law, but cos should reduce that number again as it's periodic). But all depends of what level of performance you need.
精彩评论