开发者

Why does Java seem to be executing faster than C++ - Part 2 [closed]

开发者 https://www.devze.com 2023-03-14 22:42 出处:网络
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references,or expertise, but this question will likely solicit debate, a
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 11 years ago.

Introduction

This is a follow up question to the one I asked previously: Java seems to be executing bare-bones algorithms faster than C++. Why?. Through that post, I learned a few important things:

  1. I was not using Ctrl + F5 to compile and run c++ code on Visual Studios C++ Express, and this was resulting in debuging that slowed down the code execution.
  2. Vectors are as good (if not better) than pointers at handling arrays of data.
  3. My C++ is terrible. ^_^
  4. A better test of execution time would be iteration, rather than recursion.

I tried to write a simpler program, which does not use pointers (or arrays in the Java equivalent), and which is pretty straightforward in its execution. Even then, the Java execution is faster than the C++ execution. What am I doing wrong?

Code:

Java:

 public class PerformanceTest2
 {
      public static void main(String args[])
      {
           //Number of iterations
           double iterations = 1E8;
           double temp;

           //Create the variables for timing
           double start;
           double end;
           double duration; //end - start

           //Run performance test
           System.out.println("Start");
           start = System.nanoTime();
           for(double i = 0;i < iterations;i += 1)
           {
                //Overhead and display
                temp = Math.log10(i);
                if(Math.round(temp) == temp)
                {
                     System.out.println(temp);
                }
           }
           end = System.nanoTime();
           System.out.println("End");

           //Output performance test results
           duration = (end - start) / 1E9;
           System.out.println("Duration: " + duration);
      }
 }

C++:

#include <iostream>
#include <cmath>
#include <windows.h>
using namespace std;

double round(double value)
{
return floor(0.5 + value);
}
void main()
{
//Number of iterations
double iterations = 1E8;
double temp;

//Create the variables for timing
LARGE_INTEGER start; //Starting time
LARGE_INTEGER end; //Ending time
LARGE_INTEGER freq; //Rate of time update
double duration; //end - start
QueryPerformanceFrequency(&freq); //Determinine the frequency of the performance counter (high precision system timer)

//Run performance test
cout << "Start" << endl;
QueryPerformanceCounter(&start);
for(double i = 0;i < iterations;i += 1)
{
    //Overhead and display
    temp = log10(i);
    if(round(temp) == temp)
    {
        cout << temp << endl;
    }
}
QueryPerf开发者_StackOverflow中文版ormanceCounter(&end);
cout << "End" << endl;

//Output performance test results
duration = (double)(end.QuadPart - start.QuadPart) / (double)(freq.QuadPart);
cout << "Duration: " << duration << endl;

//Dramatic pause
system("pause");
}

Observations:

For 1E8 iterations:

C++ Execution = 6.45 s

Java Execution = 4.64 s

Update:

According to Visual Studios, my C++ commandline arguments are:

/Zi /nologo /W3 /WX- /O2 /Ob2 /Oi /Ot /Oy /GL /D "_MBCS" /Gm- /EHsc /GS /Gy /fp:precise /Zc:wchar_t /Zc:forScope /Fp"Release\C++.pch" /Fa"Release\" /Fo"Release\" /Fd"Release\vc100.pdb" /Gd /analyze- /errorReport:queue

Update 2:

I changed the c++ code with the new round function, and I updated the time of execution.

Update 3:

I found the answer to the problem, with thanks to Steve Townsend and Loduwijk. Upon compiling my code into assembly and evaluating it, I found that the C++ assembly was creating way more memory movements than the Java assembly. This is because my JDK was using an x64 compiler, while my Visual Studio Express C++ could not use the x64 architecture, and was thus inherently slower. So, I installed the Windows SDK 7.1, and used those compilers to compile my code (in release, using ctrl + F5). Presently the time ratios are:

C++: ~2.2 s Java: ~4.6 s

Now I can compile all my code in C++, and finally get the speed that I require for my algorithms. :)


It's a safe assumption that any time you see Java outperforming C++, especially by such a huge margin, you're doing something wrong. Since this is the second question dedicated to such micro-micro-optimizations, I feel I should suggest finding a less futile hobby.

That answers your question: you are using C++ (really, your operating system) wrong. As to the implied question (how?), it's easy: endl flushes the stream, where Java continues buffering it. Replace your cout line with:

cout << temp << "\n";

You do not understand benchmarking enough to compare this kind of stuff (and by this I mean comparing a single math function). I recommend buying a book on testing and benchmarking.


You surely don't want to time the output. Remove the output statements inside each loop and rerun, to get a better comparison of what you are actually interested in. Otherwise you are also benchmarking the output functions and your video driver. The resulting speed could actually depend on whether the console window you run in is obscured or minimized at the time of the test.

Make sure you are not running a Debug build in C++. That will be a lot slower than Release, independent of how you start up the process.

EDIT: I've reproduced this test scenario locally and cannot get the same results. With your code modified (below) to remove the output, Java takes 5.40754388 seconds.

public static void main(String args[]) { // Number of iterations 
    double iterations = 1E8;
    double temp; // Create the variables for timing
    double start;
    int matches = 0;
    double end;
    double duration;
    // end - start //Run performance test
    System.out.println("Start");
    start = System.nanoTime();
    for (double i = 0; i < iterations; i += 1) {
        // Overhead and display
        temp = Math.log10(i);
        if (Math.round(temp) == temp) {
            ++matches;
        }
    }
    end = System.nanoTime();
    System.out.println("End");
    // Output performance test results
    duration = (end - start) / 1E9;
    System.out.println("Duration: " + duration);
}

C++ code below takes 5062 ms. This is with JDK 6u21 on Windows, and VC++ 10 Express.

unsigned int count(1E8);
DWORD end;
DWORD start(::GetTickCount());
double next = 0.0;

int matches(0);
for (int i = 0; i < count; ++i)
{
    double temp = log10(double(i));
    if (temp == floor(temp + 0.5))
    {
        ++count;
    }
}

end = ::GetTickCount();
std::cout << end - start << "ms for " << 100000000 << " log10s" << std::endl;

EDIT 2: If I reinstate your logic from Java a little more precisely I get almost identical times for C++ and Java which is what I'd expect given the dependency on log10 implementation.

5157ms for 100000000 log10s

5187ms for 100000000 log10s (double loop counter)

5312ms for 100000000 log10s (double loop counter, round as fn)


Like @Mat commented, your C++ round isn't the same as Javas Math.round. Oracle's Java documentation says that Math.round is the same thing as (long)Math.floor(a + 0.5d).

Note that not casting to long will be faster in C++ (and possibly in Java as well).


It's because of the printing of the values. Nothing to do with the actual loop.


Perhaps you should use the fast floating point mode of MSVC

The fp:fast Mode for Floating-Point Semantics

When the fp:fast mode is enabled, the compiler relaxes the rules that fp:precise uses when optimizing floating-point operations. This mode allows the compiler to further optimize floating-point code for speed at the expense of floating-point accuracy and correctness. Programs that do not rely on highly accurate floating-point computations may experience a significant speed improvement by enabling the fp:fast mode.

The fp:fast floating-point mode is enabled using a command-line compiler switch as follows:

  • cl -fp:fast source.cpp or
  • cl /fp:fast source.cpp

On my Linux box (64 bit) the timings are about equal:

oracle openjdk 6

sehe@natty:/tmp$ time java PerformanceTest2 

real    0m5.246s
user    0m5.250s
sys 0m0.000s

gcc 4.6

sehe@natty:/tmp$ time ./t

real    0m5.656s
user    0m5.650s
sys 0m0.000s

Full disclosure, I drew all the optimization flags in the book, see Makefile below


Makefile
all: PerformanceTest2 t

PerformanceTest2: PerformanceTest2.java
    javac $<

t: t.cpp
    g++ -g -O2 -ffast-math -march=native $< -o $@

t.cpp
#include <stdio.h>
#include <cmath>

inline double round(double value)
{
    return floor(0.5 + value);
}
int main()
{
    //Number of iterations
    double iterations = 1E8;
    double temp;

    //Run performance test
    for(double i = 0; i < iterations; i += 1)
    {
        //Overhead and display
        temp = log10(i);
        if(round(temp) == temp)
        {
            printf("%F\n", temp);
        }
    }
    return 0;
}

PerformanceTest2.java
public class PerformanceTest2
{
    public static void main(String args[])
    {
        //Number of iterations
        double iterations = 1E8;
        double temp;

        //Run performance test
        for(double i = 0; i < iterations; i += 1)
        {
            //Overhead and display
            temp = Math.log10(i);
            if(Math.round(temp) == temp)
            {
                System.out.println(temp);
            }
        }
    }
}


Just to sum up what others have stated here: C++ iostream functionality is differently implemented as this in Java. In C++ output to IOStreams creates an inner-type called sentry before outputting each character. E.g. ostream::sentry uses RAII idiom to ensure that the stream is in the consistent state. In the multi-threaded environement (which is in many cases the default one) sentry is also used to lock a mutex object and unlock it, after each character is printed to avoid race conditions. Mutex lock/unlock operations are very expensive and this is the reason why you face that kind of slowdown.

Java goes another direction and locks/unlocks the mutex only once for the whole output string. That's why if you would output to cout from multiple threads you would see really messed up output, but all characters would be there.

You can make C++ IOStreams performant if you work directly with stream buffers and only flush the output occasionly. To test this behaviour just switch off thread-support for your test and your C++ executable should be running much faster.

I played a bit with the stream and the code. Here are my conclusions: First of all there is no single threaded lib starting with VC++ 2008 available. Please the link below, where MS states that single threaded runtime libs are no longer supported: http://msdn.microsoft.com/en-us/library/abx4dbyh.aspx

Note LIBCP.LIB and LIBCPD.LIB (via the old /ML and /MLd options) have been removed. Use LIBCPMT.LIB and LIBCPMTD.LIB instead via the /MT and /MTd options.

MS IOStreams implementation in fact does lock for each output (not per character). Therefore writing:

cout << "test" << '\n';

produces two locks: one for "test" and the second for '\n'. This becomes obvious if you debug in to the operator << implementation:

_Myt& __CLR_OR_THIS_CALL operator<<(double _Val)
    {// insert a double
    ios_base::iostate _State = ios_base::goodbit;
    const sentry _Ok(*this);
    ...
    }

Here the operator call constructs the sentry instance. Which is derived from basic_ostream::_Sentry_base. _Sentry_base ctor makes a lock to the buffer:

template<class _Elem,   class _Traits>
class basic_ostream
  {
  class _Sentry_base
  {
    ///...

  __CLR_OR_THIS_CALL _Sentry_base(_Myt& _Ostr)
        : _Myostr(_Ostr)
        {   // lock the stream buffer, if there
        if (_Myostr.rdbuf() != 0)
          _Myostr.rdbuf()->_Lock();
        }

    ///...
  };
};

Which results in call to:

template<class _Elem, class _Traits>
void basic_streambuf::_Lock()
    {   // set the thread lock
    _Mylock._Lock();
    }

Results in:

void __thiscall _Mutex::_Lock()
    {   // lock mutex
    _Mtxlock((_Rmtx*)_Mtx);
    }

Results in:

void  __CLRCALL_PURE_OR_CDECL _Mtxlock(_Rmtx *_Mtx)
    {   /* lock mutex */
  // some additional stuff which is not called...
    EnterCriticalSection(_Mtx);
    }

Executing your code with std::endl manipulator give the following timings on my machine:

Multithreaded DLL/Release build:

Start
-1.#INF
0
1
2
3
4
5
6
7
End
Duration: 4.43151
Press any key to continue . . .

With '\n' instead of std::endl:

Multithreaded DLL/Release with '\n' instead of endl

Start
-1.#INF
0
1
2
3
4
5
6
7
End
Duration: 4.13076
Press any key to continue . . .

Replacing cout << temp << '\n'; with direct stream buffer serialization to avoid locks:

inline bool output_double(double const& val)
{
  typedef num_put<char> facet;
  facet const& nput_facet = use_facet<facet>(cout.getloc());

  if(!nput_facet.put(facet::iter_type(cout.rdbuf()), cout, cout.fill(), val).failed())
    return cout.rdbuf()->sputc('\n')!='\n';
  return false;
}

Improves the timing a bit again:

Multithreaded DLL/Release without locks by directly writing to streambuf

Start
-1.#INF
0
1
2
3
4
5
6
7
End
Duration: 4.00943
Press any key to continue . . .

Finally changing the type of the iteration variable from double to size_t and making each time a new double value improves the runtime as well:

size_t iterations = 100000000; //=1E8
...
//Run performance test
size_t i;
cout << "Start" << endl;
QueryPerformanceCounter(&start);
for(i=0; i<iterations; ++i)
{
    //Overhead and display
    temp = log10(double(i));
    if(round(temp) == temp)
      output_double(temp);
}
QueryPerformanceCounter(&end);
cout << "End" << endl;
...

Output:

Start
-1.#INF
0
1
2
3
4
5
6
7
End
Duration: 3.69653
Press any key to continue . . .

Now try my suggestion with the suggestions made by Steve Townsend. How are the timings now?


Might want to take a look here

There can be a whole host of factors that could explain why your Java code is running faster than the C++ code. One of those factors could simply be that for this test case, the Java code is faster. I wouldn't even consider using that as a blanket statement for one language being faster than the other though.

If I were to make one change in the way you're doing things, I'd port the code over to linux and time runtime with the time command. Congrats, you just eliminated the whole windows.h file.


Your C++ program is slow because you don't know your tool (Visual Studio) well enough. Look at the line of icons below the menu. You will find the word "Debug" in the project configuration textbox. Switch to "Release". Make sure you rebuilt the project completely by menu Build|Clean project and Build|Build All of Ctrl+Alt+F7. (The names on your menu may be slightly different, since my program is in German). It's not about starting by F5 or Ctrl+F5.

In "Release mode" your C++ program is about twice as fast as your Java program.

The perception that C++ programs are slower than Java or C# programs comes from building them in Debug mode (the default). Also Cay Horstman, an appreciated C++ and Java book author, fell to this trap in "Core Java 2", Addison Wesley (2002).

The lesson is: know your tools, especially, when you try to judge them.


JVM can do runtime optimizations. For this simple example, I guess the only relevant optimization is method inlining of Math.round(). Some method invocation overhead is saved; and further optimization is possible after inlining flats the code.

Watch this presentation to fully appreciate how powerful JVM inlining can be

http://www.infoq.com/presentations/Towards-a-Universal-VM

This is nice. It means we can structure our logic with methods, and they don't cost anything at runtime. When they argued about GOTO vs. procedures in 70s, they probably didn't see this coming.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号