开发者

java string optimizations - load-in-place algorithm

开发者 https://www.devze.com 2023-04-08 20:37 出处:网络
I need to optimize the actual loading/parsing of a csv file (strings). The best way I know is the load-in-place algorithms and I successfully used it using JNI and a C++ dll that loads the data direct

I need to optimize the actual loading/parsing of a csv file (strings). The best way I know is the load-in-place algorithms and I successfully used it using JNI and a C++ dll that loads the data directly from a file made out of the parsed csv data.

It would have been fine if it stopped there but using that scheme only made it 15% faster (no more parsing of the data). One of the reason it is not as fast as I first thought it would be is because the java client uses jstring so I need to convert the actual data again from char* to jstring.

The best would be to ignore that conversion step and load-in-place the data directly into the jstring objects (no more conversion). So instead of duplicating the data based on the loaded-in-place data, the jstring would be开发者_如何学Go pointing directly into the chunk of memory (note that the data would be made of jchars instead of chars). The real bad thing is that we would need to make sure the garbage collector doesn't collect that data (by keeping a reference to it maybe?) but it should be feasible.. no?

I think I have two options to do that:

1- Load the data in java (no more jni) and use chars that are pointing to the loaded data to create the strings.. but I need to find a way to prevent the duplicating of the data when creating a String.

2- Continue using jni to "manually" create and set the jstring variable and make sure that the garbage collector options are set properly to prevent it from doing anything to it. For instance:

jstring str; 
str.data = loadedinplacedata;  // assign data pointer
return str;

Not sure if that's possible but I wouldn't mind just save the jstring directly into the file and reload it like that:

jstring * str = (jstring *)&loadedinplacedata[someoffset];
return * str;

I'm aware that this is not the usual Java thing, but I'm pretty sure Java is extensible enough to be able to do that. And it's not like I really have a choice in the matter... the project is already 3 years old and it needs to work. =S

This is the JNI code (C++):

const jchar * data = GetData(id, row, col); // get pointer of the string ends w/ \0
unsigned int len = wcslen( (wchar_t*)data );
// The best would be to prevent this function to duplicate the data.
jstring str = env->NewString( data, len ); 
return str;

Note: The code above made it 20% faster (instead of 15) by using unicode data instead of UTF8 (NewString instead of NewStringUTF). This shows that if I can remove that step or optimize it, I'd get quite the good performance increase.


I've never worked with JNI, but... does it make any sense to have it return a custom class implementing CharSequence, and maybe a few other interfaces like Comparable< CharSequence >, instead of a String? It seems like you'd be less likely to have data corruption problems that way.


I think first you have to understand why the C++ version runs 15% faster, and why that performance improvement is not directly translatable into Java. Why can't you write the code 15% faster in Java?

Lets look at your problem. You've eliminated the parsing by using a C++ dll. (Why could this not have been done in Java?). And then as I understand it:

  1. You're proposing to manipulate the contents of the jstrings directly
  2. You want to prevent the garbage collector from touching these modified jstrings (by keeping a reference to them), and therefore potentially modifying the behaviour of the JVM and screwing with the garbage collector when it does eventually garbage collect.

Will you 'fix' these references before you allow them to be garbage collected?

If you propose doing your own memory management, why are you using java at all? Why not just do it in pure C++?

Assuming that you wish to continue in Java, when you create a String, it the String itself is a new Object, but the data that it's pointing to is not necessarily. You can test this by calling String.intern(). Using the following code:

public static void main(String[] args) {
    String s3 = "foofoo";

    String s1 = call("foo");
    String s2 = call("foo");

    System.out.println("s1 == s2=" + (s1 == s2));
    System.out.println("s1.intern() == s2.intern()=" + (s1.intern() == s2.intern()));
    System.out.println("s1.intern() == s3.intern()=" + (s1.intern() == s3.intern()));

    System.out.println("s1.substring(3) == s2.substring(3)=" + (s1.substring(3) == s2.substring(3)));
    System.out.println("s1.substring(3).intern() == s2.substring(3).intern()=" + (s1.substring(3).intern() == s2.substring(3).intern()));
}

public static String call(String s) {
    return s + "foo";        
}

This produces:

s1 == s2=false
s1.intern() == s2.intern()=true
s1.intern() == s3.intern()=true
s1.substring(3) == s2.substring(3)=false
s1.substring(3).intern() == s2.substring(3).intern()=true

So you can see that although the String objects are different, the data, the actual bytes aren't. So your modifications may not actually be that relevant, the JVM may already be doing it for you. And it's worth saying that if you start modifying the internals of jstrings, this may well screw this up.

My suggestion would be to find out what you can do in terms of algorithms. Development with pure java is always quicker that Java & JNI combined. You've got a much better chance of finding a better solution with pure Java.


Well... seems like what I wanted to do is not "supported" by Java unless I hack it.. I believe it would be possible to do so by using GetStringCritical to get the actual char array address and then find out the number of characters and such but this is way beyond "safe" programming.

The best work around I found was to create a hash table in java and use an unique identifier processed while creating my data file (acting similar to .intern()). if the string was not in the hash table, it would query it through the dll and save it in the hash table.

data file: numrow,numcols, for each cell, add a integer value (in my case the offset in memory pointing to the string) for each cell, add string ending with \0

By using the offset value, I can somewhat minimize the number of strings creation and string queries. I tried using globalref to keep the string inside the dll but that made it 4 times slower.

0

精彩评论

暂无评论...
验证码 换一张
取 消