Sort a list of hungarian strings in the hungarian alphabetical order_问答_开发者

I am working at the moment with some data in hungarians. I have to sort a list of hungarians strings.

According to this Collation Sequence page

Hungarian alphabetic order is: A=Á, B, C, CS, D, DZ开发者_StackOverflow, DZS, E=É, F, G, GY, H, I=Í, J, K, L, LY, M, N, NY, O=Ó, Ö=Ő, P, Q, R, S, SZ, T, TY, U=Ú, Ü=Ű, V, W, X, Y, Z, ZS

So vowels are treated the same (A=Á, ...) so in the result you can have some like that using Collator :

Abdffg
Ádsdfgsd
Aegfghhrf

Up to here, no problem :)

But now, I have the requirement to sort according to the Hungarian alphabet

A Á B C Cs D Dz Dzs E É F G Gy H I Í J K L Ly M N Ny O Ó Ö Ő P (Q) R S Sz T Ty U Ú Ü Ű V (W) (X) (Y) Z Zs

A is considered different than Á

Playing with the Strength from Collator doesnt change the order in the output. A and Á are still mixed up.

Is there any librairies/tricks to sort a list of string according to the hungarian alphabetical order?

So far what I am doing is :

Sort with Collator so that the C/Cs, D,DZ, DZS... are sorted correctly
Sort again by comparing the first characters of each word based on a map

This looks too much hassle for the task no?

List<String> words = Arrays.asList(
        "Árfolyam", "Az",
        "Állásajánlatok","Adminisztráció",
        "Zsfgsdgsdfg", "Qdfasfas"

);

final Map<String, Integer> map = new HashMap<String, Integer>();
      map.put("A",0);
      map.put("Á",1);
      map.put("E",2);
      map.put("É",3);

      map.put("O",4);
      map.put("Ó",5);
      map.put("Ö",6);
      map.put("Ő",7);

      map.put("U",8);
      map.put("Ú",9);
      map.put("Ü",10);
      map.put("Ű",11);


      final Collator c = Collator.getInstance(new Locale("hu"));
      c.setStrength(Collator.TERTIARY);
      Collections.sort(words, c);

      Collections.sort(words, new Comparator<String>(){
          public int compare(String s1, String s2) {

              int f = c.compare(s1,s2);
              if (f == 0) return 0;

              String a = Character.toString(s1.charAt(0));
              String b = Character.toString(s2.charAt(0));

              if (map.get(a) != null && map.get(b) != null) {
                  if (map.get(a) < map.get(b)) {
                      return -1;
                  }
                  else if (map.get(a) == map.get(b)) {
                      return 0;
                  }
                  else {
                      return 1;
                  }
              }


              return 0;
          }
      });

Thanks for your input

I found a good idea, you can use a RuleBasedCollator.

Source: http://download.oracle.com/javase/tutorial/i18n/text/rule.html

And here is the Hungarian rule:

 < a,A < á,Á < b,B < c,C < cs,Cs,CS < d,D < dz,Dz,DZ < dzs,Dzs,DZS 
 < e,E < é,É < f,F < g,G < gy,Gy,GY < h,H < i,I < í,Í < j,J
 < k,K < l,L < ly,Ly,LY < m,M < n,N < ny,Ny,NY < o,O < ó,Ó 
 < ö,Ö < ő,Ő < p,P < q,Q < r,R < s,S < sz,Sz,SZ < t,T 
 < ty,Ty,TY < u,U < ú,Ú < ü,Ü < ű,Ű < v,V < w,W < x,X < y,Y < z,Z < zs,Zs,ZS

By stream you can sort like below:

public List<String> sortBy(List<String> sortable) {

  Collator coll = Collator.getInstance(new Locale("hu","HU"));

  return sortable.stream()
                 .sorted(Comparator.comparing(s -> s, coll))
                 .collect(Collectors.toList());
}

Will any of the solutions result in ordering the strings (names) 'Czár' and 'Csóka' as Czár, Csóka? This would be the correct order, since CS in Csóka is considered one letter and is after C. However, recognizing double-character consonants is impossible even with a list of all Hungarian words, since there might be cases, where two words could look exactly the same character by character, but in one there are two consonants together, while in the other there are two characters reprezenting one letter at the very same place.

Change the order of your map.

Put the numeric representation as the key and the letter as the value. This will allow you to use a TreeMap which will be sorted by key.

You can then just do map.get(1) and it will return the first letter of the alphabet.