开发者

javascript+remove arabic text diacritic dynamically

开发者 https://www.devze.com 2023-02-15 20:53 出处:网络
how to remove dynamically Arabic diacritic I\'m designing an ebook \"chm\" and have multi html pages contain Arabic text

how to remove dynamically Arabic diacritic I'm designing an ebook "chm" and have multi html pages contain Arabic text but some time the search engine want highlight some of Arabic words because its diacritic so is it possible when page load to use JavaScript functions that would strip the Arabic diacritic text ?? but must have option to enabled again so i don't want to remove it from HTML physically but temporary,

the thing is i don't know where to start and what is the right function to use

thank you :)

For Example

Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
converted to : الحمد لله رب开发者_JAVA百科 العالمين 


I wrote this function which handles strings with mixed Arabic and English characters, removing special characters (including diacritics) and normalizing some Arabic characters like converting all ة's into ه's.

normalize_text = function(text) {

  //remove special characters
  text = text.replace(/([^\u0621-\u063A\u0641-\u064A\u0660-\u0669a-zA-Z 0-9])/g, '');

  //normalize Arabic
  text = text.replace(/(آ|إ|أ)/g, 'ا');
  text = text.replace(/(ة)/g, 'ه');
  text = text.replace(/(ئ|ؤ)/g, 'ء')
  text = text.replace(/(ى)/g, 'ي');

  //convert arabic numerals to english counterparts.
  var starter = 0x660;
  for (var i = 0; i < 10; i++) {
    text.replace(String.fromCharCode(starter + i), String.fromCharCode(48 + i));
  }

  return text;
}
<input value="الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ" type="text" id="input">
<button onclick="document.getElementById('input').value = normalize_text(document.getElementById('input').value)">Normalize</button>


Try this

Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
converted to : الحمد لله رب العالمين 

http://www.suhailkaleem.com/2009/08/26/remove-diacritics-from-arabic-text-quran/

The code is C# not javascript though. Still trying to figure out how to achieve this in javascript

EDIT: Apparently it's very easy in javascript. The diacratics are stored as separate "letters" and they can be removed quite easily.

var CHARCODE_SHADDA = 1617;
var CHARCODE_SUKOON = 1618;
var CHARCODE_SUPERSCRIPT_ALIF = 1648;
var CHARCODE_TATWEEL = 1600;
var CHARCODE_ALIF = 1575;

function isCharTashkeel(letter)
{
    if (typeof(letter) == "undefined" || letter == null)
        return false;

    var code = letter.charCodeAt(0);
    //1648 - superscript alif
    //1619 - madd: ~
    return (code == CHARCODE_TATWEEL || code == CHARCODE_SUPERSCRIPT_ALIF || code >= 1612 && code <= 1631); //tashkeel
}

function stripTashkeel(input)
{
  var output = "";
  //todo consider using a stringbuilder to improve performance
  for (var i = 0; i < input.length; i++)
  {
    var letter = input.charAt(i);
    if (!isCharTashkeel(letter)) //tashkeel
      output += letter;                                
  }


return output;                   
}

Edit: Here is another way to do it using BuckData http://qurandev.github.com/

Advantages Buck uses less bandwidth In Javascript, u can search thru entire Buck quran text in 1 shot. intuitive compared to Arabic search Buck to Arabic and Arabic to Buck is a simple js call. Play with live sample here: http://jsfiddle.net/BrxJP/ You can strip out all vowels from Buck text in few millisecs. Why do this? u can search in javascript, ignoring the taskheel differences (Fathah, Dammah, Kasrah). Which leads to more hits. Regex + buck text can lead to awesome optimizations. All the searches can be run locally. http://qurandev.appspot.com How data generated? just one-to-one mapping using: http://corpus.quran.com/java/buckwalter.jsp


Here's a javascript code that can handle removing Arabic diacritics nearly all the time.

var arabicNormChar = {
    'ك': 'ک', 'ﻷ': 'لا', 'ؤ': 'و', 'ى': 'ی', 'ي': 'ی', 'ئ': 'ی', 'أ': 'ا', 'إ': 'ا', 'آ': 'ا', 'ٱ': 'ا', 'ٳ': 'ا', 'ة': 'ه', 'ء': '', 'ِ': '', 'ْ': '', 'ُ': '', 'َ': '', 'ّ': '', 'ٍ': '', 'ً': '', 'ٌ': '', 'ٓ': '', 'ٰ': '', 'ٔ': '', '�': ''
}

var simplifyArabic  = function (str) {
    return str.replace(/[^\u0000-\u007E]/g, function(a){ 
        var retval = arabicNormChar[a]
        if (retval == undefined) {retval = a}
        return retval; 
    }).normalize('NFKD').toLowerCase();
}

//now you can use simplifyArabic(str) on Arabic strings to remove the diacritics

Note: you may override the arabicNormChar to your own preferences.


Use this regex to catch all tashkeel

[ؐ-ًؚٟ]


I tried the following solution and it works fine:

const str = 'الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ';
const withoutDiacs = str.replace(/([^\u0621-\u063A\u0641-\u064A\u0660-\u0669a-zA-Z 0-9])/g, '');
console.log(withoutDiacs); //الحمد لله رب العالمين
Reference: https://www.overdoe.com/javascript/2020/06/18/arabic-diacritics.html


This site has some routines for Javascript Unicode normalization which could be used to do what you're attempting. If nothing else it could provide a good starting point.

If you can preprocess the data, Python has good Unicode routines to make easy work of these sorts of transformations. This might be a good option if you can preprocess your CHM file to produe a separate index file which could be then merged into your CHM:

import unicodedata

def _strip(text):
    return ''.join([c for c in unicodedata.normalize('NFD', text) \
        if unicodedata.category(c) != 'Mn'])

composed = u'\xcd\xf1\u0163\u0115\u0155\u0148\u0101\u0163\u0129\u014d' \
    u'\u0146\u0105\u013c\u012d\u017e\u0119'

_strip(composed)
'Internationalize'


A shorter approach to remove the Arabic diacritics (either the 8 Basic diacritics or the full 52 diacritics) could be as follows:

Remove Basic Diacritics

function removeTashkeelBasic(s) {return s.replace(/[ً-ْ]/g,'');}



//===================
//     Test Cases
//===================
console.log(removeTashkeelBasic('حِسَابٌ وَحِسَابًا مِنْ ثَلَاثُمِئَةِ رِيَالٍ قَطَرِيٍّ'));
console.log(removeTashkeelBasic('بِسْمِ ٱللَّٰهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ'));

Remove All Arabic Diacritics

function removeTashkeelAll(s) {return s.replace(/[ؐ-ًؕ-ٖٓ-ٟۖ-ٰٰۭ]/g,'');}


//===================
//     Test Cases
//===================
console.log(removeTashkeelAll('حِسَابٌ وَحِسَابًا مِنْ ثَلَاثُمِئَةِ رِيَالٍ قَطَرِيٍّ'));
console.log(removeTashkeelAll('بِسْمِ ٱللَّٰهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ'));

0

精彩评论

暂无评论...
验证码 换一张
取 消