Best way to find illegal characters in a bunch of ISO-889-1 web pages?_问答_开发者

Best way to find illegal characters in a bunch of ISO-889-1 web pages?

开发者 https://www.devze.com 2022-12-10 01:15 出处：网络

I have a bunch of html files in a site that were created in the year 2000 and have been maintained to this day. We\'ve recently began an effort to replace illegal charac开发者_运维知识库ters with thei

I have a bunch of html files in a site that were created in the year 2000 and have been maintained to this day. We've recently began an effort to replace illegal charac开发者_运维知识库ters with their html entities. Going page to page looking for copyright symbols and trademark tags seems like quite a chore. Do any of you know of an app that will take a bunch of html files and tell me where I need to replace illegal characters with html entities?

You could write a PHP script (if you can; if not, I'd be happy to help), but I assume you already converted some of the "special characters", so that does make the task a little harder (although I still think it's possible)...

Any good text editor will do a file contents search for you and return a list of matches.

I do this with EditPlus. There are several editors like Notepad++, TextPad, etc that will easily help you do this.

You do not have to open the files. You just specify a path where the files are stored and the Mask (*.html) and the contents to search for "©" and the editor will come back with a list of matches and when you double click, it opens the file and brings up the matching line.

I also have a website that needs to regularly convert large numbers of file names back and forth between character sets. While a text editor can do this, a portable solution using 2 steps in php was preferrable. First, add the filenames to an array, then do the search and replace. An extra piece of code in the function excludes certain file types from the array.

Function listdir($start_dir='.') {                                                           
  $nonFilesArray=array('index.php','index.html','help.html'); //unallowed files & subfolders 
  $filesArray = array() ; // $filesArray holds new records and $full[$j] holds names         
  if (is_dir($start_dir)) {                                                                  
    $fh = opendir($start_dir);                                                               
    while (($tmpFile = readdir($fh)) !== false) { // get each filename without its path      
      if (strcmp($tmpFile, '.')==0 || strcmp($tmpFile, '..')==0) continue; // skip . & ..    
      $filepath = $start_dir . '/' . $tmpFile; // name the relative path/to/file             
      if (is_dir($filepath)) // if path/to/file is a folder, recurse into it                 
        $filesArray = array_merge($filesArray, listdir($filepath));                          
      else // add $filepath to the end of the array                                          

      $test=1 ; foreach ($nonFilesArray as $nonfile) {                                       
        if ($tmpFile == $nonfile) { $test=0 ; break ; } }                                    
      if ( is_dir($filepath) ) { $test=0 ; }                                                 
      if ($test==1 && pathinfo($tmpFile, PATHINFO_EXTENSION)=='html') {                      
        $filepath = substr_replace($filepath, '', 0, 17) ; // strip initial part of $filepath
        $filesArray[] = $filepath ; }                                                        
    }                                                                                        
    closedir($fh);                                                                           
  } else { $filesArray = false; } # no such folder                                           
  return $filesArray ;                                                                       
}                                                                                            

$filesArray = listdir($targetdir); // call the function for this directory                   
$numNewFiles = count($filesArray) ; // get number of records                                 

for ($i=0; $i<$numNewFiles; $i++) { // read the filenames and replace unwanted characters    
  $tmplnk = $linkpath .$filesArray[$i] ;                                                     
  $outname = basename($filesArray[$i],".html") ; $outname = str_replace('-', ' ', $outname); 
}