开发者

Weird Characters on Webpage after HTML Tidy

开发者 https://www.devze.com 2023-02-14 22:09 出处:网络
I\'m getting content via Amazon Web Services (e.g. product descriptions). Since the content from Amazon is often marked up very poorly, it ends up messing up the layout of my web pages. So, I have com

I'm getting content via Amazon Web Services (e.g. product descriptions). Since the content from Amazon is often marked up very poorly, it ends up messing up the layout of my web pages. So, I have come up with a function to "sanitize" the content using HTML Tidy.

The weird thing is that when I test it separate from my application, everything seems to work fine. But in my application (running on CodeIgniter), the function seems to return odd characters.

The code below is my test script. It's outputting what I think I need.

In my application, I grab the description from my database, sanitize it, then display it on my webpage. After sanitization, for example, document’s (you can see this word in the example below) becomes document’s (again, only in the actual application; not in the test code. Both functions are identical).

Any ideas why? Here's my test function:

    $amazon_content = <<<AMAZON
JavaScript is the brains of your Web page—it enables you to modify a document’s structure, styling, and content in response to user actions without requesting new pages from the server. Scriptin' with JavaScript and Ajax teaches you how to master this powerful and elegant language so you can develop intuitive user interactions that take the user experience to new levels of sophistication and responsiveness.<br><br>Today’s application-like Web experiences (such as Salesforce.com and Google Maps) and Web 2.0 sites (such as Flickr.com and Twitter) are powered by JavaScript and Ajax. Using the techniques shown in this book, you will be able to start creating similar experiences in the sites you design.<br><br>Scriptin' with JavaScript and Ajax will teach you how to:<br><ul><li>Start developing with JavaScript fast!</li></ul><ul><li>Write lightweight but powerful object-oriented code </li></ul><ul><li>Modify the Document Object Model </li></ul><ul><li>“Progressively enhance” your pages with JavaScript to provide the highest levels of accessibility to all users</li></ul><ul><li>Learn sophisticated techniques for making your pages respond to user actions</li></ul><ul><li>Use the downloadable Scriptin’ library of helper functions to speed development and ensure cross-browser compatibility</li></ul><ul><li>Use Ajax scripting techniques to update specific areas of the page with data from the server</li></ul><ul><li>Create powerful interface interactions, such as sliding panels and tree menus</li></ul><ul><li>Evaluate frameworks such as jQuery and Prototype to find the best one for your needs</li></ul><ul><li>Build an online application that looks and responds like a regular desktop application</li></ul><ul><li>Easily adapt the Scriptin’ code examples for use in your own projects—download them at www.scriptinwithajax.com</li></ul><br>
AMAZON;

    echo '<textarea cols="150" rows="12">' . $amazon_content .开发者_Python百科 '</textarea>';
    echo '<textarea cols="150" rows="12">' . get_sanitized_amazon_content($amazon_content) . '</textarea>';
    echo  get_sanitized_amazon_content($amazon_content);

    function get_sanitized_amazon_content($amazon_content)
    {
        $tidy_config             = array(
            'bare' => TRUE,
            'clean' => TRUE,
            'drop-empty-paras' => TRUE,
            'drop-font-tags' => TRUE,
            'drop-proprietary-attributes' => TRUE,
            'enclose-text' => TRUE,
            'fix-backslash' => TRUE,
            'fix-bad-comments' => TRUE,
            'fix-uri' => TRUE,
            'hide-comments' => TRUE,
            'hide-endtags' => TRUE,
            'logical-emphasis' => TRUE,
            'lower-literals' => TRUE,
            'merge-divs' => TRUE,
            'output-xhtml' => TRUE,
            'quote-ampersand' => TRUE,
            'quote-marks' => TRUE,
            'show-body-only' => TRUE,
            'word-2000' => TRUE
        );
        $tidy                    = new tidy();
        $sanitized_amazon_markup = $tidy->repairString($amazon_content, $tidy_config);

        // Replace carriage returns, line feeds, tabs with single space
        $sanitized_amazon_markup = preg_replace('/\r|\n|\t/', ' ', $sanitized_amazon_markup);

        // Removes unnecessary tags
        // TODO: get complete list; put in an array
        $sanitized_amazon_markup = strip_tag($sanitized_amazon_markup, 'div');
        $sanitized_amazon_markup = strip_tag($sanitized_amazon_markup, 'span');

        // Replace double spaces with single space
        $sanitized_amazon_markup = preg_replace('/ {2,}/i', ' ', $sanitized_amazon_markup);

        // Remove leading and trailing space
        $sanitized_amazon_markup = trim($sanitized_amazon_markup);

        return $sanitized_amazon_markup;
    }

    function strip_tag($tagged_content, $tag_name)
    {
        return preg_replace('%<[ \t\r\n]*/?[ \t\r\n]*' . $tag_name . '.*?>%i', '', $tagged_content);
    }

UPDATE:

This is what I get in my application:

<p>JavaScript is the brains of your Web page&acirc;&euro;&quot;it enables you to modify a document&acirc;&euro;&trade;s structure, styling, and content in response to user actions without requesting new pages from the server. Scriptin&#39; with JavaScript and Ajax teaches you how to master this powerful and elegant language so you can develop intuitive user interactions that take the user experience to new levels of sophistication and responsiveness.<br /> <br /> Today&acirc;&euro;&trade;s application-like Web experiences (such as Salesforce.com and Google Maps) and Web 2.0 sites (such as Flickr.com and Twitter) are powered by JavaScript and Ajax. Using the techniques shown in this book, you will be able to start creating similar experiences in the sites you design.<br /> <br /> Scriptin&#39; with JavaScript and Ajax will teach you how to:<br /></p> <ul> <li>Start developing with JavaScript fast!</li> </ul> <ul> <li>Write lightweight but powerful object-oriented code</li> </ul> <ul> <li>Modify the Document Object Model</li> </ul> <ul> <li>&acirc;&euro;&oelig;Progressively enhance&acirc;&euro; your pages with JavaScript to provide the highest levels of accessibility to all users</li> </ul> <ul> <li>Learn sophisticated techniques for making your pages respond to user actions</li> </ul> <ul> <li>Use the downloadable Scriptin&acirc;&euro;&trade; library of helper functions to speed development and ensure cross-browser compatibility</li> </ul> <ul> <li>Use Ajax scripting techniques to update specific areas of the page with data from the server</li> </ul> <ul> <li>Create powerful interface interactions, such as sliding panels and tree menus</li> </ul> <ul> <li>Evaluate frameworks such as jQuery and Prototype to find the best one for your needs</li> </ul> <ul> <li>Build an online application that looks and responds like a regular desktop application</li> </ul> <ul> <li>Easily adapt the Scriptin&acirc;&euro;&trade; code examples for use in your own projects&acirc;&euro;&quot;download them at www.scriptinwithajax.com</li> </ul> <p><br /></p>

This is what I get when outside of my application:

<p>JavaScript is the brains of your Web page-it enables you to modify a document's structure, styling, and content in response to user actions without requesting new pages from the server. Scriptin' with JavaScript and Ajax teaches you how to master this powerful and elegant language so you can develop intuitive user interactions that take the user experience to new levels of sophistication and responsiveness.<br /> <br /> Today's application-like Web experiences (such as Salesforce.com and Google Maps) and Web 2.0 sites (such as Flickr.com and Twitter) are powered by JavaScript and Ajax. Using the techniques shown in this book, you will be able to start creating similar experiences in the sites you design.<br /> <br /> Scriptin' with JavaScript and Ajax will teach you how to:<br /></p> <ul> <li>Start developing with JavaScript fast!</li> </ul> <ul> <li>Write lightweight but powerful object-oriented code</li> </ul> <ul> <li>Modify the Document Object Model</li> </ul> <ul> <li>"Progressively enhance" your pages with JavaScript to provide the highest levels of accessibility to all users</li> </ul> <ul> <li>Learn sophisticated techniques for making your pages respond to user actions</li> </ul> <ul> <li>Use the downloadable Scriptin' library of helper functions to speed development and ensure cross-browser compatibility</li> </ul> <ul> <li>Use Ajax scripting techniques to update specific areas of the page with data from the server</li> </ul> <ul> <li>Create powerful interface interactions, such as sliding panels and tree menus</li> </ul> <ul> <li>Evaluate frameworks such as jQuery and Prototype to find the best one for your needs</li> </ul> <ul> <li>Build an online application that looks and responds like a regular desktop application</li> </ul> <ul> <li>Easily adapt the Scriptin' code examples for use in your own projects-download them at www.scriptinwithajax.com</li> </ul> <p><br /></p>


The - between "page" and "it" is not a simple minus sign (ascii 0x2d) but a long dash (specifically U+2014 em dash). Encoded in UTF-8, it's a three byte sequence: 0xe2 0x80 0x94.

If you interpret that sequence in Windows-1252 encoding, that gives you:

0xe2 => â => &acirc;
0x80 => € => &euro;
0x94 => (some variant of) double quote => &quot;

So what you have is an encoding issue. You're getting UTF-8 as input, but interpreting it as Windows-1252. Your tidying up is transforming the non-ASCII7 parts as HTML entities, just as it should be.

As for why this is happening inside your app and not outside, there are a few possibilities. One is that you don't have the same locale/encoding configuration outside and inside. Another is that when you're testing outside your app, you're not getting the data exactly as it is coming from the web - i.e. the encoding you're getting is different (possibly altered).

0

精彩评论

暂无评论...
验证码 换一张
取 消