开发者

How to get some elements from html source and convert them to readable text?

开发者 https://www.devze.com 2023-02-07 10:57 出处:网络
I have a page which displays \"HeLLo 54292\" in ASCII art, using + characters inside <table> tags to produce block letters. I\'m generating this with PHP. You can check out page\'s html source c

I have a page which displays "HeLLo 54292" in ASCII art, using + characters inside <table> tags to produce block letters. I'm generating this with PHP. You can check out page's html source code, and see how the ASCII art is constructed.

I want to convert the ASCII-art letters to actual text, so I could parse that HTML source and would end up with the str开发者_开发百科ing "HeLLo 54292". How would I accomplish this?


Step 1: Write an HTML rendering engine in PHP. It will parse the HTML, lay out the page and render it to an image.

Step 2: Write an optical character recognition library in PHP. It will take an image as input, and identify letters in that image by their shapes.

Step 3: Combine those programs and you can convert your tables back to text.

Estimated time for full solution: 1-2 years.


I believe you could package this as a task on Mechanical Turk. This exactly fits the profile of solving problems which are presented via browser rendering.

https://www.mturk.com/mturk/welcome

The latency would be pretty good, probably just a little bit faster than Stack Overflow.

Actually, ok, if you hook it up to SO.. No seriously, those of you reading this, would you rather get three pennies, or 10 rep points? Mmmmm?


Wow I'm gonna go with impossible. Why would you need to convert it to text? Do you have a program generating text in such a format? If so whats stopping you from getting the original variable??


Deconstruct the HTML by using the same patterns you used to produce it.

You used PHP to create that HTML from a string. Reverse the process to convert the HTML back into a string. You have the source code, it should be easy.

Do a reverse replace of each string representing a pixel and recreate the pattern. Then compare that pattern to the one you generated from each character to find the sequence.


I voted to close this as not a real question. But, on the off chance that this is somehow a real question, I'll try to provide a real answer.

What I would suggest, assuming that the characters are not always the same and your goal here is to convert any ASCII art text to a string representation, would be to render the page to an image and try to use some sort of [OCR program]9http://en.wikipedia.org/wiki/Optical_character_recognition) to attempt to recognize the characters and determine what the original text was.

Of course if the ASCII art always uses the same characters, you could parse this using RegExes or other string manipulation.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号