I am working on a project which requires me to detect and extract the embed code of videos on a web page.
I know the <object>
t开发者_JAVA百科ag is used to embed videos, however, the specification says that it can also be used for other things like images.
So how do i deterministically know that an <object>
tag contains a video within? or is there some other way to find this out?
Historically, the <object>
tag was intended to be used as a way to embed media such as video and audio in an HTML document. But as web video evolved, it turned out you can't provide a reasonable user experience without integrating video controls to your web app, and the de-facto standard for embedding video in an HTML was to embed a flash player (using <embed>
or <object>
) and to access the video from within that flash presentation. (In HTML5, you have the <video>
object for that purpose, but I guess you don't have such control on the HTML files you need to process).
So usually, when you see an <object>
element used for playing video, the object being referenced is actually an SWF - a flash presentation - which runs its own code that links to the video file. But a flash presentation may or may not contain a video, as well as many other things. So if you want to detect videos in <object>
s, your options are
- Have a list of all SWF files/URLs that are in fact video players. This method is easiest but bear in mind that you will have a lot of false negatives.
- Programmatically evaluate the HTML you're parsing in a sandboxed browser, and detect the video from the screen capture. This is probably a huge effort but will solve your problem perfectly.
- Download and decompile the SWF files referenced by the
object
tags, and implement a heuristic to figure out whether they contain an embedded video. I'm saying heuristic because an SWF is basically a program, and if you can figure out a deterministic method to know if a program plays video, you might as well try to figure out whether the program halts.
精彩评论