开发者

Regular Expression to extract (video) names from html tags

开发者 https://www.devze.com 2023-02-14 22:23 出处:网络
I 开发者_StackOverflow社区have a webpage which contains the followsing code snippet containg links to videos:

I 开发者_StackOverflow社区have a webpage which contains the followsing code snippet containg links to videos:

<a href="video.php?video=sampel1.mov">
<a href="video.php?video=anothersample.mov">
<a href="video.php?video=yetanothersample.mov">

I want to use sed and regular expression to extract the video names, eg:

sampel1.mov 
anothersample.mov 
yetanothersample.mov

so I can use wget to download them.

Thanks a lot!


Give this a try:

sed -n 's/.*video=\([^"]*\)">/\1/p' inputfile

With GNU grep:

grep -Po '(?<=video=).*?(?=">)' inputfile

Pipe either of those commands through xargs:

command | xargs wget ...


You could do something simple like

grep -o 'video.php?video=[^"]\+' | sed -e 's/^video.php?video=//'


You can use sed to retrieve your movie names.

Create a file, for eg. movie_string.txt with all your strings containing the movie name

Now, create a sed script file, say movie_name.sed with the following:

s/\"//g
s/<//g
s/>//g
s/\(.*=\)\([a-z]\)/ \2/

save and quit.

Now from the terminal, you just need to issue the following command to redirect the result to another file movie.txt:

sed -f movie_name.sed movie_string.txt > movie.txt


A word of warning: parsing HTML/XML using regular expressions is usually not a good idea. Instead, use a language like Ruby or Python that has an XML parser library that can intelligently interpret the page structure.

Here are a few questions that might help you out (many more are only a quick search away):

  • retrieve links from web page using python and BeautifulSoup
  • What's the easiest way to extract the links on a web page using python without BeautifulSoup?
  • Parse XHTML using Ruby

Update:

In your comment, you mentioned that you already know how to do the link extraction in Python but that you don't want to use a Python script that invokes wget directly. You can still solve this with Python (which is probably the easiest solution since you already know how to do it). If your Python script prints the extracted filenames to standard output with a newline following each name, you can use either of the following shell commands to do what you want to do:

python your_script.py >filenames.txt
wget -i filenames.txt

or

python your_script.py | wget -i -

This will pass the data extracted by your script to wget without requiring your script to invoke wget via a system call.


cat yourlinks.txt | cut -f2 -d\" | cut -f2 -d=
0

精彩评论

暂无评论...
验证码 换一张
取 消