Regular Expression to extract (video) names from html tags_问答_开发者

Regular Expression to extract (video) names from html tags

开发者 https://www.devze.com 2023-02-14 22:23 出处：网络

I 开发者_StackOverflow社区have a webpage which contains the followsing code snippet containg links to videos:

<a href="video.php?video=sampel1.mov">
<a href="video.php?video=anothersample.mov">
<a href="video.php?video=yetanothersample.mov">

I want to use sed and regular expression to extract the video names, eg:

sampel1.mov 
anothersample.mov 
yetanothersample.mov

so I can use wget to download them.

Thanks a lot!

Give this a try:

sed -n 's/.*video=\([^"]*\)">/\1/p' inputfile

With GNU grep:

grep -Po '(?<=video=).*?(?=">)' inputfile

Pipe either of those commands through xargs:

command | xargs wget ...

You could do something simple like

grep -o 'video.php?video=[^"]\+' | sed -e 's/^video.php?video=//'

You can use sed to retrieve your movie names.

Create a file, for eg. movie_string.txt with all your strings containing the movie name

Now, create a sed script file, say movie_name.sed with the following:

s/\"//g
s/<//g
s/>//g
s/\(.*=\)\([a-z]\)/ \2/

save and quit.

Now from the terminal, you just need to issue the following command to redirect the result to another file movie.txt:

sed -f movie_name.sed movie_string.txt > movie.txt

A word of warning: parsing HTML/XML using regular expressions is usually not a good idea. Instead, use a language like Ruby or Python that has an XML parser library that can intelligently interpret the page structure.

Here are a few questions that might help you out (many more are only a quick search away):

retrieve links from web page using python and BeautifulSoup
What's the easiest way to extract the links on a web page using python without BeautifulSoup?
Parse XHTML using Ruby

Update:

In your comment, you mentioned that you already know how to do the link extraction in Python but that you don't want to use a Python script that invokes wget directly. You can still solve this with Python (which is probably the easiest solution since you already know how to do it). If your Python script prints the extracted filenames to standard output with a newline following each name, you can use either of the following shell commands to do what you want to do:

python your_script.py >filenames.txt
wget -i filenames.txt

python your_script.py | wget -i -

This will pass the data extracted by your script to wget without requiring your script to invoke wget via a system call.

cat yourlinks.txt | cut -f2 -d\" | cut -f2 -d=

Regular Expression to extract (video) names from html tags

精彩评论

关注公众号

热门标签

图文推荐

Regular Expression to extract (video) names from html tags

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：