开发者

youtube regex swallows remaining text

开发者 https://www.devze.com 2023-04-06 12:48 出处:网络
I\'m doing preg_match_all and str_replace on a block of text to grab YouTube-urls and replace them with the correct embed code.

I'm doing preg_match_all and str_replace on a block of text to grab YouTube-urls and replace them with the correct embed code.

Let's say I have the following block of text:

"bla bla bla bla <-youtube-url-> last few words"

Everything works fine - t开发者_StackOverflow中文版he youtube-url is replaced with the embed code etc. However, the "last few words" disappears from the final output after str_replace is run. I'm suspecting that the regex is swallowing everything after the url... This is what I'm using to match and extract YouTube ID's:

%(?:youtube\.com/(?:[^/]+/.+/|(?:v|e(?:mbed)?)/|.*[?&]v=)|youtu\.be/)([^"&?/ ]{11})%i

Any help would be greatly appreciated!

Update:

I just discovered that the problem only happens if the youtube url has any trailing parameters. The following input swallows last few words:

'www.youtube.com/watch?v=XXXXXXXXX&parameter=data last few words'

But if the input is like this:

'www.youtube.com/watch?v=XXXXXXXXX last few words'

it works fine. Can anyone help with the needed adjustments for the regular expression?


I usually break up complicated alternations to find out whats going on.
It appears you might have trouple with the last term [^"&?/ ]{11}, but not sure
what you are trying to do. (below is in Perl)

$samp = 'www.youtube.com/watch?v=XXXXXXXXX&parameter=data last few words';

$regex = qr%

(?:
    youtube\.com/
    (?:
        ( [^/]+/.+/ )    # 1
      | 
        (                # 2 
            v
          | e(?:mbed)?/
        )
      |
        ( .*[?&]v= )     # 3
    )
  |

    ( youtu\.be/ )     #4
)

( [^"&?/ ]{1,11} )     # 5, was {11}

(.*)$                  # 6 the remainder

%xi;


if ( $samp =~ /$regex/ )
{
  # just print what matched
    print "all: '$&' \n";
    print "1:   '$1' \n";
    print "2:   '$2' \n";
    print "3:   '$3' \n";
    print "4:   '$4' \n";
    print "5:   '$5' \n";
    print "6:   '$6' \n";
}

Output:

all: 'youtube.com/watch?v=XXXXXXXXX&parameter=data last few words'
1:   ''
2:   ''
3:   'watch?v='
4:   ''
5:   'XXXXXXXXX'
6:   '&parameter=data last few words'


Change the .+ to \S+ so that you don't capture whitespace as part of the regex.

%(?:youtube\.com/(?:[^/]+/\S+/|(?:v|e(?:mbed)?)/|.*[?&]v=)|youtu\.be/)([^"&?/ ]{11})%i

The .* was capturing the entire line, and the rest of your regex wasn't doing anything.


I'm not clear on what exactly you are trying to do. But I suggest that you try a regex tester tool - like this one, but there are others. it lets you visually examine the results of regex.

youtube regex swallows remaining text


My bad. There was no problem with the regex, as I first suspected.

I was passing the user input to the PHP handler without escaping the input via encodeURIComponent() first. Thus, the handler assumed &parameter=data was the next input parameter - resulting in a broken POST variable.

Sorry for my incompetence, and thanks for all the help!

0

精彩评论

暂无评论...
验证码 换一张
取 消