开发者

scrape ASIN from amazon URL using javascript

开发者 https://www.devze.com 2022-12-12 13:18 出处:网络
Assuming I have an Amazon product URL like so http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C/ref=amb_link_86123711_2?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-1&pf_

Assuming I have an Amazon product URL like so

http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C/ref=amb_link_86123711_2?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-1&pf_rd_r=开发者_如何学C0AY9N5GXRYHCADJP5P0V&pf_rd_t=101&pf_rd_p=500528151&pf_rd_i=507846

How could I scrape just the ASIN using javascript? Thanks!


Amazon's detail pages can have several forms, so to be thorough you should check for them all. These are all equivalent:

http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C
http://www.amazon.com/dp/B0015T963C
http://www.amazon.com/gp/product/B0015T963C
http://www.amazon.com/gp/product/glance/B0015T963C

They always look like either this or this:

http://www.amazon.com/<SEO STRING>/dp/<VIEW>/ASIN
http://www.amazon.com/gp/product/<VIEW>/ASIN

This should do it:

var url = "http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C";
var regex = RegExp("http://www.amazon.com/([\\w-]+/)?(dp|gp/product)/(\\w+/)?(\\w{10})");
m = url.match(regex);
if (m) { 
    alert("ASIN=" + m[4]);
}


Since the ASIN is always a sequence of 10 letters and/or numbers immediately after a slash, try this:

url.match("/([a-zA-Z0-9]{10})(?:[/?]|$)")

The additional (?:[/?]|$) after the ASIN is to ensure that only a full path segment is taken.


Actually, the top answer doesn't work if it's something like amazon.com/BlackBerry... (since BlackBerry is also 10 characters).

One workaround (assuming the ASIN is always capitalized, as it always is when taken from Amazon) is (in Ruby):

        url.match("/([A-Z0-9]{10})")

I've found it to work on thousands of URLs.


None of the above work in all cases. I have tried following urls to match with the examples above:

http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C
http://www.amazon.com/dp/B0015T963C
http://www.amazon.com/gp/product/B0015T963C
http://www.amazon.com/gp/product/glance/B0015T963C

https://www.amazon.de/gp/product/B00LGAQ7NW/ref=s9u_simh_gw_i1?ie=UTF8&pd_rd_i=B00LGAQ7NW&pd_rd_r=5GP2JGPPBAXXP8935Q61&pd_rd_w=gzhaa&pd_rd_wg=HBg7f&pf_rd_m=A3JWKAKR8XB7XF&pf_rd_s=&pf_rd_r=GA7GB6X6K6WMJC6WQ9RB&pf_rd_t=36701&pf_rd_p=c210947d-c955-4398-98aa-d1dc27e614f1&pf_rd_i=desktop

https://www.amazon.de/Sawyer-Wasserfilter-Wasseraufbereitung-Outdoor-Filter/dp/B00FA2RLX2/ref=pd_sim_200_3?_encoding=UTF8&psc=1&refRID=NMR7SMXJAKC4B3MH0HTN

https://www.amazon.de/Notverpflegung-Kg-Marine-wasserdicht-verpackt/dp/B01DFJTYSQ/ref=pd_sim_200_5?_encoding=UTF8&psc=1&refRID=7QM8MPC16XYBAZMJNMA4

https://www.amazon.de/dp/B01N32MQOA?psc=1

This is the best I could come up with: (?:[/dp/]|$)([A-Z0-9]{10}) Which will also select the prepending / in all cases. This can then be removed later on.

You can test it on: http://regexr.com/3gk2s


This worked perfectly for me, I tried all the links on this page and some other links:

function ExtractASIN(url){
    var ASINreg = new RegExp(/(?:\/)([A-Z0-9]{10})(?:$|\/|\?)/);
    var  cMatch = url.match(ASINreg);
    if(cMatch == null){
        return null;
    }
    return cMatch[1];
}
ExtractASIN('http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C/ref=amb_link_86123711_2?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-1&pf_rd_r=0AY9N5GXRYHCADJP5P0V&pf_rd_t=101&pf_rd_p=500528151&pf_rd_i=507846');
  • I assumed that the ASIN is a 10-length with capital letters and numbers
  • I assumed that after the ASIN must be: end of the link, question mark or slash
  • I assumed that before the ASIN must be a slash


Try using this regex:

(?:[/dp/]|$)([A-Z0-9]{10})

Check out the demo: https://regexr.com/3gk2s


@Gumbo: Your code works great!

//JS Test: Test it into firebug.

url = window.location.href;
url.match("/([a-zA-Z0-9]{10})(?:[/?]|$)");

I add a php function that makes the same thing.

function amazon_get_asin_code($url) {
    global $debug;

    $result = "";

    $pattern = "([a-zA-Z0-9]{10})(?:[/?]|$)";
    $pattern = escapeshellarg($pattern);

    preg_match($pattern, $url, $matches);

    if($debug) {
        var_dump($matches);
    }

    if($matches && isset($matches[1])) {
        $result = $matches[1];
    } 

    return $result;
}


this is my universal amazon ASIN regexp:

~(?:\b)((?=[0-9a-z]*\d)[0-9a-z]{10})(?:\b)~i


This may be a simplistic approach, but I have yet to find an error in it using any of the URL's provided in this thread that people say is an issue.

Simply, I take the URL, split it on the "/" to get the discrete parts. Then loop through the contents of the array and bounce them off of the regex. In my case the variable i represents an object that has a property called RawURL to contain the raw url that I am working with and a property called VendorSKU that I am populating.

try
            {
                string[] urlParts = i.RawURL.Split('/');
                Regex regex = new Regex(@"^[A-Z0-9]{10}");

                foreach (string part in urlParts)
                {
                    Match m = regex.Match(part);
                    if (m.Success)
                    {
                        i.VendorSKU = m.Value;
                    }
                }
            }
            catch (Exception) { }

So far, this has worked perfectly.


A little bit of change to the regex of the first answer and it works on all the urls I have tested.

var url = "http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C";
m = url.match("/([a-zA-Z0-9]{10})(?:[/?]|$)");;
print(m);
if (m) { 
    print("ASIN=" + m[1]);
}


Inspired by many of the answers here, I found that

(?:[/])([A-Z0-9]{10})(?:[\/|\?|\&|\s|$])

let url="https://www.amazon.com/Why-We-Sleep-Science-Dreams-ebook/dp/B06Y649387/ref=pd_sim_351_4/131-0417603-5732106?_encoding=UTF8&pd_rd_i=B06Y649387&pd_rd_r=5ebbfdd5-a2f6-4ee3-ad13-5036b5e20827&pd_rd_w=LBo2H&pd_rd_wg=OBomS&pf_rd_p=3c412f72-0ba4-4e48-ac1a-8867997981bd&pf_rd_r=TN0WDV3AC7ED4Y7EKNVP&psc=1&refRID=TN0WDV3AC7ED4Y7EKNVP"
url.match("(?:[/])([A-Z0-9]{10})(?:[\/|\?|\&|\s])")

>> Array [ "/B06Y649387/", "B06Y649387" ]

works really well for extracting asin from anywhere in the url. You can try it out here. https://regexr.com/56jm7

edit: Added end-of-string as one of the stopping checks. This is needed when the regex is used in python


something like this should work (not tested)

var match = /\/dp\/(.*?)\/ref=amb_link/.exec(amazon_url);
var asin = match ? match[1] : '';


The Wikipedia article on ASIN (which I've linkified in your question) gives the various forms of Amazon URLs. You can fairly easily create a regular expression (or series of them) to fetch this data using the match() method.


You can scrape ASIN codes from the data-asin attribute in the search results using XPath.

For example $x('//@data-asin').map(function(v,i){return v.nodeValue}) can be ran in Chrome's console.


Used both methods in a single function:

const extractASIN = (url: string) => {
  var regex = RegExp('(?:[/])([A-Z0-9]{10})(?:[/|?|&|s])');
  const m = url.match(regex);
  if (m) {
    return m[1];
  }
  return url.split('/ref')[0].split('/dp/')[1];
};


// function to find the nth instance of character in string
function nthIndex(str, pat, n) {
  var L = str.length,
    i = -1;
  while (n-- && i++ < L) {
    i = str.indexOf(pat, i);
    if (i < 0) break;
  }
  return i;
}
// this function takes a string and split string list as parameters and slices off entirely after that character is found
function splitSliceFunc(splitStr, splitStrList) {
  for (i = 0; i < splitStrList.length; i++) {
    splitStr = splitStr.split(splitStrList[i])[0];
  }
  return splitStr;
}
try {
  const amzUrl = 'https://www.amazon.com/Encyclopedia-Country-Living-50th-Anniversary/dp/1632172895/ref=sr_1_1?keywords=survival+encyclopedia&pd_rd_r=8e62738c-ae2b-46c0-b477-db5cf23a6b0a&pd_rd_w=0Eazc&pd_rd_wg=E51TF&pf_rd_p=54cea6b7-0efb-45a3-b68b-8c1ccfbfa553&pf_rd_r=EE9X3J3QBPCDQAVQJ9FQ&qid=1651929404&sr=8-1';
  const sliceUptoAsinList = ["/dp/", "/gp/product/"]; // list for slice occurrences before asin
  const sliceAfterAsinList = ["/", "?"]; // list for slice occurrences after the asin
  let sliceUptoAsin; // variable to store index of slice occurrence before asin
  let shortenedUrl;
  // if else statements for all the possible slice occurrences before asin
  if (amzUrl.includes(sliceUptoAsinList[0])) {
    sliceUptoAsin = nthIndex(amzUrl, "/dp/", 1);
    shortenedUrl = amzUrl.slice(sliceUptoAsin + 4); // + 4 to remove /dp/ also
    console.log(sliceUptoAsin, shortenedUrl);
  } else if (amzUrl.includes(sliceUptoAsinList[1])) {
    sliceUptoAsin = nthIndex(amzUrl, "/gp/product/", 1);
    shortenedUrl = amzUrl.slice(sliceUptoAsin + 12); // + 12 to remove /gp/product/ also
    console.log(sliceUptoAsin, shortenedUrl);
  } else {
    throw "url format not supported";
  }
  // removes everything after the asin following 'sliceAfterAsinList'
  shortenedUrl = splitSliceFunc(shortenedUrl, sliceAfterAsinList);
  console.log(shortenedUrl);
} catch (error) {
  console.log(error)
}

I opted for a non regex approach because they become harder to maintain. IF you are treating the urls as simple strings, split and slice can also do the job.

The above code assumes that the ASIN is followed by "/dp/" or "/gp/product/" (*but is not limited to these occurrences only because 'sliceUptoAsinList' array can have as many as slice occurrences before ASIN as many you want, followed by an added else-if condition).

The code will work irrespective of whether there are 10 or more characters in ASIN because it will only look for the first occurrence of any character found in the 'sliceAfterAsinList' array in the url and will remove everything next to that character (including the character also).

I have built a tool for this purpose github repo.


If the ASIN is always in that position in the URL:

var asin= decodeURIComponent(url.split('/')[5]);

though there's probably little chance of an ASIN getting %-escaped.

0

精彩评论

暂无评论...
验证码 换一张
取 消