How are services like Alexa and Google Analytics capable of tracking visitors' age, gender, college education, and so forth?
htt开发者_如何学编程p://www.alexa.com/siteinfo/stackoverflow.com
Alexa definitely gets its traffic info from its toolbar users. Since that is a relatively small and self-selecting group of people, this inevitably leads to a biased sample (which is why Alexa traffic doesn't match measured traffic on the sites I run). Even with the best statistical techniques for reducing bias, you can never get rid of it entirely when the sampling distribution is not uniform.
Unclear how Google does it, although it might involve tracking cookies.
A project I have been working on recently has bearing on this question.
Another way to do this (that also has biases, but different ones) would be to use an IP to location service to find the approximate latitude and longitude of each visitor to your site. Then use my project (full disclosure: I run that site and it is commercial):
http://askgeo.com
To get demographic information for that location. AskGeo actually provides demographic information on several geographic levels (state, county, county subdivision, city, ZIP code, census tract (a few thousand people), and census block group (about a thousand people). You'd presumably want to use the lowest level (i.e., census block group) for a given latitude and longitude.
The site returns a huge number of demographic variables. The idea would be to use soft counts from the demographic variables provided on the block group level. To take an example, if you are trying to track the age distribution of your users, then you'd use the age ranges provided in the AskGeo response and for a given sample, you'd add a fractional soft count to each range that corresponds to the percentage of the population in that block group from the corresponding age range. For example, take my neighborhood in San Francisco. It has the following age distribution:
- CensusAgePercent0To4: 7.3%
- CensusAgePercent5To9: 3.5%
- CensusAgePercent10To: 3.2%
... (skipping a bit, as you probably get the idea) ...
- CensusAgePercentOver85: 1.5%
If you got an IP address that you tracked to that census block group, you'd add each of those percentages (as a fraction from 0 to 1) to your (soft) counters for those age ranges. (A soft counter is just a counter that allows for non-integer counts.)
You could do the same with race, gender, income level, house values, etc.
This method also has biases, for sure, since it assumes that all the people in a given block group are equally likely to visit your site. But it is something that you can do on your own site, not just Google and Alexa, and it would still give you a relative sense of who is visiting your site if your soft counts in a given category are higher than the national average in that category.
It is also possible that a more sophisticated technique than simple direct counts could lead to a much richer result.
I did some research, and apparently these demographics are tracked the same way TV audience demographics are tracked. There are people who browse with their (Alexa's) toolbars, which keeps track of the sites visited. These people willingly (?) supply information like age, gender, etc. and Alexa extrapolates the general demographics from this sample. This of course leaves room for bias, but that's a problem with statistics.
Alexa gets its information from browser toolbars that you install on purpose or as part of a bundle with some software. It asks questions to understand demographic params and also tracks sites that you visit. If you know that 80% of site visitors are women and you have new visitor who visits this site that you can think that there is high probability that this person is a woman. If you know a lot of sites this person visits you can guess a lot.
But as http://netberry.co.uk/alexa-rank-explained.htm says you can rely only on information from Alexa TOP100,000 because then Alexa has enough information from small amount of users visiting these sites. They say "millions" but it's small share of total
精彩评论