In working on a TextToCodeRatio function for my SeoTools Excel Plugin, I'd like some input开发者_StackOverflow中文版 on my approach:
I'm using HtmlAgiltyPack to get all text nodes, discard those that have script and style tags as parent node and perform some additional text manipulation:
public static int CalculateTextSize(HtmlDocument doc)
{
int size = 0;
foreach (HtmlNode node in
doc.DocumentNode.SelectNodes("//text()[normalize-space(.) != '']"))
{
HtmlNode parentNode = node.ParentNode;
if (parentNode != null)
{
if (parentNode.Name.Equals("script",
StringComparison.CurrentCultureIgnoreCase)
|| parentNode.Name.Equals("style",
StringComparison.CurrentCultureIgnoreCase))
{
continue;
}
}
string text = node.InnerText.Trim();
//Just in case agility pack gets it wrong...
text = StringUtils.StripTags(text);
//Replaces "&" => "&" etc.
text = HttpUtility.HtmlDecode(text);
//All whitespace is reduced to single space, i.e.
//"Foo\r\nBar\t\ Hello" => "Foo Bar Hello"
text = StringUtils.NormalizeWhitespace(text);
size += text.Trim().Length;
}
return size;
}
What do you think? It's a quite restrictive approach as for example on aftonbladet.se my method returns 23722 while the SeoChat tool returns 28671. Am I doing it wrong?
UPDATE: As pointed out by Oskar Kjellin I'm counting chars instead of bytes and SeoChat is counting bytes. What is best, counting chars or bytes? I think that this metric shouldn't be affected by what Encoding the page is written in.
The reason for the difference is because he is counting bytes and you are counting character.
I would say that the best is to calculate the bytes as the reason for doing this is to see how many percentage of the loaded page is text. So you have to get the total page size loaded, and use that to calculate. You cannot use character count for that.
Not sure how the search engines do this, but yours is quite easy to fool. You can just put everything inside a big div of text and use CSS to hide the div. It depends on how thorough you want to be.
精彩评论