Liv Longley 2013/06/28

Having doubts about the Searchmetrics’ US Google Ranking Factors study?

Analysis Time to read: 14 min

ATTENTION THIS HAS BEEN UPDATED WITH MORE INFORMATION ABOUT SEARCHMETRICS’ US GOOGLE RANKING FACTORS STUDY:

OK, so you have doubts about Searchmetrics’ US Google Ranking Factors – Rank Correlation 2013 study…

Earlier this week, we announced our US Google Ranking Factors – Rank Correlation 2013 study. This tries to identify the key factors that well-positioned pages have in common and that separate them from lower ranking pages in Google searches. We performed our analysis by looking at those factors that correlate with pages that rank well.

Of course a lot of very clever SEO experts have been looking at this whole area for a very long time and have strong opinions about what factors are and are not important. And when they looked at our data many of them have asked questions and expressed doubts about what we’re trying to say. So here’s a response to some of the main concerns we’ve seen. Hopefully it’s useful.

How did you collate the data and run the study ?

The study analysed Google US (Google.com) search results for 10,000 keywords and 300,000 web sites, featuring in the top 30 positions, as well as billions of backlinks, Tweets, Google plus ones, Tweets, Pins and Facebook likes, shares and comments.

The data was collected in March 2013 and again in June 2013 to take account of Google’s recent Penguin 2.0 algorithm update. Altogether we looked at over 70 different factors (although not all were included in the final analysis) and we calculated the correlations between them and the Google search results using Spearman’s rank correlation coefficient.

So those factors that have a high correlation coefficient have the biggest impact on rankings?

No we can’t say for sure. This is about correlation not causation. With correlation we can say that more highly positioned pages appeared more likely to have certain factors (or have more of those factors), but we can’t assume those factors definitively influence or cause high rankings. It’s impossible to be certain, unless you are Google!

But why do factors such as keyword in H1 and title have 0 (or near 0) correlation – are you saying they’re useless?

No please don’t make hasty, literal interpretations from a quick look at the data. We’re not trying to say that.

With Spearman’s coefficient a score of +1 implies a perfect positive correlation and a score of -1 implies a perfect negative correlation. A high positive correlation coefficient occurs for a factor if higher ranking pages have that feature / or more of that feature while at the same time lower ranking pages do not / or have less of that feature.

But…..certain factors such as keyword in h1 tended to have a very low correlation because they are present on nearly all pages that appear in the top 30 search results. In this case there is little difference in the way these factors relate to high ranking pages and low ranking pages. They’re always there – which actually results in a low or zero Spearman correlation coefficient!

It’s seem a little bit absurd and confusing, but this issue of zero or near zero correlation occurs for some very basic on-page factors (such as the existence of H1 headings, a keyword in the meta description and site speed). But these factors are almost ever-present and should absolutely not be disregarded by SEO teams. You can find out more about this issue by reading our in-depth report about the 2013 ranking factors.

Do social signals have an impact on rankings?

OK, there is a huge debate about this. Many people are convinced Google is not using social data as part of its algorithm. Indeed, many believe it’s impossible because Google can’t even access Facebook and Twitter (but if you do a site query in Google for the number of pages that were indexed, you’ll see that Google has more than 5 billion pages of facebook.com indexed – these are pages that are accessible and Google absolutely knows what’s going on these sites).

Please understand we do appreciate the arguments against social signals being a ranking factor– we’re not denying them. Our data simply shows that social signals do correlate with rankings. And of course we know that from Google’s perspective that this makes perfect sense ie good content is shared often and Google tries to rank good content. We can’t say from the data in this study whether there’s a causal relationship. So you can interpret it how you choose to.

If you don’t know which links pass value in the search indexes then your conclusions are highly dubious….

Nobody knows, which link passes value – except Google, who determine the value of links themselves. And as we said: We are not Google! Furthermore, the value of links is influenced by several factors (“value” of the link source, quality of the link etc), which we have also taken into account. And, moreover, it’s more than just link attributes that influence rankings. All we can do is to look at the features well ranking pages are featuring and interpret the data. And we are not guessing – since our interpretations are based on an extremely large data set.

Your data says exact match domains have decreased in importance, yet many EMDs rank very highly – so why is that?

We did not claim, that Google punished all the exact match domains. What we discovered was that EMDs seem to have lost their “ranking bonus”. If you’ve read our Ranking Factors 2012 edition, then you know that EMDs had some kind of bonus until the last year. And this era ended with Google’s Penguin and EMD updates.

Until 2012, there were lots of keyword domains ranking well in the SERPs that did not provide any value for the user except having the keyword in the domain name – together with ads on the page. Most of them were absolutely irrelevant to the user’s query and requirement for information. What Google seems to have done now, is devalue these irrelevant domains.

Of course thist does not mean, that all the EMDs are irrelevant now. In fact, there are still EMDs ranking well. We know that. But these are largely the domains that offer some kind of relevance for the user. The irrelevant ones are more or less gone. For the US, our data indicates that there are about 25% less EMDs in the top 10 now, in comparison to 2013.

You will always find exceptions. But this is normal, because if keyword domains have great content, why shouldn’t they rank?

But some factors did show correlations are very low – under 0.4. Is that a typo?

No, this is not a typo. The absolute value of the correlation coefficient should be interpreted as a an indicator of the relative strength of the correlation of the corresponding factor with top 30 rankings, in relation to the other factors and our data set. Since there are no comparable studies, we cannot really say whether a correlation coefficient value of 0.4 is high or low. Given the high variability of the data, our best guess is that a coefficient of 0.4 for a single factor indicates a “good” correlation, while a coefficient of less than 0.1 – 0.2 indicates a “low correlation”. The high variability of the data is also the reason why we did not publish the results of statistical significance tests. In a high variability setting, such tests tend to accept the null hypothesis of “no correlation” since the presence of considerable variance and heteroscedasticity overestimates standard errors, which in turn increases the chance of type II errors (i.e. accepting that there is no significant correlation, while in fact there is).

Are they important for us to consider?

Yes, you should take into consideration, but do not forget the 0-correlation.

What about the reliability of the published correlation coefficients – arent’t Google’s rankings highly variable? What about factors influencing each other?

Some have argued that our conclusions are flawed because we did not consider multi-collinearity and possible detrimental effects of high variance on correlation values.

To address the first issue: In this study, we analyzed the correlation between rank and an invididual factor, i.e. we performed a simple linear regression with a single explanatory variable for each of the factors. Collinearity, in contrast, refers to multiple regression, where a dependent variable is explained in terms of several input variables. In that latter case, and only then, the estimates of the coefficients (i.e. the correlation values) of the input variables are dependent on effects of collinearity. Thus, if we had analyzed, say, the effect of having a keyword in the title AND a high number of Facebook Likes on a good ranking, we would have had to make sure that the correlation values obtained from such a model were valid with respect to any collinearity. The argument is thus simply invalid for the conclusions drawn from this study.

We mitigated the effects of high variance and of heteroscedasticity by filtering the observed values for extreme outliers before computing correlation values. To quote from Wikipedia: “Regression analysis using heteroscedastic data will still provide an unbiased estimate for the relationship between the predictor variable and the outcome, but standard errors and therefore inferences obtained from data analysis are suspect.” Thus, even though the factors we analyzed may exhibit considerable variance, even after filtering, the estimate of the relationship between factor and rank is still correct. Any further conclusions, however, such as a statistical hypothesis test of the significance of the observed estimate, may suffer due to the overestimation of standard errors.

Social signals effect on ranking a small addendum:

Again, remember that correlation is not causation. For example, a page could gain a large number of social signals simply because it appears prominently in Google’s SERPs, and hence can come to the attention of many people that in turn may share it. In this case, there would be a high correlation between social signals and a good rank, but the large number of social signals would be the effect, not the cause, of Google’s ranking strategies.

Hopefully the above will have clarified our position about the areas that people are unsure of when looking at our data. We’ve put a lot of time and effort into collating, analyzing and presenting the correlation data. And we think what we’re presenting is of value. But SEO experts have different opinions and we’re happy to have a discussion about this.

One thing we would urge people to do however is please read the detailed report rather than make assumptions from a quick look at the infographic and charts. A lot of the issues people may be unsure of are explained in the report.