Rand Fishkin terrible shoes
Depending on where you stand, scratch that no matter where you stand you will be able to see Rand`s choice of footwear, from pretty much anywhere on the planet thanks to google earth and the glimmering sheen that blinds us all. In all seriousnous Rand Fishkin is one of those who have learned his trade through years of exploration and experimentation that makes him stand out (ahem) from the crowd, like myself right now, he was a serial lurker of the SEO forums and has broken into the SEM world by laying solid bricks with SEOmoz.org
The latest posts from (SEOmoz.org)
Posted by Dr. Pete
The past few years have seen an explosion of usability and Conversion Rate Optimization (CRO) tools hit the market. There have been many good roundup posts about these tools, but I want to focus today on a more in-depth approach to putting just 3 of these tools to work: (1) Five Second Test, (2) Crazy Egg, and (3) UserTesting.com. Total cost to do one round of testing: $224.
(1) Five Second Test ($20)
The premise behind Five Second Test is incredibly simple – show a visitor your site for 5 seconds and see what they remember (or, alternatively, where they click). This is a great starting point for getting some starter observations about your visitors.
How It Works
Setup is easy – just submit a screenshot of your web page or prototype (great for design comparisons) and the replies start coming in. You can view them individually or grouped by concepts. Five Second Test is actually free, but the $20/month package means you'll get a larger response rate. It's worth the extra cash, IMO. You can also earn credits ("karma") by taking other people's tests – it's kind of fun and can be informative.
What to Test
Think about the kind of things you want your visitors to know about in 5 seconds: The big questions: Who, What, Why. Here are a few uses I recommend:
- Do visitors recognize your brand?
- Do people get what you do?
- Is your tagline descriptive and effective?
- Is your page too visually noisy?
- Is Concept B better than Concept A?
- Can people find your call to action?
If people are remembering things like "blue", "blonde girl", and "ugly site", you know you've got some work to do (those aren't far from real examples of what I've seen).
(2) Crazy Egg ($9)
Heat-mapping tools like Crazy Egg take user activity and translate it into visual maps, helping you to easily visualize how people interact with your site. Crazy Egg was founded by SEO wonder kid Neil Patel, and is an amazing bargain at $9/month. If you can't bother to spend $9 on improving your website, feel free to stop reading this post. I'm serious – go buy a Venti Iced Mocha and a cookie instead of spending money on your business.
How It Works
This one's a little bit trickier – you'll have to install a JavaScript snippet similar to Google Analytics and other tools. Then, Crazy Egg starts tracking clicks on your specified page (try to stick to one page, as jumping pages can produce odd results).
What to Test
Crazy Egg not only allows you create to visual heat maps, but also has a "confetti" mode that lets you visualize clicks by segments, such as referring sources and new vs. returning visitors. Here are a few questions a heat-mapping tool can help you answer:
- Are people clicking where you want them to click?
- Is your navigation effective?
- Do you have too many choices?
- Do search visitors behave differently?
- Is your call to action getting clicks?
Although some heat-mapping tools can get bogged down in the visuals, I think that Crazy Egg has a very simple, elegant reporting approach that can give you solid insights quickly. Once you've gathered some initial impressions from Five Second Test and Crazy Egg, it's time to do some real user testing...
(3) UserTesting.com ($195)
It used to be that user testing required a lab, expensive equipment, and a difficult recruiting process. Now, you can use remote testing services like UserTesting.com to get quick, inexpensive user feedback. While I won't say it compares apples-to-apples to laboratory testing, I often find that the insights from even a handful of remote testing subjects can be incredibly useful.
How It Works
Setup is pretty straightforward, but doing it right can take a little bit of time. Technically, you just need to submit your URL and a few instructions to visitors. You pay $39 per visitor and receive both written feedback and an online video of the user walking through your site (with voice-over). Although this is a topic of some debate in the usability community, 5 users is a good number for uncovering core insights and getting solid bang for your buck.
What to Test
Take some time setting up your questions. Traditional usability tests are task-oriented – you tell someone to try to complete a task in a fairly open-ended fashion and watch them go to work. Be specific about the task and ask follow-up questions, like "Would you trust this site enough to make a purchase?" (I generally ask 3-4 follow-ups). A few questions this kind of qualitative testing can help you answer:
- Can people complete the task?
- How long does task completion take?
- Do users experience common stumbling blocks?
- What are visitors thinking out loud about?
- Does your search/navigation work as expected?
- Are you missing features people might be looking for?
- Do visitors get frustrated using your site?
Qualitative testing can be a great precursor to quantitative (A/B and multivariate) testing. Don't throw design changes at the wall and see what sticks – put user testing to work to uncover hidden issues on your site. We all need a fresh pair (or 5 pairs) of eyes from time to time.
Here's to $224 Well Spent
I'm an entrepreneur and a Bohemian – I understand that parting with money isn't easy. The insights you'll gain from just over $200, though, will, in my experience, easily yield 10X or even 100X back in online sales improvement. Solid qualitative data collection will also prevent you from making costly mistakes and will better inform how you look at your analytics and quantitative testing. There are plenty of good tools out there – choose a couple of them, and really put the effort into understanding how they work. You'll be well rewarded.
Update: We just published a YOUmoz post about Crazy Egg that should be an interesting read for anyone who enjoyed this article. David gives some nice examples and a case study of how heat-mapping got one of his clients an 87% conversion boost.
Posted by randfish
Just a short post tonight.
First, off, I'm honored to be interviewed by Aaron Wall. We've had our differences and maintain some divergent opinions on a few topics, but we both have an insane passion for helping make SEO professionals better at their job and work hard to grow the credibility of SEO as a whole.
Second - we've got a lot of reason to be thankful. SEOmoz was recently named the 334th fastest growing company in the US by Inc Magazine. I was named to Seattle's 40 Under 40 List (I'm guessing it's a typo) and we've recently passed 6,000 PRO subscribers (actually, we're up over 6,300 as of today).

As amazing as all that is, nearly everyone at SEOmoz is thinking not about these milestones, but about one of our own - Jen Lopez - who noted on her Twitter feed that she's out battling cancer. We are all with you Jen - every last one of us, with all our hearts. And we agree: #fuckcancer
Latent Dirichlet Allocation (LDA) and Google's Rankings are Remarkably Well Correlated
Posted by randfish
Last week at our annual mozinar, Ben Hendrickson gave a talk on a unique methodology for improving SEO. The reception was overwhelming - I've never previously been part of a professional event where thunderous applause broke out not once but multiple times in the midst of a speaker's remarks.

_
Ben Hendrickson speaking in last Fall at the Distilled/SEOmoz PRO Training London
(he'll be returning this year)
_
I doubt I can recreate the energy and excitement of the 320-person filled room that day, but my goal in this post is to help explain the concepts of topic modeling, vector space models as they relate to information retrieval and the work we've done on LDA (Latent Dirichlet Allocation). I'll also try to explain the relationship and potential applications to the practice of SEO.
A Request: Curiously, prior to the release of this post and our research publicly, there have been a number of negative remarks and criticisms from several folks in the search community suggesting that LDA (or topic modeling in general) is definitively not used by the search engines. We think there's a lot of evidence to suggest engines do use these, but we'd be excited to see contradicting evidence presented. If you have such work, please do publish!
The Search Rankings Pie Chart
Many of us are likely familar with the ranking factors survey SEOmoz conducts every two years (we'll have another one next year and I expect some exciting/interesting differences). Of course, we know that this aggregation of opinion is likely missing out on many factors and may over or under-emphasize the ones it does show.
Here's an illustration I created for a presentation recently to help illustrate the major categories in the overall results:

This suggests that many SEOs don't ascribe much weight to on-page optimization
_
I myself have often felt that from all the metrics, tests and observations of Google's ranking results, the importance of on-page factors like keyword usage or TF*IDF (explained below) is fairly small. Certainly, I've not observed many results, even in low competitive spaces, where one can simply add in a few more repetitions of the keyword, maybe toss in a few synonyms or "related searches" and improve rankings. This experience, which many SEOs I've talked to share, has led me to believe that linking signals are an overwhelming majority of how the engines order results.
But, I love to be wrong.
Some of the work we've been doing around topic modeling, specifically using a process called LDA (Latent Dirichlet Allocation), has shown some surprisingly strong results. This has made me (and I think a lot of the folks who attended Ben's talk last Tuesday) question whether it was simply a naive application of the concept of "relevancy" or "keyword usage" that gave us this biased perspective.
Why Search Engines Need Topic Modeling
Some queries are very simple - a search for "wikipedia" is non-ambiguous, straightforward and can be effectively returned by even a very basic web search engine. Other searches aren't nearly as simple. Let's look at how engines might order two results - a simple problem most of the time that can be somewhat complex depending on the situation.




For complex queries or when relating large quantities of results with lots of content-related signals, search engines need ways to determine the intent of a particular page. Simply because it mentions the keyword 4 or 5 times in prominent places or even mentions similar phrases/synonyms won't necessarily mean that it's truly relevant to the searcher's query.
Historically, lots of SEOs have put effort into this process, so what we're doing here isn't revolutionary, and topic models, LDA included, have been around for a long time. However, no one in the field, to our knowledge, has made a topic modeling system public or compared its output with Google rankings (to help see how potentially influential these signals might be). The work Ben presented, and the really exciting bit (IMO), is in those numbers.
Term Vector Spaces & Topic Modeling
Term vector spaces, topic modeling and cosine similarity sound like a tough concepts, and when Ben first mentioned them on stage, a lot of the attendees (myself included) felt a bit lost. However, Ben (along with Will Critchlow, whose Cambridge mathematics degree came in handy) helped explain these to me, and I'll do my best to replicate that here:

In this imaginary example, every word in the English language is related to either "cat" or "dog," the only topics available. To measure whether a word is more related to "dog," we use a vector space model that creates those relationships mathematically. The illustration above does a reasonable job showing our simplistic world. Words like "bigfoot" are perfectly in the middle with no more closeness to "cat" than to "dog." But words like "canine" and "feline" are clearly closer to one that the other and the degree of the angle in the vector model illustrates this (and gives us a number).
BTW - in an LDA vector space model, topics wouldn't have exact label associations like "dog" and "cat" but would instead be things like "the vector around the topic of dogs."
Unfortunately, I can't really visualize beyond this step, as it relies on taking the simple model above and scaling it to thousands or millions of topics, each of which would have its own dimension (and anyone who's tried knows that drawing more than 3 dimensions in a blog post is pretty hard). Using this construct, the model can compute the similarity between any word or groups of words and the topics its created. You can learn more about this from Stanford University's posting of Introduction to Information Retrieval, which has a specific section on Vector Space Models.
Correlation of our LDA Results w/ Google.com Rankings
Over the last 10 months, Ben (with help from other SEOmoz team members) has put together a topic modeling system based on a relatively simple implementation of LDA. While it's certainly challenging to do this work, we doubt we're the first SEO-focused organization to do so, though possibly the first to make it publicly available.
When we first started this research, we didn't know what kind of an input LDA/topic modeling might have on search engines. Thus, on completion, we were pretty excited (maybe even ecstatic) to see the following results:
Correlation Between Google.com Rankings and Various Single Metrics

(the vertical blue bars indicate standard error in the diagram, which is relatively low thanks to the large sample set)
_
Using the same process we did for our release of Google vs. Bing correlation/ranking data at SMX Advanced (we posted much more detail on the process here), we've shown the Spearman correlations for a set of metrics familiar to most SEOs against some of the LDA results, including:
- TF*IDF - the classic term weighting formula, TF*IDF measures keyword usage in a more accurate way than a more primitive metric like keyword density. In this case, we just took the TF*IDF score of the page content that appeared in Google's rankings
- Followed IPs - this is our highest correlated single link-based metric, and shows the number of unique IP addresses hosting a website that contains a followed link to the URL. As we've shown in the past, with metrics like Page Authority (which uses machine learning to build more complex ranking models) we can do even better, but it's valuable in this context to just think and compare raw link numbers.
- LDA Cosine - this is the score produced from the new LDA labs tool. It measures the cosine similarity of topics between a given page or content block and the topics produced by the query.
The correlation with rankings of the LDA scores are uncanny. Certainly, they're not a perfect correlation, but that shouldn't be expected given the supposed complexity of Google's ranking algorithm and the many factors therein. But, seeing LDA scores show this dramatic result made us seriously question whether there was causation at work here (and we hope to do additional research via our ranking models to attempt to show that impact). Perhaps, good links are more likely to point to pages that are more "relevant" via a topic model or some other aspect of Google's algorithm that we don't yet understand naturally biases towards these.
However, given that many SEO best practices (e.g. keywords in title tags, static URLs and ) have dramatically lower correlations and the same difficulties proving causation, we suspect a lot of SEO professionals will be deeply interested in trying this approach.
The LDA Labs Tool Now Available; Some Recommendations for Testing & Use
We've just recently made the LDA Labs tool available. You can use this to input a word, phrase, chunk of text or an entire page's content (via the URL input box) along with a desired query (the keyword term/phrase you want to rank for) and the tool will give back a score that represents the cosine similarity in a percentage form (100% = perfect, 0% = no relationship).
When you use the tool, be aware of a few issues:
- Scores Change Slightly with Each Run
This is because, like a pollster interviewing 100 voters in a city to get a sense of the local electorate, we check a sample of the topics a content+query combo could fit with (checking every possibility would take an exceptionally long time). You can, therefore, expect the percentage output to flux 1-5% each time you check a page/content block against a query. - Scores are for English Only
Unfortunately, because our topics are built from a corpus of English language documents, we can't currently provide scores for non-English queries. - LDA isn't the Whole Picture
Remember that while the average correlation is in the 0.33 range, we shouldn't expect scores for any given set of search results to go in precisely descending order (a correlation of 1.0 would suggest that behavior). - The Tool Currently Runs Against Google.com in the US only
You should be able to see the same results the tool extracts from by using a personalization-agnostic search string like http://www.google.com/xhtml?q=my+search&pws=0 - Using Synonyms, "Related Searches" or Wonder Wheel Suggestions May Not Help
Term vector models are more sophisticated representations of "concepts" and "topics," so while many SEOs have long recommended using synonyms or adding "related searches" as keywords on their pages and others have suggested the importance of "topically relevant content" there haven't been great ways to measure these or show their correlation with rankings. The scores you see from the tool will be based on a much less naive interpretation of the connections between words than these classic approaches. - Scores are Relative (20% might not be bad)
Don't presume that getting a 15% or a 20% is always a terrible result. If the folks ranking in the top 10 all have LDA scores in the 10-20% range, you're likely doing a reasonable job. Some queries simply won't produce results that fit remarkably well with given topics (which could be a weakness of our model or a weirdness about the query itself). - Our Topic Models Don't Currently Use Phrases
Right now, the topics we construct are around single word concepts. We imagine that the search engines have probably gone above and beyond this into topic modeling that leverages multi-word phrases, too, and we hope to get there someday ourselves. - Keyword Spamming Might Improve Your LDA Score, But Probably Not Your Rankings
Like anything else in the SEO world, manipulatively applying the process is probably a terrible idea. Even if this tool worked perfectly to measure keyword relevance and topic modeling in Google, it would be unwise to simply stuff 50 words over and over on your page to get the highest LDA score you could. Quality content that real people actually want to find should be the goal of SEO and Google's almost certainly sophisticated enough to determine the different between junk content that matches topic models and real content that real users will like (even if the tool's scoring can't do that).
If you're trying to do serious SEO analysis and improvement, my suggested methodology is to build a chart something like this:

SERPs analysis of "SEO" in Google.com w/ Linkscape Metrics + LDA (click for larger)
Right now, you can use Keyword Difficulty's export function and then add in some of these metrics manually (though in the future, we're working towards building this type of analysis right into the web app beta).
Once you've got a chart like this, you can get a better sense of what's propping up your competitors rankings - anchor text, domain authority, or maybe something related to topic modeling relevancy (which the LDA tool could help with).
Undoubtedly, Google's More Sophisticated than This
While the correlations are high, and the excitement around the tool both inside SEOmoz and from a lot of our members and community is equally high, this is not us "reversing the algorithm." We may have built a great tool for improving the relevancy of your pages and helping to judge whether topic modeling is another component in the rankings, but it remains to be seen if we can simply improve scores on pages and see them rise in the results.
What's exciting to us isn't that we've found a secret formula (LDA has been written about for years and vector space models have been around for decades), but that we're making a potentially valuable addition to the parts of SEO we've traditionally had little measurement around.
BTW - Thanks to Michael Cottam, who suggested the reference of research work by a number of Googlers on pLDA. There are hundreds of papers from Google and Microsoft (Bing) researchers around LDA-related topics, too, for those interested. Reading through some of these, you can see that major search engines have almost certainly built more advanced models to handle this problem. Our correlation and testing of the tool's usefulness will show whether a naive implementation can still provide value for optimizing pages.
For those who'd like to investigate more, we've made all of our raw data available here (in XLS format, though you'll need a more sophisticated model to do LDA). If you have interest in digging into this, feel free to email Ben at SEOmoz dot org.
How Do I Explain this to the Boss/Client?
The simplest method I've found is to use an analogy like:
If we want to rank well for "the rolling stones" it's probably a really good idea to use words like "Mick Jagger," "Keith Richards," and "tour dates." It's also probably not super smart to use words like "rubies," "emeralds," "gemstones," or the phrase "gathers no moss," as these might confuse search engines (and visitors) as to the topic we're covering.
This tool tries to give a best guess number about how well we're doing on this front vs. other people on the web (or sample blocks of words or content we might want to try). Hopefully, it can help us figure out when we've done something like writing about the Stones but forgetting to mention Keith Richards.
As always, we're looking forward to your feedback and results. We've already had some folks write in to us saying they used the tool to optimize the contents of some pages and seen dramatic rankings boosts. As we know, that might not mean anything about the tool itself or the process, but it certainly has us hoping for great things.
p.s. The next step, obviously, is to produce a tool that can make recommendations on words to add or remove to help improve this score. That's certainly something we're looking into.
p.p.s. We're leaving the Labs LDA tool free for anyone to use for a while, as we'd love to hear what the community thinks of the process and want to get as broad input as possible. Future iterations may be PRO-only.
Two Quick, Simple Social Media Tips
Posted by RobOusbey
Today, I want to share two pieces of advice that are particularly useful to certain types of business - and will be exceptionally quick to implement. I've also created a free download that might help some people implement one of these ideas even more quickly.
About two years ago, I made a recommendation to a client in the UK, and I've just seen it used by a hotel in the USA. If your business offers public computers with internet access - such as those in hotel lobbies, libraries, etc - this is for you:
Tip 1: Put up a sign, next to your public computers, with a call to action; typically this could be something like 'Find us on Facebook' or 'Follow us on Twitter'.
Here's such a poster in use, at the Ledgestone Hotel in Yakima. (Click the image to embiggen.)
Sadly, it doesn't look like the Ledgestone is doing much with their Twitter account; this probably disappoints people who go to their page, and so they don't end up with as many followers as they could do. Remember - getting people to your Twitter page (or Facebook, or whatever else you're asking them to do) is only the first stage - there has to be something there for them when they arrive.
The second tip is more for people who offer wi-fi - this could be all manner of hotels, conference venues, airports, aeroplanes, train stations, coffee shops, etc. For places that offer free wi-fi, this can work even better:
Tip 2: You control the first page visitors see after logging on to your wi-fi. Don't waste this with a dull message; make the page interesting, and put some calls to action on there.
People have probably logged on to do something - but many will welcome a distraction - particularly if you keep the request brief. Create a nicely styled, but simple page, and add a couple of message on there. Some examples could include:
- Follow us on Twitter / Like us on Facebook: you could incentivize this, for example: if you're a coffee shop, then offer a free latte to new followers
- Sign up to our email newsletter: this will only take them a second if you make sure the form is right there on the page, and again this can be incentivized
- Don't forget to check in on foursquare: ideal for almost any location, and this is as good a time as any to remind them to check in
- If you're enjoying your stay, please review us: particularly useful for hotels, where online reviews can increase visibility; I'll go into a little more detail about this below.
There can be some issues with sites noticing that a lot of people from the same IP are visiting, particularly when it comes to review services. Local search expert David Mihm advised me that he's heard Yelp in particular does try to filter our multiple reviews from the same IP, and that TripAdvisor's fraud rules do include clauses that might get you into trouble (such as offering incentives for people to write reviews is not permitted.)
I'd recommend that there are two steps around this type of issue:
- Try to appeal for reviews only from people who already have accounts on those sites (e.g.: "If you're a Yelp member, please review us here...." or "If you have a Google account, please leave a review here..."
- Make this 'post-wifi-login' page available on the public internet; review sites should be able to recognize that lots of people are being referred to your page from the same URL - if it's public then they'll be able to visit that page, and should figure out what is going on.
I've built a quick free template for you to to download as a starting point. You can visit the file, or download it, by clicking this link: free wifi login CTA page.
(That was created based on a template from LayoutGala; I'm not going to add any licence to it, other than use it however you want. You should change the image that are in it to be local files at the very least.)
Honestly, it doesn't take long to print off a couple of small posters (or even to publish a nice wifi login page) so I'll hope to see social-media CTAs cropping up all over the place soon. :)
LDA - Is On-Page Optimization the SEO Secret?
Posted by Dana Lookadoo
This post was originally in YOUmoz, and was promoted to the main blog because it provides great value and interest to our community. The author's views are entirely his or her own and may not reflect the views of SEOmoz, Inc.
How do I recap the SEOmoz PRO Seminar session on Uncovering a Hidden Technique for SEO? The title is so attractive that it produces Pavlonian symptoms as we salivate at the thought of uncovering a hidden SEO treasure. Ben Hendrickson of SEOmoz presented a model which appears to show how Google may assigning relevance to keyword terms based on context - topical relevance.
Is Latent Dirichlet Allocation (LDA) that hidden jackpot?
1st - LDA is not new nor something SEOmoz invented. The Information Retrieval model has been around for 7 or 8 years, and IR geeks have talked about it before. There are a number of resources, as well as nay saying, about LDA and Google's possible use of it.
2nd - What is new is SEOmoz's LDA Topics Tool that produces a relevancy score based off a query (search term). It enables one to play with words that may increase a page's relevancy in the eyes of Google. It shows words that help Google determine how relevant the page is to a user's search query.
Game Changer?
Kyle Stone tweeted that the LDA tool is a game changer, and many retweeted.

Is SEOmoz's LDA tool a game changer? That's yet to be seen. The goal is to report Ben's research as presented at the Mozinar and how a layman (myself) interprets such. Rand is going to do a follow-up post to explain more.
Why all the hype?
The SEO Challenge
SEOs face the continual challenge of figuring out Google's hidden ranking algorithms. How do we rank higher? Which signals are the most important? We know search engines are "learning models" that attempt to understand "context” of words. Google has said for years that webmasters should concentrate most on providing good relevant (contextual) content.
There are ways to rank higher. Is it as easy as 1, 2, 3?
- Create quality copy with keyword(s) on the page along with associated anchor text links.
- Get good links.
- What Ben talked about in this session.
LDA - Topic Modeling & Analysis
Latent Dirichlet Allocation, in layman's terms, translates to "topic modeling." In search geek terms, LDA is the following formula:

(Did you digest that? Don't worry; Mozzers groaned and laughed at the same time. PLUS: Scientist Hendrickson delivered this session after lunch!)
LDA Simplified - Here is Ben's way of explaining topic modeling:

(Okay, I was once proud that I got an A in Logic and Combinatorics - discrete math/set theory. However, that computer science class now feels like basic math compared to this formula.)
It made more sense when Rand Fishkin joined Ben on stage and when Todd Freisen moderated and deciphered during Q&A. (Manuela Sanches of Brazil was sitting next to me and said that Ben's "presentation needed subtitles!")
The objective of LDA, from my deciphering of Greek, is to understand how Google is using semantic contextual analysis combined with other signals, to define topics/concepts. It's how Google analyzes the words on a page to determine the "set" to which a word belongs - how relevant a search query is to pages in its database.
For example: How does Google assign relevance to the word "orange" on a page? They determine orange is related to the fruit set or to the color set by page context.
LDA Defined:
"Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into "topics" and documents into mixtures of topics. It has been successfully applied to model change in scientific fields over time (Griffiths and Steyver, 2004; Hall, et al. 2008).
A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over "topics", which are in turn distributions over words."
Bayesian - ah, a term I recognize!! Bayesian spam filtering is a method used to detect spam. It draws off a database and learns the meaning of words. It's "trained" by us when we mark an email as spam. It looks at incoming emails and calculates the probability that the content of an email is contextually spammy.
I found a PowerPoint presentation about Bayesian Inference Techniques by Microsoft Research from 2004 that presents the possibility of using LDA. Go to slide 54 and read:
"Can we build a general-purpose inference engine which automates these procedures?"
Microsoft has been looking at LDA models. Do search engines use it as one of their primary methods?
Ben sampled over 8 million documents with approx. 1,000 queries. He believes Google is using LDA topic modeling to determine (learn) what words mean by their associations with, relevance to, other words on the page. (Other factors are included.) Ben called the results a "co-occurrence explanation" that use a "cosine similarity."
SEO Takeaway:
- Results that are higher in Google SERPs, in general, have more topical content.
- Search engines do APPEAR to apply semantic analysisÂ… when indexing a page and determining the intent of the words on the page.
Rand tweeted an explanation (in 140 x 4) as follows:

Dana's LDA Catwalk Metaphor for Topic Modeling:
Imagine the words on your page as walking down the fashion runway in Paris. Your keyword phrase is "dressed" in semantic accessories, words that correlate to and dress up your topic. Associated words bring meaning to and highlight the fashion model's outfit. Adjectives, modifiers and synonyms are like jewelry, hats, and shoes. The combination can transform your base layers (your target terms) from casual or conservative business attire into a sexy night-on-the-town ensemble.
Combinations and permutations of words on a page "dress" your skinny or curvy fashion model. Relevant words provide Google with an image of what she is wearing and the catwalk upon which she struts. LDA refers back to what Google already knows about these "accessories" (words) and their previous association with the topic terms related to fashion.
Enter Topical Ambiguity - I just broke the "rules" for context with the catwalk metaphor by referring to modeling in two contexts on this page:
- I used "modeling" terms that relate to the "fashion industry" set.
- The catwalk metaphor is irrelevant content that is off-topic for discussing "LDA topic modeling."
Google Algorithm Exposed?
Ben clearly said that LDA is an ATTEMPT to explain the SERPs. His scenario, a quote from his presentation slides, follows:
One of us needs to implement it so we can:
1) See how it applies to pages
2) See if it helps explain SERPs
One-two-three-not-it.
LDA is not LSI.
There were some tweets claiming SEOmoz was bringing back LSI or snakeoil. Ben clarified that LDA is not LSI, which deals more with keyword density. He explained that he is NOT talking about loading keywords on a page but about the relevance of the topics within the page. He said that:
"LSI doesn’t have the same bias toward simple explanations. LSI breaks down as you try to scale up the number of topics."
The LDA tool deals with context, semantic relevancy, not density - in addition to some other random factors. Example:
If SEOmoz has a page all about "SEO" and "tools," and there is another word on the page that can be explained by a word that is more related to SEO topic, then the related word would be used. Meaning, "seo tools" doesn't have to be repeated over and over, and the related word would be interpreted by Google as being relevant.
Ben, who appears to have the brain of a search engine, noted that it "appears" LDA is what Google is heading for in the near future. He said (paraphrased):
If they are not doing it, they seem to be doing something that has the same output. They are probably already using it.
Rand deciphered:
It’s a super weird coincidence if Google is not using it.
Are On-Page Signals Stronger than Links?
Are we heading toward more emphasis of on-page topic modeling? I'm not an IR geek, but I do plan to spend more energy focusing on understanding how search engines retrieve informaton. We are dealing with a semantic Web. LDA may indicate that good old on-page optimization sends stronger signals than links.
SEOmoz's LDA tool attempts to show how relevant content is to a chosen keyword. It computes relevance of queries.
The following shows how relevant SEOmoz's Tools page is to Aaron Wall's SEO Book Tools page.

The score at the top is an indicator of how relevant the content on that page is according to LDA.
- Aaron's content is 72%* relevant for the query "seo tools."
- SEOmoz's tools page is 40%* relevant.
*NOTE: (I inserted the logos.) You can run the same pages and get different results. The results are similar in that SEO Book always scored as more topically relevant, but the percentage varies. Is this the random Monte Carlo algorithm at work? Ben?
Mozinar Question:
"How do we execute this for SEO?"
Ben's Answer:
"I don't actually do SEO. I write code."
That's up to us, the SEOs, to play and test in our Google playground.
Use the tool to decide if you can win with LDA to optimize your on-page signals.
- Use the LDA Topics Tool to return words that could be used on a page for a query.
- Then determine who is ranking for that term.
- Simply write content that is highly on-topic based off the findings you observe.
If you are not performing that well in the SERPs, think about classic on-page optimization. In the example above, rather than putting another instance of "seo tools" on the page, LDA shows there are better ways to tell Google that you are about that topic. The tool provides a way to measure that.
IMPORTANT: There is a threshold at which too many related words will appear as too spammy. LDA is not something to be used to game Google.
Test the LDA Tool out for yourself, and draw your own conclusions.
***
DISCLAIMER: I'm not claiming this methodology has uncovered hidden SEO treasures. Time, testing and playing around with a new SEOmoz tool while observing the SERPs will reveal the answer. In the meantime, I'm going to dress up my pages and accessorize them with relevant terms that make them dazzle so they look good climbing the Google catwalk.
If you can not read the posts above, then here is the easy link to SEOmoz
Rand on Twitter (@randfish)
randfish: RT @RobOusbey: Smart tip from @MikeCP: If you're curious to know which searches Gg Instant is registering, enable Google Web History, th ... - Thu, 09 Sep 2010
randfish: RT @RobOusbey: Smart tip from @MikeCP: If you're curious to know which searches Gg Instant is registering, enable Google Web History, th ...
randfish: Thanks everyone - sounds like a query selection from the dropdown, 3 secs of inactivity or a SERPs click will trigger AdWords impression. - Thu, 09 Sep 2010
randfish: Thanks everyone - sounds like a query selection from the dropdown, 3 secs of inactivity or a SERPs click will trigger AdWords impression.
randfish: Anyone know if GG Instant counts partial typed-searches or words-as-you-type as full queries for KW research in AdWords? Could be very weird - Thu, 09 Sep 2010
randfish: Anyone know if GG Instant counts partial typed-searches or words-as-you-type as full queries for KW research in AdWords? Could be very weird
randfish: @w00tert Also have to consider all the things we could possibly work on and whether this is the most valuable to SEOs, for now, probably no - Thu, 09 Sep 2010
randfish: @w00tert Also have to consider all the things we could possibly work on and whether this is the most valuable to SEOs, for now, probably no
randfish: @w00tert I presume many don't want their link buying sources "outed" or perhaps they operate these sites/pages and don't want it disclosed. - Thu, 09 Sep 2010
randfish: @w00tert I presume many don't want their link buying sources "outed" or perhaps they operate these sites/pages and don't want it disclosed.
More from SEO: Begin
- Long Tailing it to Success
- On Site SEO
- Keyword use in the title tag
- Are You a LinkHater?
- New Stuff from Mashable




[...] i am a great advocate of everything Aaron Wall, Debra Mastaler and Rand Fishkin orientated to name but a few, who do you consider to be the “SEO Thought Leaders” [...]