If you wonder why a certain message is flagged as spam by Gmail, you can find the reason from Gmail's web interface. When you open a message from the spam folder, there's a new section titled "Why is this message in Spam?" which offers "a brief explanation about why that particular message was placed in Spam".
Here are some of the explanations you might see:
1. "You previously marked messages from info@example.com as spam."
2. "You clicked 'Report spam' for this message."
3. "It's written in a different language than your messages typically use."
4. "It contains content that's typically used in spam messages."
5. "It's similar to messages that were detected by our spam filters."
6. "Many people marked similar messages as spam."
7. "We've found that lots of messages from info@example.com are spam."
8. "Be careful with this message. Our systems couldn't verify that this message was really sent by amazon.com. You might want to avoid clicking links or replying with personal information."
9. [Phishing] "Be careful with this message. Similar messages were used to steal people's personal information. Unless you trust the sender, don't click links or reply with personal information."
#4 and #5 are the most common explanations and they're rather vague. It's interesting to notice that the messages written "in a different language than your messages typically use" could be flagged as spam, but this shouldn't be the only explanation.
"We hope that this is not only interesting, but also helps you learn about scams and other harmful messages that Gmail filters out. Whether you prefer to leave your spam folder untouched or do some educational digging, the information will be there for you. And if you're interested in learning more, check out our new series of spam articles in the Gmail help center," informs Google.
{ Thanks, Venkat. }
Showing posts with label Spam. Show all posts
Showing posts with label Spam. Show all posts
March 21, 2012
March 17, 2008
Evaluating Google Search Quality
A leaked copy of Google's quality rater guidelines (PDF), used internally to evaluate the quality of search results, reveals some interesting things about Google's approach to search. "According to the document, which is dated April 2007 and at least looks legitimate, a quality rater has the job to first research and understand a specific search query – say [cell phones] –, to then look at the quality of a website returned for this query," notes Philipp Lenssen.
Queries can be navigational (the user has a single site in mind, the official homepage of a company/product), informational (searching for information), transactional (trying to obtain something: buy a product, download a video) or a combination of these categories. Depending on the query, search results can be: vital (for navigational queries with a dominant interpretation), useful (comprehensive, authoritative resources), relevant (pages that don't cover all the aspects of a query), not relevant (marginally related to the query) or off-topic.
Google thinks that there must be a connection between queries and search results in terms of generality: broad queries are best matched by broad pages, specific queries by specific pages. Search results must take into account the dominant interpretation of a query in a certain location and at a certain moment.
Spam is treated separately from search results evaluation. A web page may be spammy even if it's considered "vital" for some queries or it's very authoritative. "Webspam is the term for web pages that are designed by webmasters to trick search engine robots and direct traffic to their websites," explains Google. Web pages that include ads and scraped content from other sites, but don't bring any original information are considered spam. "When trying to decide if a page is Spam, it is helpful to ask yourself this question: If I remove the scraped (copied) content, the ads, and the links to other pages, is there anything of value left? If the answer is no, the page is probably Spam."
{ via Google Blogoscoped and SEO Book }
Queries can be navigational (the user has a single site in mind, the official homepage of a company/product), informational (searching for information), transactional (trying to obtain something: buy a product, download a video) or a combination of these categories. Depending on the query, search results can be: vital (for navigational queries with a dominant interpretation), useful (comprehensive, authoritative resources), relevant (pages that don't cover all the aspects of a query), not relevant (marginally related to the query) or off-topic.
Google thinks that there must be a connection between queries and search results in terms of generality: broad queries are best matched by broad pages, specific queries by specific pages. Search results must take into account the dominant interpretation of a query in a certain location and at a certain moment.
Spam is treated separately from search results evaluation. A web page may be spammy even if it's considered "vital" for some queries or it's very authoritative. "Webspam is the term for web pages that are designed by webmasters to trick search engine robots and direct traffic to their websites," explains Google. Web pages that include ads and scraped content from other sites, but don't bring any original information are considered spam. "When trying to decide if a page is Spam, it is helpful to ask yourself this question: If I remove the scraped (copied) content, the ads, and the links to other pages, is there anything of value left? If the answer is no, the page is probably Spam."
{ via Google Blogoscoped and SEO Book }
March 11, 2008
More Spam Originating from Gmail
The email security vendor MessageLabs published a report about the increasing number of spam messages originating from Gmail. "Analysis of spam shows that 4.6 percent of all spam originates from Web mail-based services and the proportion of spam from Gmail increased two-fold from 1.3 percent in January to 2.6 percent in February, mainly promoting adult-oriented websites. Yahoo! Mail was the most abused Web mail service responsible for sending 88.7 percent of all Web mail-based spam."
Spammers create accounts at free mail services like Yahoo Mail or Gmail, but to make the process more efficient, they need to automatize it. The major challenge is that most web mail providers use CAPTCHAs ("Completely Automated Public Turing test to tell Computers and Humans Apart") and they are difficult to solve automatically. Last month, Websense Security Labs discovered that spammers managed to create bots that automatically sign up for new Gmail accounts with a success rate of 20%.
Jeff Atwood thinks that "there's simply too much money to be made in email spam for the commercial CAPTCHA algorithms, regardless of how good they may be, to survive forever." He suggests to diversify the tests and use more difficult tasks like distinguishing dogs from cats or solving failed OCR inputs, but making the test more complicated will frustrate users.
Update: there's a program called Jiffy Gmail Creator that promises to automatically create Gmail accounts. "Normally, the average amount of time it takes to create a GMail account on a fast connection is approximately 4 minutes. With this software you can create a single account in under 10 seconds, and 10 accounts in under 2 minutes. Obviously this saves you loads of time," explains the site (I think you need less than a minute to create a Gmail account manually). The program costs $57, but I'm sure it's not the only one.
Spammers create accounts at free mail services like Yahoo Mail or Gmail, but to make the process more efficient, they need to automatize it. The major challenge is that most web mail providers use CAPTCHAs ("Completely Automated Public Turing test to tell Computers and Humans Apart") and they are difficult to solve automatically. Last month, Websense Security Labs discovered that spammers managed to create bots that automatically sign up for new Gmail accounts with a success rate of 20%.
We discovered that the CAPTCHA breaking process for Gmail is sophisticated when compared to the Live Mail CAPTCHA break up which was reported in our recent blogs. It is observed that two separate hosts active on same domain are contacted during the entire process. These two hosts work collaboratively during the CAPTCHA break process. Unlike Live Mail CAPTCHA breaking, which involved just one botted host doing the entire job (signing up, filling in details, getting the CAPTCHA request), the Gmail signing process involves two botted hosts (or CAPTCHA breaking hosts).
Jeff Atwood thinks that "there's simply too much money to be made in email spam for the commercial CAPTCHA algorithms, regardless of how good they may be, to survive forever." He suggests to diversify the tests and use more difficult tasks like distinguishing dogs from cats or solving failed OCR inputs, but making the test more complicated will frustrate users.
Update: there's a program called Jiffy Gmail Creator that promises to automatically create Gmail accounts. "Normally, the average amount of time it takes to create a GMail account on a fast connection is approximately 4 minutes. With this software you can create a single account in under 10 seconds, and 10 accounts in under 2 minutes. Obviously this saves you loads of time," explains the site (I think you need less than a minute to create a Gmail account manually). The program costs $57, but I'm sure it's not the only one.
October 22, 2007
Remove Spam from Google Blog Search
Even if Google Blog Search doesn't have too many interesting features, I still use it more often than Technorati because it's faster, it's not down for hours, it's much more comprehensive and it has features not available in any other important blog search engine. I still use Technorati for finding backlinks, because Google does a poor job in this area (compare Technorati with Google Blog Search). Unfortunately, Google Blog Search indexes a lot of spam posts that steal content and use it for lucrative purposes.
Google has two features that reduce the number of splogs (spam blogs) from search results. Like in web search, there's a duplicate filter that removes some of the posts that are almost identical. But it doesn't exclude all of them and it doesn't find posts that duplicate articles from news sites like Business Week.
The second feature is the option to sort results by relevancy, which is enabled by default. It may seem counterintuitive to sort blog search results by relevancy and not chronologically, but that's a great way to filter splogs or at least move them at the bottom of Google's search results. Google uses a lot of signals to rank blog posts, including PageRank, the number of feed subscriptions or the amount of duplicate content. But if you sort the results by relevancy, you'll find both recent and old posts and that's not always the optimal solution. A better way is to restrict the results to a recent period of time in the sidebar (to the last day or the last hour, depending on the volume of posts).
If you see a "References" link after the snippet, that's an indication that Google found (a significant number of) backlinks, so the result should be a little more reliable.
Many blogs use Google Alerts to pollute the web and make money, so you could also add [-"google alert"] to your query (a search for "google alert" returns more than 200,000 results). A lot spam blogs are hosted by Google's Blog*Spot, so removing the posts from blogspot.com could increase the quality of your results, but also remove non-spammy blogs like this one or Google's official blogs. I also noticed that many spam blogs use the .info TLD. A recent study showed that, when searching for commercial keywords, 75% of the results from blogspot.com and 68% of the results from .info sites are spam.
It's also a great idea to restrict the result to English (or another language) in "Advanced blog search".
So here's a summary:
1. sort the results by relevancy
2. restrict the results to a recent period (last day)
3. restrict the results to English (or another language)
4. if you really have to sort the results by date, remove the posts that follow a spammy pattern (for example, add -"google alert" -site:blogspot.com -site:.info to your query), but make sure you don't remove important results
5. check the posts that contain "References"
Google should do a better job at detecting spam in Blog Search results and identifying results from sites that happen to have feeds, but they're not blogs. It should also make it more difficult for spammers to use sites like Blogger or Google Alerts to pollute the search results.
Google has two features that reduce the number of splogs (spam blogs) from search results. Like in web search, there's a duplicate filter that removes some of the posts that are almost identical. But it doesn't exclude all of them and it doesn't find posts that duplicate articles from news sites like Business Week.
The second feature is the option to sort results by relevancy, which is enabled by default. It may seem counterintuitive to sort blog search results by relevancy and not chronologically, but that's a great way to filter splogs or at least move them at the bottom of Google's search results. Google uses a lot of signals to rank blog posts, including PageRank, the number of feed subscriptions or the amount of duplicate content. But if you sort the results by relevancy, you'll find both recent and old posts and that's not always the optimal solution. A better way is to restrict the results to a recent period of time in the sidebar (to the last day or the last hour, depending on the volume of posts).
If you see a "References" link after the snippet, that's an indication that Google found (a significant number of) backlinks, so the result should be a little more reliable.
Many blogs use Google Alerts to pollute the web and make money, so you could also add [-"google alert"] to your query (a search for "google alert" returns more than 200,000 results). A lot spam blogs are hosted by Google's Blog*Spot, so removing the posts from blogspot.com could increase the quality of your results, but also remove non-spammy blogs like this one or Google's official blogs. I also noticed that many spam blogs use the .info TLD. A recent study showed that, when searching for commercial keywords, 75% of the results from blogspot.com and 68% of the results from .info sites are spam.
It's also a great idea to restrict the result to English (or another language) in "Advanced blog search".
So here's a summary:
1. sort the results by relevancy
2. restrict the results to a recent period (last day)
3. restrict the results to English (or another language)
4. if you really have to sort the results by date, remove the posts that follow a spammy pattern (for example, add -"google alert" -site:blogspot.com -site:.info to your query), but make sure you don't remove important results
5. check the posts that contain "References"
Google should do a better job at detecting spam in Blog Search results and identifying results from sites that happen to have feeds, but they're not blogs. It should also make it more difficult for spammers to use sites like Blogger or Google Alerts to pollute the search results.
June 14, 2007
Comment Spam in Blogger
"Hi, Added a new value add to my blog this weekend - a news widget from www.widgetmate.com. I always wanted to show latest news for my keywords in my sidebar. It was very easy with this widget. Just a small copy paste and it was done. Great indeed."
Blogger doesn't offer an option to detect spam comments. The only options you have are to add a captcha to prevent automated spam or to moderate the comments, but that takes away from the value of an instant feedback. Even if Blogger adds rel=nofollow to all the links from comments and you don't improve their ranking in the search engines, "bloggers" like Addison post the same spam comment every 5 minutes to promote some mediocre widget site. Because Addison posts the comments manually, he can enter the captcha correctly.
But Blogger could at least check if similar comments were posted to a single blog multiple times. Or use the Akismet model.
It's always surprising to see how a company that actively fights against web spam is defeated by some comment spammers that use cheap methods to promote their latest widgets.
Blogger doesn't offer an option to detect spam comments. The only options you have are to add a captcha to prevent automated spam or to moderate the comments, but that takes away from the value of an instant feedback. Even if Blogger adds rel=nofollow to all the links from comments and you don't improve their ranking in the search engines, "bloggers" like Addison post the same spam comment every 5 minutes to promote some mediocre widget site. Because Addison posts the comments manually, he can enter the captcha correctly.
But Blogger could at least check if similar comments were posted to a single blog multiple times. Or use the Akismet model.
It's always surprising to see how a company that actively fights against web spam is defeated by some comment spammers that use cheap methods to promote their latest widgets.
February 10, 2007
BlogSpot Redirect Spam Floods Search Results
Spammers found a new way to drive traffic to their spammy sites or affiliates: create tons of free BlogSpot blogs, put some content in the templates, create links between all the blogs and redirect the visitors to the spammy sites. Apparently, this scheme works and the redirect seems smart too.
A search for "how students loans affect fico score", BlogSpot redirect spam dominates the top 100 results (all the top 35 results are BlogSpot blogs).
Here's what you see when you click on the top result:
... and what Google sees:
Another query that shows a lot of spam BlogSpot blogs is "how to transfer music to blackberry". Most of the blogs don't contain any post and they only use the homepage, by changing the template.
{ Via Digg. }
A search for "how students loans affect fico score", BlogSpot redirect spam dominates the top 100 results (all the top 35 results are BlogSpot blogs).
Here's what you see when you click on the top result:
... and what Google sees:
Another query that shows a lot of spam BlogSpot blogs is "how to transfer music to blackberry". Most of the blogs don't contain any post and they only use the homepage, by changing the template.
{ Via Digg. }
December 6, 2006
Email Spam Skyrocketing
The beautiful New York Times confirms something everybody notices: spam is doing very well. "In the last six months, the problem has gotten measurably worse. Worldwide spam volumes have doubled from last year, according to Ironport, a spam filtering firm, and unsolicited junk mail now accounts for more than 9 of every 10 e-mail messages sent over the Internet."
The techniques responsible for this growth are image spam and the use of "vast networks of computers belonging to users who unknowingly downloaded viruses and other rogue programs".
Spammers try to fool message filters by using excerpts from books or articles. OCR software is not very effective because image spams use special effects, background images and weird fonts.
Traditional spam is still effective, as many mail services don't have powerful spam filters.
The techniques responsible for this growth are image spam and the use of "vast networks of computers belonging to users who unknowingly downloaded viruses and other rogue programs".
Spammers try to fool message filters by using excerpts from books or articles. OCR software is not very effective because image spams use special effects, background images and weird fonts.
Traditional spam is still effective, as many mail services don't have powerful spam filters.
October 21, 2006
The Terror Storm of Spam Comments
Here are the comments for "Funny Commercial of a girl taking a guys shorts", a video proudly presented by Google Video:
1. Great video, but Terror Storm is a better video. You must open your eyes and see the truth.
2. Bush is responsible for 9/11, and you watch funny videos instead of doing something about it.
3. Your govt. wants to occupy you with stupidity like this.
4. Terror Storm is a lie.
5. OMG LOL!!!
6. Hey, check my site for more great videos like this. www.more-great-videos.com.
7. Don't watch terror storm. It's nothing but a crock of lies... no intelligent person would waste any time listening to that conspiracy theory bulls**t.
8. OPPOSE GOOGLE CENSORSHIP! Watch the most popular video: Terror Storm.
9. DESTROY THE NEW WORLD ORDER!
10. Put on some clothes, please!
If you hate spam comments like most of these, use the new feature from Google Video: "Mark as spam". The immediate effect is that the comment will disappear for you. If more people mark a comment as spam, the comment will be removed.
Google's idea is great (and continues the label cleanup), but I would have added a link that says "Mark as not spam", as most comments are off-topic and spammy.
{ Thanks, Kent Dodds. The comments are slightly modified. }
1. Great video, but Terror Storm is a better video. You must open your eyes and see the truth.
2. Bush is responsible for 9/11, and you watch funny videos instead of doing something about it.
3. Your govt. wants to occupy you with stupidity like this.
4. Terror Storm is a lie.
5. OMG LOL!!!
6. Hey, check my site for more great videos like this. www.more-great-videos.com.
7. Don't watch terror storm. It's nothing but a crock of lies... no intelligent person would waste any time listening to that conspiracy theory bulls**t.
8. OPPOSE GOOGLE CENSORSHIP! Watch the most popular video: Terror Storm.
9. DESTROY THE NEW WORLD ORDER!
10. Put on some clothes, please!
If you hate spam comments like most of these, use the new feature from Google Video: "Mark as spam". The immediate effect is that the comment will disappear for you. If more people mark a comment as spam, the comment will be removed.
Google's idea is great (and continues the label cleanup), but I would have added a link that says "Mark as not spam", as most comments are off-topic and spammy.
{ Thanks, Kent Dodds. The comments are slightly modified. }
October 2, 2006
Gmail Improves False-Positive Spam Detection
New York Times quotes a report from Lyris, an email marketing firm that tracked 57,000 mails sent from 57 businesses and nonprofit organizations. Only 3% of these messages were mistakenly labeled as spam by Gmail. In the first quarter of 2006, a similar study concluded that 44.1% of the business messages were sent to the Spam folder, although customers chose to receive them.
Gmail's spam filters are adaptable, so they get better over time. In 2004, Slashdot asked "How good is Gmail's spam filter?" and someone responded:
"So far, no spam whatsoever has found its way into my inbox. However, the amount of false positives filtered into the spam folder is overwhelming.
For a while I wondered why I only got reports by email about 30-40% of my finished online auctions (link omitted, no free advertising here). Last week I accidentaly clicked on the spam folder, and there it was, dozens of FALSE POSITIVES."
Gmail's spam filters are adaptable, so they get better over time. In 2004, Slashdot asked "How good is Gmail's spam filter?" and someone responded:
"So far, no spam whatsoever has found its way into my inbox. However, the amount of false positives filtered into the spam folder is overwhelming.
For a while I wondered why I only got reports by email about 30-40% of my finished online auctions (link omitted, no free advertising here). Last week I accidentaly clicked on the spam folder, and there it was, dozens of FALSE POSITIVES."
August 28, 2006
A New Breed Of Spam
Who said spam can't be interesting? A new breed of spam hit my mail box. They aren't the usual mails that claim they'll help solve your medical problems or win a lot of money. Not at all: they contain excerpts from famous books. I could read texts from Stendhal, Galsworthy and they are very good.
"Yes, old man, I've been washing them ever since, but I cant get them clean. The first remark from Smither confirmed the uneasiness which had taken him forth.
It was HERE we came on your mother, Jon, and our stars were crossed.
She could call tomorrow, of course, openly at Green Street, and probably NOT see him. Could so short a sound mean so much, say so much, be so startling?"
As any quality content has a price, the mail comes with an ad attached as an image: it tells me to buy stocks from PPTL. I'll take that into account the next time I'll decide to make investments.
The mail was marked as spam by Gmail and didn't come alone. All the mails had he text from the Project Gutenberg, albeit they had some parsing errors. Another interesting thing is that the phrases are in random order, so it's pretty hard to make up something meaningful.
Related: Spam art
"Yes, old man, I've been washing them ever since, but I cant get them clean. The first remark from Smither confirmed the uneasiness which had taken him forth.
It was HERE we came on your mother, Jon, and our stars were crossed.
She could call tomorrow, of course, openly at Green Street, and probably NOT see him. Could so short a sound mean so much, say so much, be so startling?"
As any quality content has a price, the mail comes with an ad attached as an image: it tells me to buy stocks from PPTL. I'll take that into account the next time I'll decide to make investments.
The mail was marked as spam by Gmail and didn't come alone. All the mails had he text from the Project Gutenberg, albeit they had some parsing errors. Another interesting thing is that the phrases are in random order, so it's pretty hard to make up something meaningful.
Related: Spam art
July 28, 2006
Spam Art
While most people think spam is something unpleasant, Alex Dragulescu, a visual artist (he studied film, photography, art history, computer animation and programming) uses the characters found in spam messages to create virtual spam plants.
CNet explains how he created this form of art: "For the Spam Plants, he parsed the data within junk e-mail--including subject lines, headers and footers--to detect relationships between that data. Then he visually represents those relationships.
For example, the program draws on the numeric address of an e-mail sender and matches those numbers to a color chart, from 0 to 225. It needs three numbers to define a color, such as teal, so the program breaks down the IP address to three numbers so it can determine the color of the plant. The time a message is sent also plays a role. If it's sent in the early morning, the plant is smaller, or the time might stunt the plant's ability to grow, Dragulescu said.
The size of the message might determine how bushy the plant is. Certain keywords, such as Nigerian might trigger more branches."
Alex Dragulescu has many projects connected to the Internet: spam architecture (that translates "various patterns, keywords and rhythms found in the text into three-dimensional modeling gestures"), blogbot ("a software agent in development that generates experimental graphic novels based on text harvested from web blogs"), algorithms of the absurd ("heap sort algorithm as a hidden structure and illustration of the desire for efficiency and the modes of production of the capitalist system").
It's really interesting to see how close art and technology can be and how easy is to transform ugliness into something beautiful and revealing. "Spam is a random piece of literature, it has unseen effects, it changes all the time. And it's led me to see text differently," says Alex Dragulescu.
June 18, 2006
Billions Of Spam Pages Indexed By Google
I reported yesterday about two spam sites that have billions of pages in Google's index. It seems other people have spotted them too and the story is on Digg.
Here are some queries for which the sites rank really well:
pizza sauce recipe (#10)
profiling racial war (#1)
san jose mission earthquake (#1)
google product manager interview (#4)
grill steak cooking (#3)
Update: Google has removed the sites from the index. Hopefully, they tweaked the algorithms to prevent other similar situations.
Here are some queries for which the sites rank really well:
pizza sauce recipe (#10)
profiling racial war (#1)
san jose mission earthquake (#1)
google product manager interview (#4)
grill steak cooking (#3)
Update: Google has removed the sites from the index. Hopefully, they tweaked the algorithms to prevent other similar situations.
June 17, 2006
Google Index Flooded With Spam
Google doesn't show the number of pages in the index on its homepage, but a good way to find it is to search for "* *". The resulting number is 25,270,000,000. Google says their index is more than 3 times larger than any other search engine. But what kind of pages are included in this huge index? Let's see:
Search for [site:t1ps2see.com]: 2,460,000,000 results and every page is a search result from a spam site.
Search for [site:eiqz2q.org]: 5,100,000,000 results that redirect to the same t1ps2see.com spam site.
Here's Alexa traffic:
And the list of spam sites is really long. So that means Google has upgraded its index with spam sites.
Note: Yahoo and MSN have very few pages from the sites mentioned (MSN: 59 results, Yahoo: 11,800 results for [site:eiqz2q.org]).
Also read:
Examples of queries for which these sites rank well
Search Google searches (where you can see how Google indexes search results from Blog Search, Google Maps or Google Books).
April 5, 2006
Detecting Spam Pages Using Statistics
Spam web pages that are machine generated tend to differ in a number of ways from most other web pages, and can possibly be identified through statistical analysis. A paper from Microsoft Research titled "Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages" (pdf) [by Dennis Fetterly, Mark Manasse, Marc Najork] looks at some ways of finding those pages. Spam web pages tend to have these characteristics:
* Host names with many characters, dots, dashes, and digits are likely to be spam web sites.
* "One piece of folklore among the SEO community is that search engines (and Google in particular), given a query q, will rank a result URL u higher if u’s host component contains q. SEOs try to exploit this by populating pages with URLs whose host components contain popular queries that are relevant to their business, and by setting up a DNS server that resolves those host names. The latter is quite easy, since DNS servers can be configured with wildcard records that will resolve any host name within a domain to the same IP address. For example, at the time of this writing, any host within the domain highriskmortgage.com resolves to the IP address 65.83.94.42."
* Linkage properties: looking at the number of links embedded on a page compared to the number of links pointing to those pages. Are they similar to what is seen on other pages on other sites?
* Content properties: A large number of automatically generated pages contain the exact same number of words, though individual words will differ from page to page.
* "Overall, the web evolves slowly, 65% of all pages will not change at all from one week to the next, and only about 0.8% of all pages will change completely." Spam pages tend to fall in the last category, because many of them are generated at each request.
Related:
Detecting near-duplicate documents
* Host names with many characters, dots, dashes, and digits are likely to be spam web sites.
* "One piece of folklore among the SEO community is that search engines (and Google in particular), given a query q, will rank a result URL u higher if u’s host component contains q. SEOs try to exploit this by populating pages with URLs whose host components contain popular queries that are relevant to their business, and by setting up a DNS server that resolves those host names. The latter is quite easy, since DNS servers can be configured with wildcard records that will resolve any host name within a domain to the same IP address. For example, at the time of this writing, any host within the domain highriskmortgage.com resolves to the IP address 65.83.94.42."
* Linkage properties: looking at the number of links embedded on a page compared to the number of links pointing to those pages. Are they similar to what is seen on other pages on other sites?
* Content properties: A large number of automatically generated pages contain the exact same number of words, though individual words will differ from page to page.
* "Overall, the web evolves slowly, 65% of all pages will not change at all from one week to the next, and only about 0.8% of all pages will change completely." Spam pages tend to fall in the last category, because many of them are generated at each request.
Related:
Detecting near-duplicate documents
Subscribe to:
Posts (Atom)