Search Engine Optimization (SEO): Spamdexing


Introduction


Spamdexing in computer jargon is the combination of words "spamming" and "indexing". Spamdexing came as a result, of the increased importance of search engines to commercial web sites and we could say that spamdexing is a derivative of service providing web. Spamming on the web means, web pages that only exist to mislead search engines into misleading users to specific web sites. Spamdexing, also known as search engine poisoning, is the intentional manipulation of search engine indexes and cause of it users have a hard time retrieve the information they might need and search engines have to cope with an inflected corpus, which in turn causes their cost per query to increase. Spamdexing is expressed by, a variety of methods and techniques for search engine indexes manipulation and practically has to do with linkage structure and page content. Some consider it, to be part of search engine optimization (SEO) while the term is used to describe unethical and un-legitimate methods to optimize a web page. Search engine optimization industry, is a domain where search engine optimizer promise to help commercial web sites by boosting their relevance ranking. To accomplish, stuffing a page with lots of popular query terms was the first step and thereafter some of them, automatically generated web pages with little or copied content, all redirecting to the page intended to have traffic. To do so means injecting a large corpus of web pages into search engine indexes and setting up DNS servers mapping and altering IP addresses  The earliest known reference to the term spamdexing is quoted to Eric Convey in his article "Porn sneaks way back on Web" which was published in Boston Herald, May 1996. In other words, search engines in order to compile their indexes depende on web crawlers and content providers, to supply indexing information. Problem comes up when individuals, in order to achieve a better performance in search results, utilize and abuse their knowledge in how web works. Identification of spamdexing classes, a web page that contains spamdexing classes, can be achieved in some ways (statistical analysis, variations of algorithms locating spam patterns) and such pages that tamper with search engine indexes, causing problem to users, could be excluded from ranking or treated differently from valid web pages.




Keywords: Information Retrieval, Spamdexing, Search Engine Optimization



Spamdexing classes

The first step into countering spamdexing, improving indexing and relevant results performance, is understanding spamdexing, analyzing these methods and techniques that construct it. Spamdexing methods can be well separated into 2 categories. Content spam and link spam. All of these methods, are described in the section below.




Content spam
Link spam
Other types of Spamdexing
Key-word stuffing
Link building software
Mirror websites
Hidden text
Link farms
URL redirecting
Meta tags
Hidden links
Cloacking
Doorway pages
Sybil attack

Scrap sites
Spam blogs

Article spinning
Page hijacking



Expired domains



Cookie stuffing



World-writable pages [..]





Table 1.

1. Content spam

These techniques involve altering the logical view that a search engine has over the page's contents. They all aim at variants of the vector space model [algebraic model for representing text documents (and any object)] for information retrieval on text collections.
This vector space model, it is used in information filtering, information retrieval, indexing and relevancy rankings. Documents and queries are represented as vectors and each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is different to zero. Several different methods for computing these values, also called weights have been developed. One of the best known schemes is tf-idf weighting. Typically terms are single words, keywords, or longer phrases. If the words are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary (the number of distinct words occurring in the corpus of texts).Vector operations can be also used to compare documents with queries.


Content spam methods

1.1. Keyword stuffing

Keyword stuffing involves the calculated placement of keywords within a page to raise the keyword count, variety, and density of the page. This is useful to make a page appear to be relevant for a web crawler in a way that makes it more likely to be found. Older versions of indexing programs simply counted how often a keyword appeared, and used that to determine relevance levels. Most modern search engines have the ability to analyze a page for keyword stuffing and determine whether the frequency is consistent with other sites created specifically to attract search engine traffic.

1.2. Hidden or invisible text

Unrelated hidden text is disguised by making it the same color as the background, using a tiny font size, or hiding it within HTML code such as "no frame" sections, alt attributes, zero-sized DIVs, and "no script" sections. People screening websites for a search-engine company might temporarily or permanently block an entire website for having invisible text on some of its pages. However, hidden text is not always spamdexing as it can also be used to enhance accessibility.

1.3. Meta-tag stuffing

This involves repeating keywords in the Meta tags, and using meta keywords that are unrelated to the site's content. Meta tags are codes that are invisibly embedded into a Web page to "help" search engines index the page.This technique has been ineffective since 2005.

1.4. Doorway pages

Doorway pages are low-quality web pages created with very little content but are instead stuffed with very similar keywords and phrases. They are designed to rank highly within the search results, but serve no purpose to visitors looking for information. A doorway page will generally have "click here to enter" on the page. In 2006, Google ousted BMW for using "doorway pages" to the company's German site, BMW.de.

1.5. Scraper sites

Scraper sites are created using various programs designed to "scratch" search-engine results and create "content" for a website. Presentation of content on these sites is unique, but is simply an fusion of content taken from other sources, often without permission. Such websites are generally full of advertising (such as pay-per-click ads), or they redirect the user to other sites. It is even achievable for scraper sites to outrank original websites for their own information and organization names.


1.6. Article spinning

Article spinning involves rewriting existing articles, in different ways always preserving key words and query terms, in order to avoid penalties imposed by search engines for duplicating content. This process is undertaken by hired writers or automated agents using thesaurus databases or neural networks.



2. Link spam

Link spam is defined as, link building between pages that exist for reasons other than recommendation or merit. Link spam takes advantage of link-based ranking algorithms. Algorithms functioning that way, calculate highly ranked web sites linking to the targeted site. These techniques also aim at influencing other link-based ranking techniques such as the HITS algorithm. Hyperlink-Induced Topic Search (HITS) (also known as hubs and authorities) is a precursor to Page Rank. It is a link analysis algorithm that rates Web pages.


Link spam methods

2.1. Link-building software

A common form of link spam is the use of link-building software to automate the search engine optimization process. Some knows software names are: All-in-One Submission, SEO Suite, IBP, etc.

2.2. Link farms

Link farms are closed communities of web pages with common interests, in the same domain or constructed by the same organization – company. These pages referencing each another, for providing information and services. They are also known as mutual admiration societies.


2.3. Hidden links
This method involves the strategic placement of hyperlinks in "empty spaces" of a web page, where visitors cannot see them. This way popularity is increased because highlighted link text, may help a webpage rank higher for matching that phrase.

2.4. Sybil attack

A Sybil attack is the forging of multiple identities for malicious intent. It is named after the famous disordered patient "Sybil" (Shirley Ardell Mason). A spammer may create multiple web sites at different domain names, often expired, that all link to each other.

2.5. Spam blogs

Called spam blogs, are blogs created for commercial promotion and the link authority passage to target sites. These blogs are designed in a misleading way that will give the effect of a legitimate website but upon close inspection will often be written using spinning software and its content will be barely readable.

2.6. Page hijacking

Page hijacking is achieved by creating an exact copy of popular websites (service providers), which web crawlers parse as similar to the original content sites, but users are redirected to unrelated or malicious websites.

2.7. Expired domains

This technique, includes monitoring of DNS records for domains that will expire soon or have expired. Spammers then buy them and replace their content with link to the target site. However Google resets the link data on expired domains. Some of these techniques, can be applied for creating a Google bomb, meaning the cooperation of many users, to boost the ranking of a particular page for a particular query.

2.8. Cookie stuffing

Cookie stuffing involves placing an corporate tracking cookie on a website visitor's computer without their knowledge, which will then generate revenue for the person behind the cookie stuffing. This not only generates fraud data, but also it is dangerous due to its potential to overwrite other affiliate cookies, stealing their legitimately earned commissions

2.9. Using world-writable pages

2.9.1. Forum spam
Web sites that can be edited by users can be used by spamdexers to insert links to websites.
Automated spam-bots can quickly make the user editable part of a site unusable. Websites, have developed a variety of automated spam prevention techniques to track or block spam-bots.

2.9.2. Spam in blogs
Guest books, forums, blogs, and any site that accepts visitors' comments are possible targets and are often victims of drive-by spamming where automated software creates nonsense posts with links that are usually irrelevant and unwanted. Many blog providers, like WordPress or Blogger, make their comments sections non-follow by default due to concerns over spam.


2.9.3. Comment spam
Comment spam is a form of link spam in web pages that allow dynamic user interaction, such as wiki's blog's and forums.

2.9.4. Wiki spam
Wiki spam is another form of link spamdexing concerning wiki pages. Open edit-ability of wiki systems is exploited, to place links pointing to target site. Content of target website is usually unrelated to the topic of the wiki article. In 2005 Wikipedia implemented a default non follow value for some HTML attributes, in order to avoid spamdexing. Links with this attribute are ignored by Google's PageRank algorithm.


2.9.5. Referrer log spamming
Some websites have a referrer log which shows which pages link to that site. Spamdexing here is achieved by having a robot randomly access many sites enough times, giving a message or a specific address. That message of ip appears in the referrer log and that way some search engines that take under consideration these logs, may increase the search engines ranking for that specific page.

3. Other types of spamdexing

3.1. Mirror websites

Mirror websites is the hosting of multiple websites with conceptually same content, by using different URL's. Some search engines index pages, giving higher rank to results where the keyword searched for appears in the URL.

3.2. URL redirection

URL redirection is the taking of the user to another page without his or her intervention, example: using meta refresh tags, Flash, JavaScript, Java or Server side redirects.

3.3. Cloaking

Cloaking refers to several means to modify a web page for treatment by a search-engine spider, in ways that are different from users can see. It considered an at attempt to fool search engines regarding the content on a specific web site. However, cloaking can also be used to increase accessibility of a site to users with disabilities or provide human users with content that search engines aren't able to parse. Google itself uses IP delivery, a form of cloaking, to deliver results. Another form of cloaking is code swapping, which means optimizing a page for top ranking and then swapping it with another page.





Search - Engine Optimizitation (SEO)

Search Engine Optimization (SEO) is a process in which a targeted website is modified in order to be indexed with a higher rank (that would normally have) in search engine results. This site is intended to have traffic than others in the same domain and therefore will have the chance to give in information in a subject, advertise and sell products or services, etc. Some methods used to achieve higher ranking in indexes, are commonly referred in the SEO industry as "Black Hat SEO". Black Hat" search engine optimization is defined in custom, as techniques that are used to get higher search rankings in an unethical and inappropriate manner, using content spam and link building software.

Search Engine Processing

A web search engine is designed to search for information on the world wide web. Search results are generally presented in a line of results often referred to as search engine result pages. Information may be a specific for web pages, like images, sound, topological and other kind of data type files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running algorithms and using web crawler.

1. Web Crawlers (spiders)

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).
A Web crawler is one type of bot, or software agent. In general, it starts with a list of URL's to visit, called "the seeds". As the crawler visits these URL's, it identifies all the hyperlinks in the page and adds them to the list of URL's to visit, called the crawl frontier. URL's from the frontier are recursively visited according to a set of policies.
The large volume implies that the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

2. Indexing

Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and computer science. An alternate name for the process in the context of search engines designed to find web pages on the Internet is web indexing.
Popular engines focus on the full-text indexing of online, natural language documents. Media types such as video and audio and graphics are also searchable.
Meta-search engines reuse the indices of other services and do not store a local index, whereas cache-based search engines permanently store the index along with the corpus. Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs, while agent-based search engines index in real time.

3. Search analytics

Search analytics, is the analysis and aggregation of search engine statistics for use in search engine marketing (SEM) and search engine optimization (SEO). In other words, search analytics helps website owners understand and improve their performance on search engines. Search analytics includes search volume trends and analysis, reverse searching (entering websites to see their keywords), keyword monitoring, search result and advertisement history, advertisement spending statistics, website comparisons, affiliate marketing statistics and other methods of statistic analysis concerning the Web.

Conclusion

Identify instances of spamdexing, meaning specific pages that contain types of spam, and stop crawling, do not include them into indexes or deal with them after removing spam segments is a difficult task. To prevent spamdexing, practically means making specific spamdexing techniques impossible to use or even not worthy to use. For instance, a search engine's crawler could identify itself as a regular web browser application in order to avoid cloaking. Nowadays  search engines use variations of the fundamental ranking methods and that makes spamming more and more resilient, to demonstrated methods. To address the problem of spamdexing as a whole is a hard matter due to the differences among individual spamming techniques. Even though, an approach as such could be based in some common features spam pages have. Therefore, a more robust and adequate algorithm is needed, or a whole different logic of how indexing works, in order to deal with all forms of spamming without having to deal with each technique individually. De-constructing spamdexing, except from describing the different existing methods, that would be the first step of understanding the problem, includes also some though on "why do spammers spam?"which would lead us to some initial analysis of how do indexes work and how does Web works, how services are provided to clients etc. Also, there has been some conversation concerning the logic in which information is represented and treated by search engines and indexes. Implying terms of AI into search engines would mean that we have to improve or invent other ways of reasoning. Some research is growing in the fields of Semantic Web with OWL, ontologies and linked data. Others try to implement fuzzy and probabilistic theories into search engines working algorithms having as target to question answering machines replacing relevance ranking and statistical analysis. Spamdexing has caused many traffic and problems concerning indexing domain and side effects, but has served also as a reason for further research and improvement of how search engines work, constructing more robust algorithms and inventing novel methods to improve information retrieval.

References

  1. Zoltan Gyongyi, Hector Garcia-Molina.
    "Web Spam Taxonomy", Computer Science Department, Stanford University.
  2. Dennis Fetterly, Mark Manasse, Marc Najork.
    "Spam, Damn Spam, and Statistics", Microsoft Research.
  3. Ricardo baeza-Yates, Berthier Ribeiro-Neto.
    "Modern Information Retrieval", ACM Press, New York, 1999.
  4. Monika R. Henzinger, Rajeev Motwani, Craig Silverstein.
    "Challenges in Web Search Engines" 2002.
  5. Ryan Flores.
    "How Black Hat SEO Became Big", Micro Trend, 2010.
  6. Peter A. Hamilton.
    "Google-bombing - Manipulating the Page Rank Algorithm"
  7. Lofti A. Zadeh
    "From Search Engines to Question Answering Systems"
  8. "Search Engine Optimization Starter Guide", Google, 2010
  9. Gerrit Vandendriessche.
    "A few legal comments on Spamdexing", 2007.

5 comments:

  1. Curiosity to know about Digital Marketing has increased after reading your blog. Waiting for your next content.
    SEO Training Chennai
    Best seo training in chennai

    ReplyDelete
  2. Hi, Excellent Content, your blog is very useful and also interesting to read. Keep sharing this type of information.
    Digital Marketing Training
    Digital marketing Training institute in chennai

    ReplyDelete
  3. SocialMonkee not only allows you to build backlinks to your pages, it also allows you to boost your existing backlinks, by building backlinks to your backlinks (Tier 2 Link Building).



    The network keeps growing, with new sites added every week, so sign up now and submit your URLs within the next few minutes to boost your rankings and get the traffic your pages deserve. (http://www.socialmonkee.com/oan76)

    ReplyDelete
  4. Excellent Content

    http://www.besanttechnologies.in/digital-marketing-training-in-chennai.html
    http://www.besanttechnologies.in/digital-marketing-training-in-bangalore.html

    ReplyDelete
  5. Very Useful information on spamdexing. For good seo results
    Looking for a Digital marketing agency in Hyderabad.It is the place where you can find the complete Digital marketing services with industry experts.

    ReplyDelete


Free online chess

View Kapellas Nick's profile on LinkedIn
Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License