RE40683 Process for maintaining ongoing registration for pages on a given search engine
ABSTRACT – A process for maintaining ongoing registration for pages on a given search engine is disclosed. It is a method to actively cause an updating of a specific Internet search engine database regarding a particular WWW resource. The updated information can encompass changed, added, or deleted content of a specific WWW site. The process comprises the steps of having software tools at a local WWW site manually and/or automatically keep an index of added, changed, or deleted content to a particular WWW site since that WWW site was last indexed by a specific Internet search engine. The software tools will notify a specific Internet search engine of the URLs of specific WWW site resources that have been added, changed, or deleted. The Internet search engine will process the list of indices of changes, additions or deletions provided by a web site, or add the URL of resources that require indexing or re-indexing to a database and visit the WWW site to index added or re-index changed content when possible. The benefit to the Internet is the creation of an exception-based, distributed updating system to the Internet search engine as opposed to the cyclical and repetitive inquiring by the Internet search engine to visit all WWW sites to find added, changed, or deleted content. Overall Internet transmissions are reduced by distributing the update and indexing functions locally to web sites and away from the central Internet search engine.
FIELD OF THE INVENTION
The present invention relates to the process of developing and maintaining the content of Internet search engine databases.
BACKGROUND OF THE INVENTION
An internet (including, but not limited to, the Internet, intranets, extranets and similar networks), is a network of computers, with each computer being identified by a unique address. The addresses are logically subdivided into domains or domain names (e.g. ibm.com, pbs.org, and oranda.net) which allow a user to reference the various addresses. A web, (including, but not limited to, the World Wide Web (WWW)) is a group of these computers accessible to each other via common communication protocols, or languages, including but not limited to Hypertext Transfer Protocol (HTTP). Resources on the computers in each domain are identified with unique addresses called Uniform Resource Locator (URL) addresses (e.g.http:// www.ibm.com/products/laptops.htm). A web site is any destination on a web. It can be an entire individual domain, multiple domains, or even a single URL.
Resources can be of many types. Resources with a “.htm” or.“html” URL suffix are text files, or pages, formatted in a specific manner called Hypertext Markup Language (HTML). HTML is a collection of tags used to mark blocks of text and assign meaning to them. A specialized computer application called a browser can decode the HTML files and display the information contained within. A hyperlink is a navigable reference in any resource to another resource on the internet.
An internet Search Engine is a web application consisting of
- 1. Programs which visit and index the web pages on the internet.
- 2. A database of pages that have been indexed
- 3. Mechanisms for a user to search the database of pages.
Agents are programs that can travel over the internet and access remote resources. The internet search engine uses agent programs called Spiders, Robots, or Worms, among other names, to inspect the text of resources on web sites. Navigable references to other web resources contained in a resource are called hyperlinks. The agents can follow these hyperlinks to other resources. The process of following hyperlinks to other resources, which are then indexed, and following the hyperlinks contained within the new resource, is called spidering.
The main purpose of an internet search engine is to provide users the ability to query the database of internet content to find content that is relevant to them. A user can visit the search engine web site with a browser and enter a query into a form (or page), including but not limited to an HTML form, provided for the task. The query may be in several different forms, but most common are words, phrases, or questions. The query data is sent to the search engine through a standard interface, including but not limited to the Common Gateway Interface (CGI). The CGI is a means of passing data between a client, a computer requesting data or processing and a program or script on a server, a computer providing data or processing. The combination of form and script is hereinafter referred to as a script application. The search engine will inspect its database for the URLs of resources most likely to relate to the submitted query. The list of URL results is returned to the user, with the format of the returned list varying from engine to engine. Usually it will consist of ten or more hyperlinks per search engine page, where each hyperlink is described and ranked for relevance by the search engine by means of various information such as the title, summary, language, and age of the resource. The returned hyperlinks are typically sorted by relevance, with the highest rated resources near the top of the list.
The World Wide Web consists of thousands of domains and millions of pages of information. The indexing and cataloging of content on an Internet search engine takes large amounts of processing power and time to perform. With millions of resources on the web, and some of the content on those resources changing rapidly (by the day, or even minute), a single search engine cannot possibly maintain a perfect database of all Internet content. Spiders and other agents are continually indexing and re-indexing WWW content, but a single World Wide Web site may be visited by an agent once, then not be visited again for months as the queue of sites the search engine must index grows. A site owner can speed up the process by manually requesting that resources on a site be re-indexed, but this process can get unwieldy for large web sites and is in fact, a guarantee of nothing.
Many current internet search engines support two methods of controlling the resource files that are added to their database. These are the robots.txt file, which is a site-wide, search engine specific control mechanism, and the ROBOTS META HTML tag which is resource file specific, but not search engine specific. Most internet search engines respect both methods, and will not index a file if robots.txt, ROBOTS META tag, or both informs the internet search engine to not index a resource. The use of robots.txt, the ROBOTS META tag and other methods of index control is advocated for the purposes of the present invention.
Commonly, when an internet search engine agent visits a web site for indexing, it first checks the existence of robots.txt at the top level of the site. If the search agent finds robots.txt, if analyses the contents of the file for records such as:
- User-agent: *
- Disallow: /cgi-bin/SRC
- Disallow: /stats
The above example would instruct all agents not to index any file in directories names /cgi-bin/SRC or /stats. Each search engine agent has its own agent name. For example, AltaVista (currently the largest Internet search engine) has an agent called Scooter. To allow only AltaVista access to directory lavstuff, the following robots.txt file would be used:
- User-agent: Scooter
- User-agent: *
- Disallow: /avstuff
The ROBOTS META tag is found in the file itself. When the internet search engine agent indexes the file, it will look for a HTML tag like one of the following:
- <META NAME=“ROBOTS” CONTENT=“NOINDEX, NO FOLLOW”>
- <META NAME=“ROBOTS” CONTENT=“NOINDEX, FOLLOW”>
- <META NAME=“ROBOTS” CONTENT=“INDEX, NO FOLLOW”>
- <META NAME=“ROBOTS” CONTENT=“INDEX, FOLLOW”>
INDEX and NOINDEX indicate to all agents whether or not the file should be indexed by that agent. FOLLOW and NOFOLLOW indicate to all agents whether or not they should spider hyperlinks in this document.
For current internet search engines, the present invention process uses the CGI program(s) provided by the search engine in order to add, modify an remove files from the search engine index. However, the process can generally only remove a file from the search engine index if the file no longer exists or if the site owner (under the direction of the process) has configured the site, through the use of robots.txt, the ROBOTS META tag or other methods of index control, so that the search engine will remove the file from its index.
The duration of time between the first time a site is indexed and the next time that information is updated has led to several key problems:
- A. A resource that is modified or removed by its owner after it is indexed by a search engine could be incorrectly listed in that search engine for months until an agent visits the site to register the change.
- B. A resource may be modified since that last time it was indexed, in which case a user may never be directed to the new content, or incorrectly directed to content that is no longer present.
- C. Deleted resources can create the impression for a search engine user that a whole web site has shut down, that the information the user is looking for is removed, or that the web site is not being maintained, when the resources may have simply been moved to another location on the site as part of regular site maintenance.
- D. Automated tools such as search engines apply their own criteria in order to determine the relevancy of a particular resource for a particular query. These automated criteria can lead to the search engine returning spurious, misleading, or irrelevant results to a particular query. For example, a recent search for the nursery rhyme “Rub a dub dub, three men in a tub” on a particular search engine resulted in the top ten search results containing discussions of various issues among consenting males.
- E. Automated agents are not always able to understand the context of the pages they index, as illustrated by the example above. As such, their one-dimensional capabilities allow web masters to create the impression that the resources on a particular site contain information they do not. This is done to direct traffic to sites by providing incorrect or misleading information, a process called spamming.
- F. Most automated agents are incapable of processing the content of resources that are binary in nature, such as applications written in the programming language Java. These applications can display text data, but do not use text or HTML files to do so. Instead, the information is encoded in binary form in the application. As such, an agent cannot determine the content of a resource coded in this manner.
The present invention provides a mechanism for search engine and web site managers to maintain as perfect a registration of web site content as is possible. By augmenting or replacing existing agents and manual registration methods with specialized tools on the local web site (and, when feasible, at the search engine), the current problems with search engine registration and integrity can be eliminated.
SUMMARY OF THE INVENTION
The present invention defeats the key problems with automated agents and manual registration and replaces them with an exception based, distributed processing system. Instead of making the search engine do all the work necessary to index a site, the web site owner is now responsible for that operation. By distributing the work, the search engine is improved in these ways:
- 1. The search engine can maintain perfect ongoing registration and indexing of pages by re-indexing at a set interval, as frequently as the web site owner chooses.
- 2. The search engine can maintain an intelligent database, not limited by the conditions that automated agents have imposed on them and not easily corruptible by web site owners with less ethical practices.
- 3. The search engine provides a guarantee of integrity to all users, ultimately providing a more valuable service to both users and web site owners.
The process is begun by distributing a set of search engine update software tools to the web site owner. These tools can be implemented in one of three ways. The first way is to implement the tools on the web server of the site owner. The software can run automatically, having direct access to all resources on the web site. The second way is to install the software tools on a surrogate server. This surrogate is a computer with proper permissions and access to the resources of the web site and automatically accesses those resources over the network. The third way is through the use of client-side tools. The software will run on each client’s computer, check the client’s web server via internet protocols, and relay the information on the web server to the search engine.
The software could be written in a variety of different programming languages and would be applicable for as many client and server computers as needed.
Upon initial execution, the software builds a database of the resources on the web site. The resources catalogued can be specified by the user, or automatically through spidering functions of the software. The database consists of one record per resource indexed on the site. Each record contains fields including:
- A. The search engines the owner of the web site would like the resource to be indexed by.
- B. The date and time of the last index by each search engine.
- C. The date and time a resource was last modified according to the local indexing engine.
- D. Flags to indicate whether a specific resource requires updating, inclusion, or removal from a particular search engine database.
Upon each subsequent execution the software tools inspect the current state of the web site against the content of the database. When altered, removed, or additional content is found, the software tools make the appropriate changes to the database and then notify the search engine of those changes (see FIG. 1, Box206a, 207b-c). Changes to the database are made as follows:
- A. A resource is marked as deleted if the resource is listed in the current database, but cannot be retrieved.
- B. A resource is marked as modified if the date and time of last modification in the current database is earlier than the date and time of last modification provided by the web server for the resource.
- C. A resource is added and marked as added if it is present on the web server, but not yet in the database and the web site manager has opted to add it either manually or automatically.
Through application of the present invention, the following improvements are made in search engine administration:
- 1. The task of spidering the web site has been distributed to the web site owner (see FIG. 1, Box 205c).
- 2. The web site owner has the capability to protect brand image from being injured by a search engine pointing potential visitors to deleted, irrelevant, or incorrect resource information.
- 3. The search engine owner has a higher degree of database integrity. Less information storage space is wasted on spurious, nonexistent or incorrect data.
- 4. The web site owner can directly indicate the keywords and other descriptions that are most appropriate for each resource in the site, as opposed to using the cumbersome HTML ‘Meta’ tag to specify the keywords for the agent. Keywords are words that are particularly relevant to a particular resource and might be used on a search engine to locate that resource.
- 5. The search engine can create a reverse index of keywords that the individual site owners have identified for each resource. For example, a user could query for a list of all web sites that have listed ‘dog’ as an appropriate keyword.
- 6. The internet search engine could be used by users to query the content of a particular web site, as opposed to requiring a web site based search engine to index the content. This saves administration effort and computing resources at the web site.
The main aspect of the present invention is to provide a method to index locally at a web site all changes to that site’s resource content database which has occurred since the last search engine indexing.
Another aspect of the present invention is to actively transmit said changes to an internet search engine.
Another aspect of the present invention is to automatically transmit batches of updates (a list of content that has changed since the last search engine index), in a predetermined manner.
Other objects of this invention will appear from the following description and appended claims, reference being had to the accompanying drawings forming a part of this specification wherein like reference characters designate corresponding parts in the several views.