Search is one of the core component of the SharePoint Platform . Recently we have seen hell lots of issues in our farm .While troubleshooting those issues I learnt the below points regarding SharePoint search which I feel like to share .
SharePoint Search is comprised of three main functional process components:
Crawling (Gathering): Collecting content to be processed
Indexing: Organizing the processed content into a structured/searchable index
Query Processing: Retrieving a relevant result set relative to a given user query
How Search Works :
A server running a SharePoint Search Crawl Component (e.g. a “Crawl Server”) performs the act of crawling by making web requests such as HTTP GET to the applicable web front ends (WFEs) hosting the hosting the content. More simplistically, the Crawler makes requests for content and the WFE responds with the requested documents/pages/data. After the content has been gathered by the Crawler, it then gets filtered and processed into the final end product as an index. Regardless of SharePoint Search or FAST Search for SharePoint 2010, the component responsible for crawling (aka “gathering”) content-to-be-indexed is the SharePoint Crawl Component.
By default, the SharePoint Server 2010 crawler crawls all available Web front-end computers in a SharePoint farm through the network load balancer in that farm. Therefore, when a crawl is occurring, the crawler can cause increased network traffic, increased usage of hard disk and processor resources on Web front-end computers, and increased usage of resources on database servers. Putting this additional load on all Web front-end computers at the same time can decrease performance across the SharePoint farm.
This decrease in performance occurs only on the SharePoint farm that is serving user requests, and not on the SharePoint search farm. This decreased performance can cause delayed response times on the Web front-end computers and delayed response times for the overall farm. The decreased performance might not be diagnosed by specific logs, resource counters, or standard monitoring.
You can reduce the effect of crawling on SharePoint performance by doing the following:
· Redirect all crawl traffic to a single SharePoint Web front-end computer in a small environment or a specific group of computers in a large environment. This prevents the crawler from using the same resources that are being used to render and serve Web pages and content to active users.
· Limit search database usage in Microsoft SQL Server 2008 R2, SQL Server 2008 with Service Pack 1 (SP1) and Cumulative Update 2, and SQL Server 2005 with SP3 and Cumulative Update 3 to prevent the crawler from using shared SQL Server 2008 R2, SQL Server 2008 with SP1 and Cumulative Update 2, and SQL Server 2005 with SP3 and Cumulative Update 3 disk and processor resources during a crawl.
Enumerating Content from the WFE
Assuming http://foo:80 as the URL of a Web Application in a SharePoint Farm and a Search Content Source specifies that specifies this start address http://foo, the Crawl Component would gather items from this content source by essentially browsing each item from the web server hosting this content. When starting the crawl of SharePoint content, the Crawler would first look for http://foo/robots.txt to determine if any items are being disallowed. It then browses the default page for the URL and from the response, looks for the response header ‘MicrosoftSharePointTeamServices’. If this response header does not exist, the crawler will proceed as if this were generic web content and perform a “spider crawl”. If the response header does exist, then the SharePoint crawler begins enumerating all of the items in this site by targeting the Site Data web service for this URL - in this case, http://foo/_vti_bin/sitedata.asmx. Simplistically, the Crawler leverages the Site Data web service to ask the WFE, “Hey, WFE… what content do you have for this URL?” The Crawler stores the enumerated items into a queue and then begins retrieving each of the items from the WFE.
More specifically, the Crawler enumerates the content through a series of iterative SOAP calls through the Site Data web service. It first asks the WFE, “For this URL [Virtual Server], what Content Databases do you have?” Through a SOAP response, the WFE enumerates the applicable Content DB[s] along with the identifying GUID for each content DB. The Crawler then asks the WFE, “For the Content DB with the GUID [XYZ], what Site Collections do you have?” Again, the WFE responds with another SOAP response containing each of the Site Collection[s] along with other metadata applicable to each Site Collection. The Crawler and WFE continue this conversation to drill down on each of the Webs (e.g. the Top level site and all sub-sites) in the Site Collection, each of the lists/libraries in a Web, and items from lists/libraries have been enumerated for the URL.
It’s worth noting that for a Full Crawl, all of the items within this Web Application will be enumerated and queued by the Crawler for processing. However, with an Incremental Crawl, the Crawler will also pass along a Change Log Cookie (that it received from the WFE on the previous crawl) to the WFE. A Change Log Cookie would resemble the following 1;0;5e4720e3-3d6b-4217-8247-540aa1e3b90a;634872101987500000;10120 and contains both the identifying GUID for the applicable Content DB (in this case, 5e4720e3-3d6b-4217-8247-540aa1e3b90a) as well the row ID from the EventCache table (e.g. row 10120) in the specified Content DB. With this, the WFE can identify all events that have occurred for this Content DB since this point and thus, identify any content that has changed since the last crawl.