Search is one of the
core component of the SharePoint Platform . Recently we have seen hell lots of
issues in our farm .While troubleshooting those issues I learnt the below
points regarding SharePoint search which I feel like to share .
SharePoint Search is comprised of three main functional
process components:
Crawling (Gathering):
Collecting content to be processed
Indexing:
Organizing the processed content into a structured/searchable index
Query Processing:
Retrieving a relevant result set relative to a given user query
How Search Works :
A server running a SharePoint Search Crawl Component (e.g. a
“Crawl Server”) performs the act of crawling by making web requests such as
HTTP GET to the applicable web front ends (WFEs) hosting the hosting the
content. More simplistically, the Crawler makes requests for content and the
WFE responds with the requested documents/pages/data. After the content has
been gathered by the Crawler, it then gets filtered and processed into the
final end product as an index. Regardless of SharePoint Search or FAST Search
for SharePoint 2010, the component responsible for crawling (aka “gathering”)
content-to-be-indexed is the SharePoint Crawl Component.
By default, the SharePoint Server 2010 crawler crawls all
available Web front-end computers in a SharePoint farm through the network load
balancer in that farm. Therefore, when a crawl is occurring, the crawler can
cause increased network traffic, increased usage of hard disk and processor
resources on Web front-end computers, and increased usage of resources on
database servers. Putting this additional load on all Web front-end computers
at the same time can decrease performance across the SharePoint farm.
This decrease in performance occurs only on the SharePoint
farm that is serving user requests, and not on the SharePoint search farm. This
decreased performance can cause delayed response times on the Web front-end
computers and delayed response times for the overall farm. The decreased
performance might not be diagnosed by specific logs, resource counters, or
standard monitoring.
You can reduce the effect of crawling on SharePoint
performance by doing the following:
·
Redirect all crawl traffic to a single
SharePoint Web front-end computer in a small environment or a specific group of
computers in a large environment. This prevents the crawler from using the same
resources that are being used to render and serve Web pages and content to
active users.
·
Limit search database usage in Microsoft SQL
Server 2008 R2, SQL Server 2008 with Service Pack 1 (SP1) and Cumulative Update
2, and SQL Server 2005 with SP3 and Cumulative Update 3 to prevent the crawler
from using shared SQL Server 2008 R2, SQL Server 2008 with SP1 and Cumulative
Update 2, and SQL Server 2005 with SP3 and Cumulative Update 3 disk and
processor resources during a crawl.
Enumerating Content
from the WFE
Assuming http://foo:80 as the URL of a Web Application in a
SharePoint Farm and a Search Content Source specifies that specifies this start
address http://foo, the Crawl Component would gather items from this content
source by essentially browsing each item from the web server hosting this
content. When starting the crawl of SharePoint content, the Crawler would first
look for http://foo/robots.txt to determine if any items are being disallowed.
It then browses the default page for the URL and from the response, looks for
the response header ‘MicrosoftSharePointTeamServices’. If this response header
does not exist, the crawler will proceed as if this were generic web content
and perform a “spider crawl”. If the response header does exist, then the
SharePoint crawler begins enumerating all of the items in this site by
targeting the Site Data web service for this URL - in this case,
http://foo/_vti_bin/sitedata.asmx. Simplistically, the Crawler leverages the
Site Data web service to ask the WFE, “Hey, WFE… what content do you have for
this URL?” The Crawler stores the enumerated items into a queue and then begins
retrieving each of the items from the WFE.
More specifically, the Crawler enumerates the content through
a series of iterative SOAP calls through the Site Data web service. It first
asks the WFE, “For this URL [Virtual Server], what Content Databases do you
have?” Through a SOAP response, the WFE enumerates the applicable Content DB[s]
along with the identifying GUID for each content DB. The Crawler then asks the
WFE, “For the Content DB with the GUID [XYZ], what Site Collections do you
have?” Again, the WFE responds with another SOAP response containing each of
the Site Collection[s] along with other metadata applicable to each Site
Collection. The Crawler and WFE continue this conversation to drill down on
each of the Webs (e.g. the Top level site and all sub-sites) in the Site
Collection, each of the lists/libraries in a Web, and items from lists/libraries
have been enumerated for the URL.
It’s worth noting that for a Full Crawl, all of the items
within this Web Application will be enumerated and queued by the Crawler for
processing. However, with an Incremental Crawl, the Crawler will also pass along
a Change Log Cookie (that it received from the WFE on the previous crawl) to
the WFE. A Change Log Cookie would resemble the following
1;0;5e4720e3-3d6b-4217-8247-540aa1e3b90a;634872101987500000;10120 and contains
both the identifying GUID for the applicable Content DB (in this case,
5e4720e3-3d6b-4217-8247-540aa1e3b90a) as well the row ID from the EventCache
table (e.g. row 10120) in the specified Content DB. With this, the WFE can
identify all events that have occurred for this Content DB since this point and
thus, identify any content that has changed since the last crawl.
No comments:
Post a Comment