Table of Contents
Introduction
Administrator
User
Appendix
|
New profile wizard
Profiles defines how and what web pages and servers should be indexed
by the crawler. To create a new profile select New profile
wizard and follow the online instructions.
Below, the basic configuration variables for a profile is
described, while the more 'advanced' variables are described
later, on the Advanced profile configuration page.
- Profile id
-
A unique identification for the profile. It should be a short
identifying text, and must not contain any spaces. For example, the id
could be: my_profile.
- Profile name
-
The contents of the profile name will be seen on the selection tab on
the search page shown to the outside world. For example, this could be
"My test search".
- Activated
-
Should be left at "yes" for now.
- Storage directory
-
Is a search path in your filesystem. It ends with a "/". Intraseek has
automatically created a special directory for storage of the
databases, but you can change this to any path in the file-system. For
example, this could be:
/usr/www/roxen/platform/intraseek/databases/.
- Working directory
-
Is a search path in your file system. Ends with a "/". This is where
data from the crawlers' data gatherings will be stored. Due to nature
of the workings of the data base, it is advantageous for this to be
situated on a local disk (i.e one not accessed through NFS). This will
increase the speed of the process by several hundred per cent. For
example: /tmp/ or a similar locally-mounted disk may be used.
- Startpages
-
Where you specify a set of pages for the crawler to start at. It is
usually sufficient to state the URL of the main page of the site you
are about to index, since an IntraSeek crawler will follow all links
it finds. Separate the various URLs by putting them on separate
lines. For example: http://my.server.com/~sysadm/
- Accept pattern
-
Specifies which pages are to be accepted by the crawler. There are
some very important things to consider here:
|
|
- Always limit the crawler to stay within your site. If you
don't, it will, without any warning, crawl out on the worldwide web.
- Since the accept and avoid patterns really are regexps, they
should read "^http://www.foo.com/*" instead of
"www.foo.com/*" if you want to make sure not to index
"http://gazonk.www.foo.com/.
- Separate the various accept patterns by putting them on
separate lines. For example, this could be
"my.server.com/~webmaster/*".
- Avoid pattern
-
Specifies what sort of pages the crawler should avoid. Already
specified are file types that contain information the crawler
shouldn't index (eg source-files and the like). If inappropriate,
these may be removed in order to have the crawler include these files.
For example, if you specify "*/~webmaster/non-public/"
here, the crawler will avoid ~webmaster/non-public/ on all
servers. If you specify "*my.server.com/~root/*",
"/~root/" will not be indexed on the server
my.server.com.
Remember to check arguments to CGI scripts and the like. For
instance, directory listings can sometimes enter infinite loops. If
any such are present, it is recommended that "*?*" be added
here.
Check up on the crawler while it is running (check the log etc), so
that it doesn't go into a loop, run amok, etc.
Finally, on the last page of the 'New profile wizard' pages, press
OK to save the new profile. Technical notes: all profiles
are saved in the text file ENGINE_HOME/profiles.txt. If no ID
is specified, a new unique ID will be generated.
|