Solr Plugin
Enterprise Search Engine for Foswiki based on Solr
About Solr
Solr is an open source enterprise search server based on the
Lucene Java search
library, with XML/HTTP and JSON APIs, hit highlighting, faceted search,
caching, replication, and a web administration interface.
Screenshots
Installation
The below installation procedure assumes that you are going to install Solr as well as Foswiki on the same server using Linux.
Foswiki plugin installation
You do not need to install anything in the browser to use this extension. The following instructions are for the administrator who installs the extension on the server.
Open configure, and open the "Extensions" section. "Extensions Operation and Maintenance" Tab → "Install, Update or Remove extensions" Tab. Click the "Search for Extensions" button.
Enter part of the extension name or description and press search. Select the desired extension(s) and click install. If an extension is already installed, it will
not show up in the
search results.
You can also install from the shell by running the extension installer as the web server user: (Be sure to run as the webserver user, not as root!)
cd /path/to/foswiki
perl tools/extension_installer <NameOfExtension> install
If you have any problems, or if the extension isn't available in
configure
, then you can still install manually from the command-line. See
https://foswiki.org/Support/ManuallyInstallingExtensions for more help.
Download Solr
The current plugin requires
Solr 5.0.0 or later. Download it from the
Apache Archive.
tar xzf solr-5.x.x.tgz solr-5.x.x/bin/install_solr_service.sh --strip-components 2
./install_solr_service.sh ./solr-5.x.x.tgz
service solr stop
Secure Solr access
… so that it only listening to the local loopback interface
- edit
/var/solr/solr.in.sh
- add
SOLR_OPTS="$SOLR_OPTS -Djetty.host=localhost"
Optionally relocate logs
… from
/var/solr/logs
to
/var/log/solr
mv /var/solr/logs /var/log/solr
- edit
/var/solr/solr.in.sh
- disable garbage collection logs …
GC_LOG_OPTS
- set
SOLR_LOGS_DIR=/var/log/solr
- edit
/var/solr/log4j.properties
- set
solr.log=/var/log/solr
- set log level from
INFO
to WARN
: log4j.rootLogger=WARN, file, CONSOLE
Install Foswiki configuration set
cd /var/solr/data
cp -r <foswiki-dir>/solr/cores .
mkdir configsets
cd configsets
ln -s <foswiki-dir>/solr/configsets/foswiki_configs
chown -R solr.solr /var/solr
Updating from a previous configuration set
An updated SolrPlugin might come with a newer configuration set, i.e. a newer
schema.xml
pr
solrconfig.xml
files. Make sure that these files coming with an update are installed to
the solr server as well. This will be taken care of when the
foswiki_configs
directory is linked into the solr server's configsets directory. Note however that any local changes
you made to these files will be overwritten by the update. You might eigher create a config set of your own and adjust the core definition accordingly to make use of the newly
created config set, or you need to merge changes into the standard
foswiki_configs
set of files.
Start solr service again
service solr start
Test
cd <foswiki-dir>/tools
./solrindex topic=Main.WebHome
… should produce
Indexing Main.WebHome
cd <foswiki-dir>/bin
./rest /SolrPlugin/search
… should return a JSON response from Solr showing the recently indexed topic
Skin integration
SolrPlugin comes with a skin overlay - called
solr
- that will replace the upper left search boxes in
PatternSkin with a solr-driven auto-suggest search box. To switch that on use
* Set SKIN = solr, pattern
in your
SitePreferences.
Note that you won't need to enable the
solr
skin overlay in case you are using
NatSkin as it comes with support for
SolrPlugin out of the box.
Commandline scripts
There is a set of tools to interact with the Solr index from the commandline. These can either be
used to index Foswiki manually - as we did in above tests - as well as for searching or deleting specific documents in the index.
The set of tools comes in two variants, one for normal single-host Foswiki installations and for virtual hosting using
VirtualHostingContrib
The virtual-hosting aware scripts have a prefix
virtualhost-...
and take an optional
host=<domain>
parameter to specify the virtual domain to interact with.
When not specified will the script be executed for each domain in turn as configured in
VirtualHostingContrib
. Only exception is
solrjob
(see below).
solrindex / virtualhosts-index
cd <foswiki-dir>/tools
./solrindex ...
Parameter |
Description |
Default |
web="..." |
the web to be indexed; if undefined all webs will be indexed |
all |
topic="<web>.<topic>" |
the topic to be indexed; use this parameter to index one specific topic |
|
mode="full/delta" |
mode of operation: full will unconditionally index all content as specified by web or topic ; delta will only index content that has changed since the last time the script was run |
delta |
optimize="on/off" |
optimize the Solr database by de-fragmenting its internal segments for better performance; this is normally not required unless a full indexing of larger chunks of content is performed; note that optimizing the Solr index might require considerable time and I/O resources on the filesystem of the server |
off |
solrdelete / virtualhosts-delete
cd <foswiki-dir>/tools
./solrdelete ...
For instance to empty your index completely use:
./solrdelete *:*
solrjob
cd <foswiki-dir>/tools
./solrjob ...
This tool is a wrapper around
solrindex
and will use either
solrindex
or
virtualhost-solrindex
depending on the
host
commandline parameter. It is mainly used in cronjobs or
iwatch
(see below).
In contrast to
solrindex
a locking & throttling strategy is used to prevent multiple indexers being started simulataneously.
This is usefull when firing up the indexer as part of
iwatch
monitoring filesystem changes in the Foswiki store. As these events
often come in bundles firing rapidly in a short period of time will only one indexing process be spawned for a given time span defined
by the
throttle
parameter to
solrhjob
.
Parameter |
Description |
Default |
-f / --file <file-path> |
index the topic that the given file points to |
|
-h / --host <virtual-domain> |
specifies the virtual domain to operate on (only makes sense when running VirtualhostingContrib); Or specify all to perform the operation on all known virtual hosts |
|
-m / --mode full/delta |
mode of operation (see solrindex above) |
delta |
-t / --throttle <seconds> |
number of seconds to wait until the indexing process is started; note that any other calls to solrjob are prevented from entering the indexing loop as well |
5 |
rest /SolrPlugin/search
cd <foswiki-dir>/bin
./rest /SolrPlugin/search ...
Setting up an indexing strategy
Before using
SolrSearch and get back results you will need to index your content completely and do so repeatedly to keep up with changes in the Foswiki content base.
This is basically achievable in various ways:
- full indexing: index all of the content from start to end
- delta indexing: index topics that changed since the last time (delta) indexing was performed
- realtime indexing: monitor changes in the Foswiki store and fire up indexing as close to the actual change event as possible
- online indexing: index content changes as part of the content being saved
We will discuss these strategies and line out their advantages. A combination of a few of the above ways will then make up the recommended indexing strategy for Foswiki content.
Full indexing
./solrindex mode=full optimize=on
This will crawl all webs, topics and attachment and submit them to the Solr server, which will build up the search index. This can take a considerable amount of time
depending on the amount of content and number of users registered to your site, so you may prefer to do it at a quiet time.
Note that full indexing is required the first time you installed SolrPlugin. From there on will you be able to use delta indexing to update the index incrementally as
content changes in Foswiki.
It is recommended to only perform a full indexing again once in a week or preferably in longer intervals.
Delta indexing
./solrindex
This will inspect all of the content base and check for changes since the last time the content was added to Solr. Any update content will be added to the index
as required. The delta indexing procedure will also look up all of the index and delete those documents from it where the original topic in the Foswiki content base
has been removed.
Delta indexing is a relatively fast operation that is best performed every 15 minutes or so. Don't shorten the intervals of delta indexing too much as that would
create additional load on the server where no content is found to be delta-indexed.
Realtime indexing
This mode of operation requires a separate service to be installed such as
iwatch.
Iwatch is a tool using the
inotify
kernel service of Linux systems to trigger a script based on events happening on the filesystem such as "file-open", "file-delete", "file-changed", "file-moved" etc.
Iwatch lets us hook in the
solrjob
script (see above) while watching events in the Foswiki data store at
<foswiki-dir>/data/
Note that this is only a "near-realtime" indexing behavior as the used script to perform the indexing is configured to throttle the procedure for a given
amount of time defaulting to 5 seconds. So any change to the content will then show up within 5 seconds after the event.
The service is configured by placing below configuration script at
/etc/iwatch/iwatch.xml
:
<?xml version="1.0" ?>
<!DOCTYPE config SYSTEM "/etc/iwatch/iwatch.dtd" >
<config>
<guard email="root@localhost" name="IWatch"/>
<watchlist>
<title>Foswiki</title>
<contactpoint email="root@localhost" name="Administrator"/>
<path type="recursive" filter=".*\.txt$" alert="off" syslog="on" exec="su <httpd-user> '<foswiki-dir>/tools/solrjob'"><foswiki-dir>/data</path>
<path type="regexception">\.tmp|\.sw\w|\.svn|\.lease|\.lock|,$|\.changes|,v|^_[0-9]|^log|^Temporary|^UnitTestCheck</path>
</watchlist>
</config>
For VirtualHostingContrib use:
<?xml version="1.0" ?>
<!DOCTYPE config SYSTEM "/etc/iwatch/iwatch.dtd" >
<config>
<guard email="root@localhost" name="IWatch"/>
<watchlist>
<title>Foswiki</title>
<contactpoint email="root@localhost" name="Administrator"/>
<!-- watch directories shared among all virtual domains -->
<path type="recursive" filter=".*\.txt$" alert="off" syslog="on" exec="su <httpd-user> -c '<foswiki-dir>/tools/solrjob --host all'"><foswiki-dir>/data/System</path>
<!-- <path type="recursive" filter=".*\.txt$" alert="off" syslog="on" exec="su <httpd-user> -c '<foswiki-dir>/tools/solrjob --host all'"><foswiki-dir>/data/Applications</path> -->
<!-- watch each virtual domain for changes -->
<path type="recursive" filter=".*\.txt$" alert="off" syslog="on" exec="su <httpd-user> -c '<foswiki-dir>/tools/solrjob --host <domain1'>"><vhosts-dir>/<domain1>/data</path>
<!-- <path type="recursive" filter=".*\.txt$" alert="off" syslog="on" exec="su <httpd-user> -c '<foswiki-dir>/tools/solrjob --host <domain2'>"><vhosts-dir>/<domain2>/data</path> -->
<path type="regexception">\.tmp|\.sw\w|\.svn|\.lease|\.lock|,$|\.changes|,v|^_[0-9]|^log|^Temporary|^UnitTestCheck</path>
</watchlist>
</config>
Make sure to replace
-
<foswiki-dir>
-
<httpd-user>
-
<domainX>
-
<vhosts-dir>
with the appropriate values on your platform.
Note in the latter example for an
iwatch.xml
configuration for virtual hosting that those webs shared among all domains (via soft links) must be watched separately
as changes to those directories don't appear as a change to the domains' directories. These are typically the
System and the Applications web in
case you installed
WikiWorkbenchContrib and you'd like to share all wiki apps among all virtual domains.
Online indexing
This mode of operation refers to a way to update the search index
immediately as part of the
save
operation performed by Foswiki on behalf of the user.
The biggest advantage here is that changes to the content base will immediately show up in the search index reflecting the exact changes being made to the
content base.
There are a couple of flags to switch on/off online indexing in your configuration.
Enable / disable indexing content as part of a
save
operation:
$Foswiki::cfg{SolrPlugin}{EnableOnSaveUpdates} = 0;
Enable/disable updates when a new attachment has been uploaded:
$Foswiki::cfg{SolrPlugin}{EnableOnUploadUpdates} = 0;
Enable/disable updates when a topic or attachment has been moved or deleted:
$Foswiki::cfg{SolrPlugin}{EnableOnRenameUpdates} = 1;
Setting up cronjobs
Below will set up performing
- a full indexing every Saturday midnight and
- a delta indexing every 15 minutes
0 0 * * 6 <foswiki-dir>/tools/solrjob --mode full
*/15 * * * * <foswiki-dir>/tools/solrjob --mode delta
Add
--host all
to index all virtual hosts, or
--host <hostname>
to index a single virtual host.
Recommendations
By now we are able to orchestrate a couple of ways how to keep up with changes in Foswiki while indexing it into an external database such as Solr.
There are a couple of pros and cons to keep in mind innate to every of the above methods. Also, your own business requirements might significantly shift any decision
how and when to schedule crawling the content. Some of the criteria to keep in mind are:
- size of content base
- speed of indexing content determined by server resources
- interactive performance as perceived by the user
- real-time requirements for updates in search results
- changes in access control structures such as:
- new users being registered to Foswiki,
- changing member ship in user groups,
- changing clearance of user groups for specific content
What to keep in mind for full indexing
Especially changes in access control structures might affect clearance to content in a broader scale. As the indexing procedure caches the current authorization for a specific
piece of content along with it, will a change to
access control -- independent to any change of the
content itself -- render access control incorrect as cached into the Solr index
unless this content is indexed again. This is not a problem when the ACL of a single document is altered as this document is re-indexed again as part of the change event.
No such re-indexing is triggered automatically when a user group changes or is granted more or less authorization for content. This will indeed only be reflected the next
time a full indexing is performed.
Access control structures might be changing totally outside of Foswiki when using
LdapContrib where users and groups are defined in a distant LDAP database.
These user and group records immediately affect Foswiki granting access to documents (there is some caching involved here as well, but let's ignore this for now). Only after
indexing affected documents
again will a search on the index exclude / include new content users have access to when visiting the page directly.
Therefore a regular full indexing is required, presumably once a week or once a day during off times.
The runtime of a full indexing run depends on the size of your content base as well as the size of the user base. Both directly affect the throughput indexing content.
It is strongly recommended to plan full indexing during off times when the system isn't used otherwise. Also, make sure that two full indexing runs don't overlap as that
would constantly increase load on the involved servers.
In those cases where a full indexing run over all of the content base exceeds off times (e.g. starting Friday night, doesn't finish on Monday morning) will you need to
add more server resources. There are multiple ways to do so. Step one would be to use separate servers for both Foswiki and Solr. Please read up on how to scale Solr beyond
a single-node installation as has been outlined in above configuration.
Correctness of search index
A search index might show "incorrect" results for example when the content it indexes doesn't actually exist anymore. So users get a positive search hit but won't be able
to access the content anymore: both content base and search index are out of sync. Keeping the search index "correct" is of importance for any indexing strategy.
A search index might also be "incorrect" when it doesn't reflect the access rights a users has got on the content itself. That is: the search engine shall only return
search results for content that the user has clearance for. No such search result shall ever be returned for content that the user isn't allowed to access of even
get to know that it exists.
In SolrPlugin any Foswiki ACLs are added to the Solr database while content is indexed. So ACLs are checked as an additional filter on any search operation that an
authenticated user might perform.
Correctness of the search index as we discuss it now is more concerned with the time it takes for to keep any content change in Foswiki in sync as it is being indexed and
added to the Solr database.
There are two general categories for indexing content that we want to compare now:
- online indexing: index content as part of the interaction performed by the user
- offline indexing: perform content indexing independent from the user interacting with the system online
Offline indexing is performed by the
solrindex
script as well as the
solrjob
wrapper. Both might be used in a cronjob or
iwatch
as described above.
Looking at
online indexing there is a price in doing so that we should keep in mind before switching it on.
Indexing will be part of a
save
,
delete
or
rename
operation performed by the user
and thus directly increase the
perceived time for the user to interact with the system while applying content changes.
You may decide yourself when trading interactive performance against negative side-effects due to "incorrect" search indexes. It is recommended to rather sacrifice
a short period of time for the search index not being quite up-to-date rather than slowing down the interactive performance of the system by hooking the indexing procedure
into the online operations of Foswiki.
It is recommended to replace Foswiki's default AutoViewTemplatePlugin with
AutoTemplatePlugin. This will allow you to replace the default
WebSearch,
WebChanges and
SiteChanges as well as
WikiUsers with a Solr-driven interface for better usability and performance.
Configure AutoTemplatePlugin by adding the following
{ViewTemplateRules}
$Foswiki::cfg{Plugins}{AutoTemplatePlugin}{ViewTemplateRules} = {
...
'WebChanges' => 'WebChangesView',
'SiteChanges' => 'SiteChangesView',
'WebSearch' => 'SolrSearchView',
'WikiUsers' => 'SolrWikiUsersView',
...
};
The
SolrWikiUsersViewTemplate implements a person search driven by Solr. It allows you to facet on properties as defined in the
UserForm such as:
- filter by location
- filter by profession
- filter by organization
There is a specific configuration option for Foswiki to detect which topics are actually user profile pages.
$Foswiki::cfg{SolrPlugin}{PersonDataForm} = '(*UserForm)';
Any topic that has got a
UserForm
attached to it will participate in the person search interface at %USERWEB%.WikiUsers. Note that the value at
{SolrPlugin}{PersonDataForm}
specifies a Solr filter query
that might be customized and extended as required. For example, to also include any topic that has got a
PersonTopic
DataForms attached to it use:
$Foswiki::cfg{SolrPlugin}{PersonDataForm} = '(*PersonTopic OR *UserForm)';
Finally, you'll need to make this configuration accessible in wiki applications such as the WikiUsers view template. Add
'{SolrPlugin}{PersonDataForm}'
to the
{AccessibleCFG}
list as in
$Foswiki::cfg{AccessibleCFG} = [
'{ScriptSuffix}',
'{LoginManager}',
'{AuthScripts}',
...
'{SolrPlugin}{PersonDataForm}',
];
Macros
SolrPlugin comes with a set of search macros tailored to the extensive capabilities of Solr's responses to search queries.
All of them make use of the same set of options to render a response as listed in
SOLRSEARCH.
SOLRSEARCH
This is the most important macro. It allows you to interact with the Solr server and display results within wiki applications.
An example search looks like this:
%SOLRSEARCH{"test"
format=" 1 $web.$topic$n"
sort="date desc"
}%
This will list the 10 most recently changed topics that match the string "test".
To list the 20 most recently changed topics topics that have the string "test" in their name use:
%SOLRSEARCH{"topic_search:test"
format=" 1 $web.$topic$n"
sort="date desc"
rows="20"
}%
SOLRSEARCH allows you to use the full power of the Lucene query language. This
works with syntactically correct boolean queries like
"title:foo OR body:foo"
.
Consult the
Lucene Query Syntax guide to learn more about how to form more complicated queries.
SOLRSEARCH also allows you to run a query in
dismax mode. The dismax query parser only supports a subset of the Lucene syntax, but is highly tolerant of all sorts of strange user input. The query syntax is uses is familiar to many search engine users, and supports +/- and quotes for groupings words. The
edismax
mode adds several more powerful features, though still short of what is offered by the full Lucene syntax.
Parameter |
Description |
Default |
id |
a search can be cached optionally for the time of the current request, for example using id="solr1". further calls to %SOLRFORMAT can make use of the cached solr response to render it independent from the location of the %SOLRSEARCH call on the wiki page |
|
search |
query string: depending on the search type this can either be a free-form text (type=dismax), a valid lucene query (type=standard) or a combination of both (edismax) |
*:* |
type |
dismax/edismax/standard: query type |
standard |
fields |
list of fields to be returned in the result; by default all fields in solr documents are returned; communication between Foswiki and the solr search can be optimized by specifying only those fields that you are interested in while rendering the response |
*, score |
Flags: |
jump |
on/off: jump to the topic specified explicitly in the seach string |
on |
lucky |
on/off: jump to the first result found |
off |
highlight |
switch on/off highlighting of found terms |
off |
spellcheck |
switch on/off spellchecking to propose alternative spellings in case no search result was found |
off |
Pagination: |
start |
integer index within the result from where to start listing results |
0 |
rows |
maximum number of documents to return |
10 |
Filter parameters: |
web |
filter by web: this can be any webname |
all |
contributor |
filter by contributor to a topic |
|
filter |
lucene query to filter results |
|
extrafilter |
additional lucene filter query (see SolrSearchBaseTemplate for the difference in filter and extrafilter |
|
reverse |
on/off - reverts sorting if switched on; note: this overrides sorting order specified in sort |
off |
sort |
sorting expression; examples: score desc , date desc , createdate , topic_sort |
|
Dismax Parameter: |
boostquery |
a raw query string (in the solr query syntax) that will be included with the user's query to influence the score. example: type:topic^1000 will boost results of type topic |
see solrconfig.xml and SolrSearchBaseTemplate |
queryfields |
list of fields and their boosts giving each field a significance when a term was found in them. the format supported is fieldOne^2.3 fieldTwo fieldThree^0.4, which indicates that fieldOne has a boost of 2.3, fieldTwo has the default boost, and fieldThree has a boost of 0.4 … this indicates that matches in fieldOne are much more significant than matches in fieldTwo, which are more significant than matches in fieldThree |
see solrconfig.xml and SolrSearchBaseTemplate |
phrasefields |
list of fields and their boosts similar to queryfields . this parameter may contain fields and boosts that pharses (specified in quotes) are matched against. boosting those fields higher than their counterpart in queryfields allows you to prefer phrase matches over separate word matches |
see solrconfig.xml and SolrSearchBaseTemplate |
Faceting: |
facets |
list of facets to be rendered during search; each facet can be a title=name pair specifying the facet name and the title label used to display it in the result; example: %MAKETEXT{"Webs"}%=web, %MAKETEXT{"Topic type"}%=field_TopicType_lst |
|
facetquery |
query to be used for a facet query |
|
facetoffset |
used to page through a list of facets being returned by a search |
|
facetlimit |
maximum number of values to be displayed per facet; this is a list of pairs name=integer specifying a per-facet limit; example: 50, tag=100, contributor=10, category=10 will constraint the global limit of facet values to be returned to 50, tags to 100, list the top 10 contributors in the hit set as well as the 10 most used categories |
100 |
facetmincount |
minimum frequency of a facet to be included in the result |
1 |
facetprefix |
prefix string of a facet to be included |
|
facetdatestart |
part of a date facet describing the start of a time interval |
NOW/DAY-7DAYS |
facetdateend |
part of a date facet describing the end of a time interval |
NOW/DAY+1DAYS |
facetdateother |
part of a date facet describing the time intervals excluding the one specified with facetdatestart and facetdateend |
before |
hidesingle |
comma separated list of facets to be hidden if there's only one choice left |
|
disjunctivefacets |
list of facets that are queried using OR; so searching within one facet will expand the search instead of drilling down |
facet values are combined using AND |
combinedfacets |
list of facets where values are queried in each of them using OR; for example listing field_ProjectMembers_lst and field_ProjectManager_s will result in a lucne filter of the form field_ProjectMembers_lst:WikiGuest OR field_ProjectManager_s:WikiGuest |
|
Formating results: |
correction |
format string for corrections proposed by the spellchecker |
Did you mean <a href='$url'>$correction</a> |
header |
format string prepended to the result |
|
format |
format string used to render each hit in the result set |
|
separator |
format string used to separate hit results rendered using format |
|
footer |
format string appended to the result |
|
header_interesting |
format string prepended to more-like-this queries (see %SOLRSIMILAR ) |
|
format_interesting |
format string used to render more-like-this results |
|
separator_interesting |
format string used to separate hit results in more-like-this queries |
|
footer_interesting |
format string appended to more-like-this queries |
|
include_interesting |
regular expression terms must match in a more-lile-this result |
|
exclude_interesting |
regular expression terms must not match in a more-lile-this result |
|
header_<facet> |
format string prepended to a facet result |
|
format_<facet> |
format string used to render a facet value |
|
separator_<facet> |
format string used to separate facet values |
|
footer_<facet> |
format string appended to facet results |
|
include_<facet> |
regular expression facet values must match to be displayed |
|
exclude_<facet> |
regular expression facet values must not match to be displayed |
|
When a solr response has been cached using the
id
parameter to
SOLRSEARCH, it can be reused by subsequent calls to %SOLRFORMAT.
%SOLRSEARCH{"test"
id="solr1"
facets="web,contributor"
facetlimit="web=10, contributor=10"
}%
<noautolink>
*Results:*
%SOLRFORMAT{"solr1"
format=" 1 [[$web.$topic][$topic]]$n"
}%
*Webs:*
%SOLRFORMAT{"solr1"
format_web=" * $key ($count)$n"
}%
*Contributors:*
%SOLRFORMAT{"solr1"
format_contributor=" * $key ($count)$n"
exclude_contributor="UnknownUser|AdminGroup|AdminUser|RegistrationAgent|TestUser"
}%
</noautolink>
SOLRSIMILAR
SOLRSIMILAR allows to return a list of similar topics given the current one.
Parameter |
Description |
Default |
"..." |
query string referencing the document(s) to return similar ones for |
id:System.SolrPlugin |
like |
list of fields used to compute similarity |
category, tag |
fields |
list of fields and their boost value to be included in result items |
web, topic, title, score |
filter |
restricts results to those matching this filter |
type:topic |
include |
switches on/off inclusion of the matched document found in the query parameter |
off |
limit |
maximum number of results to return |
100 |
boost |
|
|
mintermfrequency |
|
|
mindocumentfrequency |
|
|
mindwordlength |
|
|
maxdwordlength |
|
|
SOLRSCRIPTURL
returns a link to a
SolrSearch with the given parameters pre-set.
Parameter |
Description |
Default |
"..." or search |
search string to render a link for |
|
id |
get a link to the search defined by SOLRSEARCH |
|
topic |
name of the search topic to jump to |
WebSearch |
union |
a list of fields whose values can be selected in a union (using an "or" operator) |
|
multivalued |
a list of fields that may be searched by multiple values |
|
start |
|
|
sort |
|
|
<field_name> |
any field defined in in solr's schema.xml |
|
---+++ Rest inteface
---++++ search
---++++ terms
---++++ similar
---++++ autocomplete
---+++ Commandline tools
---++++ solrstart
---++++ solrindex
---++++ solrdelete
---+++ Perl interface
---++++ registerIndexTopicHandler()
---++++ registerIndexAttachmentHandler()
Solr indexing schema
SolrPlugin comes with a custom schema to index general Foswiki data as defined
in the
<solr-home-dir>conf/schema.xml
file. It offers support for generic
DataForm values, so adding any new DataForm definition will allow to use
those formfields for faceting directly without changing configurations or having to reindex
the content.
The process of indexing content is configured on the Foswiki side which will crawl all webs, topics
and their attachments thus creating lucene documents which are then sent over to the solr server.
A lucene document is made up of fields of a certain type which defines the way the document should be processed
by the solr server. This is configured in the
schema.xml
file.
While the schema is able to cover all Foswiki related data it is still kept generic enough to be used for non-wiki
content as well.
Field types
This is the list of the most common field types used in the default schema.
See the
schema.xml
for more exotic field types like
point
and
location
,
useful for spatial search.
Type |
Description |
string |
not analyzed, but indexed/stored verbatim |
boolean |
boolean values (true, false) |
binary |
the data should be sent/retrieved in as Base64 encoded strings |
int, float, long, double |
default numeric field types. for faster range queries, consider the tint/tfloat/tlong/tdouble types |
date |
the format for this date field is of the form 1995-12-31T23:59:59Z, and is a more restricted form of the canonical representation of dateTime. The trailing "Z" designates UTC time and is mandatory. Optional fractional seconds are allowed: 1995-12-31T23:59:59.999Z All other components are mandatory. Note: for faster range queries, consider the tdate type |
text_ws |
a text field that only splits on whitespace for exact matching of words |
text |
a general text field that has reasonable, generic cross-language defaults: it tokenizes with StandardTokenizer, removes stop words from case-insensitive "stopwords.txt", and down cases. At query time only, it also applies synonyms. |
text_generic |
same as text but also splits words on case change while generating word parts. a general unstemmed text field - good if one does not know the language of the field. this field type is usful when searching for parts of a WikiWord |
text_prefix |
substring decomposition starting at the front of the string |
text_suffix |
substring decomposition starting at the back of the string |
text_spell |
generic text analysis for spell checking |
text_sort |
this is a text field suitable for sorting alphabetically |
text_rev |
a general unstemmed text field that indexes tokens normally and also reversed, to enable more efficient leading wildcard queries. |
type |
a list of strings used to analyse different media types. these are analysed using the system's mime types table and generating meaningfull values; for example a gif image would be of type "gif", "image" and "attachment" |
Fields
Name |
Type |
Multivalued |
Stored |
Description |
access_granted |
string |
multivalued |
|
this field controls access of users to this topic or attachment in the search index; every query is augmented with an ACL check against this field; only users listed in this field are allowed view rights; special value is "all" when there are no view restrictions |
attachment |
string |
multivalued |
stored |
list of all attachment names of this topic |
author |
string |
|
stored |
the name of the person that changed the document most recently |
author_title |
string |
|
stored |
title name of the person that changed the document most recently |
catchall |
text_generic |
multivalued |
stored |
copy-field that gathers content from (allmost) all fields; this is the default search field for the "standard" query parser; note that fields to be queried can be configured per request using the "dismax" handler |
category |
string |
multivalued |
stored |
list of categories this document is in; note: this field will only be used if Foswiki:Extensions/ClassificationPlugin is installed; it will populate it with the list of all categories up to TopCategory; content of this field is copied to category_search as well (see generic fields below) |
comment |
text_generic |
|
stored |
comment field of an attachment |
concept |
string |
multivalued |
stored |
support for uima processing chain |
container_id |
string |
|
stored |
id of containing document, e.g. the topic this is a comment or attachment for |
container_title |
string |
|
stored |
title name of containing document |
container_topic |
string |
|
stored |
topic of containing document |
container_url |
string |
|
stored |
url of containing document |
container_web |
string |
|
stored |
web of containing document |
contributor |
string |
multivalued |
stored |
list of users that contributed to this topic at some point in time |
createauthor |
string |
|
stored |
author of the initial version of this document |
createauthor_title |
string |
|
stored |
title name of the initial author of this document |
createdate |
tdate |
|
stored |
date when the initial version of this document was created |
date |
tdate |
|
stored |
time the the document was changed last |
form |
string |
|
stored |
name of the form attached to the current topic |
icon |
string |
|
stored |
icon to indetify the rendition for this document |
id |
string |
|
stored |
unique identifier for each document; this is the external id usable in applications; there's an internal solr document id not related to this field |
language |
string |
|
stored |
language of the current document; this may be specified explicitly using the CONTENT_LANGUAGE preference, or set to "detect" to let the solr update chain detect the language automatically |
macro |
string |
multivalued |
|
list of wiki macros being used in this topic |
name |
string |
|
stored |
filename of an attachment |
outgoing |
string |
multivalued |
stored |
list of all outgoing links; this information is used to detect backlinks |
parent |
string |
|
stored |
parent topic of the current topic |
phonetic |
phonetic |
multivalued |
|
holds the phonetic analysis of the most important search fields |
charnorm |
text_charnorm |
|
multivalued |
result of the character normalization analysis |
preference |
string |
multivalued |
stored |
this field catches all topic preferences. each preference is captured in a dynamic field as well (see dynamic fields below) |
sentence |
text_generic |
multivalued |
stored |
support for uima processing chain |
size |
tint |
|
stored |
size of an attachment in bytes |
spell |
text_spell |
multivalued |
|
used for spellchecking |
state |
string |
|
|
used by comments or any other application that tracks specific states of a document, such as "new", "unapproved", "approved", "draft", "unpublished", "published", … |
text_prefix |
text_text_prefix |
multivalued |
|
holds substring analysis of the most important search fields, starting at the front |
text_suffix |
text_text_suffix |
multivalued |
|
holds substring analysis of the most important search fields, starting at the back |
summary |
text_generic |
|
stored |
this is a plainified summary of the topic text |
tag |
string |
multivalued |
stored |
list of tags assigned to this document; note: this field will only be used if Foswiki:Extensions/ClassificationPlugin is installed; content of this field is copied to category_search as well (see generic fields below) |
text |
text_generic |
|
|
document text |
thumbnail |
string |
|
stored |
url to thumbnail representation of this document; mostly used for images |
timestamp |
tdate |
|
stored |
time when the document was added to the index |
title |
string |
|
stored |
title of a document; a topic title is read from a TopicTitle formfield, a TOPICTITLE preference variable or defaults to the topic name itself; for attachments this is the filename with the extension stripped off |
topic |
string |
|
stored |
name of the topic |
type |
type |
|
stored |
holds the type facet of the document; this is "image" for all kinds of images, "video" for all kinds of videos, "topic" for Foswiki topics and the verbatim file extension for everything else; note: plugins like Foswiki:Extensions/MetaCommentPlugin might use specific types as well (like "comment" in this case) |
url |
string |
|
stored |
url used to access the document being indexed |
version |
float |
|
|
current version of the topic |
webcat |
string |
|
stored |
combined web-category facet |
web |
string |
|
stored |
name of the web this document is located in |
webtopic |
string |
|
stored |
concatenation of the web and topic part |
Dynamic fields
Dynamic fields are generated based on the content properties of the document to
be indexed. Fields are specified using some kind of wildcard in
schema.xml
.
When a document is indexed, the wildcard will be expanded to create a proper
field name. Dynamic fields allow to apply specific ways of analyzing fields
based on their name, as well as cover fields that aren't known in advance, like
the name of all formfields of a DataForm that ever could be invented.
When SolrPlugin is about to index a DataForm attached to a topic, it tries to
guess the data type of each formfield. Normally, Foswiki does not specify any
type information within a DataForm definition. Exceptions are (1) date: these
are mapped to a *_dt field and (2) checkbox, select, radio, textboxlist: these
are potentially multi-value fields and are thus indexed in a *_lst field.
Every other formfield is stored into an *_s field as well as into a *_search field.
The former captures the exact content while the latter analyses the text more thoroughly
optimized for fuzzy searching.
DataForm formfields are mapped to lucene document fields by prepending the
field_*
prefix to prevent name clashes with other dynamic fields generated on the fly.
So for example a formfield
ProjectManager
will be stored in
field_ProjectManager_s
and
field_ProjectManager_search
. Likewise a select+multi formfield
ProjectMembers
will be stored in
field_ProjectMembers_lst
as it is a multivalued field.
If a formfield name already comes with one of the below suffixes (_i, _l, _f, _dt, etc)
then this suffix will be used instead of any heuristics trying to derive the best
field type for the lucene field. That way DataForm fields although untyped by Foswiki
can be indexed type-specific nevertheless.
Similarly topic preferences are indexed using a
preference_*
prefix.
Name |
Type |
Multivalued |
Stored |
Description |
*_i |
tint |
|
stored |
fields with a _i suffix are indexed as an integer number |
*_l |
tlong |
|
stored |
fields with a _l suffix are indexed as a long integer |
*_f |
tfloat |
|
stored |
fields with a _f suffix are indexed as a float |
*_d |
tdouble |
|
stored |
fields with a _f suffix are indexed as a double precision float |
*_b |
boolean |
|
stored |
true, false |
*_s |
string |
|
stored |
dynamic field for unanalyzed text |
*_std |
string |
not stored |
dynamic field for standard analysis, i.e. stopwords not being removed |
*_t |
text_generic |
|
stored |
generic text |
*_dt |
tdate |
|
stored |
a dateTime value |
*_lst |
string |
multivalued |
stored |
this field is used for any multi-valued formfield in DataForms like, select, radio, checkbox, textboxlist |
preference_* |
string |
|
stored |
preference values such as preference_NAMEOFPREFERENCE_t |
*_search |
text_generic |
|
stored |
generic text, optimized for searching |
*_sort |
text_sort |
|
stored |
text optimized for sorting alphabetically |
Copy fields
Finally, after having defined all field type there are some fields that are created by copying some
source field to a destination field using the
copyField
feature of solr. So while most of a lucene document
to be indexed is created by the crawler and indexer explicitly, some more are created automatically to facilitate
specific search applications. The destination fields are then analysed using the dynamic field definitions as given above.
Source |
Destination |
attachment |
catchall |
attachment |
charnorm |
attachment |
phonetic |
attachment |
spell |
category |
catchall |
category |
category_search |
category |
charnorm |
category |
phonetic |
comment |
catchall |
comment |
charnorm |
comment |
phonetic |
comment |
spell |
concept |
catchall |
concept |
charnorm |
concept |
phonetic |
concept |
spell |
field_* |
catchall |
field_* |
charnorm |
field_* |
phonetic |
field_* |
spell |
form |
catchall |
form |
charnorm |
form |
phonetic |
form |
spell |
name |
catchall |
name |
charnorm |
name |
phonetic |
name |
spell |
name |
name_std |
name |
name_search |
tag |
catchall |
tag |
charnorm |
tag |
phonetic |
tag |
tag_search |
text |
catchall |
text |
charnorm |
text |
phonetic |
text |
spell |
text |
text_prefix |
text |
text_std |
text |
text_suffix |
title |
catchall |
title |
charnorm |
title |
phonetic |
title |
spell |
title |
title_first_letter |
title |
title_prefix |
title |
title_search |
title |
title_sort |
title |
title_std |
title |
title_suffix |
topic |
catchall |
topic |
charnorm |
topic |
phonetic |
topic |
spell |
topic |
topic_search |
topic |
topic_sort |
topic |
topic_std |
type |
catchall |
type |
charnorm |
type |
phonetic |
web |
spell |
webtopic |
webtopic_search |
web |
web_search |
web |
web_sort |
web |
web_std |
---++ Templates
---+++ Structure of !SolrSearchBaseTemplate
---+++ Replacing !WebSearch and !WebChanges
---+++ Creating custom search interfaces
Dependencies
Name | Version | Description |
---|
Foswiki::Contrib::JQMomentContrib | >=1.0 | Required |
Foswiki::Contrib::JQPhotoSwipeContrib | >=1.0 | Required |
Foswiki::Contrib::JQSerialPagerContrib | >=2.0 | Required |
Foswiki::Contrib::JQTwistyContrib | >=1.0 | Required |
Foswiki::Contrib::StringifierContrib | >=1.20 | Required |
Foswiki::Plugins::AutoTemplatePlugin | >=1.0 | Optional |
Foswiki::Plugins::ClassificationPlugin | >=1.0 | Optional |
Foswiki::Plugins::DBCachePlugin | >=1 | Optional |
Foswiki::Plugins::FilterPlugin | >=2.0 | Required |
Foswiki::Plugins::FlexWebListPlugin | >=1.91 | Required |
Foswiki::Plugins::ImagePlugin | >=3.0 | Required |
Foswiki::Plugins::JQueryPlugin | >=6.00 | Required |
Cache::Cache | >0 | Required |
HTML::Entities | >=3.64 | Required |
JSON::XS | >=2.231 | Required |
LWP::UserAgent | >=5.820 | Required |
Moo | >=2.00 | Required |
Types::Standard | >=1.00 | Required |
XML::Easy | >0 | Required |
Foswiki::Plugins::TopicTitlePlugin | >1.00 | Required for Foswiki < 2.2 |
Change History
31 Jan 2019: |
reduce amount of presumably unrelated search results; improved language detection in solr; added fields name_std and name_search for better searchability of attachments; don't display wiki markup in search result summaries; added field macro to capture use of wiki macros |
10 Oct 2018: |
mime types are now multivalued, e.g. and image is now tagged type: ["gif", "image", "attachment"] ; better support for attachments listed in the autosuggest drop down box; the rudimentary type mapping is now based on the system mime types table and not using a typemap file in solr's config anymore; removed dependency on Image::Magick ; fixed error exceeding the max string length in solr; the form name will now be used when no TopicType field is present to construct the TopicType facet; fixed support for ALLOWWEBVIEW = * |
13 Aug 2018: |
new alphabetical navigation for wiki users; fixed searching for summary; replaced jquery.scrollto with native scroll api; make number of items suggested configurable in jquery.autosuggest drop-down box |
07 Jun 2018: |
new index fields author_title , createauthor_title , title_first_letter ; added support indexing arbitrary meta data; added support for ListyPlugin; added toggle "exact search" to search interface; depending on new TopicTitlePlugin now; fixed keyboard interaction of autosuggest box; fixed sorting facet values by title; much improved relavancy sorting |
09 Jan 2018: |
added support for jquery.i18n; improved solr schema for better findability; fixed solr sidebar in subwebs |
18 Sep 2017: |
replacing text_substring with text_prefix and text_suffix to improve substring matching; truncate document values larger than 32k to prevent solr from crashing; use flexbox for people search interface; fixed creating urls to ImagePlugin rest interface to generate thumbnail previews |
23 Jan 2017: |
converted WebServices::Solr to Moo; fixed documentation for iwatch realtime indexing; documentation of SOLRSCRIPTURL macro; using jquery.i18n for javascript translations now; new facet filter to search in facet values; improved indexing of user profile pages and their thumbnail image; indexing image geometry now; improved jquery.autosugest widget; improved ToggleFacetWidget; improved boosting of query ingrediences; mapping all office documents to a combined attachment type (document, presentation, spreadsheet, chart, …); better support for plenv in system services and cron jobs |
18 Oct 2015: |
fixed backwards compatibility with pre-unicode Foswiki; bring back solr::queryfields in SolrSearchBaseTemplate; fixed language facet to properly match language tags to their name; improved layout of search results as well as autosuggestion widget; removed workflow facet from default search; fixed icon mapping for topics that don't come with an icon defined in their TopicType; don't try to encode html entities without a code point in utf8; don't remove all macros from topic text, just some; removed dependency on MimeIconsPlugin as we are using fontawesome now; improved formula for sorting results by reference; fixed sorting in ajax-solr; fixed exposing/hiding parameters in ajax-solr; improved findability of content; i.e. when containing stop words only in the title; removed unused /browse search handler from solr config |
01 Oct 2015: |
improve default layout of search results; moved unsafe inline-javascript into a js file of its own |
21 Sep 2015: |
cache stringified attachments using Cache::FileCache now and added api to purge/clear cache regularly; removed IndexExtensions config parameter to let the stringifier decide on supported file formats; added support for Foswiki:Extensions/LikePlugin boosting search results by social preferences |
17 Jul 2015: |
added support for Foswiki-2.0 ; indexing workflow and state facets supporting Foswiki:Extensions/WorkflowPlugin; added author_url to solr schema; added google image and video mime types mapping them to "image" while indexing |
27 Feb 2015: |
upgraded to solr-5.0.0 |
29 Sep 2014: |
moved to jsrender for templating, replacing the deprecated jquery.tmpl |
29 Aug 2014: |
fix mailto links in WikiUsers view template; fully specify rest security; fixed creating of working area for timestamps db; improved indexing of list values; fixed encoding error in SOLRSEARCH/FORMAT; use SOLR_EXTRAFILTER preference setting in auto-suggest widget as well; fixed applying strings and defaults in solrDictionary class; fixed applying extra-filters in SolrSearch; harvest facet headings for translations; |
28 May 2014: |
implemented new ACL style compatible with Foswiki >= 1.2 |
14 Jul 2013: |
added support for PiwikPlugin |
14 Mar 2013: |
improved indexing performance; added configurable http timeouts takling to the solr backend; fixed language mappings for multilingual content; fixes due to latest changes in jquery.moment |
17 Oct 2011: |
fixed WebServices::Solr to only encode to utf8 if needed; fixed handling character encoding on a pure utf8 foswiki; fixed schema for spell correction |
29 Sep 2011: |
improved schema.xml: replaced StandardTokenizer with WhitespaceTokenizer, using new ClassicTokenizer and ClassicFilter to feed the spellchecker, switched spellchecker to JaroWinklerDistance and lowered the frequency threshold for a term to be added to the spellchecker; building the spellchecker when optimizing the index now; fixed detecting the content language |
28 Sep 2011: |
added multilanguage support per document; fixed default values in %SOLRSIMILAR; speeding up indexing by better caching ACLs; implemented mapping facet values to any other label; during query time; added Language facet to default search interface |
26 Sep 2011: |
improved default boosting in dismax to prefer topic hits a lot stronger than attachments; improved default cache settings for better default performace; added support to distribute updates and search in a master-slave setup; added boostquery , queryfields , phrasefields parameter to customize boosting and sorting; improved default schema while documenting it |
21 Sep 2011: |
upgrading to solr-3.4.0; fixed utf8 handling; added jump and i-feel-lucky options; made hidesingle configurable per facet; added disjunctivefacets and combinedfacets; fixed handling of date fields; support new ui::autocomplete in JQueryPlugin; using type-specific icons in Foswiki:Extensions/MimeIconPlugin if installed; fixed quoting lucene queries; indexing outgoing links to support fast backlinks; adding fields createauthor, language and collection to schema; disabling phonetic boost in schema by default; be more robust in case of mallformed DataForm definitions; copying every string field into a search field also to allow exact as well as fuzzy search; enhancing normalizeWebTopicName to create uniform web names using dots, not slashes everywhere; fixed parsing inline topic permissions; externalized sidebar pager into a new plugin of its own: Foswiki:Extensions/JQSerialPagerContrib; upgrading to WebService::Solr-0.14 … which now requires CPAN:XML::Easy instead of CPAN:XML::Generator; lots of improvements to SolrSearchBaseTemplate; now supporting Foswiki:Extensions/InfiniteScrollContrib in SolrSearch; documentation improvements |
19 Apr 2011: |
shipping a multicore setup by default; added support for Foswiki:Extensions/VirtualHostingContrib; fixed utf8 recoding; some usability improvements to faceted search interface; fixing illegal control characters in output (Oliver Schaub) |
16 Dec 2010: |
added state field to schema used for approval workflows; added solrjob to ease cronjobbing indexing; added docu how to use iwatch for almost-realtime indexing; fixed dependencies to include Foswiki:Extensions/FilterPlugin as well; fixed mapping facet values to their display title in search interface; fixed delta updates not properly removing outdated attachment entries when these where moved/renamed; and some minor html improvements |
03 Dec 2010: |
fixed solr-based WebChanges and SiteChanges using PatternSkin |
01 Dec 2010: |
adjustments due to changes in stringifier api; fixed removal of deleted webs from search index |
22 Nov 2010: |
fixes integration with pattern skin |
18 Nov 2010: |
initial public release |