Search¶
Warning
This section of documentation may be outdated.
Kitsune uses Elasticsearch to power its on-site search facility.
It gives us a number of advantages over MySQL’s full-text search or Google’s site search.
Much faster than MySQL.
And reduces load on MySQL.
We have total control over what results look like.
We can adjust searches with non-visible content.
We don’t rely on Google reindexing the site.
We can fine-tune the algorithm and scoring.
Installing Elasticsearch¶
There’s an installation guide on the Elasticsearch site:
https://www.elastic.co/guide/en/elasticsearch/reference/1.3/setup-service.html
We’re currently using 1.2.4 in production.
The directory you install Elasticsearch in will hereafter be referred
to as ELASTICDIR
.
You can configure Elasticsearch with the configuration file at
ELASTICDIR/config/elasticsearch.yml
.
Elasticsearch uses several settings in kitsune/settings.py
that you
need to override in kitsune/settings_local.py
. Here’s an example:
# Connection information for Elastic
ES_URLS = ['http://127.0.0.1:9200']
ES_INDEXES = {'default': 'sumo_dev'}
ES_WRITE_INDEXES = ES_INDEXES
These settings explained:
ES_URLS
Defaults to
['http://127.0.0.1:9200']
.Points to the url for your Elasticsearch instance.
Warning
The url must match the host and port in
ELASTICDIR/config/elasticsearch.yml
. So if you change it in one place, you must also change it in the other.
ES_INDEXES
Mapping of
'default'
to the name of the index used for searching.The index name must be prefixed with the value of
ES_INDEX_PREFIX
.Examples if
ES_INDEX_PREFIX
is set to'sumo'
:ES_INDEXES = {'default': 'sumo'} ES_INDEXES = {'default': 'sumo_20120213'} ES_INDEXES = {'default': 'tofurkey'} # WRONG!
ES_WRITE_INDEXES
Mapping of
'default'
to the name of the index used for indexing.The index name must be prefixed with the value of
ES_INDEX_PREFIX
.Examples if
ES_INDEX_PREFIX
is set to'sumo'
:ES_WRITE_INDEXES = ES_INDEXES ES_WRITE_INDEXES = {'default': 'sumo'} ES_WRITE_INDEXES = {'default': 'sumo_20120213'} ES_WRITE_INDEXES = {'default': 'tofurkey'} # WRONG!Note
The separate roles for indexes allows us to push mapping changes to production. In the first push, we’ll push the mapping change and give
ES_WRITE_INDEXES
a different value. Then we reindex into the new index. Then we push a change updatingES_INDEXES
to equalES_WRITE_INDEXES
allowing the search code to use the new index.If you’re a developer, the best thing to do is have your
ES_WRITE_INDEXES
be the same asES_INDEXES
. That way you can reindex and search and you don’t have to fiddle with settings in between.
There are a few other settings you can set in your
kitsune/settings_local.py
file that override ElasticUtils defaults. See
the ElasticUtils docs
for details.
Other things you can change:
ES_INDEX_PREFIX
Defaults to
'sumo'
.All indexes for this application must start with the index prefix. Indexes that don’t start with the index prefix won’t show up in index listings and cannot be deleted through the esdelete subcommand and the search admin.
Note
The index names in both
ES_INDEXES
andES_WRITE_INDEXES
must start with this prefix.
ES_LIVE_INDEXING
Defaults to False.
You can also set
ES_LIVE_INDEXING
in yourkitsune/settings_local.py
file. This affects whether Kitsune does Elasticsearch indexing when data changes in thepost_save
andpre_delete
hooks.For tests,
ES_LIVE_INDEXING
is set toFalse
except for Elasticsearch specific tests so we’re not spending a ton of time indexing things we’re not using.
ES_TIMEOUT
Defaults to 5.
This affects timeouts for search-related requests.
If you’re having problems with ES being slow, raising this number might be helpful.
Using Elasticsearch¶
Running¶
Start Elasticsearch by:
$ ELASTICDIR/bin/elasticsearch
That launches Elasticsearch in the background.
Indexing¶
Do a complete reindexing of everything by:
$ ./manage.py esreindex
This will delete the existing index specified by ES_WRITE_INDEXES
,
create a new one, and reindex everything in your database. On my
machine it takes under an hour.
If you need to get stuff done and don’t want to wait for a full indexing, you can index a percentage of things.
For example, this indexes 10% of your data ordered by id:
$ ./manage.py esreindex --percent 10
This indexes 50% of your data ordered by id:
$ ./manage.py esreindex --percent 50
I use this when I’m fiddling with mappings and the indexing code.
Another way of specifying a smaller number of things to index is by indicating how recently updated things should be to be included:
$ ./manage.py esreindex --hours-ago 2
$ ./manage.py esreindex --minutes-ago 20
$ ./manage.py esreindex --seconds-ago 90
Those options can be combined as well if you wish. Different indexes have different ways of determining how long ago something was updated, but as a whole this should reindex everything in every index (or those specified in the –mapping_types option) that was updated less than or equal to how long ago you say.
You can also specify which mapping_types to index:
$ ./manage.py esreindex --mapping_types questions_question,wiki_document
See --help
for more details:
$ ./manage.py esreindex --help
Note
Once you’ve indexed everything, if you have ES_LIVE_INDEXING
set to True
, you won’t have to do it again unless indexing code
changes. The models have post_save
and pre_delete
hooks
that will update the index as the data changes.
Note
If you kick off indexing with the admin, then indexing gets done in chunks by celery tasks. If you need to halt indexing, you can purge the tasks with:
$ celery -A kitsune purge
If you do this often, it helps to write a shell script for it.
Health/statistics¶
You can see Elasticsearch index status with:
$ ./manage.py esstatus
This lists the indexes, tells you which ones are set to read and write, and tells you how many documents are in the indexes by mapping type.
Deleting indexes¶
You can use the search admin to delete the index.
On the command line, you can do:
$ ./manage.py esdelete <index-name>
Implementation details¶
Kitsune uses elasticutils and pyelasticsearch.
Most of our code is in the search
app in kitsune/search/
.
Models in Kitsune that are indexable use SearchMixin
defined in
models.py
.
Utility functions are implemented in es_utils.py
.
Sub commands for manage.py
are implemented in
management/commands/
.
Searching on the site¶
Scoring¶
These are the default weights that apply to all searches:
wiki (aka kb):
document_title__match 6
document_content__match 1
document_keywords__match 8
document_summary__match 2
questions (aka support forums):
question_title__match 4
question_content__match 3
question_answer_content__match 3
forums (aka contributor forums):
post_title__match 2
post_content__match 1
Elasticsearch is built on top of Lucene so the Lucene documentation on scoring covers how a document is scored in regards to the search query and its contents. The weights modify that—they’re query-level boosts.
Additionally, this blog post from 2006 is really helpful in terms of providing insight on the implications of the way things are scored.
Filters¶
We use a series of filters on document_tag, question_tag, and other properties of documents like has_helpful, is_locked, is_archived, etc.
In ElasticSearch, filters remove items from the result set, but don’t affect the scoring.
We cannot apply weights to filtered fields.
Regular search¶
A regular search is any search that doesn’t start from the Advanced Search form.
You could start a regular search from the front page or from the search form on any article page.
Regular search does the following:
searches only kb and support forums
(filter) kb articles are tagged with the product (e.g. “desktop”)
(filter) kb articles must not be archived
(filter) kb articles must be in Troubleshooting (10) and How-to (20) categories
(filter) support forum posts tagged with the product (e.g. “desktop”)
(filter) support forum posts must have an answer marked as helpful
(filter) support forum posts must not be archived
It scores as specified above.
Ask A Question search¶
An Ask a question or AAQ search is any search that is performed within the AAQ workflow. The only difference to regular search is that AAQ search shows forum posts that have no answer marked as helpful.
Advanced search¶
The advanced search is any search that starts from the Advanced Search form.
The advanced search is defined by whatever you specify in the Advanced Search form.
For example, if you search for knowledge base articles in the Troubleshooting category, then we add a filter where the result has to be in the Troubleshooting category.
Link to the Elasticsearch code¶
Here’s a link to the search view in the master branch:
https://github.com/mozilla/kitsune/blob/master/kitsune/search/views.py