ShihabYasin / simple-django-full-text-searching

Basic and Full-text Search in Django

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Basic and Full-text Search in Django

Project Setup

Code is here.

You'll use Docker to simplify setting up and running Postgres along with Django.

From the project root, create the images and spin up the Docker containers:

$ docker-compose up -d --build

Next, apply the migrations and create a superuser:

$ docker-compose exec web python manage.py makemigrations
$ docker-compose exec web python manage.py migrate
$ docker-compose exec web python manage.py createsuperuser

Once done, navigate to http://127.0.0.1:8011/quotes/ to ensure the app works as expected.

Take note of the Quote model in quotes/models.py:

from django.db import models

class Quote(models.Model): name = models.CharField(max_length=250) quote = models.TextField(max_length=1000)

def str(self): return self.quote

Next, run the following management command to add 10,000 quotes to the database:

$ docker-compose exec web python manage.py add_quotes

This will take a couple of minutes. Once done, navigate to http://127.0.0.1:8011/quotes/ to see the data.

The output of the view is cached for five minutes, so you may want to comment out the @method_decorator in quotes/views.py to load the quotes. Make sure to remove the comment once done.

In the quotes/templates/quote.html file, you have a basic form with a search input field:

<form action="{% url 'search_results' %}" method="get">
<input
type="search"
name="q"
placeholder="Search by name or quote..."
class="form-control"
/>
</form>

On submit, the form sends the data to the backend. A GET request is used rather than a POST so that way we have access to the query string both in the URL and in the Django view, allowing users to share search results as links.

Before proceeding further, take a quick look at the project structure and the rest of the code.

When it comes to search, with Django, you'll typically start by performing search queries with contains or icontains for exact matches. The Q object can be used as well to add AND (&) or OR (|) logical operators.

For instance, using the OR operator, override theSearchResultsList's default QuerySet in quotes/views.py like so:

class SearchResultsList(ListView):
model = Quote
context_object_name = "quotes"
template_name = "search.html"

def get_queryset(self): query = self.request.GET.get("q") return Quote.objects.filter( Q(name__icontains=query) | Q(quote__icontains=query) )

Here, we used the filter method to filter against the name or quote fields. Furthermore, we also used the icontains extension to check if the query is present in the name or quote fields (case insensitive). A positive result will be returned if a match is found.

Don't forget the import:

from django.db.models import Q

For small data sets, this is a great way to add basic search functionality to your app. If you're dealing with a large data set or want search functionality that feels like an Internet search engine, you'll want to move to full-text search.

The basic search that we saw earlier has several limitations especially when you want to perform complex lookups.

As mentioned, with basic search, you can only perform exact matches.

Another limitation is that of stop words. Stop words are words such as "a", "an", and "the". These words are common and insufficiently meaningful, therefore they should be ignored. To test, try searching for a word with "the" in front of it. Say you searched for "the middle". In this case, you'll only see results for "the middle", so you won't see any results that have the word "middle" without "the" before it.

Say you have these two sentences:

  1. I am in the middle.
  2. You don't like middle school.
  • I am in the middle.
  • You don't like middle school.
  • You'll get the following returned with each type of search:

    Another issue is that of ignoring similar words. With basic search, only exact matches are returned. However, with full-text search, similar words are accounted for. To test, try to find some similar words like "pony" and "ponies". With basic search, if you search for "pony" you won't see results that contain "ponies" -- and vice versa.

    Say you have these two sentences.

    1. I am a pony.
    2. You don't like ponies
  • I am a pony.
  • You don't like ponies
  • You'll get the following returned with each type of search:

    With full-text search, both of these issues are mitigated. However, keep in mind that depending on your goal, full-text search may actually decrease precision (quality) and recall (quantity of relevant results). Typically, full-text search is less precise than basic search, since basic search yields exact matches. That said, if you're searching through large data sets with large blocks of text, full-text search is preferred since it's usually much faster.

    Full-text search is an advanced searching technique that examines all the words in every stored document as it tries to match the search criteria. In addition, with full-text search, you can employ language-specific stemming on the words being indexed. For instance, the word "drives", "drove", and "driven" will be recorded under the single concept word "drive". Stemming is the process of reducing words to their word stem, base, or root form.

    It suffices to say that full-text search is not perfect. It's likely to retrieve many documents that are not relevant (false positives) to the intended search query. However, there are some techniques based on Bayesian algorithms that can help reduce such problems.

    To take advantage of Postgres full-text search with Django, add django.contrib.postgres to your INSTALLED_APPS list:

    INSTALLED_APPS = [
    ...
    

    "django.contrib.postgres", # new ]

    Next, let's look at two quick examples of full-text search, on a single field and on multiple fields.

    Update the get_queryset function under the SearchResultsList view function like so:

    class SearchResultsList(ListView):
    model = Quote
    context_object_name = "quotes"
    template_name = "search.html"
    

    def get_queryset(self): query = self.request.GET.get("q") return Quote.objects.filter(quote__search=query)

    Here, we set up full-text search against a single field -- the quote field.

    As you can see, it takes similar words into account. In the above example, "ponies" and "pony" are treated as similar words.

    To search against multiple fields and on related models, you can use the SearchVector class.

    Again, update SearchResultsList:

    class SearchResultsList(ListView):
    model = Quote
    context_object_name = "quotes"
    template_name = "search.html"
    

    def get_queryset(self): query = self.request.GET.get("q") return Quote.objects.annotate(search=SearchVector("name", "quote")).filter( search=query )

    To search against multiple fields, you annotated the queryset using a SearchVector. The vector is the data that you're searching for, which has been converted into a form that is easy to search. In the example above, this data is the name and quote fields in your database.

    Make sure to add the import:

    from django.contrib.postgres.search import SearchVector
    

    Try some searches out.

    Stemming and Ranking

    In this section, you'll combine several methods such as SearchVector, SearchQuery, and SearchRank to produce a very robust search that uses both stemming and ranking.

    Again, stemming is the process of reducing words to their word stem, base, or root form. With stemming, words like "child" and "children" will be treated as similar words. Ranking, on the other hand, allows us to order results by relevancy.

    Update SearchResultsList:

    class SearchResultsList(ListView):
    model = Quote
    context_object_name = "quotes"
    template_name = "search.html"
    

    def get_queryset(self): query = self.request.GET.get("q") search_vector = SearchVector("name", "quote") search_query = SearchQuery(query) return ( Quote.objects.annotate( search=search_vector, rank=SearchRank(search_vector, search_query) ) .filter(search=search_query) .order_by("-rank") )

    What's happening here?

    1. SearchVector - again you used a search vector to search against multiple fields. The data is converted into another form since you're no longer just searching the raw text like you did when icontains was used. Therefore, with this, you will be able to search plurals easily. For example, searching for "flask" and "flasks" will yield the same search because they are, well, basically the same thing.
    2. SearchQuery - translates the words provided to us as a query from the form, passes them through a stemming algorithm, and then it looks for matches for all of the resulting terms.
    3. SearchRank - allows us to order the results by relevancy. It takes into account how often the query terms appear in the document, how close the terms are on the document, and how important the part of the document is where they occur.
  • SearchVector - again you used a search vector to search against multiple fields. The data is converted into another form since you're no longer just searching the raw text like you did when icontains was used. Therefore, with this, you will be able to search plurals easily. For example, searching for "flask" and "flasks" will yield the same search because they are, well, basically the same thing.
  • SearchQuery - translates the words provided to us as a query from the form, passes them through a stemming algorithm, and then it looks for matches for all of the resulting terms.
  • SearchRank - allows us to order the results by relevancy. It takes into account how often the query terms appear in the document, how close the terms are on the document, and how important the part of the document is where they occur.
  • Add the imports:

    from django.contrib.postgres.search import SearchVector, SearchQuery, SearchRank
    

    Compare the results from the basic search to that of the full-text search. There's a clear difference. In the full-text search, the query with the highest results is shown first. This is the power of SearchRank. Combining SearchVector, SearchQuery, and SearchRank is a quick way to produce a much more powerful and precise search than the basic search.

    Adding Weights

    Full-text search gives us the ability to add more importance to some fields in our table in the database over other fields. We can achieve this by adding weights to our queries.

    The weight should be one of the following letters D, C, B, A. By default, these weights refer to the numbers 0.1, 0.2, 0.4, and 1.0, respectively.

    Update SearchResultsList:

    class SearchResultsList(ListView):
    model = Quote
    context_object_name = "quotes"
    template_name = "search.html"
    

    def get_queryset(self): query = self.request.GET.get("q") search_vector = SearchVector("name", weight="B") + SearchVector( "quote", weight="A" ) search_query = SearchQuery(query) return ( Quote.objects.annotate(rank=SearchRank(search_vector, search_query)) .filter(rank__gte=0.3) .order_by("-rank") )

    Here, you added weights to the SearchVector using both the name and quote fields. Weights of 0.4 and 1.0 were applied to the name and quote fields, respectively. Therefore, quote matches will prevail over name content matches. Finally, you filtered the results to display only the ones that are greater than 0.3.

    Adding a Preview to the Search Results

    In this section, you'll add a little preview of your search result via the SearchHeadline method. This will highlight the search result query.

    Update SearchResultsList again:

    class SearchResultsList(ListView):
    model = Quote
    context_object_name = "quotes"
    template_name = "search.html"
    

    def get_queryset(self): query = self.request.GET.get("q") search_vector = SearchVector("name", "quote") search_query = SearchQuery(query) search_headline = SearchHeadline("quote", search_query) return Quote.objects.annotate( search=search_vector, rank=SearchRank(search_vector, search_query) ).annotate(headline=search_headline).filter(search=search_query).order_by("-rank")

    The SearchHeadline takes in the field you want to preview. In this case, this will be the quote field along with the query, which will be in bold.

    Make sure to add the import:

    from django.contrib.postgres.search import SearchVector, SearchQuery, SearchRank, SearchHeadline
    

    Before trying out some searches, update the <li></li> in quotes/templates/search.html like so:

    <li>{{ quote.headline | safe }} - <b>By <i>{{ quote.name }}</i></b></li>
    

    Now, instead of showing the quotes as you did before, only a preview of the full quote field is displayed along with the highlighted search query.

    Boosting Performance

    Full-text search is an intensive process. To combat slow performance, you can:

    1. Save the search vectors to the database with SearchVectorField. In other words, rather than converting the strings to search vectors on the fly, we'll create a separate database field that contains the processed search vectors and update the field any time there's an insert or update to either the quote or name fields.
    2. Create a database index, which is a data structure that enhances the speed of the data retrieval processes on a database. It, therefore, speeds up the query. Postgres gives you several indexes to work with that might be applicable for different situations. The GinIndex is arguably the most popular.
  • Save the search vectors to the database with SearchVectorField. In other words, rather than converting the strings to search vectors on the fly, we'll create a separate database field that contains the processed search vectors and update the field any time there's an insert or update to either the quote or name fields.
  • Create a database index, which is a data structure that enhances the speed of the data retrieval processes on a database. It, therefore, speeds up the query. Postgres gives you several indexes to work with that might be applicable for different situations. The GinIndex is arguably the most popular.
  • To learn more about performance with full-text search, review the Performance section from the Django docs.

    Search Vector Field

    Start by adding a new SearchVectorField field to the Quote model in quotes/models.py:

    from django.contrib.postgres.search import SearchVectorField  # new
    from django.db import models
    

    class Quote(models.Model): name = models.CharField(max_length=250) quote = models.TextField(max_length=1000) search_vector = SearchVectorField(null=True) # new

    def str(self): return self.quote

    Create the migration file:

    $ docker-compose exec web python manage.py makemigrations
    

    Now, you can only populate this field when the quote or name objects already exists in the database. Thus, we need to add a trigger to update the search_vector field whenever the quote or name fields are updated. To achieve this, create a custom migration file in "quotes/migrations" called 0003_search_vector_trigger.py:

    from django.contrib.postgres.search import SearchVector
    from django.db import migrations
    

    def compute_search_vector(apps, schema_editor): Quote = apps.get_model("quotes", "Quote") Quote.objects.update(search_vector=SearchVector("name", "quote"))

    class Migration(migrations.Migration):

    dependencies = [ ("quotes", "0002_quote_search_vector"), ]

    operations = [ migrations.RunSQL( sql=""" CREATE TRIGGER search_vector_trigger BEFORE INSERT OR UPDATE OF name, quote, search_vector ON quotes_quote FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger( search_vector, 'pg_catalog.english', name, quote ); UPDATE quotes_quote SET search_vector = NULL; """, reverse_sql=""" DROP TRIGGER IF EXISTS search_vector_trigger ON quotes_quote; """, ), migrations.RunPython( compute_search_vector, reverse_code=migrations.RunPython.noop ), ]

    Depending on your project structure, you may need to update the name of the previous migration file in dependencies.

    Apply the migrations:

    $ docker-compose exec web python manage.py migrate
    

    To use the new field for searches, update SearchResultsList like so:

    class SearchResultsList(ListView):
    model = Quote
    context_object_name = "quotes"
    template_name = "search.html"
    

    def get_queryset(self): query = self.request.GET.get("q") return Quote.objects.filter(search_vector=query)

    Update the <li></li> in quotes/templates/search.html again:

    <li>{{ quote.quote | safe }} - <b>By <i>{{ quote.name }}</i></b></li>
    

    Index

    Finally, let's set up a a functional index, GinIndex.

    Update the Quote model:

    from django.contrib.postgres.indexes import GinIndex  # new
    from django.contrib.postgres.search import SearchVectorField
    from django.db import models
    

    class Quote(models.Model): name = models.CharField(max_length=250) quote = models.TextField(max_length=1000) search_vector = SearchVectorField(null=True)

    def str(self): return self.quote

    # new class Meta: indexes = [ GinIndex(fields=["search_vector"]), ]

    Create and apply the migrations one last time:

    $ docker-compose exec web python manage.py makemigrations
    $ docker-compose exec web python manage.py migrate
    

    About

    Basic and Full-text Search in Django


    Languages

    Language:Python 83.5%Language:HTML 12.2%Language:Shell 2.5%Language:Dockerfile 1.7%