Building a search service from scratch

In one of my previous posts I fiddled around with client-side search powered by Lunr.js. Since then I have been wanting to experiment more with this; it feels like it should be possible to build a hosted search service with Node.js - without writing a whole lot of code, that is.

So, that’s on today’s agenda 😊

For this project I’ll use Umbraco as the data source once again, but this time I’ll feed to the search service using the built-in Umbraco webhooks. I’ll also use the same blog content as I did in the previous post, so the search results should be comparable.

The final result can be found in this GitHub repo.

Waitaminute! But… why?

Given the number of excellent search services out there, why would you ever consider building your own? The answer is… you probably wouldn’t 😛

Unless you’re on a tight budget and really love hosting Node.js apps in production, this project ranks pretty high on the “useless” scale.

But it’s a fun one all the same 🤓

Choosing a search engine

The search service needs to be able to continuously update its index, as posts are created, updated or deleted.

As it turns out, Lunr.js generates immutable indexes, so that’s not going to be a great choice for this project.

Enter MiniSearch, which ticks off all the boxes for the search service:

It is lightweight.
It supports altering the index at runtime.
It is feature rich (although not as feature rich as Lunr.js).
It has a comprehensive query language.
It is extendable.

The requirements

Here’s what I want my search service to handle:

The search results should be immediately consumable by clients.
The Umbraco webhooks must be able to power the index.
Changes to the index must require authorization.
The index must be continuously persisted on disk.

Let’s break those down one by one.

Immediately consumable search results

Unlike Lunr.js, MiniSearch allows for storing raw property values alongside the indexed properties in the index. This means I can somewhat effortlessly produce search results that can be rendered directly on the consumer side.

As an added bonus, the indexed and stored properties are declared separately, so I don’t have to worry about the stored fields necessarily being indexed:

import MiniSearch from 'minisearch';

// define the index:
// - index the fields 'title', 'excerpt', 'tags'
// - store the fields 'title', 'excerpt', 'tags', 'path' (for search result generation later on)
const indexOptions = {
    fields: ['title', 'excerpt', 'tags'],
    storeFields: ['title', 'excerpt', 'tags', 'path']
};

const index = new MiniSearch(indexOptions);

With the index in place, the storeFields become immediately available in the search results. This is quite convenient for building the consumable search result output:

const index = [index from previous code snippet]
const query = [search query from client]

// execute the search against the index
const results = index.search(
    query,
    {
        // use AND for multi-word queries (default is OR)
        combineWith: 'AND',
        // use trailing wildcard for terms that are 3 chars or longer
        prefix: term => term.length > 2
    }
);

// create an output that is immediately consumable by clients
const items = results.map((result) => ({
    id: result.id,
    path: result.path,
    title: result.title,
    excerpt: result.excerpt,
    tags: result.tags.split(' ')
}));

Indexing content from Umbraco webhooks

At first glance, the built-in webhooks in Umbraco seem like a great fit for pushing content to the search service. But they do come with a few limitations that require some workarounds:

All webhooks are executed as POST requests. Thus, all indexing operations will be handled by the same POST endpoint in the search service, using the custom umb-webhook-event header to quantify the concrete operation. While this is definitively doable, it means that the indexing part of the search service won’t be terribly RESTful.
The webhooks do not have explicit support for authorization. But they do support custom request headers, so I’ll just have to settle for using basic auth.

With these things in mind, the indexing endpoint ends up something along these lines:

import express from 'express';

const index = [index from previous code snippet]
const app = express();

app.post('/index', (req, res) => {
    // the 'umb-webhook-event' header value contains the instruction on what to do
    const event = req.headers['umb-webhook-event'];
    if (!event) {
        res.status(400).send('Malformed request (missing umb-webhook-event header)');
        return;
    }

    // get the request body and sanity check
    const data = req.body;
    if (!data.Id) {
        res.status(400).send('Malformed request body (missing id)');
        return;
    }

    // figure out what to do based on the event header value
    switch (event) {
        // content published => create/update the document in the index 
        case 'Umbraco.ContentPublish':
            // construct a document for the index
            const doc = {
                id: data.Id,
                path: data.Route.Path,
                title: data.Name,
                excerpt: data.Properties.excerpt,
                tags: data.Properties.tags
            };
            // add/update the index
            if (index.has(doc.id)) {
                index.replace(doc);
            } else {
                index.add(doc);
            }
            storeIndex().then(() => res.status(200).send());
            break;
        // content unpublished or deleted => delete the document from the index 
        case 'Umbraco.ContentUnpublish':
        case 'Umbraco.ContentDelete':
            index.discard(data.Id);
            storeIndex().then(() => res.status(200).send());
            break;
    }
});

// stores the index on disk
async function storeIndex() {
    [will be discussed later]
}

Authorization for index changes

Since the Umbraco webhooks only support basic auth, this one is pretty straightforward; I’ll enable basic auth for the Express server using express-basic-auth middleware.

There is a little trick to it, though. The middleware should only be applied for the indexing operations - the search endpoint must remain publicly available. Fortunately this is entirely doable because of how the middleware works in Express:

import express from 'express';
import basicAuth from 'express-basic-auth'

// the users allowed to modify the index (using basic auth)
const indexUsers = {'someone': 'some-secret-password'};

// define the express app and setup basic auth middleware for the indexing operations
const app = express()
const basicAuthMiddleware = basicAuth({users: indexUsers});
app.use((req, res, next) =>
    req.originalUrl.startsWith('/index') ? basicAuthMiddleware(req, res, next) : next()
);

Persisting the index on disk

In order for the index to survive a restart (or re-deploy) of the search service, the index must be persisted on disk. There are two caveats here.

Firstly, concurrency is a thing. I am of course expecting a supermassive high load on my search service 🤘

Apparently, Node.js is not really great at file locking. But of course, there is an NPM package for that 🤦 - the proper-lockfile.

Secondly, because of the expected supermassive high load, I don’t want to persist the index on disk with every single update; this would be a performance degradation when handling bulk updates.

To get around this I’ve chosen to defer the persistence by a second, so multiple successive indexing operations within a second only yield a single file operation.

The following code snippet shows all this in action. Note that it is a slightly simplified version of the final code, in order to emphasize the file locking and the deferred file writing.

import MiniSearch from 'minisearch';
import {promises as fs} from 'fs';
import lockfile from 'proper-lockfile'

const indexOptions = [index options from previous code snippet]
const index = [index from previous code snippet]

// this is the name of the index file on disk
const indexFile = 'index.idx';

// stores the index on disk
async function storeIndex() {
    // defer writing by a second, in case multiple indexing requests are made within a short period
    clearTimeout(storeIndexTimeout);
    storeIndexTimeout = setTimeout(async () => {
        // lock the index file before writing
        const release = await lockfile.lock(indexFile);
        try {
            const data = JSON.stringify(index);
            await fs.writeFile(indexFile, data);
        } catch (err) {
            // ...add some error handling here
        }
        // release the index file lock
        await release();
    }, 1000);
}

// restores the index from disk (creates a new index if no index file is found)
async function restoreIndex() {
    try {
        // lock the index file before reading
        const release = await lockfile.lock(indexFile);
        const data = await fs.readFile(indexFile, {encoding: 'utf8'});
        // IMPORTANT: the index must be loaded with the same options as it was originally created with
        index = MiniSearch.loadJSON(data, indexOptions);
        // release the index file lock
        await release();
    } catch (err) {
        // ... add some error handling here - fallback to creating a new index
        index = new MiniSearch(indexOptions);
    }
}

Time to try it out 🚀

That was a whole lot of code! If you have cloned down the the GitHub repo, here’s how you can try it out:

First start the search service. Open a terminal in /src/Service and run:

npm install
npm run service

The Umbraco site is pre-configured with a webhook that pushes content to the search service when a post is published. To start Umbraco, open a terminal in /src/Cms and run dotnet run.

Once Umbraco starts up, you can log in to the backoffice (the credentials are displayed on the login screen) and start publishing the blog posts. Unfortunately, publishing the posts root with descendants won’t trigger the webhook for all the descendant posts, so you have to publish all the posts one by one.

If you don’t care to do that, I have added a Node.js script that pulls all posts from the Delivery API and pushes them to the search service. You can execute it by running npm run seed in /src/Service.

With the posts indexed by the search service, you can now test the search endpoint. I have tweaked the test page from the previous post, so it consumes the search service instead of the local Lunr.js index. Fire up yet another terminal in /src/Client and run:

npm install
npm run client

This starts an Express server, which hosts the test page. If all went well, the published posts should be readily searchable from the test page 🤞

Pushing the envelope a bit

The test dataset is just six blog posts in the Umbraco database, which arguably isn’t a whole lot. So I figure it makes sense to try with a little more test data 😉

To that end I have created a console app to seed the index with 10000 posts. The posts are made up of random values, generated by the brilliant Bogus package. Size-wise they are comparable to the test dataset in Umbraco - possibly a little larger.

The results are quite positive:

Query execution time is seemingly unaffected; searches still execute in a matter of milliseconds.
The resulting index is only around 9 MB on disk, and roughly half of that is made up of the stored fields.

If you’re curious about Bogus or want to recreate my test, you’ll find the console app in the GitHub repo as well under /src/Seeder.

Takeaways from this project

First and foremost - MiniSearch is impressive. Just as with Lunr.js in my previous post, it’s super easy to get started with MiniSearch, and it JustWorks™ without any fuss.

I’m particularly impressed with all the work that has clearly gone into performance optimizing the MiniSearch implementation. If you dissect the generated index, you’ll see stuff like optimized document IDs being generated automatically.

Add to that a wealth of options for tweaking and fine-tuning both indexing and searching - impressive indeed 👏

I know there’s a lot of code in this post… but in reality, the entire search service implementation spans less than 200 lines of JavaScript - including comments. I think that’s pretty cool 😎 and it goes to show that a lot can be accomplished with fairly little code.

If you ever consider using this for anything more than fun and games, here are a few things that should be amended in the search service:

The deferred index file writing is great for performance, but it is also a point of failure. If the service crashes before it has time to write the index file, all non-committed index changes are lost. One workaround might be to write delta files, so the index can be spooled back into the last known state after a crash.
The indexing operations should be split into proper RESTful operations - PUT for updates, POST for creations and DELETE for removals. This of course only works if your data source is able to consume the service using those HTTP verbs.

Something tells me that geek in me is not entirely finished with this project just yet. I am weirdly fascinated by this whole thing, even if I can’t really imagine myself ever using it “for real”.

Anyway… happy hacking 💜