Scaling an online search engine to thousands of physical stores – ElasticON

10/03/2023

Auteur(s) :

Roudy Khoury

Temps de lecture : 6 minute(s)

A summary of the talk Scaling an online search engine to thousands of physical stores by Roudy Khoury and Aline Paponaud at ElasticON 2023

Scaling an online search engine to thousands of physical stores – ElasticON

We presented the English version of this talk at Berlin Buzzwords last year. This year, on March 8, we participated in the ElasticON Global 2023 conference, where we presented our search engine solution for managing a large number of online stores. This was an opportunity to generalize and update the presentation, and also to share our experience in French.

Online shopping has been an opportunity for some e-commerce stores during the Covid-19 crisis. But for almost all businesses, it has brought a multitude of new challenges. Today, almost all physical stores want to go online.

The search engine plays a very important role in the success of e-commerce stores. So there are features that are expected of the search engine:

With online search, we need to find the products we want to buy
We need an autocomplete system to help us reformulate our search and quickly find what we are looking for
Ability to navigate through sections
If we don’t find the product we’re looking for, the search engine should suggest similar products
If we misspell a word, the search engine must be intelligent and know what we mean (“Did you mean?")

When a store wants to go online, it will need to manage a lot of data. This data can take several forms: in files, in a database, or obtained via streams from different sources.

We can have different levels of maturity between store data, the data can vary from one store to another, they are heterogeneous, and they come from many sources. So the search engine platform needs to be built on top of that. It is necessary to manage all the complexities that come with the data, hide them and present them in a clean way.

Managing Indexes

There are several approaches for managing data in Elasticsearch. As we have specificities for each store’s offers, for example, promotions that exist in one store but not in another, the first idea that comes to mind is to have a large index with all the product and offer data in it, duplicated by store.

This could be a good solution since this approach is compact, unitary, and in which the cluster state remains small, but there is also another idea.

We can create an index with product references, i.e., product data that does not change between stores, and then all offers will go into indexes separated by stores, one index per store. In this case, we have a model that looks like a relational data model.

However, there are drawbacks:

We lose performance because we have too much information that needs to be retrieved. The search engine must traverse large data models, and what is specific will be played at the time of the search.
Any configuration error will break the search on all stores. For example, if a product contains bad information, it will be replicated in other stores.
Updates will be very frequent on the same index. If a store wants to change the price, for example, it must update the shared index with the others.
In this case, there are also joins. We don’t like that in search!

What we did to solve these points is that we set up an index per store.

In the store index, we store all products and offers for that store with the configuration. In this case, there will be duplications because there are common elements in the data for several stores. Since we have an index per store, we can really fine-tune each store. In this case:

The cluster state will likely increase as we have many indexes, but performance and response time will be the best.
Data will be independent per store, so better isolation.
Fewer updates per index, they will be distributed across multiple indexes.

Maturity and Competition

Everyone who has managed to put their store online will be in direct competition with the big players (Amazon, Rakuten, etc.). The search engine will be able to show products from stores, and if there are none, it can show products from the marketplace. This can be a 3rd party that provides products with delivery options. So the search will include physical stores as well as the marketplace, and the idea is that the search engine hides these complexities.

We can have thousands of physical store products and millions of marketplace products. Since marketplace products will not have specific features, they can be grouped into a single index.

There is still one problem to solve: products do not have the same data type or structure. We can have food and non-food products that do not necessarily have the same structure and fields. So we need some kind of intelligence during indexing to pool this. For this, a common schema for product data has been proposed. We will prepare our data in advance before indexing to have a well-defined structure, and in this way the search will be able to work on multiple indexes at the same time, both marketplace and store.

To ensure the scalability of our search engine, there are a few points to guarantee:

We are in a mobile world, so we need a very minimal response time.
If the user searches for a product and the search takes a few seconds to display the results, they will probably go somewhere else.
Ensure thousands of product updates.
Manage multiple entry points for mobile applications, websites, crawlers, etc.
Maintain search security and robustness by isolating data.

A quick view of such a solution:

Elasticsearch with its configuration, different mappings of fields, and settings for each index.
An index that stores the configurations of stores or a group of stores.
A business console to manage these configurations: to apply boosts, facets, create dictionaries, or any other changes to the configuration.
A module to optimize the data of the structures defined before indexing.

Since we have a lot of indexed data, the cluster state can quickly become large. So we need to make sure we have a multi-cluster setup.

We create multiple clusters on which we distribute the indexes and duplicate the common indexes such as the marketplace index. Therefore, we will need a router that will have a table containing information on which cluster each store is on.

Points to consider for monitoring the search engine:

Track user search behavior
What are the most searched terms? The most applied filters?
Handle the case of zero results for a query
Track clicks that have been made on the 2nd page or the position of the clicked product, to boost/deboost these products
Keep an eye on response time histograms

To conclude, here is a checklist to consider when building an e-commerce search engine:

It is very important to adapt to the physical store and its specifics.
Use a schema to work with large datasets from different sources.
Response time is not negotiable, and it must be optimized by using a good indexing approach.
It is important to learn by monitoring the system and preparing for scalability in advance.
And finally, security, not just protection against hackers, but also in terms of data isolation to prevent the spread of errors or index corruption.

Question answering,a more human-based approach to our research on all.site.

19/01/2023

Everything about Question-Answering and how to implement it using a flask and elasticsearch.

Read the article

Feedback - Fine-tuning a VOSK model

05/01/2022

all.site is a collaborative search engine. It works like Bing or Google but it has the advantage of being able to go further by indexing for example media content and organizing data from systems like Slack, Confluence or all the information present in a company's intranet.

Read the article

Feedback - Indexing of media file transcripts

17/12/2021

Read the article

New Search & Data meetup - E-Commerce Search and Open Source

28/10/2021

The fifth edition of the Search and Data meetup is dedicated to e-commerce search and open source. A nice agenda to mark our return to the Meetup scene

Read the article

Shipping to Synonym Graph in Elasticsearch

21/04/2021

In this article, we explain how we moved from the old Elasticsearch synonym filters to the new Synonym Graph Token Filter.

Read the article

When queries are very verbose

22/02/2021

In this article, we present a simple method to rewrite user queries so that a keyword-based search engine can better understand them. This method is very useful in the context of a voice search or a conversation with a chatbot, context in which user queries are generally more verbose.

Read the article

Enrich the data and rewrite the queries with the Elasticsearch percolator

26/04/2019

This article is a transcript of the lightning talk we presented this week at Haystack - the Search and Relevance Conference. We showed a method allowing to enrich and rewrite user queries using Wikidata and the Elasticsearch percolator.

Read the article

A2 the engine that makes Elasticsearch great

13/06/2018

Elasticsearch is an open technology that allows integrators to build ever more innovative and powerful solutions. Elasticsearch

Read the article