What we learned from Data Indexing System here at VTEX

Rodrigo Abinader May 21, 2020

What we learned from Data Indexing System here at VTEX

The science and technology of working with data are now one of the main drivers of success in the e-commerce business. Empowered with the ability to generate, organize and read massive volumes of data stored in multiple servers and database clusters, companies can improve customers’ user experience and create effective marketing strategies.

This article aims to show the journey of VTEX’s software engineers to implement and improve a complex data indexing system, with advanced search engine functionalities, in a single multi-tenant platform for hundreds of millions of users on a global scale.

Improving search intelligence with data indexing system tool Solr

The end-user search experience in e-commerce is not as straightforward as an SQL database query. It must take into account the complexities of natural human languages, which are commonly prone to variation and error — an intelligence that the standard database systems simply do not possess.

Take the simple example of a computer. How many related words could a user type in order to retrieve the desired results? Pc, laptop, mac, macbook, notebook… If you consider the possible typing mistakes, the results are uncountable. For the customer, however, this is none of his or her business, and the system’s failure to retrieve data can lead to retries or, even worse, an increased dropout rate.

The solution I found for this problem, back in the early days of the VTEX platform development, some eleven years ago, was to implement in our platform the Solr system, an data indexing system and search engine for enterprise, built on Apache Lucene. With Solr, it became possible to index the relevant data in our databases, integrate, and manipulate it in a REST API interface by means of HTTP GET requests, while at the same time profiting from all the advanced data intelligence it provides.

Benefits of the Solr system

By adding this third-party software to the VTEX e-commerce platform, users could count with a full range of improved search and BI tools, such as:

Semantic analysis of synonyms and correlated terms
Autocomplete and autocorrect
Geospatial search
Categorization and counters
Faceted search
Real-time indexing
Dynamic fields
Database integration

The Solr system is built with support for distributed indexing, replication, load-balanced querying, and automated failover and recovery, making it reliable, scalable, and fault-tolerant. All these features, in conjunction with its fast response, resulted in a great increase in performance, making search results at the same time faster and more detailed.

First challenge: data creation and update on Solr

Despite all these benefits, the rapid growth of VTEX platform’s user base (and therefore their store databases) led our team to a bottleneck: while the Solr system responded very well to data retrieval, it was not so great when it came to creating and updating data in a multi-server infrastructure. As the database grew to a volume in the order of terabytes and beyond, crashes started to happen. The problem was in great part due to the two processes of indexing and search competing for the server’s hardware resources (memory, CPU…) concurrently, gravely stressing the infrastructure.

In order to solve this issue, after studying the Solr documentation, analyzing our complex data structure and reading use cases from other companies, we modified our system so that updates would occur in batches, therefore avoiding the problem of concurrent processes overloading the server. This change in our system’s architecture proved effective, allowing us to greatly scale the amount of indexed data from millions to a volume of billions.

Second challenge: dynamic fields

This method of dealing with problems would set the standard workflow of the VTEX computer engineering team in the company’s growth journey: adoption of a technology, fine-tuning it to the needs of the user base, stressing the system with more input and testing its limits, developing solutions to bugs and crashes.

One interesting feature of the indexing systems which is quite useful for e-commerce, especially in the context of a large multi-tenant application, is the dynamic field. Dynamic fields make it possible to customize the data beyond the constraints of a basic universal schema, allowing users to add more fields according to their needs and particular specifications. For example, a shoe store may want to further specify products of the same category in terms of sizes, colors, collections; a computer store, on the other hand, would categorize its products in terms of hardware specifications and operating systems.

However, just as in the case of writing data, adding dynamic fields starts to become an issue once the platform’s user base increases. The Solr system sets a maximum of dynamic fields that can be used in the entire application. To make matters worse, dynamic fields greatly increase the complexity and variability of data in the application.

Our apparently simple but greatly effective solution to this problem was to create specific indexes for dynamic fields. By removing complex data structures from the global index, we were able to create new instances of indexing only when required and decrease the impact of variability.

Radical change: transition to the Elasticsearch system

The rapid growth of data can prove to be a real challenge to the infrastructure. This is something we experienced first-hand. If at one point the company was growing at a rate of 50% per year, the volume of data grew at an even larger rate. When our total number of users was around 30 million, suddenly a new client would start using the platform, bringing with him some 150 million users. It is no surprise that such an event caused unprecedented stress to the system, in addition to crashes and downtimes from which we required upwards of three hours to recover.

But one fundamental aspect of the development process at VTEX is the independence and autonomy employees have for making decisions and defining the next steps. What matters to us is not who proposes the idea, but its efficacy to solve the problem at hand. In other words, every member of the software engineering team has the power to contribute decisively to the project, regardless of his or her position.

When this challenge that could not be solved with our default methods appeared, a radical change was proposed by the then junior back-end developer Ygor Santos. At the same time that he was working on improvements and solutions for bugs in the Solr system, Ygor studied a new technology, Elasticsearch, based on the same library as the system in use, but more robust and made to withstand massive loads of data, like the sort that the company was facing. After adapting the models used on Solr to the new technology and putting it to the test, Elasticsearch proved able to handle large volumes of data and to be especially fit for scalability.

Due to his engagement and learning, Ygor Santos evolved within VTEX and became one of the references in this subject. The developer also leads one of the teams that are responsible for handling the largest volume of data that reaches the platform, dealing with data indexing so that the customer can perform simple queries or calculations with aggregations.

Benefits of the Elasticsearch system

Indexes in Elasticsearch are distributed and divided into shards (a partitioning unit from Lucene, also used in Solr) that can be replicated. Nodes are responsible for storing shards, and the system manages data operations in this structure, with safety mechanisms to protect index creation, recover primary shards by using allocation IDs, guard against overload and avoid crashes. Some of the system’s main features include:

Distributed infrastructure
Better support for multi-tenant systems
Data analytics
Grouping and aggregation

The Elasticsearch documentation resources explain its great resiliency:

“A master node in Elasticsearch continuously monitors the cluster nodes and removes any node from the cluster that doesn’t respond to its pings in a timely fashion. If the master is left with too few nodes, it will step down and a new master election will start.”

Differences in technology

Each of these systems is suitable for a particular type of file. Elasticsearch is a more modern tool than Solr, so it is easier to work with. In performance tests, this tool proved to be more efficient, with better indexing time and more resources in queries.

The differences that can greatly impact businesses, however, are those concerning the software’s orientation and architecture. In that sense, Solr specializes in linguistic operations and text interpretation, while Elasticsearch specializes in data analysis, with great features for grouping and querying.

One question that is often asked when it comes to comparing software is performance. Our tests revealed that Elasticsearch not only has a better infrastructure for a large distributed application but also performs better at indexing, making it the better option for scalability in the cloud and distributed environments.

That does not mean Solr does not have its advantages. As an older system, Solr has very detailed and complete documentation, lots of available open-source resources and an active community of users, making its implementation safe and versatile.

What we have learned

We used a specific indexing software for some eleven years, and during this time we had to change and tweak its configurations and our own systems architecture — sometimes to create new features and improve user experience in our platform, other times to fix bugs and crashes caused by the massive overload of data that are bound to happen in a company experiencing rapid growth. Then, after a crisis that appeared impossible to solve, we discovered that the best way was not to tweak our then-current system, but to transition to a new and better one.

So, in this journey to develop the VTEX data indexing system, what have we learned?

We have learned that not all problems are solvable by the same method. That sometimes, improving a tool is the best way to work with it, but other times, a better tool is needed for bigger tasks. We learned the true impact of innovation. And, most of all, we learned that, with data systems, there is no limit to what we can do, as long as we apply intelligence and build correctly.

If you liked this story and want to be a part of its next chapter, come join the VTEX development team.

We believe in the power of talent, and we are hiring.

DISCLAIMER: It is important to note that historical financial information or operational KPIs may not be comparable with publicly-filed information at SEC, since VTEX did not report its financials in accordance with International Financial Reporting Standards (IFRS) prior to 2019 and certain KPI definitions may differ from publicly-filed information. You are cautioned not to place undue reliance on figures published before July 21st, 2021 as they may not be comparable to the metrics disclosed from the IPO onwards.

Written by Rodrigo Abinader

Keep reading: Related stories