Elasticsearch is a powerful distributed search engine that has evolved over the years into a more general NoSQL storage and analysis tool. while Elasticsearch may require someImportant Oversight for Efficient Scale, has features that simplify ELK operations - one area that deserves particular attention is Elasticsearch indexes and index management.
The way data is organized between nodes in an Elasticsearch cluster has a huge impact on performance and reliability. This Elasticsearch feature is also one of the most requested features for users.
Unoptimized or incorrect configurations may vary. While traditional Elasticsearch index management best practices still apply, Elasticsearch has added several features to further optimize and automate index management. In this article, we'll explore different ways to get the most out of indicators, combining traditional advice with researching newly released features.
The concepts explained in this article also apply to OpenSearch — a forked version of Elasticsearch released and maintained by AWS after the shutdown of Elastic that delivered the ELK stack — which provides the same functionality.
If you are interested in merging with ELK Stack/OpenSearch, but don't need to think about overall Elasticsearch index management,View Logz.ioIt provides a fully managed OpenSearch cluster and OpenSearch Dashboards instance. This allows you to focus on things other than managing Elasticsearch.
With all that said, this is an article about Elasticsearch, so let's get started.
Data in Elasticsearch is stored in one or more indexes. Since those of us using Elasticsearch typically deal with large amounts of data, the data in the index is fragmentedlintEasy storage management.Indexes can be too large to fit on one disk, but blocks are smaller and can be allocated to different nodes as needed.
Another advantage of proper sharing is that searches can run in parallel on different shards, speeding up query processing. The number of shards in an index is fixed when the index is created and cannot be changed later.
Shared indexes are useful, but even then there is still only one copy of each document in the index, meaning there is no protection against data loss.
To solve this problem, we can setcopy.Each fragment can have multiplecopy, which are configured during index creation and may change later. thisprimary segmentis the main piece that handles document indexing and also handles query processing.
Replica fragments process queries but do not directly index documents. They are always assigned to a different node than the primary shard, and if the primary shard fails, you can promote a replica shard to take its place.
While more replicas provide a higher level of availability in the event of a failure, it is also important not to have too many replicas. Each fragment has a state which must be stored in memory for fast access. The more shards you use, the more overhead will accumulate and affect resource usage and performance.
Optimization for Time Series Data
Using Elasticsearch to store and analyze time-series data, such as application logs or Internet of Things (IoT) events, requires managing large volumes of data over long periods of time.
elastic search raid
Time series data is often spread across multiple metrics. A simple approach is to use different indexes for arbitrary periods of time, such as one marker per day. Another way is to use the so-calledscroll APIIt can automatically create a new index when the main index is too old, too large, or contains too many documents.
As indexes age and their data becomes less relevant, there are steps you can take to make them use fewer resources so that the most active indexes have more resources available. One of them is to use the so-calledreduce APIFlatten pointers to a single main block.
Having a large number of shards is generally a good thing, but it can also become a burden for legacy pointers that only receive occasional requests. This of course depends a lot on the structure of the data.
For very old pointers that are rarely accessed, it makes sense to completely free the memory they use. Available in Elasticsearch 6.6 and abovefreeze APIThis allows you to do just that. When an index crashes, it becomes read-only and its resources are no longer kept alive.
The tradeoff is that finding frozen pointers is slower, since those resources now have to be allocated on demand and then shredded again.
To prevent random query slowdowns that might occur as a result, the query parameterignore_throttled=falseMust be used to explicitly indicate that frozen pointers should be included when processing search queries.
Index Lifecycle Management
The previous two sections explained how long-term index management goes through a series of phases between actively accepting new data for indexing and no longer needing new data.
thisIndex Lifecycle ManagementA feature (ILM) introduced in Elasticsearch 6.7 brings them all together and allows you to automate those transformations that in previous versions of the E-Stack had to be done manually or using external processes.
ILM, available under the Elastic Basic license rather than the Apache 2.0 license, allows users to define policies that determine when these transitions occur, andactionApplies to every stage.
We can configure it using ILMWarm, hot and cold architectureWhere stages and actions are optional and can be configured as needed:
- WarmIndexers receive live data for indexing and often serve queries. Typical activities at this stage include:
- Set recovery priority to high.
- Specifies a transition strategy to create new indexes when the current index becomes too large, too old, or contains too many documents.
- hotThe index no longer has index data, but queries are still processed. Typical activities at this stage include:
- Set recovery priority to medium.
- Optimize indexes by shrinking them, forcing them to be merged, or making them read-only.
- Allocate pointers to less efficient hardware.
- coldMetrics are rarely questioned.
Typical activities at this stage include:
- Set recovery priority to low.
- Freeze indicators.
- Allocate pointers to less efficient hardware.
- deleteMetrics older than any retention period.
ILM policies can be defined using the Elasticsearch REST API or even directly in Kibana, as shown in the screenshot below:
Organize data in an Elasticsearch index
When managing Elasticsearch indexes, your primary concern is ensuring stability and performance. However, the data structure that actually goes into these metrics is also a very important factor in the usability of the overall system.
This structure affects the accuracy and flexibility of search queries on data that may come from multiple sources, and therefore affects how the data is analyzed and visualized.
In fact, it is recommended to createmapfor pointers has been around for a long time. While Elasticsearch is able to guess the data type based on the input it receives, its intuition is based on a small sample of the dataset and may not be instantaneous. Explicitly creating mappings prevents problems with datatype conflicts in indexes.
Even with mappings, extracting insights from large amounts of data stored in an Elasticsearch cluster can be a tedious task.
Data from different sources that may have a similar structure (e.g. IP addresses from IIS, NGINX, application logs) can be indexed into fields with completely different names or data types.
thisFlexible joint shapeReleased with Elasticsearch 7.x, it is a new development in the field. Searching and visualizing data from a variety of sources has suddenly become easier by defining standards that unify field names and data types. This enables users to use Kibana to obtain a single, unified view of the disparate systems they maintain.
Properly configuring index sharing and replication has a direct impact on the stability and performance of your Elasticsearch cluster.
The features described above are useful tools to help you manage your Elasticsearch indexes. However, this task is still one of the most demanding parts of working with Elasticsearch, requiring knowledge of the Elasticsearch data model and specific indexed datasets.
As mentioned earlier, if you'd rather focus on other tasks,Logz.io Offers Fully Managed OpenSearchexperience. This lets you use the most popular open source log management stacks without having to manage indexes or any other data pipeline components yourself.
For time series data, the Rollover and Shrink APIs handle basic index overflow and optimized indexing. The new addition of frozen markers allows you to deal with another type of aging markers.
The ILM feature was also recently added, allowing complete automation of index lifecycle transitions. As indexes age, they can be modified and moved to use fewer resources, leaving more resources for the most active indexes.
Finally, creating mappings for indexed data and mapping fields to Elastic Common Schema can help you get the most out of your data in your Elasticsearch cluster.