Log management may never go away, but the way it is done may change. While indexes have long been the backbone of most log management solutions, Kresten Krab Thorup, CTO and co-founder of Humio, explains that’s not the only way to manage log data.
According to Krab Thorup, indexing has been an accepted approach for collecting and analyzing log data, but it’s time to move on. Humio is not the only company offering log management solutions based on index-free architectures; companies like Loki and Scalyr also offer similar solutions.
In a talk given today at QCon in New York City, Thorup described what index-free logging is and how it is beneficial.
When using an index-based system, organizations pay upfront for disk space and CPU usage. A downside of this is that “if the time and space to build the index grows out of proportion with the real data that you are actually interested in, then you have lost,” said Thorup.
“Database indexes provide a trade off suitable for systems with low ingest rate and high query frequency. The core activity with logs is to write a lot and only search parts of them in specific time ranges, when an incident occurs. Indexes are good for many things but not for logging.”
With index-free logging, benefits include lower ingest latency, near-real time alerts and dashboards, lower disk space requirements, and much lower hardware requirements, Thorup explained. “The key interesting thing is just to be able to do 10 times as much with the same hardware,” he said.
According to Thorup, there are two major issues with using indexes. First, the “high-cardinality” problem. High cardinality occurs when data contains a large set of keys and values. Users often create keys for log data, user-defined events, and traces; but indexing all of that property data tends to result in indexes that end up being larger than the data they want to put into the system. Thus, it takes a long time to compute the index and the data may be irrelevant by the time it is completed.
Another potential issue is a lack of coverage. Thorup explained that when a user queries log and metrics data, it returns matches from all time, not just the time frame they are interested in. This is because the index maps a domain of keys that corresponds to the data set, but doesn’t carry other information over. According to Thorup, this issue can be mitigated in SQL by creating a covering index, but that’s not possible with full-text searching, such as with Apache Lucene. “Without a covering index that includes the time index of the log message is the only way to find the matches in the time frame of the query is by doing extra work,” said Thorup.
Thorup believes that index-free logging solves these issues. With index-free logging, data is stored in buckets, which are then labeled with information that enables to query engine to decide if data could be in that bucket or not.
He recognizes that this technique could also be viewed as a sort of index, but that the term “index-free logging” refers to the fact that individual keys and values are not indexed.
“Index-free Logging provides for a different set of tradeoffs uniquely better suited for logs, events, and traces than solutions based on exact term indexing,” he said. In short, index-free logging gives users more time to understand and debug their solution and not waste time troubleshooting the logging platform itself, Thorup stated.