Last Updated on October 7, 2020
Many people have heard of summary indexing, yet have not made use of them. One reason for this, is that you may not realize you need them, until you need them. Summary indexes, as the name implies, allows for the storage of summarized data over time. A good use case for a summary index are queries that require the summarizing/trending of large amounts of data over a longer period of time.
For example, with the onset of increased teleworking, my customer had an increased need to monitor the activity and resources used around teleworking (vpn activity, login/logoff, concurrent users, zoom activity, etc.). I was required to track hourly metrics over a longer period of time, such as hourly metrics over a period of weeks. When using good spl structure and tuning, the query still failed to return results in a reasonable amount of time. This type of query is a great candidate for summary indexing.
There are two parts to this: 1). define a query to populate a summary index using smaller chunks of data, and 2). Define a query to run against the summarized index to calc your final results.
The summary indexing solution allows us to take these bite-size calculations of our data, and store the those results in a separate index. The smaller amount of data has not only gotten us a head start in our calculations, but has also allowed for a smaller amount of data for us to query through.
What is a summary index?
Summary indexes are no different than other indexes, however an advantage to using separate indexes is that it allows you to modify retention times for the data – segmenting the summarized data from the source data. Consider that the source of your data is housed within an index with 90 days of retention, and utilizes a large amount of disk space; once summarized to a separate index, you can hold on to the key pieces of that data for a longer period of time, while also saving on disk space. By default, Splunk provides an index named “summary”, however dependent on an organization’s needs or structure, you may create additional indexes based on security requirements, retention, etc.
How is the data stored, and what are the licensing implications?
All events in a summary index use the sourcetype stash by default, and thereby summary indexing does not count against your license, no matter how many summary indexes your environment has allocated. However, licensing is impacted if you make use of the “collect” command and change the sourcetype to something other than ‘stash’.
Does this cost me in licensing?
Now – consider stash vs another sourcetype.
The basic steps to make use of summary indexing
The process to implement summary indexing is fairly straight forward:
- Identify the index you would like to utilize for summary indexing
- Identify your report requirements (what data to report on and the frequency)
- Create a scheduled savedsearch
- Develop and test the query that you will use to populate the summary index
- Make use of Splunk’s summary indexing commands.
- Schedule the query
- Enable summary indexing
- Develop and test a query used to view the summary index results
A Basic Example
For the purposes of this article, we will make use of a summary index in order to summarize data found within the _internal index. We will implement summary indexing via the web interface.
Step 1: Identify the index that should hold the summarized data
For this example, we will make use of the default ‘summary’ index.
Step 2: Identify the report requirements
Before populating the summary index, you’ll need to know what the end-goal of this specific data is; identify what data is to be reported on, and for what time slices.
I’d like to report on the number of events that are indexed per sourcetype within the _internal index. I’d like to get report that has the ability to show the data on an hourly basis per day.
Step 3: Develop and test the index-populating query
An easy way to create a query to populate the summary-index is to write a query similar to what you want to report on in the end; take into account all fields that you may want to include, summarize the data using the stats command, and test the output.
As mentioned above, I’d like to count the number of events indexed per hour within the _internal index by sourcetype:
|bin _time span=10m
|stats count as count by _time, sourcetype
As a result, we get a list of counts by sourcetype divided into 10-minute bins. Even though I will only run this report 1x/hour, I have decided to ‘bin’ the data into 10-minute chunks in case my requirements change in the future. Remember, you can always ‘bin’ the _time field in future queries to rollup data by larger time spans.
Note: In order to report on data using specific time slices, you must include _time as part of your stats command. If you are working with data that does not include _time, you may create a _time field by appending the following to the end of your query: “|eval _time=now()”
Want to Know More? Contact Aditum’s Splunk Experts.
“We have a demanding development environment and Aditum has delivered top notch support.”– Large Health Insurance Provider
Aditum’s Splunk Architects, Splunk Administrators, Splunk Developers and Information Security consultants deliver outstanding results to companies like yours every day. From initial installation to managed services, our experts can help you deliver success.
Step 4: Edit the query to utilize a summary-indexing command (an si-* command)
Now that we’ve tested the query, we need to format the query to write to a summary index appropriately. To do so, we need to make use of our choice of the si-* commands. Splunk has provided a number of commands specifically designed for use with summary indexing (sometimes referenced as the “si-*” commands). Note that the following commands are simply ‘si’ versions of other spl commands:
In our example, we replace ‘stats’ with ‘sistats’ as shown below:
|bin _time span=10m
|sistats count as count by _time, sourcetype
Step 5: Save, and Schedule the Query
Save your query as a saved search using the typical Splunk Web UI navigation:
Settings | Searches, reports, and alerts | New Report
From the reports listing, schedule the report to run once every hour
Actions| Edit | Edit Schedule
Step 6: Enable Summary Indexing
Access the Summary Indexing Dialogue:
Actions | Edit Summary Indexing
Choose to enable summary indexing
In the dialogue box, we will select to store the data in the “summary” index as we determined in Step 1 above (the only indexes that are shown in this dialogue box are those that your userid has write permissions to),
I have chosen to utilize a new field, “report”, which helps to distinguish the data within the summary index from other queries that are developed. The name “report” is arbitrary – you may create your own fields as you choose.
Step 7: Develop and test a query used to view the summary index results.
This is the final step, to actually run the query to get the final results. Assuming I’d like to count today’s number of events per hour, by sourcetype:
earliest=-0d@d index=summary report=sourcetype_eventcount
|bin _time span=1h
|stats sum(count) by _time orig_sourcetype
A few interesting items at this point:
- report: this is the custom field that was added in step 6 to identify the data, you may use this within the query to pull back the appropriate dataset.
- orig_sourcetype – note that we are required to query based on a new field, “orig_sourcetype”, in order to avoid conflicting with the sourcetype that has been assigned to this data. (stash).
- source: the source of the data is set to the name of the savedsearch that was executed to populate the index.
- search_name: also set to the name of the savedsearch that was executed to populate the index.
- psrsvd* fields: these are special “prestats reserved” fields which Splunk has added when you have used any of the si* commands. These are fields that are not usually directly referenced, but are used by Splunk when using reporting commands such as chart, timechart, and stats with this data.
- When setting up the query – ensure that you use a _time field – in order to be able to make use of “stats latest()” functionality. Try to include all fields and granularity that you may need in the future. Best practices, per Splunk, is to capture the lowest granularity possible in order to achieve your goal, and still perform appropriately.
- Not covered within this article is the manual population of summary indexes using the addinfo and collect commands: https://docs.splunk.com/Documentation/Splunk/8.0.4/Knowledge/Configuresummaryindexes#Manually_configure_a_report_to_populate_a_summary_index
- Consider backfilling data from that past. This can be done via the Splunk CLI and the use of the python script fill_summary_index.py https://docs.splunk.com/Documentation/Splunk/latest/Knowledge/Managesummaryindexgapsandoverlaps#Use_the_backfill_script_to_add_other_data_or_fill_summary_index_gaps
Aditum’s Splunk Professional Services consultants can assist your team with best practices to optimize your Splunk deployment and get more from Splunk.
Our certified Splunk Architects and Splunk Consultants manage successful Splunk deployments, environment upgrades and scaling, dashboard, search, and report creation, and Splunk Health Checks. Aditum also has a team of accomplished Splunk Developers that focus on building Splunk apps and technical add-ons.
Contact us directly to learn more.