Browsing Splunk buckets on a S3 bucket

As a Splunk admin managing the storage for the Splunk deployment, it is not uncommon to browse the index directories on the Splunk indexers either to gather the total storage usage across hot/warm and cold tiers or to perform backup or restore activities.  In the traditional Splunk Enterprise deployment with direct attached storage or an enterprise storage that is attached to the indexers, the indexed files can be browsed under the index directories.  Every hot/warm or cold bucket for an index occupies a sub-directory under the index directory.  

For example, the hot/warm buckets for the index audit can be found under the db directory below and the cold buckets can be found under the colddb directory.

[[email protected]]# pwd
[[email protected]]# ls -l
total 28
drwx------   2 splunk splunk     6 Sep  2 15:08 colddb
drwx------   2 splunk splunk    36 Sep  2 15:08 datamodel_summary
drwx------ 207 splunk splunk 28672 Sep  8 13:41 db
drwx------   2 splunk splunk    36 Sep  2 15:08 summary
drwx------   2 splunk splunk     6 Apr 11 09:13 thaweddb

However, once you switch to Splunk SmartStore, the warm buckets of the indexed data are placed on the remote object storage like Pure Storage FlashBlade that supports S3 protocol.  While you can still browse the hot and cached data found under the index directories that are hosted on direct attached storage, you cannot get to all the indexed data, as the master copy of the data resides on the remote object storage which is not mounted on the indexer to do any browsing.  

If so, how can the Splunk admins navigate/browse the indexed data residing on the remote object storage to calculate the total storage used?  

There are two ways to accomplish this:

  1. Use Splunk’s rfs command
  2. S3 command line tools

Splunk RFS command

Splunk offers the remote file system aka rfs command to perform object management like list, get or put objects on the remote object storage.  The command can be invoked through the administrative Splunk CLI command.

splunk cmd splunkd rfs --help

To list all objects of a given volume, you can run the command

splunk cmd splunkd rfs –- ls --starts-with volume:<remote-store-name>

If you noticed, there is no reference to the remote object store, access key/secret key, S3 bucket name in the command.  Splunk rfs command extracts these details from the indexes.conf file based on the remote store volume specified in the command.  

If you have not configured the indexes.conf with remote store details, you can still run the RFS command by manually providing all the necessary details like the access-key, secret-key, endpoint and the bucket-name in the command line as below.

splunk cmd splunkd rfs -- --access-key XXXXX --secret-key XXXXX --endpoint ls --starts-with s3://test-bucket/

Note: The rfs command can be run even if the Splunk instance on that server is not running.

To list all objects of a given index, you can run the following command. 

splunk cmd splunkd rfs –- ls --starts-with index:<index-name>

For example, the following command lists the object details from the volume remote_store  with prefix _metrics.

[[email protected]]# splunk cmd splunkd rfs -- ls --starts-with volume:remote_store/_metrics 
. . .
<output truncated>

As you can see from the above output all the standard files in a splunk warm bucket directory like the .data files, tsidx file, rawdata are indeed stored in the remote object storage but in a specific folder format.  See this community post if you want to know how to map the Smartstore bucket to the local splunk bucket.  

Total Space usage

As the rfs command lists the objects, to get the total size of all objects on the remote object store, you can use the awk along with rfs command as follows.

splunk cmd splunkd rfs -- ls --starts-with volume:remote_store |grep -v 'size,name'| awk -F, '{ sum+= $1; ct+=1} END { print "Object count: "ct, "Total size: "sum/1024/1024/1024" GB"}'

Object count: 153525 Total size: 6135.83 GB

If you want to get the space usage at an index level, you can fine tune the awk command like below.

splunk cmd splunkd rfs -- ls --starts-with volume:remote_store |grep -v 'size,name' | awk -F, ' { print $1"/"$2 } '  |awk -F/ ' { print $2 " " $1} '|awk '{ sum[$1]+= $2; ct[$1]+=1} END { for (i in sum) { print i", " sum[i]/1024/1024/1024 " GB, "ct[i]} }'

_internal, 50.8643 GB, 8287
_introspection, 20.3804 GB, 730
_metrics, 13.612 GB, 3117
apache, 199.419 GB, 459

Note: When there are millions of objects on the S3 bucket for Splunk, the above commands can take some time to run and the duration can certainly be impacted by the type of S3 object store.  Pure Storage FlashBlade being the file and object store based on all-flash, tends to perform better than any other object store.

Using the value “v2” for remote.s3.list_objects_version can make a big difference when there are numerous objects in the object store.  For Splunk SmartStore on FlashBlade, we do recommend that this is set to v2.  For more best practices for Splunk SmartStore on FlashBlade, please see this article.

S3 Command line tools

There are various S3 command line tools like s3cmd, aws-cli, s4cmd, goofys and s5cmd to name a few are available at your disposal.  Amongst these tools, we have found s5cmd to be the most performant and robust to browse and manage the objects on a S3 compatible object storage like Pure Storage FlashBlade.  

Please see this blog post by my ex-colleague Joshua Robinson who has covered the performance of s5cmd extensively.

Please see the official s5cmd page for more details on how to install the tool.  The easiest option is to download the pre-built binary from the releases page that is relevant to your environment. 

s5cmd uses the official AWS SDK to access S3 which requires credentials.  They can be provided by either through

  • Environment variables
  • Credentials file (~/.aws/credentials) or specify the credentials file through command line using --credentials-file option.
  • Command line option –-profile to use the named profile.

To list the objects for a specific index on a Splunk S3 bucket, you can run the following command.

s5cmd ls s3://splunk-data/_audit/* 
2022/09/01 23:06:27                 7 db/00/60/47~4B72571E-D509-4020-8EFA-ED88B7216978/guidSplunk-4B72571E-D509-4020-8EFA-ED88B7216978/.rawSize
2022/09/01 23:06:24                 6 db/00/60/47~4B72571E-D509-4020-8EFA-ED88B7216978/guidSplunk-4B72571E-D509-4020-8EFA-ED88B7216978/.sizeManifest4.1
2022/09/01 23:04:47             72224 db/00/60/47~4B72571E-D509-4020-8EFA-ED88B7216978/guidSplunk-4B72571E-D509-4020-8EFA-ED88B7216978/1655851968-1655838884-5862769335757404721.tsidx
2022/09/01 23:06:02               266 db/00/60/47~4B72571E-D509-4020-8EFA-ED88B7216978/guidSplunk-4B72571E-D509-4020-8EFA-ED88B7216978/
2022/09/01 23:04:57               105 db/00/60/47~4B72571E-D509-4020-8EFA-ED88B7216978/guidSplunk-4B72571E-D509-4020-8EFA-ED88B7216978/
2022/09/01 23:05:50               101 db/00/60/47~4B72571E-D509-4020-8EFA-ED88B7216978/guidSplunk-4B72571E-D509-4020-8EFA-ED88B7216978/
2022/09/01 23:05:20              2936 db/00/60/47~4B72571E-D509-4020-8EFA-ED88B7216978/guidSplunk-4B72571E-D509-4020-8EFA-ED88B7216978/bloomfilter
2022/09/01 23:06:29                67 db/00/60/47~4B72571E-D509-4020-8EFA-ED88B7216978/guidSplunk-4B72571E-D509-4020-8EFA-ED88B7216978/bucket_info.csv
2022/09/01 23:06:20             16746 db/00/60/47~4B72571E-D509-4020-8EFA-ED88B7216978/guidSplunk-4B72571E-D509-4020-8EFA-ED88B7216978/rawdata/journal.gz
2022/09/01 23:05:13                93 db/00/60/47~4B72571E-D509-4020-8EFA-ED88B7216978/guidSplunk-4B72571E-D509-4020-8EFA-ED88B7216978/rawdata/slicemin.dat
2022/09/01 23:05:07               610 db/00/60/47~4B72571E-D509-4020-8EFA-ED88B7216978/guidSplunk-4B72571E-D509-4020-8EFA-ED88B7216978/rawdata/slicesv2.dat
2022/09/01 23:05:04              1536 db/00/60/47~4B72571E-D509-4020-8EFA-ED88B7216978/receipt.json
. . .
<output truncated>

In the above command, if the S3_ENDPOINT_URL is not set then specify it through the command line as follows.

s5cmd --endpoint-url ls s3://splunk-data/_audit/*

Total Space Usage

To get the total usage of the Splunk data on a S3 bucket, run the following command.

s5cmd du --humanize s3://splunk-data/*
6.0T bytes in 154581 objects: s3://splunk-data/*

As like the RFS command, to get the total space usage and the number of objects per index from the Splunk S3 bucket, you can run the following command that uses s5cmd along with awk.

s5cmd ls s3://splunk-data/* |awk '{ print $3"/"$4}' |awk -F/ '{arr[$2]+= $1; ct[$2]+=1} END {for (i in arr) { print i, arr[i], ct[i]}}'
_introspection 21883276762 730
_internal 59760464974 8787
_audit 4158332603 11663
apache 269875080796 5764

The s5cmd tool by default uses ListObjects v2 version. If you wanted s5cmd to use v1, you can use the option --use-list-objects-v1 when invoking the s5cmd.

The s5cmd tool is actively developed and as of now, they have added the sync feature which synchronizes S3 buckets, prefixes, directories and files between S3 buckets and prefixes.

Hopefully, this blog post gave a good idea of what is possible with different command line tools to browse the Splunk buckets in a SmartStore environment.

Like it? Share ...Tweet about this on Twitter
Share on LinkedIn
Notify of
Inline Feedbacks
View all comments
Would love your thoughts, please comment.x