Highly scalable event logging on AWS

Print Friendly, PDF & Email

Most applications generate configuration events and access events. It is important for administrators to have visibility into these events. The Barracuda Email Security Service provides transparency and visibility into many of these events to help administrators fine-tune and understand the system. For example, knowing who logged into the account and when. Or knowing who added, changed, or deleted the configuration of a particular policy.

To build this distributed and searchable system, many questions come to mind, such as:

  • How should you write these logs from all the applications, services, and machines to a central location?
  • What should the standard format of the log files be?
  • How long should you retain these logs?
  • How should you correlate events from different applications?
  • How do you provide a simple and quick searching mechanism via a user interface for the administrator?
  • How do you make these logs available via an API?

When you think of a distributed search engine, the first thing that comes to mind is Elasticsearch. It is highly scalable with near real-time search and available as a fully managed service in AWS. So, the journey started with the thinking of storing these event logs in Elasticsearch and all the different applications pushing logs to Elasticsearch using Kinesis Data Firehose.

Components involved in this architecture

  1. Kinesis Agent – Amazon Kinesis Agent is a standalone Java software application that offers an easy way to collect and send data to Kinesis Data Firehose. The agent continuously monitors event log files on EC2 Linux instances and sends them to the configured Kinesis Data Firehose delivery stream. The agent handles file rotation, checkpointing, and retry upon failures. It delivers all of your data in a reliable, timely, and simple manner. Note: If the application that needs to write to Kinesis Firehose is a Fargate container, you will need a Fluentd container. However, this article focuses on applications running on Amazon EC2 instances.
  2. Kinesis Data Firehose – Amazon Kinesis Data Firehose direct put method can write the JSON formatted data into Elasticsearch. This way it doesn’t store any data on the stream.
  3. S3 – An S3 bucket can be used to back up either all records or records that fail to be delivered to Elasticsearch. Lifecycle policies can also be set to auto-archive logs.
  4. Elasticsearch – Elasticsearch hosted by Amazon. Kibana access can be enabled to help query and search the logs for any debugging purpose.
  5. Curator – AWS recommends using Lambda and Curator to manage the indices and snapshots of the Elasticsearch cluster. AWS has more details and sample implementation that can be found here
  6. REST API interface – You can create an API as an abstraction for Elasticsearch which integrates well with the User interface. API-driven microservice architectures are proven to be the best in many aspects such as security, compliance, and integration with other services.

 

Scaling

  • Kinesis Data Firehose: By default, firehose delivery streams can scale up to 1,000 records/sec or 1MiB/sec for US East (Ohio). This is a soft limit and can be increased up to 10,000 records/sec. This is region specific.
  • Elasticsearch: The Elasticsearch cluster can be scaled both in terms of storage and compute power on AWS. Version upgrades are also possible. Amazon ES uses a blue/green deployment process when updating domains. This means that the number of nodes in the cluster might temporarily increase while your changes are applied.

Advantages of this Architecture

  1. The pipeline architecture is effectively completely managed and requires very little maintenance.
  2. If the Elasticsearch cluster fails, Kinesis Firehose can retain records for up to 24 hours. In addition, records that fail to deliver are also backed up to S3.

The chances for data loss are low with these options available.

  1. Fine-grained access control is possible to both Kibana and Elasticsearch API through IAM policies.

Shortcomings

  1. Pricing needs to be carefully considered and monitored. The Kinesis Data Firehose can handle large amounts of data ingestion with ease, and if a rogue agent starts logging large amounts of data, the Kinesis Data Firehose will deliver them without issues. This can incur large costs.
  2. The Kinesis Data Firehose to Elasticsearch integration is only supported for non-vpc Elasticsearch clusters.
  3. The Kinesis Data Firehose currently cannot deliver logs to Elasticsearch clusters that are not hosted by AWS. If you would like to self-host Elasticsearch clusters, this setup will not work. 

Conclusion

If you are looking for a solution that is completely managed and (mostly) scales without intervention, this would be a good option to consider. The automatic backup to S3 with lifecycle policies also solves the log retention and archival problem easily.

Scroll to top
Tweet
Share
Share