Analyze and Query AWS CloudFront Logs through Amazon Athena
I have been working a lot with Amazon CloudFront recently for many of the applications I manage for my organization. CloudFront was quite helpful when enhancing the customer experience of these applications and I have seen a noticeable performance improvement. However, the downside of using a CDN means, you have another layer to troubleshoot if something goes wrong :(
When a user accesses a web application, CloudFront will generate access logs from each of the configured global edge locations and usually save them to an S3 bucket. However, these logs are not optimized for easy querying and can take a large amount of time for combing out any potential causes of an app failure or anomaly.
So, we need a solution to this! I could think of two ways to do it.
1. Amazon Elasticsearch service to feed data to Kibana and build a custom dashboard
2. Amazon Athena to query the data in S3
As I have worked with ELK stack for quite a long time, I went ahead with Athena as a POC. Who knows, it might give me a better result compared to ELK!
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
So, our workflow consists of ‘CloudFront → S3 → Amazon Athena’
During my R&D work, I’ve realized Athena scanning efficiency can be vastly improved through optimized input compared to a raw. Having a large number of logs will be extremely hard to scan and process. Therefore, to improve your Athena performance, it is needed to partitioning the data. AWS Lambda Function can be easily injected into the middle of our workflow to divide the large data files into partitions. Then an Athena CTAS query can be used to convert the files from CSV to Apache Parquet format. These partitions will be the input for Athena. Our overall finalized workflow is as below
1. Understanding & Partitioning the Log Files
First, you need to make sure that your CloudFront logs are enabled.
CloudFront uses CSV format to put user access log files in S3. It follows a pattern as below
distributon_id.yyyy-mm-dd.uniqure_id.gz
Which should look like as below in your S3
Now we look into partitioning these log files.
For this, we can use an AWS Lambda Function. It will be responsible for moving the log files to a partition directory.
Once done, you should have an output like this in your S3.
2. Converting the Access Logs to Apache Parquet
Now we look into converting the partitioned logs to Apache Parquet format. This further improves Athena querying and as Athena can skip unwanted data, cost will be relatively less.
Amazon Athena CTAS (Create Table as Select) can be utilized to transform the data and convert it to Parquet format. So, I’ll use a CTAS query to create a new table from the result of select query runs on the current access logs.
The Preview Table Query should look like below
CREATE EXTERNAL TABLE `cdai_wdev_logs`.`cf_useraccess_opt_parquet` (
`time` timestamp,
`location` string,
`bytes` bigint,
`requestip` string,
`method` string,
`host` string,
`uri` string,
`status` int,
`referrer` string,
`useragent` string,
`querystring` string,
`cookie` string,
`resulttype` string,
`requestid` string,
`hostheader` string,
`requestprotocol` string,
`requestbytes` bigint,
`timetaken` double,
`xforwardedfor` string,
`sslprotocol` string,
`sslcipher` string,
`responseresulttype` string,
`httpversion` string
)PARTITIONED BY (
year string,
month string,
day string
)ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'LOCATION 's3://cdai_wdev_logs`/'TBLPROPERTIES ('has_encrypted_data'='false');
The data now is tailor made for Athena! :)
3. Querying Data using Athena
First, we’ll look into the initial steps.
- Go to Athena Query editor and select an S3 bucket as a query result location. Encryption of the query results is encouraged but depends on your organization’s policies
- Go to Data Sources and select S3 — AWS Glue Data Catalog as the source. But we are creating the table manually, so select the Create a table manually option in Step 2
- The Preview Table Query should look like below
CREATE EXTERNAL TABLE IF NOT EXISTS `cdai_wdev_logs`.`cf_logs` (
`time` timestamp,
`location` string,
`bytes` bigint,
`requestip` string,
`method` string,
`host` string,
`uri` string,
`status` int,
`referrer` string,
`useragent` string,
`querystring` string,
`cookie` string,
`resulttype` string,
`requestid` string,
`hostheader` string,
`requestprotocol` string,
`requestbytes` bigint,
`timetaken` double,
`xforwardedfor` string,
`sslprotocol` string,
`sslcipher` string,
`responseresulttype` string,
`httpversion` string
) PARTITIONED BY (
year string,
month string,
day string
)ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'WITH SERDEPROPERTIES (
serialization.format' = '1'
)LOCATION 's3://cdai_wdev_logs`/'TBLPROPERTIES ('has_encrypted_data'='false');
Now we need to scan the files in S3 for HIVE compatible partitions. Command MSCK REPAIR TABLE
compares the partitions in the table metadata and the partitions in S3. If new partitions are present in the S3 location that you specified when you created the table, it adds those partitions to the metadata and to the Athena table.
Now we can simply query anything we want just as a database, without going through the logs!