Analyze and Query AWS CloudFront Logs through Amazon Athena

Udaara Jayawardana
5 min readOct 3, 2021

--

I have been working a lot with Amazon CloudFront recently for many of the applications I manage for my organization. CloudFront was quite helpful when enhancing the customer experience of these applications and I have seen a noticeable performance improvement. However, the downside of using a CDN means, you have another layer to troubleshoot if something goes wrong :(

Image source: independent.co.uk — copyright disclaimer under section 107 of the copyright act 1976, allowance is made for “fair use” for purposes such as education and teaching

When a user accesses a web application, CloudFront will generate access logs from each of the configured global edge locations and usually save them to an S3 bucket. However, these logs are not optimized for easy querying and can take a large amount of time for combing out any potential causes of an app failure or anomaly.

So, we need a solution to this! I could think of two ways to do it.

1. Amazon Elasticsearch service to feed data to Kibana and build a custom dashboard
2. Amazon Athena to query the data in S3

As I have worked with ELK stack for quite a long time, I went ahead with Athena as a POC. Who knows, it might give me a better result compared to ELK!

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

So, our workflow consists of ‘CloudFront → S3 → Amazon Athena’

During my R&D work, I’ve realized Athena scanning efficiency can be vastly improved through optimized input compared to a raw. Having a large number of logs will be extremely hard to scan and process. Therefore, to improve your Athena performance, it is needed to partitioning the data. AWS Lambda Function can be easily injected into the middle of our workflow to divide the large data files into partitions. Then an Athena CTAS query can be used to convert the files from CSV to Apache Parquet format. These partitions will be the input for Athena. Our overall finalized workflow is as below

Finalized Flow of Querying Logs through Athena

1. Understanding & Partitioning the Log Files

First, you need to make sure that your CloudFront logs are enabled.

CloudFront configured and enabled to record Logs to S3

CloudFront uses CSV format to put user access log files in S3. It follows a pattern as below

distributon_id.yyyy-mm-dd.uniqure_id.gz

Which should look like as below in your S3

CloudFront user access logs in S3

Now we look into partitioning these log files.

For this, we can use an AWS Lambda Function. It will be responsible for moving the log files to a partition directory.

Once done, you should have an output like this in your S3.

Partitioned files in S3

2. Converting the Access Logs to Apache Parquet

Now we look into converting the partitioned logs to Apache Parquet format. This further improves Athena querying and as Athena can skip unwanted data, cost will be relatively less.

Amazon Athena CTAS (Create Table as Select) can be utilized to transform the data and convert it to Parquet format. So, I’ll use a CTAS query to create a new table from the result of select query runs on the current access logs.

The Preview Table Query should look like below

CREATE EXTERNAL TABLE `cdai_wdev_logs`.`cf_useraccess_opt_parquet` (
`time` timestamp,
`location` string,
`bytes` bigint,
`requestip` string,
`method` string,
`host` string,
`uri` string,
`status` int,
`referrer` string,
`useragent` string,
`querystring` string,
`cookie` string,
`resulttype` string,
`requestid` string,
`hostheader` string,
`requestprotocol` string,
`requestbytes` bigint,
`timetaken` double,
`xforwardedfor` string,
`sslprotocol` string,
`sslcipher` string,
`responseresulttype` string,
`httpversion` string
)
PARTITIONED BY (
year string,
month string,
day string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'LOCATION 's3://cdai_wdev_logs`/'TBLPROPERTIES ('has_encrypted_data'='false');

The data now is tailor made for Athena! :)

3. Querying Data using Athena

First, we’ll look into the initial steps.

  1. Go to Athena Query editor and select an S3 bucket as a query result location. Encryption of the query results is encouraged but depends on your organization’s policies
  2. Go to Data Sources and select S3 — AWS Glue Data Catalog as the source. But we are creating the table manually, so select the Create a table manually option in Step 2
  3. The Preview Table Query should look like below
CREATE EXTERNAL TABLE IF NOT EXISTS `cdai_wdev_logs`.`cf_logs` (
`time` timestamp,
`location` string,
`bytes` bigint,
`requestip` string,
`method` string,
`host` string,
`uri` string,
`status` int,
`referrer` string,
`useragent` string,
`querystring` string,
`cookie` string,
`resulttype` string,
`requestid` string,
`hostheader` string,
`requestprotocol` string,
`requestbytes` bigint,
`timetaken` double,
`xforwardedfor` string,
`sslprotocol` string,
`sslcipher` string,
`responseresulttype` string,
`httpversion` string
)
PARTITIONED BY (
year string,
month string,
day string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'WITH SERDEPROPERTIES (
serialization.format' = '1'
)
LOCATION 's3://cdai_wdev_logs`/'TBLPROPERTIES ('has_encrypted_data'='false');

Now we need to scan the files in S3 for HIVE compatible partitions. Command MSCK REPAIR TABLE compares the partitions in the table metadata and the partitions in S3. If new partitions are present in the S3 location that you specified when you created the table, it adds those partitions to the metadata and to the Athena table.

Scan S3 for HIVE compatible data partitions through ‘MSCK Repair Table’

Now we can simply query anything we want just as a database, without going through the logs!

Querying through CloudFront access logs

--

--

Udaara Jayawardana

A DevOps Engineer who specialises in the design and implementation of AWS and Containerized Infrastructure.