What Happens in Vegas Stays in Vegas, Or Not

2017-02-21, 11:44 Posted by: Aki Karjalainen

AWS Re:invent

Great way to end the year. 2016 AWS Re:invent, check. A lot of interesting topics were covered, new products and services announced. Had a great time, though moving around between the sessions was a bit cumbersome and time consuming. I heard Amazon is rethinking the venue next year, it may be that they need a bigger venue without the sessions split up to two different resorts across the Vegas Boulevard. Well we'll see what they come up with. However, a wanted to share a couple of tips, tricks and ideas from Re:invent you may find useful. Basically just something I picked more or less randomly from my notes. Instead of the usual "do not try this at home" I encourage anyone to "try at home" - fool around with the new (or old) AWS services and give them a try if you haven't done so already.

AWS Redshift

A free Redshift tip. Typically you want things to be spread evenly. Weather it's your usual housekeeping tasks divided with your wife, workload on your EC2 instances or distribution of data across database cluster. Redshift is column-oriented database. Data is stored by column and it's well suited for analytics with a large number of columns. Redshift is also a distributed which means that each compute node (there's also a leader node) in the cluster stores a subset of the data. Each node is divided to slices (the number of slices depend on the node type). When you create a table you must also think of how you want your data to be spread. Evenly spread data is the default but for performance reasons you may want to spread by specific column. This is also called data partitioning, much like you would call it in IBM DB2 or Oracle world. Ok, about choosing a distribution style. Key, for large FACT tables, rapidly changing tables used in joins. Even, for tables not frequently joined or large tables without candidate keys as data is spread across the slices in a round-robin fashion regardless of the values in any column. All, for slowly changing and reasonable size data. The beef is that data redistribution which Redshift does may account for a big portion of the cost of a query plan and network traffic. It's important to minimize data movement during query execution - if the rows that participate in joins are already collocated on the nodes with their joining rows in other tables, life will be good. Depending on your applications and use case, one scenario would be to do data distribution by user id, whether it's generated unique id or email address. So you would do something like "create table MyTable (Id int) diststyle key distkey (UserId);". Ok, life looks good and your EDW applications and queries run smoothly. Over time you notice that data is not spread evenly across compute nodes and you might wonder why that is the case. A typical reason might be that your application is actually storing some data from anonymous user sessions (that is, not logged in users) and the column data in those cases might be something like null, unauthenticated etc. They would be located on a single slice and lead to significant skew. Beware of similar scenarios when data would first seem like evenly distributed but there's a hidden anomaly.


About design patterns on WAF and serverless architectures. WAF stands for Web Application Firewall which protects your services against some common exploits. Well nothing new as other big players like F5 with their Application Security Manager have similar features like dynamic learning and threat analysis. What's cool about WAF is the fact that it's not just on or off type of service with predefined exploits but you can define customizable rules. And what is even cooler with that is the fact that you can of course modify your rules using Lambda functions. Hm, that opens up some nice possibilities like altering your rules dynamically based on specific patterns you look for (in your logs or somewhere else). You might need to have your SSH port open but block bad actors like too many failed logins or something similar. You could enable VPC flow logs and based on the CloudWatch logs generated twiddle with your WAF rules accordingly. Previously WAF had been integrated only with CDN (CloudFront) but now WAF is available to be used with Application Load Balancers as well which will actually make WAF quite a compelling option with typical web applications as well. Combining WAF with the "security business" logic run as Lambda functions is one of the cool things and I'd expect to hear about some nice patterns build around that combination in the future.


One more security related reminder for the holidays. In case you are accessing your resources in AWS using a bastion host (a.k.a. jump server or whatever you call it) remember to destroy and rebuild your bastion host periodically to mitigate the risk of having your bastion host hacked for a long period of time. And, whatsoever, of course the most effective housekeeping action is to recreate the resource from scratch. After all, as you have your automation / infrastructure as code implementation in use, recreating critical resources like bastion hosts even daily should be a breeze, right. By the way, I had a privilege to attend a nice session about immutable infrastructure in production and life without SSH by AWS and Experian. Experian is in B/C credit reporting business operating in 80 countries and they run a significant number of workloads in AWS and guess how many times they actually have had to log in to an instance using SSH during 2 years after they disabled it? 0 times. At first disabling SSH altogether might sound like IaC case gone too far but honestly, why would you want to log into any of your instances ever. Logs can be examined elsewhere. Basically the only time you would like to do so is after you've been hacked and you need to do forensics analysis directly on the box. So, even a better alternative is to disallow SSH access altogether.

Kudos if you read this far. :)

comments powered by Disqus