One of the major reasons organizations migrate to the AWS cloud is to gain the elasticity that can grow and shrink on demand, allowing them to pay only for resources they use. But the freedom to provide on-demand resources can sometimes lead to very high costs if they aren’t carefully monitored. Cost Optimization is one of the five pillars of the AWS Well-Architected Framework, and with good reason. When you optimize your costs, you build a more efficient cloud that helps focus your cloud spend where it’s needed most while freeing up resources to invest in things like more headcount, innovative projects or developing competitive differentiators.
Additionally, considering the cost implementation in mind, we will try to optimise our own cost of AWS usage by visualising it with AWS Quicksight. We will look into the complete setup of viewing the AWS cost and usage reports. Furthermore, we will look to implement our goal using S3 and Athena.
What is AWS Cost and Usage Service?
The AWS Cost and Usage report tracks your AWS usage and provides estimated charges associated with your AWS account. The report contains line items for each unique combination of AWS product, usage type, and operation that your AWS account uses. You can customize the AWS Cost and Usage report to aggregate the information either by the hour or by the day. AWS delivers the report files to an Amazon S3 bucket that you specify in your account and updates the report up to three times a day. You can also call the AWS Billing and Cost Management API Reference to create, retrieve, or delete your reports. You can download the report from the Amazon S3 console, upload the report into Amazon Redshift or Amazon QuickSight, or query the report in Amazon S3 using Amazon Athena.
What is AWS QuickSight?
Amazon QuickSight is an Amazon Web Services utility that allows a company to create and analyze visualizations of its customers’ data. The business intelligence service uses AWS’ Super-fast, Parallel, In-memory Calculation Engine (SPICE) to quickly perform data calculations and create graphs. Amazon QuickSight reads data from AWS storage services to provide ad-hoc exploration and analysis in minutes. Amazon QuickSight collects and formats data, moves it to SPICE and visualizes it. By quickly visualizing data, QuickSight removes the need for AWS customers to perform manual Extract, Transform, and Load operations.
Amazon QuickSight pulls and reads data from Amazon Aurora, Amazon Redshift, Amazon Relational Database Service, Amazon Simple Storage Service (S3), Amazon DynamoDB, Amazon Elastic MapReduce and Amazon Kinesis. The service also integrates with on-premises databases, file uploads and API-based data sources, such as Salesforce. QuickSight allows an end user to upload incremental data in a file or an S3 bucket. The service can also transform unstructured data using a Prepare Data option
What is AWS Athena?
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.
Athena is out-of-the-box integrated with AWS Glue Data Catalog, allowing you to create a unified metadata repository across various services, crawl data sources to discover schemas and populate your Catalog with new and modified table and partition definitions, and maintain schema versioning. You can also use Glue’s fully-managed ETL capabilities to transform data or convert it into columnar formats to optimize cost and improve performance.
Setting up AWS S3 and Cost Service
The very first task requires you to set up an S3 bucket. S3 bucket is the location where we will be putting our amazon cost and usage data. Go to your Amazon console and select S3. Click on create bucket button to initialise the setup
Once the create bucket menu pops up, you will see the different options to fill. You need to write the bucket name, mention region and select access settings for the bucket in this step.
Click on create after filling all the fields. Open S3 and navigate to the Permissions tab in the console. We need to copy the access policy from here to access this bucket from quicksight. Furthermore, this policy will help in connecting the bucket with AWS cost and usage service.
Click on bucket policy. A JSON file will come up with some default settings. We do not need to change much of the things in this file. You can directly copy the code their
We are able to set up our S3 bucket till now. Additionally, we need to create our cost and usage report now. Go to AWS Cost and usage reports tab from the console. Click on create the report to create a new report on cost and usage.
After clicking on create report, a form will pop up. Mention all the necessary details here. The form includes the report name, cost and usage time level. You can directly access these reports for Redshift and Quicksight. In this tutorial, we are storing the data in the S3 bucket first. After storing it in S3 bucker, we will connect it with AWS Quicksight.
In the second part, we need to select a delivery option. I will mention the name of the final delivery S3 bucket which we created in the previous step.
Fill the form and click next. After clicking the next, we have created a report on AWS cost and billing. Click on the newly generated report now.
We need to set up access policies for the report. Click on create a new policy and sample editor will pop up
You can choose to edit the policy depending upon your requirement. Edit the resource section here and mention the correct name of your S3 bucket here. Click on done to complete the policy initialisation
Congratulations! Till this part, we have done most of our work. We have an S3 bucket to store cost and usage data. Also, we have set up cost and usage reports to access our S3 bucket and store the results there.
Setting up Athena (Cloud formation template) and Running Queries
Now we need to setup Athena using a cloud formation template. Go to cloud formation console and click on select “Create New Stack”. Once you click on create a new stack, a sample popup will come.
Here you need to fill the form for creating the template. You can choose to select an existing Amazon S3 or can mention a template URL. Once you fill all the fields, click on next. This will create the Athena stack for you using cloud formation template.
Following is the query command, to access the cost and usage statistics. Also, you can try running the following code on the Athena editor to view the results.
Setting up Quicksight
Now we have our Athena and S3 setup completely. We need to setup Quicksight now. Go to Quicksight section, and click on setup. There can be the case when you need to enable or signup for the Quicksight again. In case the below pop up appears, click on signup to create the QuickSight account
A sample form like below will pop up for you. Mention the account name, email address and the services you want to enable for Quicksight. Once you have filled all the entries, click on Finish to complete the setup
You can choose to connect your s3 account with Quicksight. In the following popup, a sample of already existing buckets will pop for you. You can select the pre-existing buckets and it will automatically get connected with your Quicksight. Here you can easily connect the bucket which holds your AWS cost and usage report. With bucket already connected, you can easily pull the cost and usage report into Quicksight and analyse it.
After setting up the Quicksight, a sample popup will come. You can click on Next to finish the setup.
Now the basic setup of Quicksight is complete. All you want to do now is connect your S3 bucket with Quicksight using Athena. Click the Athena option and run the code to extract the usage report into the AWS S3 storage.
You can then select the column names present in the left sidebar panel to plot the charts in the right panel. Quick sight is a drag-and-drop visualisation tool. You can search for the columns and quick sight will show you the suggested visualisations. You can choose the visualisation and drop it onto the right canvas.
It will automatically plot the charts for you. As you can see, below image contains cost by product visualisation of the AWS services. It also depicts the costs distribution of different instances running on AWS.
Data-driven decision making is essential throughout an organization. It is no longer prohibitively expensive to ensure access to BI to employees at all levels. Amazon’s QuickSight lets you create and publish interactive dashboards that can be accessed from browsers or mobile devices. You can embed dashboards into your applications, providing your customers with powerful self-service analytics. It easily scales to tens of thousands of users without any software to install, servers to deploy, or infrastructure to manage.
QuickSight is an innovative and cloud-hosted BI platform that addresses the shortfalls of traditional BI systems. Furthermore, its low pay-per-session pricing is a great alternative to the competition. QuickSight can get data from various sources including relational databases, files, streaming, and NoSQL databases. QuickSight also comes with an in-memory caching layer that can cache and calculate aggregates on the fly. With QuickSight, data analysts are truly empowered and can build intuitive reports in minutes without any significant set up by IT.
Enterprises have long relied on BI to help them move their businesses forward. Years ago, translating BI into actionable information required the help of data experts. Today, technology supports BI which is accessible to people at all levels of an enterprise.
All that BI data needs to live somewhere. The data storage solution you choose for enterprise app development positions your business to access, secure, and use data in different ways. That’s why it’s helpful to understand the basic options, how they’re different, and which use cases are suitable for each.
In this blog, we will be looking at the key differences between Data lakes and Data warehouses. We will understand their basics and will try to see their implementation in different fields with different tools.
What is Data Lake?
A data lake is a central location in which you can store all your data, regardless of its source or format. It is typically, although not always, built using Hadoop. The data can be structured or unstructured. You can then use a variety of storage and processing tools — typical tools in the extended Hadoop ecosystem — to extract value quickly and inform key organizational decisions. Because of the growing variety and volume of data, data lakes are an emerging and powerful architectural approach, especially as enterprises turn to mobile, cloud-based applications, and the Internet of Things (IoT) as right-time delivery mediums for big data.
What is a Data Warehouse?
A data warehouse is a large collection of business data used to help an organization make decisions. The concept of the data warehouse has existed since the 1980s when it was developed to help transition data from merely powering operations to fueling decision support systems that reveal business intelligence.
A large amount of data in data warehouses comes from different places such as internal applications such as marketing, sales, and finance; customer-facing apps; and external partner systems, among others. On a technical level, a data warehouse periodically pulls data from those apps and systems; then, the data goes through formatting and import processes to match the data already in the warehouse. The data warehouse stores this processed data so itʼs ready for decision-makers to access. How frequently data pulls occur, or how data is formatted, etc., will vary depending on the needs of the organization.
1. Data Types
Data warehouses store structured organizational data such as financial transactions, CRM and ERP data. Other data sources such as social media, web server logs, and sensor data, not to mention documents and rich media, are not storable because they are more difficult to model, and their sheer volume makes them expensive and difficult to manage. These types of data are more appropriate for a data lake.
In a data warehouse, data is organized, defined, and metadata is applied before the data is written and stored. We call this process as ‘schema on writeʼ. A data lake consumes everything, including data types considered inappropriate for a data warehouse. Data is present in raw form; information is present to the schema as we extract data from the data source, not when we write it to storage. We call this as a ‘schema on readʼ.
3. Storage and Data Retention
Before we can load data to a data warehouse, data engineers work hard to analyze the data and how to use it for business analysis. They design transformations to summarize and transform the data to enable the extraction of relevant insights. They do not consider the data which doesnʼt answer concrete business questions in the data warehouse. In order to reduce storage space and improve performance — a traditional data warehouse is an expensive and scarce enterprise resource. In a data lake, data retention is less complex, because it retains all data — raw, structured, and unstructured. Data is never going in the deletion phase, permitting analysis of past, current and future information. Data lakes run on commodity servers using inexpensive storage devices, removing storage limitations.
Data warehouses store historical data. Incoming data conforms to a predefined structure. This is useful for answering specific business questions, such as “what is our revenue and profitability across all 124 stores over the past week”. However, if business questions are evolving, or the business wants to retain all data to enable in-depth analysis, data warehouses are insufficient. The development effort to adapt the data warehouse and ETL process to new business questions is a huge burden. A data lake stores data in its original format, so it is immediately accessible for any type of analysis. Information can be retrieved and reused — a user can apply a formalized schema to the data, store it, and share it with others. If the information is not useful, the copy can be discarded without affecting the data stored in the data lake. All this is done with no development effort.
5. Security, Maturity, and Usage
Data warehouses have been around for two decades and are a secure, enterprise-ready technology. Data lakes are getting there, but are newer and have a shorter enterprise track record. A large enterprise cannot buy and implement a data lake like it would a data warehouse — it must consider which tools to use, open source or commercial, and how to piece them together to meet requirements. The end users of each technology are different: a data warehouse is used by business analysts, who query the data via pre-integrated reporting and BI. Business users cannot use a data lake as easily, because data requires processing and analysis to be useful. Data scientists, data engineers, or sophisticated business users, can extract insights from massive volumes of data in the data lake.
Benefits of Data lakes
1. The Historical Legacy Data Architecture Challenge
Some reasons why data lakes are more popular are historical. Traditional legacy data systems are not that open, to say the least, if you want to start integrating, adding and blending data together to analyze and act. Analytics with traditional data architectures weren’t that obvious nor cheap either (with the need for additional tools, depending on the software). Moreover, they weren’t built with all the new and emerging (external) data sources which we typically see in big data in mind.
2. Faster Big Data Analytics as a Driver of Data Lake Adoption
Another important reason to use data lakes is the fact that big data analytics can be done faster. In fact, data lakes are designed for big data analytics if you want and, more important than ever, for real-time actions based on real-time analytics. Data lakes are fit to leverage big quantities of data in a consistent way with algorithms to drive (real-time) analytics with fast data.
3. Mixing and Converging Data: Structured and Unstructured in One Data Lake
A benefit we more or less already mentioned is the possibility to acquire, blend, integrate and converge all types of data, regardless of sources and format. Hadoop, one of the data lake architectures, can also deal with structured data on top of the main chunk of data: the previously mentioned unstructured data coming from social data, logs and so forth. On a side note: unstructured data is the fastest growing form of all data (even if structured data keeps growing too) and is predicted to reach about 90 percent of all data.
Benefits of Data Warehousing
Organizations that use a data warehouse to assist their analytics and business intelligence to see a number of:
Better data, hence adding data sources to a data warehouse enables organizations to ensure that they are collecting consistent and relevant data from that source. They donʼt need to wonder whether the data will be accessible or inconsistent as it comes into the system. This ensures higher data quality and data integrity for sound decision making.
Data in a warehouse is in always consistent analyzable formats. It also provides analytical power and a more complete dataset to base decisions on hard facts. Therefore, decision-makers no longer need to rely on hunches, incomplete data, or poor quality data and risk delivering slow and inaccurate results.
Tools for Data Warehousing
1. Amazon Redshift
Amazon Redshift is an excellent data warehouse product which is a very critical part of Amazon Web Services — a very famous cloud computing platform. Redshift is a fast, well-managed data warehouse that analyses data using the existing standard SQL and BI tools. It is a simple and cost-effective tool that allows running complex analytical queries using smart features of query optimization. It handles analytics workload pertaining to big data sets by utilizing columnar storage on high-performance disks and massively parallel processing concepts. One of its very powerful features is Redshift spectrum, that allows the user to run queries against unstructured data directly in Amazon S3. It eliminates the need for loading and transformation. It automatically scales query computing capacity depending on data. Hence the queries run fast. Official URL: Amazon Redshift
Teradata is another market leader when it comes to database services and products. Most of the competitive enterprise organizations use Teradata DWH for insights, analytics & decision making. Teradata DWH is a relational database management system by Teradata organization. It has two divisions i.e. data analytics & marketing applications. It works on the concept of parallel processing and allows users to analyze data in a simple yet efficient manner. An interesting feature of this data warehouse is its data segregation into hot & cold data. Here cold data refers to less frequently used data and this is the tool in the market these days. Official URL: Teradata
Tools for Data lakes
1. Amazon S3
The Amazon S3-based data lake solution uses Amazon S3 as its primary storage platform. Amazon S3 provides an optimal foundation for a data lake because of its virtually unlimited scalability. You can seamlessly and nondisruptively increase storage from gigabytes to petabytes of content, paying only for what you use. Amazon S3 has 99.999999999% durability. It has scalable performance, ease-of-use features, and native encryption and access control capabilities. Amazon S3 integrates with a broad portfolio of AWS and third-party ISV data processing tools.
2. Azure Data lake
Azure Data Lake Storage Gen2 is a highly scalable and cost-effective data lake solution for big data analytics. It combines the power of a high-performance file system with massive scale and economy to help you speed your time to insight. Data Lake Storage Gen2 extends Azure Blob Storage capabilities and can handle analytics workloads. Data Lake Storage Gen2 is the most comprehensive data lake available.
So Which is Better? Data Lake or the Data Warehouse? Both! Instead of a Data Lake vs Data Warehouse decision, it might be worthwhile to consider a target state for your enterprise that includes a Data Lake as well as a Data Warehouse. Just like the advanced analytic processes that apply statistical and machine learning techniques on vast amounts of historical data, the Data Warehouse can also take advantage of the Data Lake. Newly modeled facts and slowly changing dimensions can now be loaded with data from the time the Data Lake was built instead of capturing only new changes.
This also takes the pressure off the data architects to create each and every data entity that may or may not be used in the future. They can instead focus on building a Data Warehouse exclusively on current reporting and analytical needs, thereby allowing it to grow naturally.
Follow this link, if you are looking to learn more about data science online!