Methods to Manage Amazon Spark Application Memory on Amazon EMR

Image result for managing apache spark on amazon emr

source: Amazon Web Services


It is said that most of the modern-day data have been generated in the past few years. The advancement in technology and the growth of the internet is the source of such voluminous unstructured data which is rapidly changing the dynamics of our daily lives. No wonder, Prof. Andrew NG said – ‘Data is the new electricity’.



We refer to such data which exhibits the properties such as volume, variety, veracity, and value as Big Data. Big Data Developers has built several architectures to get the best out of this big data. Two such frameworks are Hadoop and Spark.

In this blog post, I would provide a brief about Apache Spark and Amazon EMR and then we would learn how memory of an Apache Spark application could be managed on Amazon EMR.


What is Apache Spark?

Though Hadoop was the most popular Big Data framework built, it lacked the real time data processing capabilities which Spark provides. Unlike in Hadoop, Spark uses both memory and disk for data processing.

The applications in Spark are written mainly in Scala, Java, Python. Other libraries like SQL, Machine Learning, Spark Streaming, and so on are also included. It makes Spark flexible for a variety of use cases.

 Many big organizations have incorporated Spark in their application to speed up the data engineering process. It has a directly acyclic graph which boosts performance. A detailed description of the architecture of Spark is beyond the scope of this article.


What is Amazon EMR?

The processing and analysis of Big data in Amazon Web Services is done with the help of a tool known as the Amazon Elastic Map Reduce. As an alternative to in-house cluster computing, an expandable low configuration service is provided by Amazon EMR.

On Apache Hadoop, the Amazon EMR is based upon and across the Hadoop cluster on Amazon EC2 and Amazon S3, the big data is processed. The dynamic resizing ability is referred to by the elastic property which allows flexible usage of resources based on the needs.

In the scientific simulation, data warehousing, financial analysis, machine learning, and so on, the Amazon EMR is used for data analysis. The Apache Spark based workloads are also supported by it which we would see later in this blog.


Managing Apache Spark Application Memory on Amazon EMR

As mentioned earlier, the running of the big data frameworks like Hadoop, Spark, etc., are simplified by the Amazon EMR. ETL or Extract Transform Load is one of the common use cases in the modern world to generate insights from data and one of the best cloud solutions to analyse data is Amazon EMR.

Through parallel processing, various business intelligence and data engineering workloads managed using Amazon EMR. It reduces time, effort, and costs that is involved in scaling and establishing a cluster.

In the distributed processing of big data, a fast, open source framework known as Apache Spark is widely used which relies on RAM performing parallel computing and reducing the input output time. On Amazon EMR, to run a Spark application below steps are performed.

  • To Amazon S3, the Spark application package is uploaded.
  • The Amazon EMR cluster is configured and launched with Apache Spark.
  • From Amazon S3, the application package is installed onto the cluster and the application is ran.
  • The cluster is terminated post the completion of the application.


Based on the requirements, the Spark application needs to be configured appropriately. There could be virtual and physical issues if the default settings are kept. Below are of the memory related issues in Apache Spark.

  • Exceeding physical memory – When limits are exceeded, YARN kills the container and throws an error message: Executor Lost Failure: 4.5GB of 3GB physical memory used limits. It would even ask you to boost spark.yarn.executor.memoryOverhead.
  • Exceeding virtual memory – If the virtual memory exceeds, Yarn kills the container.
  • Java Heap Space – This is a very common memory issue that we face when the Java Heap Space is out of memory. The error message thrown is java.lang.OutOfMemoryError.
  • Exceeding Executor memory – When the executor memory required is above a threshold, we get an out of memory error.


Some of the reasons why the above reasons occur are –

  • Due to the inappropriate setting of the executor memory, executor instances, the number of cores, and so on to handle large volumes of data.
  • When the YARN allocated memory is exceeded by the Spark executor’s physical memory which creates issues while handling memory intensive operations.
  • In the Spark executor instance, when memory is not available to perform operations like garbage collection and so on.


On Amazon EMR, to successfully configure a Spark application, the following steps should be performed.

Step 1: Based on the application needs, the number and the type of instances should be determined

There are three types of nodes in Amazon EMR.

  • The one master node which manages the cluster and acts as a resource manager.
  • The master node manages the core nodes. The Map-Reduce tasks, the Node Manager daemons are run by the core nodes which executes tasks, manages storage, and sends a report back to the master which gives information about its activeness.
  • The task node which only performs but doesn’t save the data.

The R-type instances are preferred for memory-based applications while the C type of instances is preferable for compute-intensive applications. The M-type instances provides a balance between compute and memory applications. The number of instances following the selection of the instance type is done based on the execution time of the application, input data sets, and so on.


Step 2: Determining the configuration parameters of Spark 


Image result for spark parameters

source: AWS


There are multiple memory compartments in the Spark executor container among which the only one executes the task. Some of the parameters which need to be configured efficiently are –

  • For each executor, the size of memory that is required to run the task represented by spark.executor.memory. The formula for executor memory is given by the total per instance RAM divided by the per-instance executor’s number.
  • The number of virtual cores given by spark.executor.cores. A large number of virtual cores leads to reduced parallelism while vice versa results in high I/O operations. Thus, five virtual cores are optimal.
  • The driver memory size – spark.driver.memory. This should be equal to the spark.executor.memory.
  • The driver’s virtual cores number – spark.driver.cores. Setting its value equal to the spark.executor.core is recommended.
  • The executor’s number – spark.executor.instances. This is calculated by multiplying per instance number of executors with the core instances and subtracting one from the product.
  • In RDD, the default number of partitions – spark.default.parallelism. To set its value, multiply spark.executor.instances with spark.executor.cores and two.


Some of the parameters which need to be set in the default configuration file to avoid memory and time-out related issues are.

  • All network transactions timeout –
  • Each executor’s heartbeats interval – spark.executor.heartbeatInterval.
  • To save space by compressing the RDDs, the spark.rdd.compress property is set to True.
  • To compress the data during shuffles, spark.shuffle.compress is set to true.
  • For joins and aggregations, the number of partitions is set by spark.sql.shuffle.partitions.
  • The map output is compressed by setting spark.shuffle.compress to True.


Step 3: To clear memory, a garbage collector needs to be implemented 

In certain scenarios, out-of-memory errors is led by the garbage collection. In the application, when there are multiple RDDs, such cases would occur. An interference between RDD cached memory and task execution memory would also lead to such cases.

To replace old objects with new in the memory, multiple garbage collectors could be used. The limitations with the old garbage collector could be overcome with the Garbage First Garbage Collector.


Step 4: The Configuration parameters of YARN

The YARN site setting should be set accordingly to prevent virtual-out-of-memory errors. The physical and virtual memory flag should be set to False.


Step 5: Monitoring and debugging should be performed

To monitor the progress of the Spark application Network I/O, etc., the Spark UI and Ganglia should be used. To manage 10 terabytes of data successfully in Spark, we need at least 8 terabytes of RAM, 170 instances of executor, 37 GB of executor per memory, the total virtual CPU’s of 960, 1700 parallelism, and so on.

The default configurations of Spark could lead to out of physical memory error as the configurations are incapable of processing 10 terabytes of data. Moreover, the memory is not cleared efficiently by the default garbage collectors leading to failures quite often.


The future of Cloud

Technology is gradually moving from traditional infrastructure set up to an all cloud-based environment. Amazon Web Services is certainly a frontrunner in cloud services which lets you build and deploy scalable applications at a reasonable cost. It also ensures payment for only those services which are required.

As the size of the data would increase exponentially in the future, it is pertinent that companies and employees master learn the art of using cloud platforms like AWS to build robust applications.



Apache Spark is one of the most sought-after big frameworks in the modern world and Amazon EMR undoubtedly provides an efficient means to manage applications built on Spark.In this blog post, we learned about the memory issues in an Apache Spark application and the measures taken to prevent it.

If you are willing to learn more about Big Data or Data Science in general, follow the blogs and courses of Dimensionless.

Follow this link, if you are looking to learn more about data science online!

Additionally, if you are having an interest in learning Data Science, Learn online Data Science Course to boost your career in Data Science.

Furthermore, if you want to read more about data science, you can read our blogs here

Different Ways to Manage Apache Spark Applications on Amazon EMR

Image result for apache spark on amazon emr

source: SlideShare


Technology has seen advancing rapidly in the last few years and so is the amount the data that is getting generated. There are a plethora of sources which generates unstructured data that carries a huge amount of information if mined correctly. These varieties of voluminous data are known as the Big Data which traditional computers or storage systems are incapable to handle.

To mine big data, the concept of parallel computing or clusters came into place popularly known as the Hadoop. Hadoop has several components which not only stores the data in the form of clusters but processes them in parallel as well. The HDFS or the Hadoop storage file system stores the big data while using the Map Reduce technique the data is processed.

However, most applications nowadays generate data in real-time which requires real-time analysis. Hadoop doesn’t allow real-time data storage or analysis as the data is processed in batches in Hadoop. To resolve such instances, Apache introduced Spark which is faster than Hadoop and allows data to be processed in real time. More and more companies have since transitioned from Hadoop to Spark as their application depends on real-time data analytics. You could also perform Machine Learning operations on Spark using the MLlib library.

The computation in spark is done in memory, unlike Hadoop which relies on disk for computation. It is an elegant and expressive development application programming interface which allows fast and efficient SQL and ML operations on iterative datasets. Applications could be created everywhere and the power of Spark could be exploited as Spark runs on Apache Hadoop YARN. Within a single dataset in Hadoop, insights could be derived and the data science workloads could be enriched.

A common cluster could be shared by Spark and other applications by maintaining service and response consistency which is a foundation provided by the Hadoop YARN-based architecture. Working with YARN in HDP, one of the many data access engines now is Spark. A Spark Core and other libraries are present in the Apache Spark.

The abstraction in Spark makes data science easier. Machine Learning is a technique where algorithms learn from data. The data processing speeds up by caching the dataset by Spark which is ideal for implementing such algorithms. To model an entire Data Science workflow, a high-level abstraction is provided by Spark’s Machine Learning Pipeline API. Abstractions like Transformer, Estimator, and so on are provided by the Spark’s Machine Learning pipeline package which increases the productivity of a Data Scientist.

So far we have discussed Big Data and how it could be processed using Apache Spark. However, to run Apache Spark applications, proper infrastructure needs to be in place and thus Amazon EMR provides a platform to manage applications built on Apache Spark.


Managing Spark Applications on Amazon EMR


Image result for manage apache spark applications on amazon emr

source: medium


Amazon EMR is one of the most popular cloud-based solutions to extract and analyze huge volumes of data from a variety of sources. On AWS, frameworks such as Apache Hadoop and Apache Spark could be run with the help of the Amazon EMR. In the matter of a few minutes, with the help of multiple instances, organizations could spin up a cluster which is enabled by the Amazon EMR. Through parallel processing, various data engineering and business intelligence workloads to be processed which reduces effort, cost, and time of the data processing involved in setting up the cluster.

As Apache Spark is a fast, an open-source framework, it is used in the processing of the big data. To reduce I/O, in memory across nodes, Apache Spark performs parallel computing in memory and thus the reliability of cluster memory (RAM) is heavy. On Amazon EMR, to run a Spark application, the following steps need to be performed –

  • To the Amazon S3, the Spark application package is uploaded.
  • With the configured Apache Spark, the Amazon EMR cluster is configured and launched.
  • Onto the cluster, from the Amazon S3, the application package is installed and then the application is run.
  • After the application is completed, the cluster is terminated.


For a successful operation, based on the data and processing requirements, the Spark application needs to be configured. There could be memory issues if Spark is configured with the default settings. Below are some of the memory errors occurs while maintaining Apache Spark on Amazon EMR in a default setting.

  • The loss of memory error when the Java Heap space is not empty – lang.OutOfMemoryError: Java heap space
  • When the physical memory exceeds, you get the out of memory error Error: ExecutorLostFailure Reason: Container killed by YARN for exceeding limits
  • If the Virtual memory is exceeded, you would also get the out of memory error.
  • The Executor memory also gives the out of memory error if it’s exceeded.


Some of the reasons why these issues occur are –

  • While handling large volumes of data, due to the inappropriate settings of the number of cores, executor memory, or the number of Spark executor instances.
  • The memory allocated by YARN is exceeded by the Spark executor’s physical memory. In such cases, to handle memory intensive operations, the memory of the Spark executor and the overhead together is not enough.
  • In the Spark executor instance, to handle operations like garbage collection, enough memory is not present.


Below are the ways in which Amazon Spark could be successfully configured and maintained on Amazon EMR.

Based on the needs of the application, the number of instances and type should be determined. There are three types of nodes in the Amazon EMR –

  • The master acts as the resource manager.
  • The Core nodes which are managed by the master that executes tasks and manages storage.
  • The Task which performs only tasks but no storage. 

The right instance type should be chosen based on the application whether it is memory intensive or compute intensive. The R-type instances are preferred for the memory intensive applications while the C-Type instances are preferred for the compute-intensive applications. For each node types, the number of instances are decided after the type of the instance is decided. The number is dependent on the frequency requirements, the execution time of the application, and the input dataset size.

The Spark Configuration parameters need to be determined. Below is the diagram representing the executor container memory.


Image result for manage apache spark applications on amazon emr

source: Amazon Web Services


There are multiple memory compartments in the executor container. However, for task execution, only one is used and for seamlessly running of the task, these need to be configured properly.

Based on the task and core instance types, the values for the Spark parameters are automatically set by the in spark-defaults. The maximize resource allocation is set to True to use all the resources in the cluster. Based on the workloads, the number of executors used could be dynamically scaled by Spark on YARN. In an application, to use the right number of executors, in most cases, the sub-properties are required which requires a lot of trial and error. These could often lead to the wastage of memory if the trial and error are not right.

  • Memory should be effectively cleared with the implementation of a proper garbage collector. In certain cases, out of memory error could occur due to the garbage collection especially when in the application, there are multiple RDDs. between the RDD cached memory and the task memory, when there is an interference, such instances might occur. Multiple garbage collectors could be used and in the memory, the new ones could be placed. However, the latency is overcome by the latest Garbage First Garbage Collector (G1GC).
  • The configuration parameters of YARN should be set. As the operating system bumps up the virtual memory aggressively, the virtual out of memory could still occur, even if all properties of Spark are configured correctly. The virtual memory and the physical memory check flag should be set to False to prevent such application failures. 
  • Monitoring and Debugging should be performed. With the verbose option, the spark-submit should be run and the Spark configuration details could be known. To monitor Network I/O, the application progress, Spark UI and Ganglia could be used. A Spark application could process ten terabytes of data successfully if it is configured using 170 executor instances, 37GB memory, eight terabytes of RAM, five virtual CPUs, a twelve times large master and core nodes and 1700 equaled parallelism.



Apache Spark is being used by most industries these days and thus building a flawless application using Spark is a necessity which could help the business in their day to day activities.

Amazon EMR is one of the most popular cloud-based solutions to extract and analysis huge volumes of data from a variety of sources. On AWS, frameworks such as Apache Hadoop and Apache Spark could be run with the help of the Amazon EMR. This blog post covered various memory errors, the causes of the errors and how to prevent them when running Spark applications on Amazon EMR.

Dimensionless has several blogs and training to get started with Python, and Data Science in general.

Follow this link, if you are looking to learn more about data science online!

Additionally, if you are having an interest in learning AWS Big Data Course, Learn AWS Course Online

Furthermore, if you want to read more about data science, you can read our blogs here

Also Read:

What is Map Reduce Programming and How Does it Work

How to Visualize AWS Cost and Usage Data Using Amazon Athena and QuickSight



What is Map Reduce Programming and How Does it Work

Image result for map reduce

source : Talend


Data Science is the study of extracting meaningful insights from the data using various tools and technique for the growth of the business. Despite its inception at the time when computers came into the picture, the recent hype is a result of the huge amount of unstructured data that is getting generated and the unprecedented computational capacity that modern computers possess.

However, there is a lot of misconception among the masses about the true meaning of this field with many of the opinion that it is about predicting future outcomes from the data. Though predictive analytics is a part of Data Science, it is certainly not all of what Data Science stands for. In an analytics project, the first and foremost role is to get the build the pipeline and get the relevant data to perform predictive analytics later on. The professional who is responsible for building such ETL pipelines and the creating the system for flawless data flow is the Data Engineer and this field is known as Data Engineering.

Over the years the role of Data Engineers has evolved a lot. Previously it was about building Relational Database Management System using Structured Query Language or run ETL jobs. These days, the plethora of unstructured data from a multitude of sources has resulted in the advent of Big Data. It is nothing but a different forms of voluminous data which carries a lot of information if mined properly.

Now, the biggest challenge that professionals face is to analyse these huge terabytes of data which traditional file storage systems are incapable of handling. This problem was resolved by Hadoop which is an open-source Apache framework built to process large data in the form of clusters. Hadoop has several components which takes care of the data and one such component is known as Map Reduce.


What is Hadoop?

Created by Doug Cutting and Mike Cafarella in 2006, Hadoop facilitates distributed storage and processing of huge data sets in the form parallel clusters. HDFS or Hadoop Distributed File System is the storage component of Hadoop where different file formats could be stored to be processed using the Map Reduce programming which we would cover later on in this article.

The HDFS runs on large clusters and follows a master/slave architecture. The metadata of the file i.e., information about the relative position of the file in the node is managed by the NameNode which is the master and could save several DataNodes to store the data. Some of the other components of Hadoop are –

  • Yarn – It manages the resources and performs job scheduling.
  • Hive – It allows users to write SQL-like queries to analyse the data.
  • Sqoop – Used for to and fro structured data transfer between the Hadoop Distributed file System and the Relational Database Management System.
  • Flume – Similar to Sqoop but it facilitates the transfer of unstructured and semi-structured data between the HDFS and the source.
  • Kafka – A messaging platform of Hadoop.
  • Mahout – It used to create Machine Learning operations on big data.

Hadoop is a vast concept and in detail explanation of each components is beyond the scope of this blog. However, we would dive into one of its components – Map Reduce and understand how it works.


What is Map Reduce Programming

Map Reduce is the programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop Cluster, i.e. suppose you have a job to run and you write the Job using the MapReduce framework and then if there are a thousand machines available, the Job could run potentially in those thousand machines.

The Big Data is not stored traditionally in HDFS. The data gets divided into chunks of small blocks of data which gets stored in respective data nodes.  No complete data’s present in one centralized location and hence a native client application cannot process the information right away. So a particular framework is needed with the capability of handling the data that stays as blocks of data into respective data nodes, and the processing can go there to process that data and bring back the result. In a nutshell, data is processed in parallel which makes processing faster.

To improve performance and for better efficiency, the idea of parallelization was developed. The process is automated and concurrently executed. The instructions which are fragmented could also run on a single machine or on different CPU’s. To gain direct disk access, multiple computers uses SAN or Storage Area Networks which is a common type of Clustered File System unlike the Distributed File Systems which sends the data using the network.

One term that is common in this maser/slave architecture of data processing is Load Balancing where among the processors the tasks are spread to avoid overload on any DataNode. Unlike the static balancers, there is more flexibility provided by the dynamic balancers.

The Map-Reduce algorithm which operates on three phases – Mapper Phase, Sort and Shuffle Phase and the Reducer Phase. To perform basic computation, it provides abstraction for Google engineers while hiding fault tolerance, parallelization, and load balancing details.

  • Map Phase – In this stage, the input data is mapped into intermediate key-value pairs on all the mappers assigned to the data.
  • Shuffle and Sort Phase – This phase acts as a bridge between the Map and the Reduce phase to decrease the computation time. The data here is shuffled and sorted simultaneously based on the keys i.e., all intermediate values from the mapper phase is grouped together with respect to the keys and passed on to reduce function.
  • Reduce Phase– The sorted data is the input to the Reducer which aggregates the value corresponding to each key and produces the desired output.


How Map Reduce works

  • Across multiple machines, the Map invocations are distributed and the input data is automatically partitioned into M pieces of size sixteen to sixty four megabytes per piece. On a cluster of machines, many copies of the program are then started up.
  • Among the copies, one is the master copy while the rest are the slave copies. The master assigns M map and R reduce tasks to the slaves. Any idle worker would be assigned a task by the master.
  • The map task worker would read the contents of the input and pass key-value pairs to the Map function defined by the user. In the memory buffer, the intermediate key-value pairs would be produced.
  • To the local disk, the buffered pairs are written in a periodic fashion. The partitioning function then partitions them into R regions. The master would forward the location of the buffered key-value pairs to the reduce workers.
  • The buffered data is read by the reduce workers after getting the location from the master. Once it is read, the data is sorted based on the intermediate keys grouping similar occurrences together.
  • The Reduce function defined the user receives a set of intermediate values corresponding to each unique intermediate key that it encounters. The final output file would consists of the appended output from the Reduce function.
  • The user program is woken up by the Master once all the Map and Reduce tasks are completed. In the R output files, the successful MapReduce execution output could be found.
  • Each and every worker’s aliveness is checked by the master after the execution by sending periodic pings. If any worker does not respond to the ping, it is marked as failed after a certain point if time and its previous works are reset.
  • In case of failures, the map tasks which are completed would be re-executed as their output would be inaccessible in the local disk. Output which are stored in the global file system need not to be re-executed.


Some of the examples of Map Reduce programming are –

  • Map Reduce programming could count the frequencies of the URL access. The logs of web page would be processed by the map function and stored as output say <URL, 1> which would be processed by the Reduce function by adding all the same URL and output their count.
  • Map Reduce programming could also be used to parse documents and count the number of words corresponding to each document.
  • For a given URL, the list of all the associated source URL’s could be obtained with the help of Map Reduce.
  • To calculate per host term vector, the map reduce programming could be used. The hostname and the term vector pair would be created for each document by the Map function which would be processed by the reduce function which in turn would remove less frequent terms and give a final hostname, term vector.



Data Engineering is a key step in any Data Science project and Map Reduce is undoubtedly an essential part of it. In this article we have a brief intuition about Big Data and provided an overview of Hadoop. Then we explained Map Reduce programming and its workflow and gave few real life applications of Map Reduce programming as well.

Dimensionless has several blogs and training to get started with Python, and Data Science in general.

Follow this link, if you are looking to learn more about data science online!

Additionally, if you are having an interest in learning Data Science, Learn online Data Science Course to boost your career in Data Science.

Furthermore, if you want to read more about data science, you can read our blogs here

Top 50 AWS Interview Questions and Answers with Dimensionless

Top 50 AWS Interview Questions and Answers with Dimensionless


Launched back in 2006, AWS has succeeded in becoming the leading provider of on-demand cloud computing services. The cloud computing services provider secures a staggering 32% of the cloud computing market share up until the last quarter of 2018.

Every aspiring developer looking to make it big in the cloud computing ecosphere must have a stronghold on AWS. If you’re eyeing for the role of an AWS Developer, then these most important 20 AWS interview questions will help you take a step further towards your desired job avenue. So let us kickstart your AWS learning with Dimensionless!

AWS Interview Questions with Answers

1. What is AWS?

AWS attains as Amazon Web Service; this is a gathering of remote computing settings also identified as cloud computing policies. This unique realm of cloud computing is also recognized as IaaS or Infrastructure as a Service.

2. What are the Key Components of AWS?

The fundamental elements of AWS are

  • Route 53: A DNS web service
  • *Easy E-mail Service: It permits addressing e-mail utilizing RESTFUL API request or through normal SMTP
  • *Identity and Access Management: It gives heightened protection and identity control for your AWS account
  • *Simple Storage Device or (S3): It is warehouse equipment and the well-known widely utilized AWS service
  • *Elastic Compute Cloud (EC2): It affords on-demand computing sources for hosting purposes. It is extremely valuable in trouble of variable workloads
  • *Elastic Block Store (EBS): It presents persistent storage masses that connect to EC2 to enable you to endure data beyond the lifespan of a particular EC2
  • *CloudWatch: To observe AWS sources, It permits managers to inspect and obtain key Additionally, one can produce a notification alert in the state of crisis.


3. What is S3?

S3 holds for Simple Storage Service. You can utilize S3 interface to save and recover the unspecified volume of data, at any time and from everywhere on the web. For S3, the payment type is “pay as you go”.

4. What is the Importance of Buffer in Amazon Web Services?

An Elastic Load Balancer ensures that the incoming traffic is distributed optimally across various AWS instances. A buffer will synchronize different components and makes the arrangement additional elastic to a burst of load or traffic. The components are prone to work in an unstable way of receiving and processing the requests. The buffer creates the equilibrium linking various apparatus and crafts them effort at the identical rate to supply more rapid services.

5. What Does an AMI Include?

An AMI comprises the following elements:

  1. A template to the source quantity concerning the instance
  2. Launch authorities determine which AWS accounts can avail the AMI to drive instances
  3. A base design mapping that defines the amounts to join to the instance while it is originated.

6. How Can You Send the Request to Amazon S3?

Amazon S3 is a REST service, you can transmit the appeal by applying the REST API or the AWS SDK wrapper archives that envelop the underlying Amazon S3 REST API.

7. How many Buckets can you Create in AWS by Default?

In each of your AWS accounts, by default, You can produce up to 100 buckets.

8. List the Component Required to Build Amazon VPC?

Subnet, Internet Gateway, NAT Gateway, HW VPN Connection, Virtual Private Gateway, Customer Gateway, Router, Peering Connection, VPC Endpoint for S3, Egress-only Internet Gateway.

9. What is the Way to Secure Data for Carrying in the Cloud?

One thing must be ensured that no one should seize the information in the cloud while data is moving from point one to another and also there should not be any leakage with the security key from several storerooms in the cloud. Segregation of information from additional companies’ information and then encrypting it by means of approved methods is one of the options.

10. Name the Several Layers of Cloud Computing?

Here is the list of layers of the cloud computing

PaaS — Platform as a Service
IaaS — Infrastructure as a Service
SaaS — Software as a Service

11. Explain- Can You Vertically Scale an Amazon Instance? How?

  • Surely, you can vertically estimate on Amazon instance.
  • Twist up a fresh massive instance than the one you are currently governing
  • Delay that instance and separate the source webs mass of server and dispatch
  • Next, quit your existing instance and separate its source quantity
  • Note the different machine ID and connect that source mass to your fresh server

12. What are the Components Involved in Amazon Web Services?

There are 4 components involved and areas below. Amazon S3: with this, one can retrieve the key information which are occupied in creating cloud structural design and amount of produced information also can be stored in this component that is the consequence of the key specified. Amazon EC2 instance: helpful to run a large distributed system on the Hadoop cluster. Automatic parallelization and job scheduling can be achieved by this component.

Amazon SQS: this component acts as a mediator between different controllers. Also worn for cushioning requirements those are obtained by the manager of Amazon.

Amazon SimpleDB: helps in storing the transitional position log and the errands executed by the consumers.

13. What is Lambda@edge in Aws?

  • In AWS, we can use Lambda@Edge utility to solve the problem of low network latency for end users.
  • In Lambda@Edge there is no need to provision or manage servers. We can just upload our Node.js code to AWS Lambda and create functions that will be triggered on CloudFront requests.
  • When a request for content is received by CloudFront edge location, the Lambda code is ready to execute.
  • This is a very good option for scaling up the operations in CloudFront without managing servers.

14. Distinguish Between Scalability and Flexibility?

The aptitude of any scheme to enhance the tasks on hand on its present hardware resources to grip inconsistency in command is known as scalability. The capability of a scheme to augment the tasks on hand on its present and supplementary hardware property is recognized as flexibility, hence enabling the industry to convene command devoid of putting in the infrastructure at all. AWS has several configuration management solutions for AWS scalability, flexibility, availability and management.

15. Name the Various Layers of the Cloud Architecture?

There are 5 layers and are listed below

  • CC- Cluster Controller
  • SC- Storage Controller
  • CLC- Cloud Controller
  • Walrus
  • NC- Node Controller

16. What is the Difference Between Azure and AWS?

AWS and Azure are subsets in terms of cloud computing. Both are used to build and host applications. Azure helped many companies, such as the platform, such as PaaS. … Storage: – AWS has temporary storage that is assigned when an instance is started and destroyed when the instance is terminated.

17. Explain- What is T2 Instances?

T2 instances are outlined to present average baseline execution and the ability to explode to powerful execution as needed by the workload.

18. In VPC with Private and Public Subnets, Database Servers should ideally be launched into which Subnet?

Among private and public subnets in VPC, database servers should ideally originate toward separate subnets.

19. What is AWS SageMaker?

Amazon SageMaker is a fully managed machine learning service. With Amazon SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment. It provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you don’t have to manage servers. It also provides common machine learning algorithms that are optimized to run efficiently against extremely large data in a distributed environment.

20. While Connecting to your Instance What are the Possible Connection Issues one Might Face?

  • The feasible connection failures one might battle while correlating instances are
  • Consolidation timed out
  • User key not acknowledged by the server
  • Host key not detected, license denied
  • The unguarded private key file
  • Server rejected our key or No sustained authentication program available
  • Error handling Mind Term on Safari Browser
  • Error utilizing Mac OS X RDP Client

21. Explain Elastic Block Storage? What Type of Performance can you Expect? How do you Back itUp? How do you Improve Performance?

That indicates it is RAID warehouse to begin with, so it’s irrelevant and faults tolerant. If disks expire in the RAID you don’t miss data. Excellent! It is more virtualized, therefore you can provision and designate warehouse, and connect it to your server with multiple API appeals. No calling the storage specialist and asking him or her to operate specific requests from the hardware vendor.

Execution of EBS can manifest variability. Such signifies that can run above the SLA enforcement level, suddenly descend under it. The SLA gives you among a medium disk I/O speed you can foresee. That can prevent any groups particularly performance specialists who suspect stable and compatible disk throughput on a server. Common physically entertained servers perform that direction. Pragmatic AWS cases do not.

Backup EBS masses by utilizing the snap convenience through API proposal or by a GUI interface same elasticfox.

Progress execution by practising Linux software invasion and striping over four extents.

22. Which Automation Gears can Help with Spinup Services?

The API tools can be used for spinup services and also for the written scripts. Those scripts could be coded in Perl, bash or other languages of your preference. There is one more option that is patterned administration and stipulating tools such as a dummy or improved descendant. A tool called Scalr can also be used and finally, we can go with a controlled explanation like a RightScale.

23. What is an Ami? How Do I Build One?

AMI holds for Amazon Machine Image. It is efficiently a snap of the source filesystem. Products appliance servers have a bio that shows the master drive report of the initial slice on a disk. A disk form though can lie anyplace physically on a disc, so Linux can boot from an absolute position on the EBS warehouse interface.

Create a unique AMI at beginning rotating up and instance from a granted AMI. Later uniting combinations and components as needed. Comprise wary of setting delicate data over an AMI (learn salesforce online). For instance, your way credentials should be joined to an instance later spinup. Among a database, mount an external volume that carries your MySQL data next spinup actually enough.

24. What are the Main Features of Amazon Cloud Front?

Some of the main features of Amazon CloudFront are as follows: Device Detection Protocol Detection Geo Targeting Cache Behavior Cross-Origin Resource Sharing Multiple Origin Servers HTTP Cookies Query String Parameters Custom SSL.

25. What is the Relation Between an Instance and Ami?

AMI can be elaborated as Amazon Machine Image, basically, a template consisting software configuration part. For example an OS, applications, application server. If you start an instance, a duplicate of the AMI in a row as an unspoken attendant in the cloud.

26. What is Amazon Ec2 Service?

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable (scalable) computing capacity in the cloud. You can use Amazon EC2 to launch as many virtual servers you need. In Amazon EC2 you can configure security and networking as well as manage storage. Amazon EC2 service also helps in obtaining and configuring capacity using minimal friction.

27. What are the Features of the Amazon Ec2 Service?

As the Amazon EC2 service is a cloud service so it has all the cloud features. Amazon EC2 provides the following features:

  • The virtual computing environment (known as instances)
  • re-configured templates for your instances (known as Amazon Machine Images — AMIs)
  • Amazon Machine Images (AMIs) is a complete package that you need for your server (including the operating system and additional software)
  • Amazon EC2 provides various configurations of CPU, memory, storage and networking capacity for your instances (known as instance type)
  • Secure login information for your instances using key pairs (AWS stores the public key and you can store the private key in a secure place)
  • Storage volumes of temporary data are deleted when you stop or terminate your instance (known as instance store volumes)
  • Amazon EC2 provides persistent storage volumes (using Amazon Elastic Block Store — EBS)
  • A firewall that enables you to specify the protocols, ports, and source IP ranges that can reach your instances using security groups
  • Static IP addresses for dynamic cloud computing (known as Elastic IP address)
  • Amazon EC2 provides metadata (known as tags)
  • Amazon EC2 provides virtual networks that are logically isolated from the rest of the AWS cloud, and that you can optionally connect to your own network (known as virtual private clouds — VPCs)

28. What is the AWS Kinesis

Amazon Kinesis Data Streams can collect and process large streams of data records in real time. You can create data-processing applications, known as Kinesis Data Streams applications. A typical Kinesis Data Streams application reads data from a data stream as data records. These applications can use the Kinesis Client Library, and they can run on Amazon EC2 instances. You can send the processed records to dashboards, use them to generate alerts, dynamically change pricing and advertising strategies, or send data to a variety of other AWS services

29. Distinguish Between Scalability and Flexibility?

The aptitude of any scheme to enhance the tasks on hand on its present hardware resources to grip inconsistency in command is known as scalability. The capability of a scheme to augment the tasks on hand on its present and supplementary hardware property is recognized as flexibility, hence enabling the industry to convene command devoid of putting in the infrastructure at all.  AWS has several configuration management solutions for AWS scalability, flexibility, availability and management.

30. What are the Different Types of Events Triggered By Amazon Cloud Front?

Different types of events triggered by Amazon CloudFront are as follows:

Viewer Request: When an end user or a client program makes an HTTP/HTTPS request to CloudFront, this event is triggered at the Edge Location closer to the end user.

Viewer Response: When a CloudFront server is ready to respond to a request, this event is triggered.

Origin Request: When CloudFront server does not have the requested object in its cache, the request is forwarded to the Origin server. At this time this event is triggered.

Origin Response: When CloudFront server at an Edge location receives the response from the Origin server, this event is triggered.

31. Explain Storage for Amazon Ec2 Instance.?

Amazon EC2 provides many data storage options for your instances. Each option has a unique combination of performance and durability. These storages can be used independently or in combination to suit your requirements.

There are mainly four types of storages provided by AWS:

  • Amazon EBS: Its durable, block-level storage volumes can be attached in running Amazon EC2 instance. The Amazon EBS volume persists independently from the running life of an Amazon EC2 instance. After an EBS volume is attached to an instance, you can use it like any other physical hard drive. Amazon EBS encryption feature supports encryption feature.
  • Amazon EC2 Instance Store: Storage disk that is attached to the host computer is referred to as instance store. The instance storage provides temporary block-level storage for Amazon EC2 instances. The data on an instance store volume persists only during the life of the associated Amazon EC2 instance; if you stop or terminate an instance, any data on instance store volumes is lost.
  • Amazon S3: Amazon S3 provides access to reliable and inexpensive data storage infrastructure. It is designed to make web-scale computing easier by enabling you to store and retrieve any amount of data, at any time, from within Amazon EC2 or anywhere on the web.
  • Adding Storage: Every time you launch an instance from an AMI, a root storage device is created for that instance. The root storage device contains all the information necessary to boot the instance. You can specify storage volumes in addition to the root device volume when you create an AMI or launch an instance using block device mapping.

32. What are the Security Best Practices for Amazon Ec2?

  • There are several best practices for secure Amazon EC2. Following are a few of them.
  • Use AWS Identity and Access Management (AM) to control access to your AWS resources.
  • Restrict access by only allowing trusted hosts or networks to access ports on your instance.
  • Review the rules in your security groups regularly, and ensure that you apply the principle of least
  • Privilege — only open up permissions that you require.
  • Disable password-based logins for instances launched from your AMI. Passwords can be found or cracked, and are a security risk.

33. Explain Stopping, Starting, and Terminating an Amazon Ec2 Instance?

Stopping and Starting an instance: When an instance is stopped, the instance performs a normal shutdown and then transitions to a stopped state. All of its Amazon EBS volumes remain attached, and you can start the instance again at a later time. You are not charged for additional instance hours while the instance is in a stopped state.

Terminating an instance: When an instance is terminated, the instance performs a normal shutdown, then the attached Amazon EBS volumes are deleted unless the volume’s deleteOnTermination attribute is set to false. The instance itself is also deleted, and you can’t start the instance again at a later time.

34. What is S3? What is it used for? Should Encryption be Used?

S3 implies for Simple Storage Service. You can believe it similar ftp warehouse, wherever you can transfer records to and from beyond, merely not uprise it similar to a filesystem. AWS automatically places your snaps there, at the same time AMIs there. sensitive data is treated with Encryption, as S3 is an exclusive technology promoted by Amazon themselves, and as still unproven vis-a-vis a protection viewpoint.

35. What is AWS Cloud Search

Amazon CloudSearch is a managed service in the AWS Cloud that makes it simple and cost-effective to set up, manage, and scale a search solution for your website or application.

Amazon CloudSearch supports 34 languages and popular search features such as highlighting, autocomplete, and geospatial search

36. What is Qlik Sense Charts?

Qlik Sense Charts is another software as a service (SaaS) offering from Qlik which allows Qlik Sense visualizations to be easily shared on websites and social media. Charts have limited interaction and allow users to explore and discover.

37. Define Auto Scaling?

Answer: Auto-scaling is one of the conspicuous characteristics feature of AWS anywhere it authorizes you to systematize and robotically obligation and twist up new models externally that necessary for your entanglement. This can be accomplished by initiating brims and metrics to view. If these proposals are demolished, the latest model of your preference will be configured, wrapped up and cloned into the weight administrator panel.

38. Which Automation Gears can Help with Spinup Services?

For the written scripts we can use spinup services with the help of API tools. These scripts could be coded in bash, Perl, or any another language of your choice. There is one more alternative that is patterned control and stipulating devices before-mentioned as a dummy or advanced descendant. A machine termed as Scalar can likewise be utilized and ultimately we can proceed with a constrained expression like a RightScale.

39. Explain what EC2 Instance Metadata is. How does an EC2 instance get its IAM access key and Secret key?

EC2 instance metadata is a service accessible from within EC2 instances, which allows querying or managing data about a given running instance.

It is possible to retrieve an instance’s IAM access key by accessing the iam/security-credentials/role-name metadata category. This returns a temporary set of credentials that the EC2 instance automatically uses for communicating with AWS services.

40. What is AWS snowball?

Snowball is a petabyte-scale data transport solution that uses devices designed to be secure to transfer large amounts of data into and out of the AWS Cloud. Using Snowball addresses common challenges with large-scale data transfers including high network costs, long transfer times, and security concerns. Customers today use Snowball to migrate analytics data, genomics data, video libraries, image repositories, backups, and to archive part of data center shutdowns, tape replacement or application migration projects. Transferring data with Snowball is simple, fast, more secure, and can be as little as one-fifth the cost of transferring data via high-speed Internet.

41. Explain in Detail the Function of Amazon Machine Image (AMI)?

An Amazon Machine Image AMI is a pattern that comprises a software conformation (for instance, an operating system, a request server, and applications). From an AMI, we present an example, which is a duplicate of the AMI successively as a virtual server in the cloud. We can even offer plentiful examples of an AMI.

42. If I’m expending Amazon Cloud Front, can I custom Direct Connect to handover objects from my own data center?

Certainly. Amazon Cloud Front stipulations culture rises computing sources of separate AWS. By AWS Direct Connect, you will be accelerating with the appropriate information substitution rates. AWS Training Free Demo

43. If My AWS Direct Connect flops, will I lose my Connection?

If a gridlock AWS Direct connects has been transposed, in the event of a let-down, it will convert over to the next one. It is voluntary to allow Bidirectional Forwarding Detection (BFD) while systematizing your rules to safeguard quicker identification and failover. Proceeding the opposite hand, if you have built a backup IPsec VPN connecting as an option, all VPC transactions will failover to the backup VPN association routinely.

44. What is AWS Certificate Manager?

AWS Certificate Manager (ACM) manages the complexity of extending, provisioning, and regulating certificates granted over ACM (ACM Certificates) to your AWS-based websites and forms. You work ACM to petition and maintain the certificate and later practice other AWS services to provision the ACM Certificate for your website or purpose. As designated in the subsequent instance, ACM Certificates are currently ready for performance with only Elastic Load Balancing and Amazon CloudFront. You cannot handle ACM Certificates outside of AWS.

45. Explain What is Redshift?

The executes it easy and cost-effective to efficiently investigate all your data employing your current marketing intelligence devices which is a completely controlled, high-speed, it is petabyte-scale data repository service known as Redshift.

46. Mention What are the Differences Between Amazon S3 and EC2?

S3: Amazon S3 is simply a storage aid, typically applied to save huge binary records. Amazon too has additional warehouse and database settings, same as RDS to relational databases and DynamoDB concerning NoSQL.

EC2: An EC2 instance is similar to a foreign computer working Linux or Windows and on which you can install whatever software you need, including a Network server operating PHP code and a database server.

47. Explain What is C4 Instances?

C4 instances are absolute for compute-bound purposes that serve from powerful-performance processors. AWS Interview Questions and Answers

48. Explain What is DynamoDB in AWS?

Amazon DynamoDB is a completely controlled NoSQL database aid that renders quick and anticipated execution with seamless scalability. You can perform Amazon DynamoDB to formulate a database table that can save and reclaim any quantity of data, and help any level of application transactions. Amazon DynamoDB automatically increases the data and transactions for the table above an adequate number of servers to supervise the inquiry function designated by the customer and the volume of data saved, while keeping constant and quick execution.

49. Explain What is ElastiCache?

A web service that executes it comfortable to set up, maintain, and scale classified in-memory cache settings in the cloud is known as ElastiCache.

50. What is the AWS Key Management Service?

A managed service that makes it easy for you to create and control the encryption keys used to encrypt your data is known as the AWS Key Management Service (AWS KMS).


The above questions will provide you with a fair idea of how to get ready for an AWS interview. You are required to have all the concepts relating to AWS in your fingertips to crack the interview with ease. These questions and Answers will boost your confidence level in attending the interviews.

Learn AWS Course online with Dimensionless

Additionally, Read our blogs here Data Science Blogs


How to Visualize AWS Cost and Usage Data Using Amazon Athena and QuickSight

How to Discover and Classify Metadata using Apache Atlas on Amazon EMR

Visualize AWS Cost and Usage Data Using Amazon Athena and QuickSight

Visualize AWS Cost and Usage Data Using Amazon Athena and QuickSight


One of the major reasons organizations migrate to the AWS cloud is to gain the elasticity that can grow and shrink on demand, allowing them to pay only for resources they use. But the freedom to provide on-demand resources can sometimes lead to very high costs if they aren’t carefully monitored. Cost Optimization is one of the five pillars of the AWS Well-Architected Framework, and with good reason. When you optimize your costs, you build a more efficient cloud that helps focus your cloud spend where it’s needed most while freeing up resources to invest in things like more headcount, innovative projects or developing competitive differentiators.

Additionally, considering the cost implementation in mind, we will try to optimise our own cost of AWS usage by visualising it with AWS Quicksight. We will look into the complete setup of viewing the AWS cost and usage reports. Furthermore, we will look to implement our goal using S3 and Athena.

What is AWS Cost and Usage Service?

The AWS Cost and Usage report tracks your AWS usage and provides estimated charges associated with your AWS account. The report contains line items for each unique combination of AWS product, usage type, and operation that your AWS account uses. You can customize the AWS Cost and Usage report to aggregate the information either by the hour or by the day. AWS delivers the report files to an Amazon S3 bucket that you specify in your account and updates the report up to three times a day. You can also call the AWS Billing and Cost Management API Reference to create, retrieve, or delete your reports. You can download the report from the Amazon S3 console, upload the report into Amazon Redshift or Amazon QuickSight, or query the report in Amazon S3 using Amazon Athena.

What is AWS QuickSight?

Amazon QuickSight is an Amazon Web Services utility that allows a company to create and analyze visualizations of its customers’ data. The business intelligence service uses AWS’ Super-fast, Parallel, In-memory Calculation Engine (SPICE) to quickly perform data calculations and create graphs. Amazon QuickSight reads data from AWS storage services to provide ad-hoc exploration and analysis in minutes. Amazon QuickSight collects and formats data, moves it to SPICE and visualizes it. By quickly visualizing data, QuickSight removes the need for AWS customers to perform manual Extract, Transform, and Load operations.

Amazon QuickSight pulls and reads data from Amazon Aurora, Amazon Redshift, Amazon Relational Database Service, Amazon Simple Storage Service (S3), Amazon DynamoDB, Amazon Elastic MapReduce and Amazon Kinesis. The service also integrates with on-premises databases, file uploads and API-based data sources, such as Salesforce. QuickSight allows an end user to upload incremental data in a file or an S3 bucket. The service can also transform unstructured data using a Prepare Data option

What is AWS Athena?

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.

Athena is out-of-the-box integrated with AWS Glue Data Catalog, allowing you to create a unified metadata repository across various services, crawl data sources to discover schemas and populate your Catalog with new and modified table and partition definitions, and maintain schema versioning. You can also use Glue’s fully-managed ETL capabilities to transform data or convert it into columnar formats to optimize cost and improve performance.

Setting up AWS S3 and Cost Service

The very first task requires you to set up an S3 bucket. S3 bucket is the location where we will be putting our amazon cost and usage data. Go to your Amazon console and select S3. Click on create bucket button to initialise the setup

image result for button click initialization


Once the create bucket menu pops up, you will see the different options to fill. You need to write the bucket name, mention region and select access settings for the bucket in this step.

image result fto creat bucket

Click on create after filling all the fields. Open S3 and navigate to the Permissions tab in the console. We need to copy the access policy from here to access this bucket from quicksight. Furthermore, this policy will help in connecting the bucket with AWS cost and usage service.

image result for access bucket from quicksight



Click on bucket policy. A JSON file will come up with some default settings. We do not need to change much of the things in this file. You can directly copy the code their

image result for JSON File



We are able to set up our S3 bucket till now. Additionally, we need to create our cost and usage report now. Go to AWS Cost and usage reports tab from the console. Click on create the report to create a new report on cost and usage.

image result for amazon cost and usage report


After clicking on create report, a form will pop up. Mention all the necessary details here. The form includes the report name, cost and usage time level. You can directly access these reports for Redshift and Quicksight. In this tutorial, we are storing the data in the S3 bucket first. After storing it in S3 bucker, we will connect it with AWS Quicksight.

image result for creating report


In the second part, we need to select a delivery option. I will mention the name of the final delivery S3 bucket which we created in the previous step.

final delivery in bucket


Fill the form and click next. After clicking the next, we have created a report on AWS cost and billing. Click on the newly generated report now.

image result for report created


We need to set up access policies for the report. Click on create a new policy and sample editor will pop up

image result for report check policy


You can choose to edit the policy depending upon your requirement. Edit the resource section here and mention the correct name of your S3 bucket here. Click on done to complete the policy initialisation

image result for create policy in json


Congratulations! Till this part, we have done most of our work. We have an S3 bucket to store cost and usage data. Also, we have set up cost and usage reports to access our S3 bucket and store the results there.

Setting up Athena (Cloud formation template) and Running Queries

Now we need to setup Athena using a cloud formation template. Go to cloud formation console and click on select “Create New Stack”. Once you click on create a new stack, a sample popup will come.

image result for create a stack

Here you need to fill the form for creating the template. You can choose to select an existing Amazon S3 or can mention a template URL. Once you fill all the fields, click on next. This will create the Athena stack for you using cloud formation template.


Following is the query command, to access the cost and usage statistics. Also, you can try running the following code on the Athena editor to view the results.

Setting up Quicksight

Now we have our Athena and S3 setup completely. We need to setup Quicksight now. Go to Quicksight section, and click on setup. There can be the case when you need to enable or signup for the Quicksight again. In case the below pop up appears, click on signup to create the QuickSight account

image result for sign up for quicksight


A sample form like below will pop up for you. Mention the account name, email address and the services you want to enable for Quicksight. Once you have filled all the entries, click on Finish to complete the setup

image result for create a Quicksight account


You can choose to connect your s3 account with Quicksight. In the following popup, a sample of already existing buckets will pop for you. You can select the pre-existing buckets and it will automatically get connected with your Quicksight. Here you can easily connect the bucket which holds your AWS cost and usage report. With bucket already connected, you can easily pull the cost and usage report into Quicksight and analyse it.

image result for select bucket

After setting up the Quicksight, a sample popup will come. You can click on Next to finish the setup.

image result for final setup of account


Now the basic setup of Quicksight is complete. All you want to do now is connect your S3 bucket with Quicksight using Athena. Click the Athena option and run the code to extract the usage report into the AWS S3 storage.


image result for basic quicksight setup


You can then select the column names present in the left sidebar panel to plot the charts in the right panel. Quick sight is a drag-and-drop visualisation tool. You can search for the columns and quick sight will show you the suggested visualisations. You can choose the visualisation and drop it onto the right canvas.

image result for visuliasation


It will automatically plot the charts for you. As you can see, below image contains cost by product visualisation of the AWS services. It also depicts the costs distribution of different instances running on AWS.

image result for final data set creation

image result for cost analysis after setup



Data-driven decision making is essential throughout an organization. It is no longer prohibitively expensive to ensure access to BI to employees at all levels. Amazon’s QuickSight lets you create and publish interactive dashboards that can be accessed from browsers or mobile devices. You can embed dashboards into your applications, providing your customers with powerful self-service analytics. It easily scales to tens of thousands of users without any software to install, servers to deploy, or infrastructure to manage.

QuickSight is an innovative and cloud-hosted BI platform that addresses the shortfalls of traditional BI systems. Furthermore, its low pay-per-session pricing is a great alternative to the competition. QuickSight can get data from various sources including relational databases, files, streaming, and NoSQL databases. QuickSight also comes with an in-memory caching layer that can cache and calculate aggregates on the fly. With QuickSight, data analysts are truly empowered and can build intuitive reports in minutes without any significant set up by IT.

Follow this link, if you are looking to learn data science online!

You can follow this link for our Big Data course, which is a step further into advanced data analysis and processing!

Additionally, if you are having an interest in learning Data Science, click here to start the Online Data Science Course

Furthermore, if you want to read more about data science, read our Data Science Blogs

How to Discover and Classify Metadata using Apache Atlas on Amazon EMR

What is Data Lake and How to Improve Data Lake Quality 

The key Difference Between a Data Warehouse and Data lake

The key Difference Between a Data Warehouse and Data lake


Enterprises have long relied on BI to help them move their businesses forward. Years ago, translating BI into actionable information required the help of data experts. Today, technology supports BI which is accessible to people at all levels of an enterprise.

All that BI data needs to live somewhere. The data storage solution you choose for enterprise app development positions your business to access, secure, and use data in different ways. That’s why it’s helpful to understand the basic options, how they’re different, and which use cases are suitable for each.

In this blog, we will be looking at the key differences between Data lakes and Data warehouses. We will understand their basics and will try to see their implementation in different fields with different tools.

What is Data Lake?

A data lake is a central location in which you can store all your data, regardless of its source or format. It is typically, although not always, built using Hadoop. The data can be structured or unstructured. You can then use a variety of storage and processing tools — typical tools in the extended Hadoop ecosystem — to extract value quickly and inform key organizational decisions. Because of the growing variety and volume of data, data lakes are an emerging and powerful architectural approach, especially as enterprises turn to mobile, cloud-based applications, and the Internet of Things (IoT) as right-time delivery mediums for big data.

What is a Data Warehouse?

A data warehouse is a large collection of business data used to help an organization make decisions. The concept of the data warehouse has existed since the 1980s when it was developed to help transition data from merely powering operations to fueling decision support systems that reveal business intelligence.

A large amount of data in data warehouses comes from different places such as internal applications such as marketing, sales, and finance; customer-facing apps; and external partner systems, among others. On a technical level, a data warehouse periodically pulls data from those apps and systems; then, the data goes through formatting and import processes to match the data already in the warehouse. The data warehouse stores this processed data so itʼs ready for decision-makers to access. How frequently data pulls occur, or how data is formatted, etc., will vary depending on the needs of the organization.


1. Data Types

Data warehouses store structured organizational data such as financial transactions, CRM and ERP data. Other data sources such as social media, web server logs, and sensor data, not to mention documents and rich media, are not storable because they are more difficult to model, and their sheer volume makes them expensive and difficult to manage. These types of data are more appropriate for a data lake.

2. Processing

In a data warehouse, data is organized, defined, and metadata is applied before the data is written and stored. We call this process as ‘schema on writeʼ. A data lake consumes everything, including data types considered inappropriate for a data warehouse. Data is present in raw form; information is present to the schema as we extract data from the data source, not when we write it to storage. We call this as a ‘schema on readʼ.

3. Storage and Data Retention

Before we can load data to a data warehouse, data engineers work hard to analyze the data and how to use it for business analysis. They design transformations to summarize and transform the data to enable the extraction of relevant insights. They do not consider the data which doesnʼt answer concrete business questions in the data warehouse. In order to reduce storage space and improve performance — a traditional data warehouse is an expensive and scarce enterprise resource. In a data lake, data retention is less complex, because it retains all data — raw, structured, and unstructured. Data is never going in the deletion phase, permitting analysis of past, current and future information. Data lakes run on commodity servers using inexpensive storage devices, removing storage limitations.

4. Agility

Data warehouses store historical data. Incoming data conforms to a predefined structure. This is useful for answering specific business questions, such as “what is our revenue and profitability across all 124 stores over the past week”. However, if business questions are evolving, or the business wants to retain all data to enable in-depth analysis, data warehouses are insufficient. The development effort to adapt the data warehouse and ETL process to new business questions is a huge burden. A data lake stores data in its original format, so it is immediately accessible for any type of analysis. Information can be retrieved and reused — a user can apply a formalized schema to the data, store it, and share it with others. If the information is not useful, the copy can be discarded without affecting the data stored in the data lake. All this is done with no development effort.

5. Security, Maturity, and Usage

Data warehouses have been around for two decades and are a secure, enterprise-ready technology. Data lakes are getting there, but are newer and have a shorter enterprise track record. A large enterprise cannot buy and implement a data lake like it would a data warehouse — it must consider which tools to use, open source or commercial, and how to piece them together to meet requirements. The end users of each technology are different: a data warehouse is used by business analysts, who query the data via pre-integrated reporting and BI. Business users cannot use a data lake as easily, because data requires processing and analysis to be useful. Data scientists, data engineers, or sophisticated business users, can extract insights from massive volumes of data in the data lake.

Benefits of Data lakes

1. The Historical Legacy Data Architecture Challenge

Some reasons why data lakes are more popular are historical. Traditional legacy data systems are not that open, to say the least, if you want to start integrating, adding and blending data together to analyze and act. Analytics with traditional data architectures weren’t that obvious nor cheap either (with the need for additional tools, depending on the software). Moreover, they weren’t built with all the new and emerging (external) data sources which we typically see in big data in mind.

2. Faster Big Data Analytics as a Driver of Data Lake Adoption

Another important reason to use data lakes is the fact that big data analytics can be done faster. In fact, data lakes are designed for big data analytics if you want and, more important than ever, for real-time actions based on real-time analytics. Data lakes are fit to leverage big quantities of data in a consistent way with algorithms to drive (real-time) analytics with fast data.

3. Mixing and Converging Data: Structured and Unstructured in One Data Lake

A benefit we more or less already mentioned is the possibility to acquire, blend, integrate and converge all types of data, regardless of sources and format. Hadoop, one of the data lake architectures, can also deal with structured data on top of the main chunk of data: the previously mentioned unstructured data coming from social data, logs and so forth. On a side note: unstructured data is the fastest growing form of all data (even if structured data keeps growing too) and is predicted to reach about 90 percent of all data.

Benefits of Data Warehousing

Organizations that use a data warehouse to assist their analytics and business intelligence to see a number of:

  1. Substantial Benefits
    Better data, hence adding data sources to a data warehouse enables organizations to ensure that they are collecting consistent and relevant data from that source. They donʼt need to wonder whether the data will be accessible or inconsistent as it comes into the system. This ensures higher data quality and data integrity for sound decision making.
  2. Faster Decisions
    Data in a warehouse is in always consistent analyzable formats. It also provides analytical power and a more complete dataset to base decisions on hard facts. Therefore, decision-makers no longer need to rely on hunches, incomplete data, or poor quality data and risk delivering slow and inaccurate results.

Tools for Data Warehousing

1. Amazon Redshift

Amazon Redshift is an excellent data warehouse product which is a very critical part of Amazon Web Services — a very famous cloud computing platform. Redshift is a fast, well-managed data warehouse that analyses data using the existing standard SQL and BI tools. It is a simple and cost-effective tool that allows running complex analytical queries using smart features of query optimization. It handles analytics workload pertaining to big data sets by utilizing columnar storage on high-performance disks and massively parallel processing concepts. One of its very powerful features is Redshift spectrum, that allows the user to run queries against unstructured data directly in Amazon S3. It eliminates the need for loading and transformation. It automatically scales query computing capacity depending on data. Hence the queries run fast. Official URL: Amazon Redshift

2. Teradata

Teradata is another market leader when it comes to database services and products. Most of the competitive enterprise organizations use Teradata DWH for insights, analytics & decision making. Teradata DWH is a relational database management system by Teradata organization. It has two divisions i.e. data analytics & marketing applications. It works on the concept of parallel processing and allows users to analyze data in a simple yet efficient manner. An interesting feature of this data warehouse is its data segregation into hot & cold data. Here cold data refers to less frequently used data and this is the tool in the market these days. Official URL: Teradata

Tools for Data lakes

1. Amazon S3

The Amazon S3-based data lake solution uses Amazon S3 as its primary storage platform. Amazon S3 provides an optimal foundation for a data lake because of its virtually unlimited scalability. You can seamlessly and nondisruptively increase storage from gigabytes to petabytes of content, paying only for what you use. Amazon S3 has 99.999999999% durability. It has scalable performance, ease-of-use features, and native encryption and access control capabilities. Amazon S3 integrates with a broad portfolio of AWS and third-party ISV data processing tools.

2. Azure Data lake

Azure Data Lake Storage Gen2 is a highly scalable and cost-effective data lake solution for big data analytics. It combines the power of a high-performance file system with massive scale and economy to help you speed your time to insight. Data Lake Storage Gen2 extends Azure Blob Storage capabilities and can handle analytics workloads. Data Lake Storage Gen2 is the most comprehensive data lake available.


So Which is Better? Data Lake or the Data Warehouse? Both! Instead of a Data Lake vs Data Warehouse decision, it might be worthwhile to consider a target state for your enterprise that includes a Data Lake as well as a Data Warehouse. Just like the advanced analytic processes that apply statistical and machine learning techniques on vast amounts of historical data, the Data Warehouse can also take advantage of the Data Lake. Newly modeled facts and slowly changing dimensions can now be loaded with data from the time the Data Lake was built instead of capturing only new changes.

This also takes the pressure off the data architects to create each and every data entity that may or may not be used in the future. They can instead focus on building a Data Warehouse exclusively on current reporting and analytical needs, thereby allowing it to grow naturally.

Follow this link, if you are looking to learn more about data science online!

You can follow this link for our Big Data course!

Additionally, if you are having an interest in learning Data Science, click here to start Best Online Data Science Courses 

Furthermore, if you want to read more about data science, you can read our blogs here

How to Discover and Classify Metadata using Apache Atlas on Amazon EMR

What is Data Lake and How to Improve Data Lake Quality