9923170071 / 8108094992 info@dimensionless.in
The key Difference Between a Data Warehouse and Data lake

The key Difference Between a Data Warehouse and Data lake

Introduction

Enterprises have long relied on BI to help them move their businesses forward. Years ago, translating BI into actionable information required the help of data experts. Today, technology supports BI which is accessible to people at all levels of an enterprise.

All that BI data needs to live somewhere. The data storage solution you choose for enterprise app development positions your business to access, secure, and use data in different ways. That’s why it’s helpful to understand the basic options, how they’re different, and which use cases are suitable for each.

In this blog, we will be looking at the key differences between Data lakes and Data warehouses. We will understand their basics and will try to see their implementation in different fields with different tools.

What is Data Lake?

A data lake is a central location in which you can store all your data, regardless of its source or format. It is typically, although not always, built using Hadoop. The data can be structured or unstructured. You can then use a variety of storage and processing tools — typical tools in the extended Hadoop ecosystem — to extract value quickly and inform key organizational decisions. Because of the growing variety and volume of data, data lakes are an emerging and powerful architectural approach, especially as enterprises turn to mobile, cloud-based applications, and the Internet of Things (IoT) as right-time delivery mediums for big data.

What is a Data Warehouse?

A data warehouse is a large collection of business data used to help an organization make decisions. The concept of the data warehouse has existed since the 1980s when it was developed to help transition data from merely powering operations to fueling decision support systems that reveal business intelligence.

A large amount of data in data warehouses comes from different places such as internal applications such as marketing, sales, and finance; customer-facing apps; and external partner systems, among others. On a technical level, a data warehouse periodically pulls data from those apps and systems; then, the data goes through formatting and import processes to match the data already in the warehouse. The data warehouse stores this processed data so itʼs ready for decision-makers to access. How frequently data pulls occur, or how data is formatted, etc., will vary depending on the needs of the organization.

Differences

1. Data Types

Data warehouses store structured organizational data such as financial transactions, CRM and ERP data. Other data sources such as social media, web server logs, and sensor data, not to mention documents and rich media, are not storable because they are more difficult to model, and their sheer volume makes them expensive and difficult to manage. These types of data are more appropriate for a data lake.

2. Processing

In a data warehouse, data is organized, defined, and metadata is applied before the data is written and stored. We call this process as ‘schema on writeʼ. A data lake consumes everything, including data types considered inappropriate for a data warehouse. Data is present in raw form; information is present to the schema as we extract data from the data source, not when we write it to storage. We call this as a ‘schema on readʼ.

3. Storage and Data Retention

Before we can load data to a data warehouse, data engineers work hard to analyze the data and how to use it for business analysis. They design transformations to summarize and transform the data to enable the extraction of relevant insights. They do not consider the data which doesnʼt answer concrete business questions in the data warehouse. In order to reduce storage space and improve performance — a traditional data warehouse is an expensive and scarce enterprise resource. In a data lake, data retention is less complex, because it retains all data — raw, structured, and unstructured. Data is never going in the deletion phase, permitting analysis of past, current and future information. Data lakes run on commodity servers using inexpensive storage devices, removing storage limitations.

4. Agility

Data warehouses store historical data. Incoming data conforms to a predefined structure. This is useful for answering specific business questions, such as “what is our revenue and profitability across all 124 stores over the past week”. However, if business questions are evolving, or the business wants to retain all data to enable in-depth analysis, data warehouses are insufficient. The development effort to adapt the data warehouse and ETL process to new business questions is a huge burden. A data lake stores data in its original format, so it is immediately accessible for any type of analysis. Information can be retrieved and reused — a user can apply a formalized schema to the data, store it, and share it with others. If the information is not useful, the copy can be discarded without affecting the data stored in the data lake. All this is done with no development effort.

5. Security, Maturity, and Usage

Data warehouses have been around for two decades and are a secure, enterprise-ready technology. Data lakes are getting there, but are newer and have a shorter enterprise track record. A large enterprise cannot buy and implement a data lake like it would a data warehouse — it must consider which tools to use, open source or commercial, and how to piece them together to meet requirements. The end users of each technology are different: a data warehouse is used by business analysts, who query the data via pre-integrated reporting and BI. Business users cannot use a data lake as easily, because data requires processing and analysis to be useful. Data scientists, data engineers, or sophisticated business users, can extract insights from massive volumes of data in the data lake.

Benefits of Data lakes

1. The Historical Legacy Data Architecture Challenge

Some reasons why data lakes are more popular are historical. Traditional legacy data systems are not that open, to say the least, if you want to start integrating, adding and blending data together to analyze and act. Analytics with traditional data architectures weren’t that obvious nor cheap either (with the need for additional tools, depending on the software). Moreover, they weren’t built with all the new and emerging (external) data sources which we typically see in big data in mind.

2. Faster Big Data Analytics as a Driver of Data Lake Adoption

Another important reason to use data lakes is the fact that big data analytics can be done faster. In fact, data lakes are designed for big data analytics if you want and, more important than ever, for real-time actions based on real-time analytics. Data lakes are fit to leverage big quantities of data in a consistent way with algorithms to drive (real-time) analytics with fast data.

3. Mixing and Converging Data: Structured and Unstructured in One Data Lake

A benefit we more or less already mentioned is the possibility to acquire, blend, integrate and converge all types of data, regardless of sources and format. Hadoop, one of the data lake architectures, can also deal with structured data on top of the main chunk of data: the previously mentioned unstructured data coming from social data, logs and so forth. On a side note: unstructured data is the fastest growing form of all data (even if structured data keeps growing too) and is predicted to reach about 90 percent of all data.

Benefits of Data Warehousing

Organizations that use a data warehouse to assist their analytics and business intelligence to see a number of:

  1. Substantial Benefits
    Better data, hence adding data sources to a data warehouse enables organizations to ensure that they are collecting consistent and relevant data from that source. They donʼt need to wonder whether the data will be accessible or inconsistent as it comes into the system. This ensures higher data quality and data integrity for sound decision making.
  2. Faster Decisions
    Data in a warehouse is in always consistent analyzable formats. It also provides analytical power and a more complete dataset to base decisions on hard facts. Therefore, decision-makers no longer need to rely on hunches, incomplete data, or poor quality data and risk delivering slow and inaccurate results.

Tools for Data Warehousing

1. Amazon Redshift

Amazon Redshift is an excellent data warehouse product which is a very critical part of Amazon Web Services — a very famous cloud computing platform. Redshift is a fast, well-managed data warehouse that analyses data using the existing standard SQL and BI tools. It is a simple and cost-effective tool that allows running complex analytical queries using smart features of query optimization. It handles analytics workload pertaining to big data sets by utilizing columnar storage on high-performance disks and massively parallel processing concepts. One of its very powerful features is Redshift spectrum, that allows the user to run queries against unstructured data directly in Amazon S3. It eliminates the need for loading and transformation. It automatically scales query computing capacity depending on data. Hence the queries run fast. Official URL: Amazon Redshift

2. Teradata

Teradata is another market leader when it comes to database services and products. Most of the competitive enterprise organizations use Teradata DWH for insights, analytics & decision making. Teradata DWH is a relational database management system by Teradata organization. It has two divisions i.e. data analytics & marketing applications. It works on the concept of parallel processing and allows users to analyze data in a simple yet efficient manner. An interesting feature of this data warehouse is its data segregation into hot & cold data. Here cold data refers to less frequently used data and this is the tool in the market these days. Official URL: Teradata

Tools for Data lakes

1. Amazon S3

The Amazon S3-based data lake solution uses Amazon S3 as its primary storage platform. Amazon S3 provides an optimal foundation for a data lake because of its virtually unlimited scalability. You can seamlessly and nondisruptively increase storage from gigabytes to petabytes of content, paying only for what you use. Amazon S3 has 99.999999999% durability. It has scalable performance, ease-of-use features, and native encryption and access control capabilities. Amazon S3 integrates with a broad portfolio of AWS and third-party ISV data processing tools.

2. Azure Data lake

Azure Data Lake Storage Gen2 is a highly scalable and cost-effective data lake solution for big data analytics. It combines the power of a high-performance file system with massive scale and economy to help you speed your time to insight. Data Lake Storage Gen2 extends Azure Blob Storage capabilities and can handle analytics workloads. Data Lake Storage Gen2 is the most comprehensive data lake available.

Summary

So Which is Better? Data Lake or the Data Warehouse? Both! Instead of a Data Lake vs Data Warehouse decision, it might be worthwhile to consider a target state for your enterprise that includes a Data Lake as well as a Data Warehouse. Just like the advanced analytic processes that apply statistical and machine learning techniques on vast amounts of historical data, the Data Warehouse can also take advantage of the Data Lake. Newly modeled facts and slowly changing dimensions can now be loaded with data from the time the Data Lake was built instead of capturing only new changes.

This also takes the pressure off the data architects to create each and every data entity that may or may not be used in the future. They can instead focus on building a Data Warehouse exclusively on current reporting and analytical needs, thereby allowing it to grow naturally.

Follow this link, if you are looking to learn more about data science online!

You can follow this link for our Big Data course!

Additionally, if you are having an interest in learning Data Science, click here to start Best Online Data Science Courses 

Furthermore, if you want to read more about data science, you can read our blogs here

How to Discover and Classify Metadata using Apache Atlas on Amazon EMR

What is Data Lake and How to Improve Data Lake Quality 

What is Cloud Computing & Which is Better, AWS or GCP

What is Cloud Computing & Which is Better, AWS or GCP

Introduction

The advent of Cloud computing has made it possible for many organizations to rapidly scale their current analytics operations. It involves very little maintenance overhead. This has, in turn, created a need to build strategies for migration to the Cloud. In this blog,  we will discuss the various factors to consider while evaluating different Cloud technologies.

What is cloud computing?

Cloud Computing is an Information Technology (IT) paradigm that enables ubiquitous access to shared pools of configurable system resources. It also provides higher-level services that can be rapidly provisioned with minimal management effort, often over the Internet. Cloud Computing relies on the sharing of resources to achieve coherence and economy of scale, similar to a utility. By using a Cloud-based solution for computing, organizations can significantly reduce their IT infrastructure. It costs while focusing on their core business.

Advantages of cloud computing

  1. Scalability
    With the advent of Cloud infrastructure, it has become virtually effortless to scale an organization’s infrastructure up or down. This is due to the infrastructure essentially being the responsibility of the Cloud service provider. The customer only needs to specify the required configuration of the application or service without worrying about procuring the necessary infrastructure.
  2. Reliability
    Since cloud providers handle the infrastructure and its maintenance, any periodic or immediate maintenance activities adhere to the predefined SLA, essentially creating a highly reliable system.
  3. High availability
    Providers generally have servers located in physical locations across the world and ensure highly available data and services through multiple replication strategies.
  4.  Reduced operational costs
    When opting for a Cloud vendor, the infrastructure becomes their responsibility hence eliminating the most cost associated with operations/maintenance for the customer. This pulls the cost down to virtually zero.
  5. Increased IT effectiveness
    The IT team is now able to focus solely on software development without worrying about hardware limitations or maintenance. The utopia of building a platform with almost no hardware constraints allows for more robust platform development. It also increases overall effectiveness

Cloud services providers

  1. Amazon Web Services
    Amazon Web Services, commonly referred to as AWS, was the starting point for the Cloud Computing paradigm with its launch of EC2 compute instances in 2006. AWS has documented all the services very well and seamlessly integrate with other provided services at almost zero cost for transfer of data between services. AWS is cost-effective, highly scalable with high availability. It provides spawning and allows for usage of services both programmatically and through the UI console. AWS comprises of more than 90 different services, spanning a wide range of use cases including computing, storage, networking, database, analytics, application services, deployment, management, mobile, developer tools, and tools for machine learning and the Internet of Things.
  2. Google Cloud Platform
    Google Cloud Platform, also known as GCP, is built with power and simplicity in mind. GCP offers services which can seamlessly integrate with other Google products, providing access to a wide range of services in the domain of computing, data storage, data analytics, and machine learning. It also has a wide set of management tools which work on top of these services.

Where is the difference?

As cloud computing continues to find its way into MNC big and small, the choice of the right cloud computing solution has become a talking point for specialists and business owners alike. Among public cloud providers, Amazon Web Services (AWS) seems to have the lead in the competition, with Google Cloud and Microsoft Azure close behind.

AWS Vs Google cloud platform | Dimensionless

Let us focus on some key differences between Google cloud services and AWS. We can differentiate between both of them based upon

  1. Pricing
  2. Features
  3. Implementation
  4. Security
  5. Support

Pricing

When comparing Google Cloud vs AWS, both handle billing differently. And to be honest, neither of them provide a very straightforward way of easily calculating this unless you are very familiar with the platforms. More generally a difference in pricing is not much but google cloud services can turn out to be a tad cheaper in long run!

Google’s Cloud is a winner when it comes to computing and storage costs. For example, a 2 CPUs/8GB RAM instance will cost $69/month with AWS, compared to only $52/month with GCP (25% cheaper). As for cloud storage costs, GCP’s regional storage costs are only 2 cents/GB/month vs 2.3 cents/GB/month for AWS. Additionally, GCP offers a “multi-regional” cloud storage option, where the data is automatically replicated across several regions for the very little added cost (total of 2.6 cents/GB/month).

Pricing AWS and Google cloud | dimensionless

Here are their monthly calculators if you’re just starting:

Estimating monthly spend with both of these cloud providers can be a challenge. There are even entire tools out there such as reOptimize or Cloudability which were built to help you understand your bills better. Essentially AWS offers you a dashboard which provides insights into your bill. Google Cloud Platform provides estimated exports via their BigQuery tool. However, both providers are doing things to decrease costs and make billing easier.

Features

In this parameter, we will divide features into 3 major parts which are most essentially used. On those features, we will try to list out differences between Google cloud and AWS.

Features: AWS & Google Cloud | dimensionless

Let us also have a look at the 3 most common services provided by both of them

Compute: The first category is how Google Compute Engine and AWS EC2 handle their virtual machines (instances). The technology behind Google Cloud’s VMs is KVM, whereas the technology behind AWS EC2 VMs is Xen. Both offer a variety of predefined instance configurations with specific amounts of virtual CPU, RAM, and network. However, they have a different naming convention, which can at first be confusing. Google Compute Engine refers to them as machine types, whereas Amazon EC2 refers to them as instance types.

Storage: One of the most common use cases for public IaaS cloud computing is storage and that’s for good reason: Instead of buying hardware and managing it, users simply upload data to the cloud and pay for how much they put there.

Networking: Google Cloud and AWS both utilize different networks and partners to interconnect their data centres across the globe and deliver content via ISPs to end users. They offer a variety of different products to accomplish this.

Implementation

Implementation: AWS & Google Cloud | dimensionless

AWS provides a nice and easy page to start using their services.

You can see that they break it down by the platform you wish to work on, so whether you are making an iOS app, or writing in PHP, they provide some sample code to begin the integration.

Lastly, we have the process of starting with Google — named ‘Cloud Launcher’.

They equally provide some starting documentation and list some useful benefits

Support

Both Google Cloud and AWS have extensive documentation and community forums which you can take advantage of for free.

However, if you need assistance or support right away, you’ll have to pay. Both Google Cloud and AWS have support plans, but you’ll definitely want to read the fees involved as they can add up quite fast. Both providers include an unlimited number of account and billing support cases, with no long-term contracts.

Support: AWS & Google Cloud | dimensionless

Google Cloud Premium Support

  • Google offers three different levels of support: Silver, Gold, and Platinum
  • Cheapest support plan, Silver, starts at $150/month minimum
  • The next level support plan, Gold, starts at a $400/month minimum, but at this level, GCP will bill you a minimum of 9% of product usage fees (decreases as spend increases)

AWS Support

  • AWS offers four different levels of support: Basic, Developer, Business, and Enterprise
  • Cheapest paid support plan, Developer, starts at $29/month or 3% of monthly AWS usage
  • The next level support plan, Business, starts at a $100/month minimum, but at this level, AWS will bill you a minimum of 10% of product usage fees (decreases as spend increases)

Security

In their Second Annual Cloud Computing Survey (2017), Clutch surveyed 283 IT professionals at businesses across the United States that currently use a cloud computing service. In regards to security, they found that almost 70% of professionals were more comfortable storing data in the cloud than their previous legacy systems.

AWS platform security model includes:

  • All the data stored on EC2 instances is encrypted under 256-bit AES. Each encryption key is also encrypted with a set of regularly changed master keys.
  • Network firewalls built into Amazon VPC, and web application firewall capabilities in AWS WAF let you create private networks. They control access to your instances and applications.
  • AWS Identity and Access Management (IAM), AWS Multi-Factor Authentication, and AWS Directory Services allow for defining, enforcing, and managing user access policies.
  • AWS has audit-friendly service features for PCI, ISO, HIPAA, SOC and other compliance standards.

Google Cloud security model includes:

  • All the data stored on persistent disks and is encrypted under 256-bit AES and each encryption key is also encrypted with a set of regularly changed master keys. By default.
  • Commitment to enterprise security certifications (SSAE16, ISO 27017, ISO 27018, PCI, and HIPAA compliance).
  • Only authenticated and authorized requests from other components that coming to Google storage stack are required.
  • Google Cloud Identity and Access Management (Cloud IAM) was launched in September 2017 to provide predefined roles that give granular access to specific Google Cloud Platform resources and prevent unwanted access to other resources.

Conclusion

which is better: AWS or Google cloud

After going through different aspects and components of cloud services, we can form a conclusion that

  1. Google Cloud wins on pricing
  2. AWS wins on market share and offerings
  3. Google Cloud wins on instance configuration
  4. GCP wins on the free trial
  5. Google Cloud wins on UX

Stay tuned for more blogs!