Moving your data lake to the cloud has a number of significant benefits including cost-effectiveness and agility. However, to see these benefits, it’s important to understand how to structure your data lake architecture in the cloud, which is a bit different than a traditional on-premises architecture. Also, moving to a cloud-based data lake or multi-cloud environment can’t (or really shouldn’t) happen all at once – it’s a journey that happens over time. Let’s explore some key benefits as well as the steps you need to consider to achieve a modern data architecture in the cloud.
1. Agile pay-for-use: Dynamic processing
The beauty of the cloud is its agility and flexibility. The cloud makes it possible to pay for just the compute you use. For example, you can start with a 20-node cluster and then easily increase to 100 nodes as your requirements change. You can also scale down as needed. With other models, you can pay only for a specific time; for example, if you want compute for two hours to run a back job, then you only pay for two hours.
2. Cost-effective data storage and compute
When it comes to storage and compute, the cloud is different from an on-premise data lake. With on-prem, whether your cluster is in Hortonworks, Cloudera or Map R, the storage and compute are the same nodes. In other words, if you have a 100-node cluster, it stores the data as well as performs the compute. In the cloud, you have separate storage and compute services. This is because in the cloud, storage is cheap and compute is expensive. This separation requires slightly different thinking on your part when it comes to your data lake architecture.
3. Infrastructure agility
In addition to on-demand processing, you get on-demand infrastructure with the cloud. You have the ability to start small, grow as needed and, if you encounter a scenario where you need to cut back, it’s easy to make that happen.
4. Most up-to-date technologies
The cycles to refresh an upgrade can be long because there are so many dependencies and aspects that have to be planned out, including infrastructure, operations, and software. However, many cloud providers are adding services from vendors that make it easier to upgrade without impacting your overall solution. For example, we have clients with cloud-based data lake architectures that were able to upgrade to a new version of Hadoop in a matter of days