Demystifying the Rivian Data Lake Architecture

Srihari Bhupathiraju
3 min readFeb 2, 2024

--

I recently encountered a technical presentation from AWS re:Invent 2023 detailing how Rivian, an electric vehicle manufacturer, uses a Data Lake to address their data analytical requirements. Curious enough, I delved into the video’s intricacies, focusing on the data lake component. It was revealed that each Rivian vehicle generates hundreds of data events, processed in real-time and batch mode to fulfill diverse critical business needs. The accompanying image in the article demonstrate Rivian’s high-level data lake architecture. I am making an effort to understand specific purposes and functions of each technical tool or service depicted in the architecture.

Vehicle Telematics Control Module/Unit (TCM): It is an embedded system fitted into vehicles ( basically a micro controller) to avail telematic services. This system captures diverse metrics such as speed, engine data, and temperature directly from the vehicle. The collected data is then transmitted to cloud services, specifically on the AWS platform. This telemetry is indispensable for comprehending vehicle performance across different driving and terrain scenarios. Engineers leverage this data to iteratively enhance the vehicle software, ensuring optimal functionality. In this use case, every Rivian vehicle actively emits data, whether stationary or in motion, facilitating a continuous flow of crucial information into the AWS cloud infrastructure.

Amazon MSK : Managed Service Kafka is used to stream the data events sent by TCU’s in real time. MSK is a native AWS service and a great choice as it can scale up easily with increasing number of vehicles and data volume over time and also offers easy integration with other AWS services like S3. Below are some reasons why MSK is a good choice for streaming.

Data Lake with Databricks: From the architectural diagram, it is evident that Databricks is their preferred choice for data lake management with storage supported by AWS S3. Databricks is a unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale. The Databricks Data Intelligence Platform integrates with cloud storage and security in your cloud account, and manages and deploys cloud infrastructure on your behalf.

  • ETL or ELT follows a medallion architecture and data is exposed to users with delta tables. In the medallion architecture, data passes through various layers while undergoing various transformations like cleaning, transforming etc.
  • Databricks supports Delta Lake which uses parquet data files with transaction log setup to handle ACID transactions. This is a great feature for file based platforms.
  • Structured Streaming is probably supported using Delta Live Tables (DLT) which uses auto loader and other databricks out of the box features to process data close to real time.
  • Aggregated data suiting the business needs is loaded into the gold layer (9b).

Reporting: After the data is loaded, cleaned, transformed and aggregated in data lake, reporting is done using Amazon Athena, Preset and Databricks SQL Analytics. Business users generally look at the aggregated data in the data lake contrary to lowest grain of data and take help of business intelligence generated reports to identify patterns, anomalies, trends, performance etc.

References: https://www.youtube.com/watch?v=io5w08-WKHI

I truly appreciate your thoughts and suggestions in improving the accuracy of this article.

--

--

No responses yet