Azure Databricks architecture and its components

Back to Home

Architecture

Dataflow:

  1. Azure Databricks ingests raw streaming data from Azure Event Hubs.
  2. Data Factory loads raw batch data into Data Lake Storage.
  3. For data storage
    • Data Lake Storage houses stores data of all types, such as structured, unstructured, and semi-structured. It also stores batch and streaming data.
    • Delta Lake forms the curated layer of the data lake. It stores the refined data in an open-source format.
    • Azure Databricks works well with a medallion architecture that organizes data into layers:
      • Bronze: Holds raw data.
      • Silver: Contains cleaned, filtered data.
      • Gold: Stores aggregated data that’s useful for business analytics.

4.The analytical platform ingests data from the disparate batch and streaming sources. Data scientists use this data for these tasks:

  • Data preparation.
  • Data exploration.
  • Model preparation.
  • Model training.

MLflow manages parameter, metric, and model tracking in data science code runs. The coding possibilities are flexible:

  • Code can be in SQL, Python, R, and Scala.
  • Code can use popular open-source libraries and frameworks such as Koalas, Pandas, and scikit-learn, which are pre-installed and optimized.
  • Practitioners can optimize for performance and cost with single-node and multi-node compute options.

5.Machine learning models are available in several formats:

  • Azure Databricks stores information about models in the ML flow Model Registry. The registry makes models available through batch, streaming, and REST APIs.
  • The solution can also deploy models to Azure Machine Learning web services or Azure Kubernetes Service (AKS).

6.Services that work with the data connect to a single underlying data source to ensure consistency. For instance, users can run SQL queries on the data lake with Azure Databricks SQL Analytics. This service:

  • Provides a query editor and catalog, the query history, basic dashboarding, and alerting.
  • Uses integrated security that includes row-level and column-level permissions.

7.Power BI generates analytical and historical reports and dashboards from the unified data platform. This service uses these features when working with Azure Databricks:

  • A built-in Azure Databricks connector for visualizing the underlying data
  • Optimized Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC) drivers.

8.Users can export gold data sets out of the data lake into Azure Synapse via the optimized Synapse connector. SQL pools in Azure Synapse provide a data warehousing and compute environment.

9.The solution uses Azure services for collaboration, performance, reliability, governance, and security:

  • Microsoft Purview provides data discovery services, sensitive data classification, and governance insights across the data estate.
  • Azure DevOps offers continuous integration and continuous deployment (CI/CD) and other integrated version control features.
  • Azure Key Vault securely manages secrets, keys, and certificates.
  • Azure Active Directory (Azure AD) provides single sign-on (SSO) for Azure Databricks users. Azure Databricks supports automated user provisioning with Azure AD for these tasks:
    1. Creating new users.
    2. Assigning each user an access level.
    3. Removing users and denying them access.
  • Azure Monitor collects and analyzes Azure resource telemetry. By proactively identifying problems, this service maximizes performance and reliability.
  • Azure Cost Management and Billing provide financial governance services for Azure workloads.

Components of Azure Databricks :

Core components

  • Event Hubs is a big data streaming platform. As a platform as a service (PaaS), this event ingestion service is fully managed.
  • Data Factory is a hybrid data integration service. You can use this fully managed, serverless solution to create, schedule, and orchestrate data transformation workflows.
  • Data Lake storage is a scalable and secure data lake for high-performance analytics workloads. This service can manage multiple petabytes of information while sustaining hundreds of gigabits of throughput. The data may be structured, semi-structured, or unstructured. It typically comes from multiple, heterogeneous sources like logs, files, and media.
  • Azure Databricks SQL analytics runs queries on data lakes. This service also visualizes data in dashboards.
  • Machine Learning is a cloud-based environment that helps you build, deploy, and manage predictive analytics solutions. With these models, you can forecast behavior, outcomes, and trends.
  • Azure Kubernetes service is a highly available, secure, and fully managed Kubernetes service. AKS makes it easy to deploy and manage containerized applications.
  • Azure Synapse is an analytics service for data warehouses and big data systems. This service integrates with Power BI, Machine Learning, and other Azure services.
  • Azure Synapse connectors provide a way to access Azure Synapse from Azure Databricks. These connectors efficiently transfer large volumes of data between Azure Databricks clusters and Azure Synapse instances.
  • Delta Lake is a storage layer that uses an open file format. This layer runs on top of cloud storage such as Data Lake Storage. Delta Lake supports data versioning, rollback, and transactions for updating, deleting, and merging data.
  • MLflow is an open-source platform for the machine learning lifecycle. Its components monitor machine learning models during training and running. MLflow also stores models and loads them in production.

Reporting and governing components

  • PowerBI is a collection of software services and apps. These services create and share reports that connect and visualize unrelated sources of data. Together with Azure Databricks, Power BI can provide root cause determination and raw data analysis
  • Microsoft purview manages on-premises, multi-cloud, and software as a service (SaaS) data. This governance service maintains data landscape maps. Features include automated data discovery, sensitive data classification, and data lineage.
  • Azure DevOps is a DevOps orchestration platform. This SaaS provides tools and environments for building, deploying, and collaborating on applications.
  • Azure Key vault stores and controls access to secrets such as tokens, passwords, and API keys. Key Vault also creates and controls encryption keys and manages security certificates.
  • Azure Active Directory offers cloud-based identity and access management services. These features provide a way for users to sign in and access resources.
  • Azure Monitor collects and analyzes data on environments and Azure resources. This data includes app telemetry, such as performance metrics and activity logs.
  • Azure Cost Management and Billing manage cloud spending. By using budgets and recommendations, this service organizes expenses and shows how to reduce costs.