Google Cloud is a cloud computing platform that can be used to build and deploy applications. It allows you to take advantage of the flexibility of development while scaling the infrastructure as needed.
I’m often asked by developers to provide a list of Google Cloud architectures that help to get started on the cloud journey. Last month, I decided to start a mini-series on Twitter called “#13DaysOfGCP” where I shared the most common use cases on Google Cloud. I have compiled the list of all 13 architectures in this post. Some of the topics covered are hybrid cloud, mobile app backends, microservices, serverless, CICD and more. If you were not able to catch it, or if you missed a few days, here we bring to you the summary!
Anomaly detection plays a vital role in many industries across the globe, such as fraud detection for the financial industry, health monitoring in hospitals, fault detection and operating environment monitoring in the manufacturing, oil and gas, utility, transportation, aviation, and automotive industries.
Anomaly detection is about finding patterns in data that do not conform to expected behavior. It is important for decision-makers to be able to detect them and take proactive actions if needed. Using the oil and gas industry as one example, deep-water rigs with various equipment are intensively monitored by hundreds of sensors that send measurements in various frequencies and formats. Analysis or visualization is hard using traditional software platforms, and any non-productive time on deep-water oil rig platforms caused by the failure to detect anomaly could mean large financial losses each day.
Companies need new technologies like Azure IoT, Azure Stream Analytics, Azure Data Explorer and machine learning to ingest, processes, and transform data into strategic business intelligence to enhance exploration and production, improve manufacturing efficiency, and ensure safety and environmental protection. These managed services also help customers dramatically reduce software development time, accelerate time to market, provide cost-effectiveness, and achieve high availability and scalability.
While the Azure platform provides lots of options for anomaly detection and customers can choose the technology that best suits their needs, customers also brought questions to field facing architects on what use cases are most suitable for each solution. We’ll examine the answers to these questions below, but first, you’ll need to know a couple definitions:
What is a time series? A time series is a series of data points indexed in time order. In the oil and gas industry, most equipment or sensor readings are sequences taken at successive points in time or depth.
What is decomposition of additive time series? Decomposition is the task to separate a time series into components as shown on the graph below.
Time-series forecasting and anomaly detection
Anomaly detection is the process to identify observations that are different significantly from majority of the datasets.
This is an anomaly detection example with Azure Data Explorer.
The red line is the original time series.
The blue line is the baseline (seasonal + trend) component.
The purple points are anomalous points on top of the original time series.
To detect anomalies, either Azure Stream Analytics or Azure Data Explorer can be used for real-time analytics and detection as illustrated in the diagram below.
Azure Stream Analytics is an easy-to-use, real-time analytics service that is designed for mission-critical workloads. You can build an end-to-end serverless streaming pipeline with just a few clicks, go from zero to production in minutes using SQL, or extend it with custom code and built-in machine learning capabilities for more advanced scenarios.
Azure Data Explorer is a fast, fully managed data analytics service for near real-time analysis on large volumes of data streaming from applications, websites, IoT devices, and more. You can ask questions and iteratively explore data on the fly to improve products, enhance customer experiences, monitor devices, boost operations, and quickly identify patterns, anomalies, and trends in your data.
Azure Stream Analytics or Azure Data Explorer?
Data Explorer is for on-demand or interactive near real-time analytics, data exploration on large volumes of data streams, seasonality decomposition, ad hoc work, dashboards, and root cause analyses on data from near real-time to historical. It will not suit you use case if you need to deploy analytics onto the edge.
Data Explorer provides native function for forecasting time series based on the same decomposition model. Forecasting is useful for many scenarios like preventive maintenance, resource planning, and more.
Stream Analytics does not provide seasonality support, with the limitation of sliding windows size.
Data Explorer provides functionalities to automatically detect the periods in the time series or allows you to verify that a metric should have specific distinct period(s) if you know them.
Stream Analytics does not support decomposition.
Data Explorer provides function which takes a set of time series and automatically decomposes each time series to its seasonal, trend, residual, and baseline components.
Filtering and Analysis
Stream Analytics provides functions to detect spikes and dips or change points.
Data Explorer provides analysis to finds anomalous points on a set of time series, and a root cause analysis (RCA) function after anomaly is detected.
Stream Analytics provides a filter with reference data, slow-moving, or static.
Data Explorer provides two generic functions: • Finite impulse response (FIR) which can be used for moving average, differentiation, shape matching • Infinite impulse response (IIR) for exponential smoothing and cumulative sum
Stream Analytics provides detections for: • Spikes and dips (temporary anomalies) • Change points (persistent anomalies such as level or trend change)
Data Explorer provides detections for: • Spikes & dips, based on enhanced seasonal decomposition model (supporting automatic seasonality detection, robustness to anomalies in the training data) • Changepoint (level shift, trend change) by segmented linear regression • KQL Inline Python/R plugins enable extensibility with other models implemented in Python or R
Azure Data Analytics, in general, brings you the best of breed technologies for each workload. The new Real-Time Analytics architecture (shown above) allows leveraging the best technology for each type of workload for stream and time-series analytics including anomaly detection. The following is a list of resources that may help you get started quickly:
Multi-language speech transcription was recently introduced into Microsoft Video Indexer at the International Broadcasters Conference (IBC). It is available as a preview capability and customers can already start experiencing it in our portal. More details on all our IBC2019 enhancements can be found here.
Multi-language videos are common media assets in the globalization context, global political summits, economic forums, and sport press conferences are examples of venues where speakers use their native language to convey their own statements. Those videos pose a unique challenge for companies that need to provide automatic transcription for video archives of large volumes. Automatic transcription technologies expect users to explicitly determine the video language in advance to convert speech to text. This manual step becomes a scalability obstacle when transcribing multi-language content as one would have to manually tag audio segments with the appropriate language.
Microsoft Video Indexer provides a unique capability of automatic spoken language identification for multi-language content. This solution allows users to easily transcribe multi-language content without going through tedious manual preparation steps before triggering it. By that, it can save anyone with large archive of videos both time and money, and enable discoverability and accessibility scenarios.
Multi-language audio transcription in Video Indexer
The multi-language transcription capability is available as part of the Video Indexer portal. Currently, it supports four languages including English, French, German and Spanish, while expecting up to three different languages in an input media asset. While uploading a new media asset you can select the “Auto-detect multi-language” option as shown below.
Additionally, each instance in the transcription section will include the language in which it was transcribed.
Customers can view the transcript and identified languages by time, jump to the specific places in the video for each language, and even see the multi-language transcription as video captions. The result transcription is also available as closed caption files (VTT, TTML, SRT, TXT, and CSV).
Language identification from an audio signal is a complex task. Acoustic environment, speaker gender, and speaker age are among a variety of factors that affect this process. We represent audio signal using a visual representation, such as spectrograms, assuming that, different languages induce unique visual patterns which can be learned using deep neural networks.
Our solution has two main stages to determine the languages used in multi-language media content. First, it employs a deep neural network to classify audio segments with very high granularity, in other words, very few seconds. While a good model will successfully identify the underlying language, it can still miss-identify some segments due to similarities between languages. Therefore, we apply a second stage for examining these misses and smooth the results accordingly.
We introduced a differentiated capability for multi-language speech transcription. With this unique capability in Video Indexer, you can become more effective about the content of your videos as it allows you to immediately start searching across videos for different language segments. During the coming few months, we will be improving this capability by adding support for more languages and improving the model’s accuracy.
Earlier this year, we announced a preview of built-in Jupyter notebooks for Azure Cosmos DB. These notebooks, running inside Azure Cosmos DB, are now available.
Cosmic notebooks are available for all data models and APIs including Cassandra, MongoDB, SQL (Core), Gremlin, and Spark to enhance the developer experience in Azure Cosmos DB. These notebooks are directly integrated into the Azure Portal and your Cosmos accounts, making them convenient and easy to use. Developers, data scientists, engineers and analysts can use the familiar Jupyter notebooks experience to:
Interactively run queries
Explore and analyze data
Build, train, and run machine learning and AI models
In this blog post, we’ll explore how notebooks make it easy for you to work with and visualize your Azure Cosmos DB data.
Easily query your data
With notebooks, we’ve included built-in commands to make it easy to query your data for ad-hoc or exploratory analysis. From the Portal, you can use the %%sql magic command to run a SQL query against any container in your account, no configuration needed. The results are returned immediately in the notebook.
Improved developer productivity
We’ve also bundled in version 4 of our Azure Cosmos DB Python SDK for SQL API, which has our latest performance and usability improvements. The SDK can be used directly from notebooks without having to install any packages. You can perform any SDK operation including creating new databases, containers, importing data, and more.
Visualize your data
Azure Cosmos DB notebooks comes with a built-in set of packages, including Pandas, a popular Python data analysis library, Matplotlib, a Python plotting library, and more. You can customize your environment by installing any package you need.
For example, to build interactive visualizations, we can install bokeh and use it to build an interactive chart of our data.
Users with geospatial data in Azure Cosmos DB can also use the built-in GeoPandas library, along with their visualization library of choice to more easily visualize their data.
Follow our documentation to create a new Cosmos account with notebooks enabled or enable notebooks on an existing account.
Start with one of the notebooks included in the sample gallery in Azure Cosmos Explorer or Data Explorer.
Python support for Azure Functions is now generally available and ready to host your production workloads across data science and machine learning, automated resource management, and more. You can now develop Python 3.6 apps to run on the cross-platform, open-source Functions 2.0 runtime. These can be published as code or Docker containers to a Linux-based serverless hosting platform in Azure. This stack powers the solution innovations of our early adopters, with customers such as General Electric Aviation and TCF Bank already using Azure Functions written in Python for their serverless production workloads. Our thanks to them for their continued partnership!
In the words of David Havera, blockchain Chief Technology Officer of the GE Aviation Digital Group, “GE Aviation Digital Group’s hope is to have a common language that can be used for backend Data Engineering to front end Analytics and Machine Learning. Microsoft have been instrumental in supporting this vision by bringing Python support in Azure Functions from preview to life, enabling a real world data science and Blockchain implementation in our TRUEngine project.”
Throughout the Python preview for Azure Functions we gathered feedback from the community to build easier authoring experiences, introduce an idiomatic programming model, and create a more performant and robust hosting platform on Linux. This post is a one-stop summary for everything you need to know about Python support in Azure Functions and includes resources to help you get started using the tools of your choice.
Bring your Python workloads to Azure Functions
Many Python workloads align very nicely with the serverless model, allowing you to focus on your unique business logic while letting Azure take care of how your code is run. We’ve been delighted by the interest from the Python community and by the productive solutions built using Python on Functions.
Workloads and design patterns
While this is by no means an exhaustive list, here are some examples of workloads and design patterns that translate well to Azure Functions written in Python.
Simplified data science pipelines
Python is a great language for data science and machine learning (ML). You can leverage the Python support in Azure Functions to provide serverless hosting for your intelligent applications. Consider a few ideas:
Use Azure Functions to deploy a trained ML model along with a scoring script to create an inferencing application.
Leverage triggers and data bindings to ingest, move prepare, transform, and process data using Functions.
Use Functions to introduce event-driven triggers to re-training and model update pipelines when new datasets become available.
Automated resource management
As an increasing number of assets and workloads move to the cloud, there’s a clear need to provide more powerful ways to manage, govern, and automate the corresponding cloud resources. Such automation scenarios require custom logic that can be easily expressed using Python. Here are some common scenarios:
Process Azure Monitor alerts generated by Azure services.
React to Azure events captured by Azure Event Grid and apply operational requirements on resources.
Leverage Azure Logic Apps to connect to external systems like IT service management, DevOps, or monitoring systems while processing the payload with a Python function.
Perform scheduled operational tasks on virtual machines, SQL Server, web apps, and other Azure resources.
Powerful programming model
To power accelerated Python development, Azure Functions provides a productive programming model based on event triggers and data bindings. The programming model is supported by a world class end-to-end developer experience that spans from building and debugging locally to deploying and monitoring in the cloud.
The programming model is designed to provide a seamless experience for Python developers so you can quickly start writing functions using code constructs that you’re already familiar with, or import existing .py scripts and modules to build the function. For example, you can implement your functions as asynchronous coroutines using the async def qualifier or send monitoring traces to the host using the standard logging module. Additional dependencies to pip install can be configured using the requirements.txt file.
With the event-driven programming model in Functions, based on triggers and bindings, you can easily configure the events that will trigger the function execution and any data sources the function needs to orchestrate with. This model helps increase productivity when developing apps that interact with multiple data sources by reducing the amount of boilerplate code, SDKs, and dependencies that you need to manage and support. Once configured, you can quickly retrieve data from the bindings or write back using the method attributes of your entry-point function. The Python SDK for Azure Functions provides a rich API layer for binding to HTTP requests, timer events, and other Azure services, such as Azure Storage, Azure Cosmos DB, Service Bus, Event Hubs, or Event Grid, so you can use productivity enhancements like autocomplete and Intellisense when writing your code. By leveraging the Azure Functions extensibility model, you can also bring your own bindings to use with your function, so you can also connect to other streams of data like Kafka or SignalR.
As a Python developer, you can use your preferred tools to develop your functions. The Azure Functions Core Tools will enable you to get started using trigger-based templates, run locally to test against real-time events coming from the actual cloud sources, and publish directly to Azure, while automatically invoking a server-side dependency build on deployment. The Core Tools can be used in conjunction with the IDE or text editor of your choice for an enhanced authoring experience.
You can also choose to take advantage of the Azure Functions extension for Visual Studio Code for a tightly integrated editing experience to help you create a new app, add functions, and deploy, all within a matter of minutes. The one-click debugging experience enables you to test your functions locally, set breakpoints in your code, and evaluate the call stack, simply with the press of F5. Combine this with the Python extension for Visual Studio Code, and you have an enhanced Python development experience with auto-complete, Intellisense, linting, and debugging.
For a complete continuous delivery experience, you can now leverage the integration with Azure Pipelines, one of the services in Azure DevOps, via an Azure Functions-optimized task to build the dependencies for your app and publish them to the cloud. The pipeline can be configured using an Azure DevOps template or through the Azure CLI.
Advance observability and monitoring through Azure Application Insights is also available for functions written in Python, so you can monitor your apps using the live metrics stream, collect data, query execution logs, and view the distributed traces across a variety of services in Azure.
The Consumption plan is now generally available for Linux-based hosting and ready for production workloads. This serverless plan provides event-driven dynamic scale and you are charged for compute resources only when your functions are running. Our Linux plan also now has support for managed identities, allowing your app to seamlessly work with Azure resources such as Azure Key Vault, without requiring additional secrets.
The Consumption plan for Linux hosting also includes a preview of integrated remote builds to simplify dependency management. This new capability is available as an option when publishing via the Azure Functions Core Tools and enables you to build in the cloud on the same environment used to host your apps as opposed to configuring your local build environment in alignment with Azure Functions hosting.
Workloads that require advanced features such as more powerful hardware, the ability to keep instances warm indefinitely, and virtual network connectivity can benefit from the Premium plan with Linux-based hosting now available in preview.
With the Premium plan for Linux hosting you can choose between bringing only your app code or bringing a custom Docker image to encapsulate all your dependencies, including the Azure Functions runtime as described in the documentation “Create a function on Linux using a custom image.” Both options benefit from avoiding cold start and from scaling dynamically based on events.
Here are a few resources you can leverage to start building your Python apps in Azure Functions today:
On the Azure Functions team, we are committed to providing a seamless and productive serverless experience for developing and hosting Python applications. With so much being released now and coming soon, we’d love to hear your feedback and learn more about your scenarios. You can reach the team on Twitter and on GitHub. We actively monitor StackOverflow and UserVoice as well, so feel free to ask questions or leave your suggestions. We look forward to hearing from you!
In a world where data volume, variety, and type are exponentially growing, organizations need to collaborate with data of any size and shape. In many cases data is at its most powerful when it can be shared and combined with data that resides outside organizational boundaries with business partners and third parties. For customers, sharing this data in a simple and governed way is challenging. Common data sharing approaches using file transfer protocol (FTP) or web APIs tend to be bespoke development and require infrastructure to manage. These tools do not provide the security or governance required to meet enterprise standards, and they often are not suitable for sharing large datasets. To enable enterprise collaboration, we are excited to unveil Azure Data Share Preview, a new data service for sharing data across organizations.
Simple and safe data sharing
Alongside governance, security is fundamental in Azure Data Share and leverages core Azure security measures to help protect the data.
Enabling data collaboration
Azure Data Share maximizes access to simple and safe data sharing for organizations in many industries. For example, retailers can leverage Azure Data Share to easily share sales inventory and demographic data for demand forecasting and price optimization with their suppliers.
In the finance industry, Microsoft collaborated with Finastra, a multi-billion dollar company and provider of the broadest portfolio of financial services software in the world today that spans retail banking, transaction banking, lending, and treasury and capital markets. Finastra is fully integrating Azure Data Share with their open platform, FusionFabric.cloud, to enable seamless distribution of premium datasets to a wider ecosystem of application developers across the FinTech value chain. These datasets have been curated by Finastra over several years, and by leveraging the data distribution capabilities of Azure Data Share, ingestion by app developers and other partners requires simple wrangling, significantly reducing the go to market timeframe and unlocking net new revenue potential for Finastra.
“Our decision to integrate Azure Data Share with Finastra’s FusionFabric.cloud platform is now a great way to further accelerate innovation via an expanded open ecosystem. Our partnership with Microsoft truly provides us with limitless opportunities to drive transformation in Financial Services.”
– Eli Rosner, Chief Product and Technology Officer, Finastra
Industries of all types need a simple and safe way to share data. Azure Data Share opens up new opportunities for innovation and insights to drive greater business impact.
This blog was co-authored by Jordan Edwards, Senior Program Manager, Azure Machine Learning
This year at Microsoft Build 2019, we announced a slew of new releases as part of Azure Machine Learning service which focused on MLOps. These capabilities help you automate and manage the end-to-end machine learning lifecycle.
Historically, Azure Machine Learning service’s management plane has been via its Python SDK. To make our service more accessible to IT and app development customers unfamiliar with Python, we have delivered an extension to the Azure CLI focused on interacting with Azure Machine Learning.
While it’s not a replacement for the Azure Machine Learning service Python SDK, it is a complimentary tool that is optimized to handle highly parameterized tasks which suit themselves well to automation. With this new CLI, you can easily perform a variety of automated tasks against the machine learning workspace, including:
Compute target management
Experiment submission and job management
Model registration and deployment
Combining these commands enables you to train, register their model, package it, and deploy your model as an API. To help you quickly get started with MLOps, we have also released a predefined template in Azure Pipelines. This template allows you to easily train, register, and deploy your machine learning models. Data scientists and developers can work together to build a custom application for their scenario built from their own data set.
The Azure Machine Learning service Command-Line Interface is an extension to the interface for the Azure platform. This extension provides commands for working with Azure Machine Learning service from the command-line and allows you to automate your machine learning workflows. Some key scenarios would include:
Running experiments to create machine learning models
Registering machine learning models for customer usage
Packaging, deploying, and tracking the lifecycle of machine learning models
This blog was co-authored by Shweta Mishra, Senior Solutions Architect, CitiusTech and Vinil Menon, Chief Technology Officer, CitiusTech
CitiusTech is a specialist provider of healthcare technology services which helps its customers to accelerate innovation in healthcare. CitiusTech used Azure Cosmos DB to simplify the real-time collection and movement of healthcare data from variety of sources in a secured manner. With the proliferation of patient information from established and current sources, accompanied with scrupulous regulations, healthcare systems today are gradually shifting towards near real-time data integration. To realize such performance, healthcare systems not only need to have low latency and high availability, but should also be highly responsive. Furthermore, they need to scale effectively to manage the inflow of high speed, large volumes of healthcare data.
The rise of Internet of Things (IoT) has enabled ordinary medical devices, wearables, traditional hospital deployed medical equipment to collect and share data. Within a wide area network (WAN), there are well defined standards and protocols, but with the ever increasing number of devices getting connected to the internet, there is a general lack of standards compliance and consistency of implementation. Moreover, data collation and generation from IoT enabled medical/mobile devices need specialized applications to cope with increasing volumes of data.
This free-form approach provides a great deal of flexibility, since different data can be stored in document oriented stores as business requirements change. Relational databases aren’t efficient in performing CRUD operations on such data but are essential for handling transactional data where consistent data integrity is necessary. Different databases are designed to solve different problems, using a single database engine for multiple purposes usually leads to non-performant solutions. Whereas management of multiple types of databases is an operational overhead.
Developing distributed global scale solutions are challenged by the capability and complexity of scaling databases across multiple regions without compromising performance, and while complying with data sovereignty needs. This often leads to inefficient management of multiple regional databases and/or underperformance.
Azure Cosmos DB has the ability of polyglot persistence, which allows it to use a mix of data store technologies without compromising on performance. It is a multi-model, highly-available, globally scalable database which supports proven low latency reads and writes. Azure Cosmos DB has enterprise grade security features and keeps all data encrypted at rest.
Azure Cosmos DB is suited for distributed global scale solutions as it not only provides a turnkey global distribution feature but can geo-fence a database to specific regions to manage data sovereignty compliance. Its multi-master feature allows writes to be made and synchronized across regions with guaranteed consistency. In addition, it supports multi-document transactions with ACID guarantees.
Use cases in healthcare
Azure Cosmos DB works very well for the following workloads.
1. Global scale secure solutions
Organizations like CitiusTech that offer a mission-critical, global-scale solution should consider Azure Cosmos DB a critical component of their solution stack. For example, An ISV developing a non-drug treatment for patients through a medical device at a facility can develop web or mobile applications which store the treatment information and medical device metadata in Azure Cosmos DB. Treatment information can be pushed to medical devices at global facilities for the treatment. ISVs can comply to the compliance requirement by using geo-fencing feature.
Azure Cosmos DB can also be used as a multi-tenant database with carefully designed strategy. For instance, if a tenant has different scaling requirements, different Azure Cosmos containers can be created for such tenants. In Azure Cosmos DB, containers serve as logical units of distribution and scalability. Multi-tenancy may be possible at a partition level within an Azure Cosmos container, but needs to be designed carefully to avoid creating hot-spots and compromising the overall performance.
2. Real-time location system, Internet of Things
Azure Cosmos DB is effective for building a solution for real-time tracking and management of medical devices and patients, as it often requires rapid velocity of data, scale, and resilience. Azure Cosmos DB supports low latency writes and reads so that all data is replicated across multiple fault and update domains in each region for high availability and resilience. It supports session consistency as one of its five consistency levels which is suitable for such scenarios. Session consistency guarantees strong consistency within a session.
Using Azure Cosmos DB also allows scaling of processing power, this is useful for burst scenarios and also provides elastic scale petabytes of storage. This enables request units (RU’s) to be programmatically adjusted as per the workload.
CitiusTech worked with a leading provider of medical grade vital signs and physiological monitoring solution to build a medical IoT based platform with the following requirements:
Monitor vitals with medical quality
Provide solutions for partners to integrate custom solutions
Deliver personalized, actionable insights
Messages and/or device generated data don’t have a fixed structure and may change in the future
Data producer(s) to simultaneously upload data for at least 100 subjects in less than two seconds per subject, receiving no more than 40*21=840 data points per subject, per request
Data consumer(s) to read simultaneously, data of at least 100 subjects in less than two seconds, producing no more than 15,000 data points per data consumer
Data for most recent 14 days shall be ready to be queried, and data older than 14 days to be moved to a cold storage
CitiusTech used Azure Cosmos DB as a hot storage to store health data, since it enabled low latency writes and reads of health data that was generated by the wearable sensor continuously. Azure Cosmos DB provided schema agnostic flexible storage to store documents with different shapes and size at scale and allowed enterprise grade security with Azure compliance certification.
The time to live (TTL) feature in Azure Cosmos DB automatically deleted expired items based on the TTL value. It was geo-distributed with its geo-fencing feature to address data sovereignty compliance requirements.
Architecture of data flow in CitiusTech’s solution using Azure Cosmos DB
Azure Cosmos DB unlocks the potential of polyglot persistence for healthcare systems to integrate healthcare data from multiple systems of record. It also ensures the need for flexibility, adaptability, speed, security and scale in healthcare is addressed while maintaining low operational overheads and high performance.
CitiusTech is a specialist provider of healthcare technology services and solutions to healthcare technology companies, providers, payers and life sciences organizations. CitiusTech helps customers accelerate innovation in healthcare through specialized solutions, healthcare technology platforms, proficiencies and accelerators. Find out more about CitiusTech.
Data scientists have a dynamic role. They need environments that are fast and flexible while upholding their organization’s security and compliance policies.
Data scientists working on machine learning projects need a flexible environment to run experiments, train models, iterate models, and innovate in. They want to focus on building, training, and deploying models without getting bogged down in prepping virtual machines (VMs), vigorously entering parameters, and constantly going back to IT to make changes to their environments. Moreover, they need to remain within compliance and security policies outlined by their organizations.
Organizations seek to empower their data scientists to do their job effectively, while keeping their work environment secure. Enterprise IT pros want to lock down security and have a centralized authentication system. Meanwhile, data scientists are more focused on having direct access to virtual machines (VMs) to tinker at the lower level of CUDA drivers and special versions of the latest machine learning frameworks. However, direct access to the VM makes it hard for IT pros to enforce security policies. Azure Machine Learning service is developing innovative features that allow data scientists to get the most out of their data and spend time focusing on their business objectives while maintaining their organizations’ security and compliance posture.
Azure Machine Learning service’s Notebook Virtual Machine (VM), announced in May 2019, resolves these conflicting requirements while simplifying the overall experience for data scientists. Notebook VM is a cloud-based workstation created specifically for data scientists. Notebook VM based authoring is directly integrated into Azure Machine Learning service, providing a code-first experience for Python developers to conveniently build and deploy models in the workspace. Developers and data scientists can perform every operation supported by the Azure Machine Learning Python SDK using a familiar Jupyter notebook in a secure, enterprise-ready environment. Notebook VM is secure and easy-to-use, preconfigured for machine learning, and fully customizable.
Let’s take a look at how Azure Machine Learning service Notebook VMs are:
Secure and easy-to-use
Preconfigured for machine learning and,
1. Secure and easy to use
When a data scientist creates a notebook in standard infrastructure-as-a-service (IaaS) VM, it requires a lot of intricate, IT specific parameters. They need to name the VM and specify titles of images, security parameters (virtual network, subnet, and more), storage accounts, and a variety of other IT specific parameters. If incorrect parameters are given, or details are overlooked, this can open an organization up to serious security risks.
Compared to an IaaS VM, the Notebook VM creation experience has been streamlined, as it takes just two parameters – a VM name and a VM type. Once the Notebook VM is created it provides access to Jupyter and JupyterLab – two popular notebook environments for data science. The access to the notebooks is secured out-of-the-box with HTTPS and Azure Active Directory, which makes it possible for IT pros to enforce a single sign-on environment with strong security features like Multi-Factor Authentication, ensuring a secure environment in compliance with organizational policies.
2. Preconfigured for machine Learning
Setting up GPU drivers and deploying libraries on a traditional IaaS VM can be cumbersome and require substantial amounts of time. It can also get complicated finding the right drivers for given hardware, libraries, and frameworks. For instance, the latest versions of PyTorch may not work with the drivers a data scientist is currently using. Installation of client libraries for services such as Azure Machine Learning Python SDK can also be time-consuming, and some Python packages can be incompatible with others, depending on the environment where they are installed.
Notebook VM has the most up-to-date, compatible packages preconfigured and ready to use. This way, data scientists can use any of the latest frameworks on Notebook VM without versioning issues and with access to all the latest functionality of Azure Machine Learning service. Inside of the VM, along with Jupyter and JupyterLab, data scientists will find a fully prepared environment for machine learning. Notebook VM draws its pedigree from Data Science Virtual Machine (DSVM), a popular IaaS VM offering on Azure. Similar to DSVM it comes equipped with preconfigured GPU drivers and a selection of ML and Deep Learning Frameworks.
Notebook VM is also integrated with its parent, Azure Machine Learning workspace. The notebooks that data scientists run on the VM have access to the data stores and compute resources of the workspace. The notebooks themselves are stored in a Blob Storage account of the workspace. This makes it easy to share notebooks between VMs, as well as keeps them safely preserved when the VM is deleted.
3. Fully customizable
In environments where IT pros prepare virtual machines for data scientists, there is a very vigorous process for this preparation and limitations on what can be done on these machines. Alternatively, data scientists are very dynamic and need the ability to customize VMs to fit their ever-changing needs. This often means going back to IT pros to have them make the necessary changes to the VMs. Even then, data scientists hit blockers when iterations don’t meet their needs or take too long. Some data scientists will resort to using their personal laptop to run jobs their corporate VMs don’t support, breaking compliance policies and putting the organization at risk.
While Notebook VM is a managed VM offering, it retains full access to hardware capabilities. Data scientists can create VMs of any type, all supported by Azure. This way they can customize it to their heart’s desire by adding custom packages and drivers. For example, data scientists can quickly create the latest NVidia V100 powered VM to perform step-by-step debugging of novel neural network architectures.
If you are working with code, Notebook VM will offer you a smooth experience. It includes a set of tutorials and samples which make every capability of the Azure Machine Learning service just one-click away. Give it a try and let us know your feedback.
This blog post was authored by Jordan Edwards, Senior Program Manager, Microsoft Azure.
At Microsoft Build 2019 we announced MLOps capabilities in Azure Machine Learning service. MLOps, also known as DevOps for machine learning, is the practice of collaboration and communication between data scientists and DevOps professionals to help manage the production of the machine learning (ML) lifecycle.
Azure Machine Learning service’s MLOps capabilities provide customers with asset management and orchestration services, enabling effective ML lifecycle management. With this announcement, Azure is reaffirming its commitment to help customers safely bring their machine learning models to production and solve their business’s key problems faster and more accurately than ever before.
Here is a quick look at some of the new features:
Azure Machine Learning Command Line Interface (CLI)
Azure Machine Learning’s management plane has historically been via the Python SDK. With the new Azure Machine Learning CLI, you can easily perform a variety of automated tasks against the ML workspace including:
Compute target management
Model registration and deployment
Azure Machine Learning service introduced new capabilities to help manage the code, data, and environments used in your ML lifecycle.
Git repositories are commonly used in industry for source control management and as key assets in the software development lifecycle. We are including our first version of Git repository tracking – any time you submit code artifacts to Azure Machine Learning service, you can specify a Git repository reference. This is done automatically when you are running from a CI/CD solution such as Azure Pipelines.
Data set management
With Azure Machine Learning data sets you can version, profile, and snapshot your data to enable you to reproduce your training process by having access to the same data. You can also compare data set profiles and determine how much your data has changed or if you need to retrain your model.
Azure Machine Learning Environments are shared across Azure Machine Learning scenarios, from data preparation to model training to inferencing. Shared environments help to simplify handoff from training to inferencing as well as the ability to reproduce a training environment locally.
Environments provide automatic Docker image management (and caching!), plus tracking to streamline reproducibility.
Simplified model debugging and deployment
Some data scientists have difficulty getting an ML model prepared to run in a production system. To alleviate this, we have introduced new capabilities to help you package and debug your ML models locally, prior to pushing them to the cloud. This should greatly reduce the inner loop time required to iterate and arrive at a satisfactory inferencing service, prior to the packaged model reaching the datacenter.
Model validation and profiling
Another challenge that data scientists commonly face is guaranteeing that models will perform as expected once they are deployed to the cloud or the edge. With the new model validation and profiling capabilities, you can provide sample input queries to your model. We will automatically deploy and test the packaged model on a variety of inference CPU/memory configurations to determine the optimal performance profile. We also check that the inference service is responding correctly to these types of queries.
Data scientists want to know why models predict in a specific manner. With the new model interpretability capabilities, we can explain why a model is behaving a certain way during both training and inferencing.
ML audit trail
Azure Machine Learning is used for managing all of the artifacts in your model training and deployment process. With new audit trail capabilities, we are enabling automatic tracking of the experiments and datasets that corresponds to your registered ML model. This helps to answer the question, “What code/data was used to create this model?”
Azure DevOps extension for machine learning
Azure DevOps provides commonly used tools data scientists leverage to manage code, work items, and CI/CD pipelines. With the Azure DevOps extension for machine learning, we are introducing new capabilities to make it easy to manage your ML CI/CD pipelines with the same tools you use for software development processes. The extension includes the abilities to trigger Azure Pipelines release on model registration, easily connect an Azure Machine Learning Workspace to an Azure DevOps project, and perform a series of tasks designed to help interaction with Azure Machine Learning as easy as possible from the existing automation tooling.
Get started today
These new MLOps features in the Azure Machine Learning service aim to enable users to bring their ML scenarios to production by supporting reproducibility, auditability, and automation of the end-to-end ML lifecycle. We’ll be publishing more blogs that go in-depth with these features in the following weeks, so follow along for the latest updates and releases.
The Jupyter Notebook on HDInsight Spark clusters is useful when you need to quickly explore data sets, perform trend analysis, or try different machine learning models. Not being able to track the status of Spark jobs and intermediate data can make it difficult for data scientists to monitor and optimize what they are doing inside the Jupyter Notebook.
To address these challenges, we are adding cutting edge job execution and visualization experiences into the HDInsight Spark in-cluster Jupyter Notebook. Today, we are delighted to share the release of the real time Spark job progress indicator, native matplotlib support for PySpark DataFrame, and the cell execution status indicator.
Spark job progress indicator
When you run an interactive Spark job inside the notebook, a Spark job progress indicator with a real time progress bar appears to help you understand the job execution status. You can also switch tabs to see a resource utilization view for active tasks and allocated cores, or a Gantt chart of jobs, stages, and tasks for the overall workload.
Native matplotlib support for PySpark DataFrame
Previously, PySpark did not support matplotlib. If you wanted to plot something, you would first need to export the PySpark DataFrame out of the Spark context, convert it into a local python session, and plot from there. In this release, we provide native matplotlib support for PySpark DataFrame. You can use matplotlib directly on the PySpark DataFrame just as it is in local. No need to transfer data back and forth between the cluster spark context and the local python session.
Cell execution status indicator
Step-by-step cell execution status is displayed beneath the cell to help you see its current progress. Once the cell run is complete, an execution summary with the total duration and end time will be shown and kept there for future reference.
These features have been built into the HDInsight Spark Jupyter Notebook. To get started, access HDInsight from the Azure portal. Open the Spark cluster and select Jupyter Notebook from the quick links.
We look forward to your comments and feedback. If you have any feature requests, asks, or suggestions, please send us a note to [email protected]. For bug submissions, please open a new ticket.
DevOps is the union of people, processes, and products to enable the continuous delivery of value to end users. DevOps for machine learning is about bringing the lifecycle management of DevOps to Machine Learning. Utilizing Machine Learning, DevOps can easily manage, monitor, and version models while simplifying workflows and the collaboration process.
Effectively managing the Machine Learning lifecycle is critical for DevOps’ success. And the first piece to machine learning lifecycle management is building your machine learning pipeline(s).
What is a Machine Learning Pipeline?
DevOps for Machine Learning includes data preparation, experimentation, model training, model management, deployment, and monitoring while also enhancing governance, repeatability, and collaboration throughout the model development process. Pipelines allow for the modularization of phases into discrete steps and provide a mechanism for automating, sharing, and reproducing models and ML assets. They create and manage workflows that stitch together machine learning phases. Essentially, pipelines allow you to optimize your workflow with simplicity, speed, portability, and reusability.
There are four steps involved in deploying machine learning that data scientists, engineers and IT experts collaborate on:
Data Ingestion and Preparation
Model Training and Retraining
Together, these steps make up the Machine Learning pipeline. Below is an excerpt from documentation on building machine pipelines with Azure Machine Learning service, which explains it well.
“Using distinct steps makes it possible to rerun only the steps you need, as you tweak and test your workflow. A step is a computational unit in the pipeline. As shown in the preceding diagram, the task of preparing data can involve many steps. These include, but aren’t limited to, normalization, transformation, validation, and featurization. Data sources and intermediate data are reused across the pipeline, which saves compute time and resources.”
4 benefits of accelerating Machine Learning pipelines for DevOps
1. Collaborate easily across teams
Data scientists, data engineers, and IT professionals using machine learning pipelines need to collaborate on every step involved in the machine learning lifecycle: from data prep to deployment.
Azure Machine Learning service workspace is designed to make the pipelines you create visible to the members of your team. You can use Python to create your machine learning pipelines and interact with them in Jupyter notebooks, or in another preferred integrated development environment.
2. Simplify workflows
Data prep and modeling can last days or weeks, taking time and attention away from other business objectives.
The Azure Machine Learning SDK offers imperative constructs for sequencing and parallelizing the steps in your pipelines when no data dependency is present. You can also templatize pipelines for specific scenarios and deploy them to a REST endpoint, so you can schedule batch-scoring or retraining jobs. You only need to rerun the steps you need, as you tweak and test your workflow when you rerun a pipeline.
3. Centralized Management
Tracking models and their version histories is a hurdle many DevOps teams face when building and maintaining their machine learning pipelines.
The Azure Machine Learning service model registry tracks models, their version histories, their lineage and artifacts. Once the model is in production, the Application Insights service collects both application and model telemetry that allows the model to be monitored in production for operational and model correctness. The data captured during inferencing is presented back to the data scientists and this information can be used to determine model performance, data drift, and model decay, as well as the tools to train, manage, and deploy machine learning experiments and web services in one central view.
The Azure Machine Learning SDK also allows you to submit and track individual pipeline runs. You can explicitly name and version your data sources, inputs, and outputs instead of manually tracking data and result paths as you iterate. You can also manage scripts and data separately for increased productivity. For each step in your pipeline. Azure coordinates between the various compute targets you use, so that your intermediate data can be shared with the downstream compute targets easily. You can track the metrics for your pipeline experiments directly in the Azure portal.
4. Track your experiments easily
DevOps capabilities for machine learning further improve productivity by enabling experiment tracking and management of models deployed in the cloud and on the edge. All these capabilities can be accessed from any Python environment running anywhere, including data scientists’ workstations. The data scientist can compare runs, and then select the “best” model for the problem statement.
The Azure Machine Learning workspace keeps a list of compute targets that you can use to train your model. It also keeps a history of the training runs, including logs, metrics, output, and a snapshot of your scripts. Create multiple workspaces or common workspaces to be shared by multiple people.
As you can see, DevOps for Machine Learning can be streamlined across the ML pipeline with more visibility into training, experiment metrics, and model versions. Azure Machine Learning service, seamlessly integrates with Azure services to provide end-to-end capabilities for the entire Machine Learning lifecycle, making it simpler and faster than ever.
This is part two of a four-part series on the pillars of Azure Machine Learning services. Check out part one if you haven’t already, and be sure to look out for our next blog, where we’ll be talking about ML at scale.