Microsoft is a leader in The Forrester Wave™: Streaming Analytics, Q3 2019

Processing Big data in real-time is an operational necessity for many businesses. Azure Stream Analytics is Microsoft’s serverless real-time analytics offering for complex event processing.

We are excited and humbled to announce that Microsoft has been named a leader in The Forrester Wave™: Streaming Analytics, Q3 2019. Microsoft believes this report truly reflects the market momentum of Azure Stream Analytics, satisfied customers, a growing partner ecosystem and the overall strength of our Azure cloud platform. You can access the full report here.

Forrester Wave for Streaming Analytics published in Q3 2019 that positions Microsoft as a leader in the category.

 

The Forrester Wave™: Streaming Analytics, Q3 2019

 

Forrester Wave™: Streaming Analytics, Q3 2019 report evaluated streaming analytics offerings from 11 different solution providers and we are honored to share that that Forrester has recognized Microsoft as a Leader in this category. Azure Stream Analytics received the highest possible score in 12 different categories including Ability to execute, Administration, Deployment, Solution Roadmap, Customer adoption and many more.

The report states, “Microsoft Azure Stream Analytics has strengths in scalability, high availability, deployment, and applications. Azure Stream Analytics is an easy on-ramp for developers who already know SQL. Zero-code integration with over 15 other Azure services makes it easy to try and therefore adopt, making the product the real-time backbone for enterprises needing real-time streaming applications on the Azure cloud. Additionally, through integration with IoT Hub and Azure Functions, it offers seamless interoperability with thousands of devices and business applications.”

Key Differentiators for Azure Stream Analytics

Fully integrated with Azure ecosystem: Build powerful pipelines with few clicks

Whether you have millions of IoT devices streaming data to Azure IoT Hub or have apps sending critical telemetry events to Azure Event Hubs, it only takes a few clicks to connect multiple sources and sinks to create an end-to-end pipeline.

Developer productivity

One of the biggest advantages of Stream Analytics is the simple SQL-based query language with its powerful temporal constraints to analyze data in motion. Familiarity with SQL language is enough to author powerful queries. Additionally, Azure Stream Analytics supports language extensibility via C# and JavaScript user-defined functions (UDFs) or user-defined aggregates to perform complex calculations as part of a Stream Analytics query.

Analytics prowess

Stream Analytics contains a wide array of analytic capabilities such as native support for geospatial functions, built-in callouts to custom machine learning (ML) models for real-time scoring, built-in ML models for Anomaly Detection, Pattern matching, and more to help developers easily tackle complex scenarios while staying in a familiar context.

Intelligent edge

Azure Stream Analytics helps bring real-time insights and analytics capabilities closer to where your data originates. Customers can easily enable new scenarios with true hybrid architectures for stream processing and run the same query in the cloud or on the IoT edge.

Best-in-class financially backed SLA by the minute

We understand it is critical for businesses to prevent data loss and have business continuity. Stream Analytics guarantees event processing with a 99.9 percent availability service-level agreement (SLA) at the minute level, which is unparalleled in the industry.

Scale instantly

Stream Analytics is a fully managed serverless (PaaS) offering on Azure. There is no infrastructure to worry about, and no servers, virtual machines, or clusters to manage. We do all the heavy lifting for you in the background. You can instantly scale up or scale-out the processing power from one to hundreds of streaming units for any job.

Mission critical

Stream Analytics guarantees “exactly once” event processing and at least once delivery of events. It has built-in recovery capabilities in case the delivery of an event fails. So, you never have to worry about your events getting dropped.

Try it today

There is a strong and growing developer community that supports Stream Analytics. Learn how to get started and build a real-time fraud detection system.

Simply unmatched, truly limitless: Announcing Azure Synapse Analytics

Today, businesses are forced to maintain two types of analytical systems, data warehouses and data lakes. Data warehouses provide critical insights on business health. Data lakes can uncover important signals on customers, products, employees, and processes. Both are critical, yet operate independently of one another, which can lead to uninformed decisions. At the same time, businesses need to unlock insights from all their data to stay competitive and fuel innovation with purpose. Can a single cloud analytics service bridge this gap and enable the agility that businesses demand?

Azure Synapse Analytics

Today, we are announcing Azure Synapse Analytics, a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs.

A diagram showing how Azure Synapse Analytics connects Power BI, Azure Machine Learning, and your ecosystem.

Simply put, Azure Synapse is the next evolution of Azure SQL Data Warehouse. We have taken the same industry-leading data warehouse to a whole new level of performance and capabilities. In fact, it’s the first and only analytics system to have run all TPC-H queries at petabyte-scale. Businesses can continue running their existing data warehouse workloads in production today with Azure Synapse and will automatically benefit from the new capabilities which are in preview. Businesses can put their data to work much more quickly, productively, and securely, pulling together insights from all data sources, data warehouses, and big data analytics systems. Partners can continue to build with us as Azure Synapse will offer a rich and vibrant ecosystem of partners like Databricks, Informatica, Accenture, Talend, Panoply, Attunity, Pragmatic Works, and Adatis.

With Azure Synapse, data professionals of all types can collaborate, build, manage, and analyze their most important data with ease, all within the same service. From Apache Spark integration with the powerful and trusted SQL engine to code-free data integration and management, Azure Synapse is built for every data professional.

That is why companies like Unilever are choosing Azure Synapse.

Our adoption of the Azure Analytics platform has revolutionized our ability to deliver insights to the business. We are very excited that Azure Synapse Analytics will streamline our analytics processes even further with the seamless integration the way all the pieces have come together so well.

Nallan Sriraman, Global Head of Technology, Unilever
Unilever Logo

Limitless scale

Azure Synapse delivers insights from all your data, across data warehouses and big data analytics systems, with blazing speed. With Azure Synapse, data professionals can query both relational and non-relational data at petabyte-scale using the familiar SQL language. For mission-critical workloads, they can easily optimize the performance of all queries with intelligent workload management, workload isolation, and limitless concurrency.

Powerful insights

With Azure Synapse, enabling business intelligence and machine learning is a breeze. It is deeply integrated with Power BI and Azure Machine Learning to greatly expand the discovery of insights from all your data and apply machine learning models to all your intelligent apps. Significantly reduce project development time for business intelligence and machine learning projects with a limitless analytics service that enables you to seamlessly apply intelligence over all your most important data — from Dynamics 365 to Office 365, to SaaS services that support Open Data Initiative — and easily share data with just a few clicks.

Unified experience

Build end-to-end analytics solutions with a unified experience. The Azure Synapse studio provides a unified workspace for data prep, data management, data warehousing, big data, and AI tasks. Data engineers can use a code-free visual environment for managing data pipelines. Database administrators can automate query optimization. Data scientists can build proofs of concept in minutes. Business analysts can securely access datasets and use Power BI to build dashboards in minutes, all while using the same analytics service.

Unmatched security

Azure has the most advanced security and privacy features in the market. These features are built into the fabric of Azure Synapse, such as automated threat detection and always-on data encryption. And for fine-grained access control, businesses can help ensure data stays safe and private using column-level security and native row-level security, as well as dynamic data masking to automatically protect sensitive data in real-time.

 

Get started today

Businesses can continue running their existing data warehouse workloads in production today with generally available features on Azure Synapse.


Azure. Invent with purpose. 

The key to a data-driven culture: Timely insights

A data-driven culture is critical for businesses to thrive in today’s environment. In fact, a brand-new Harvard Business Review Analytic Services survey found that companies who embrace a data-driven culture experience a 4x improvement in revenue performance and better customer satisfaction.

Foundational to this culture is the ability to deliver timely insights to everyone in your organization across all your data. At our core, that is exactly what we aim to deliver with Azure Analytics and Power BI, and our work is paying off in value for our customers. According to a recent commissioned Forrester Consulting Total Economic Impact™ study, Azure Analytics and Power BI deliver incredible value to customers with a 271 percent ROI, while increasing satisfaction by 60 percent.

Our position in the leaders quadrant in Gartner’s 2019 Magic Quadrant for Analytics & Power BI, coupled with our undisputed performance in analytics provides you with the foundation you need to implement a data-driven culture.

But what are three key attributes needed to establish a data-driven culture?

First, it is vital to get the best performance from your analytics solution across all your data, at the best possible price.

Second, it is critical that your data is accurate and trusted, with all the security and privacy rigor needed for today’s business environment.

Finally, a data-driven culture necessitates self-service tools that empower everyone in your organization to gain insights from your data.

Let’s take a deeper look into each one of these critical attributes.

Performance

When it comes to performance, Azure has you covered. An independent study by GigaOm found that Azure SQL Data Warehouse is up to 14x faster and costs 94% less than other cloud providers. This unmatched performance is why leading companies like Anheuser-Busch Inbev adopt Azure.

“We leveraged the elasticity of SQL Data Warehouse to scale the instance up or down, so that we only pay for the resources when they’re in use, significantly lowering our costs. This architecture performs significantly better than the legacy on-premises solutions it replaced, and it also provides a single source of truth for all of the company’s data.” – Chetan Kundavaram, Global Director, Anheuser-Busch Inbev

Security

Azure is the most secure cloud for analytics. This is according to Donald Farmer, a well-respected thought leader in the data industry, who recently stated, “Azure SQL Data Warehouse platform offers by far the most comprehensive set of compliance and security capabilities of any cloud data warehouse provider”. Since then, we announced Dynamic Data Masking and Data Discovery and Classification to automatically help protect and obfuscate sensitive data on-the-fly to further enhance your data security and privacy.

Insights for all

Only when everyone in your organization has access to timely insights can you achieve a truly data-driven culture. Companies drive results when they break down data silos and establish a shared context of their business based on trusted data. Customers that use Azure Analytics and Power BI do exactly that. According to the same Forrester study, customers stated.

Azure Analytics has helped with a culture change at our company. We are expanding into other areas so that everyone can make informed business decisions.”  — Study interviewee

“Power BI was a huge success. We’ve added 25,000 users organically in three years.”  — Study interviewee

Only Azure Analytics and Power BI together can unlock the performance, security and insights for your entire organization. We are uniquely positioned to empower you to develop a data-driven culture needed to thrive. We are excited to see customers like Reckitt Benckiser, choose Azure for their analytics needs.

“Data is most powerful when it’s accessible and understandable. With this Azure solution, our employees can query the data however they want versus being confined to the few rigid queries our previous system required. It’s very easy for them to use Power BI Pro to integrate new data sets to deliver enormous value. When you put BI solutions in the hands of your boots on the ground—your sales force, marketing managers, product managers—it delivers a huge impact to the business.”  — Wilmer Peres, Information Services Director, Reckitt Benckise

When you add it all up, Azure Analytics and Power BI are simply unmatched.

Get started today

To learn more about Azure’s insights for all advantage, get started today!

Gartner, Magic Quadrant for Analytics and Business Intelligence Platforms, 11 February 2019, Cindi Howson, James Richardson, Rita Sallam, Austin Kronz

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, express or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

Over 100 Azure services support PROTECTED Australian government data

Today Microsoft published an independent security assessment of 113 Microsoft Azure services for their suitability to handle official and PROTECTED Australian government information. This assessment, carried out under the Information Security Registered Assessor Program (IRAP), is now available for customers and partners to review and use as they plan for increasing the use of cloud in government.

This milestone significantly expands the ability of the Australian government to leverage Microsoft Azure to drive digital transformation. The expanded scope of this IRAP assessment includes cognitive services, machine learning, IoT, advanced cybersecurity, open source database management, and serverless and application development technologies. This enables the full range of innovation within Azure Australia to be utilized for government applications, further reinforcing our commitment to achieving the broadest range of accreditations and assurances to meet the needs of government customers.

This assurance is critical for customers such as the Victorian Government, using ICT shared services provider Cenitex in partnership with Canberra-based OOBE to deploy VicCloud Protect, a ground-breaking and highly secure service that enables its government customers to safely manage applications and data rated up to PROTECTED level.

“VicCloud Protect is a first for the Victorian Government and our customers can now confidently store their classified data in the cloud with peace of mind that the platform meets both the Australian Cyber Security Centre guidelines and the Victorian Protection Data Security Framework to handle Protected level information.” – Nigel Cadywould, Cenitex Service Delivery Director

This is just one of many examples of Australian governments and partners building on the secure foundations of Azure to build transformative solutions for government. Microsoft is one of the only global cloud providers to operate cloud regions in Canberra specifically designed and secured to meet the strict security compliance requirements of Australian government and national critical infrastructure, including:

  • Data center facilities within CDC, a datacenter provider based in Canberra that specializes in government and national critical infrastructure and meets the stringent sovereignty and transparent ownership controls required by the Australian government’s hosting policy.
  • Leading physical and personnel security within the Canberra facilities designed for the even higher requirements of handling secret government data.
  • Direct connection within the data center to the federal government’s intragovernment communications network (ICON) for enhanced security and performance.
  • Unmatched flexibility for colocation of critical systems in the same facilities as Microsoft Azure in Canberra and access to the ecosystem of solution providers deployed within CDC.

Microsoft delivers the Azure Australia Central regions in Canberra as the first and best home of Australian government data and applications. The assessment released today covers not just the Central regions , but addresses all regions of Microsoft Azure in Australia, including Australia East (Sydney) and Australia Southeast (Melbourne). Also, as Microsoft has introduced further capacity and capabilities into the Australia Central regions, we have streamlined the process for customers to deploy services into our Canberra regions. Customers no longer need to manually request access to deploy services to the Australia Central region and can now directly deploy from the portal.

Because the Australian Government has designed the IRAP program to follow a risk-based approach, each customer decides whether to operate that service at the PROTECTED level or lower. To assist customers with their authorization decision, Microsoft makes the IRAP assessment report and supporting documents available to customers and partners on an Australia-specific page of the Microsoft Service Trust Portal.

For government customers who want to get started building solutions for PROTECTED level data, we’ve published Australia PROTECTED Blueprint guidance with reference architectures for IaaS and PaaS web applications along with threat model and control implementation guidance. This Blueprint enables customers to more easily deploy Azure solutions suitable for processing, storage, and transmission of sensitive and official information classified up to and including PROTECTED.

Learn more about our latest IRAP assessment

Azure Cosmos DB recommendations keep you on the right track

The tech world is fast-paced, and cloud services like Azure Cosmos DB get frequent updates with new features, capabilities, and improvements. It’s important—but also challenging—to keep up with the latest performance and security updates and assess whether they apply to your applications. To make it easier, we’ve introduced automatic and tailored recommendations for all Azure Cosmos DB users. A large spectrum of personalized recommendations now show up in the Azure portal when you browse your Azure Cosmos DB accounts.

Some of the recommendations we’re currently dispatching cover the following topics

  • SDK upgrades: When we detect the usage of an old version of our SDKs, we recommend upgrading to a newer version to benefit from our latest bug fixes and performance improvements.
  • Fixed to partitioned collections: To fully leverage Azure Cosmos DB’s massive scalability, we encourage users of legacy, fixed-sized containers that are approaching the limit of their storage quota to migrate these containers to partitioned ones.
  • Query page size: We recommend using a query page size of -1 for users that define a specific value instead.
  • Composite indexes: Composite indexes can dramatically improve the performance and RU consumption of some queries, so we suggest their usage whenever our telemetry detects queries that can benefit from them.
  • Incorrect SDK usage: It’s possible for us to detect when our SDKs are incorrectly used, like when a client instance is created for each request instead of being used as a singleton throughout the application; corresponding recommendations are provided in these cases.
  • Lazy indexing: The purpose of Azure Cosmos DB’s lazy indexing mode is rather limited and can impact the freshness of query results in some situations. We advise using the (default) consistent indexing mode instead of lazy indexing.
  • Transient errors: In rare occurrences, some transient errors can happen when a database or collection gets created. SDKs usually retry operations whenever a transient error occurs, but if that’s not the case, we notify our users that they can safely retry the corresponding operation.

Each of our recommendations includes a link that brings you directly to the relevant section of our documentation, so it’s easy for you to take action.

3 ways to find your Azure Cosmos DB recommendations

1.    Click on this message at the top of the Azure Cosmos DB blade:

A pop-up message in Azure Cosmos DB saying that new notifications are available.
2.    Head directly to the new “Notifications” section of your Cosmos DB accounts:

The Notifications section showing all received Cosmos DB recommendations.
3.    Or even find them through Azure Advisor, making it easier to receive our recommendations for users who don’t routinely visit the Azure portal.

Over the coming weeks and months, we’ll expand the coverage of these notifications to include topics like partitioning, indexing, network security, and more. We also plan to surface general best practices to ensure you’re making the most out of Azure Cosmos DB.

Have ideas or suggestions for more recommendations? Email us or leave feedback using the smiley on the top-right corner of the Azure portal!

HDInsight support in Azure CLI now out of preview

We are pleased to share that support for HDInsight in Azure CLI is now generally available. The addition of the az hdinsight command group allows you to easily manage your HDInsight clusters using simple commands while taking advantage of all that Azure CLI has to offer, such as cross-platform support and tab completion.

Key Features

  • Cluster CRUD: Create, delete, list, resize, show properties, and update tags for your HDInsight clusters.
  • Script actions: Execute script actions, list and delete persistent script actions, promote ad-hoc script executions to persistent script actions, and show the execution history of script action runs.
  • Manage Azure Monitor integration: Enable, disable, and show the status of Azure Monitor integration on HDInsight clusters.
  • Applications: Create, delete, list, and show properties for applications on your HDInsight clusters.
  • Core usage: View available core counts by region before deploying large clusters.

A gif showing the creation of an HDInsight cluster using a single, simple Azure CLI command.

Create an HDInsight cluster using a single, simple Azure CLI command

Azure CLI benefits

  • Cross platform: Use Azure CLI on Windows, macOS, Linux, or the Azure Cloud Shell in a browser to manage your HDInsight clusters with the same commands and syntax across platforms.
  • Tab completion and interactive mode: Autocomplete command and parameter names as well as subscription-specific details like resource group names, cluster names, and storage account names. Don’t remember your 88-character storage account key off the top of your head? Azure CLI can autocomplete that as well!
  • Customize output: Make use of Azure CLI’s globally available arguments to show verbose or debug output, filter output using the JMESPath query language, and change the output format between json, tab-separated values, or ASCII tables, and more.

Getting started

You can get up and running to start managing your HDInsight clusters using Azure CLI with 3 easy steps.

  1. Install Azure CLI for Windows, macOS, or Linux. Alternatively, you can use Azure Cloud Shell to use Azure CLI in a browser.
  2. Log in using the az login command.
  3. Take a look at our reference documentation, “az hdinsight” or run az hdinsight -h to see a full list of supported HDInsight commands and descriptions and start using Azure CLI to manage your HDInsight clusters.

About HDInsight

Azure HDInsight is an easy, cost-effective, enterprise-grade service for open source analytics that enables customers to easily run popular open source frameworks, such as Apache Hadoop, Spark, Kafka, and more. The service is available in 28 public regions and Azure Government Clouds in the US, Germany, and China. Azure HDInsight powers mission-critical applications in a wide variety of sectors and enables a wide range of use cases including ETL, streaming, and interactive querying.

Monitoring on Azure HDInsight part 4: Workload metrics and logs

This is the fourth blog post in a four-part series on monitoring on Azure HDInsight. Monitoring on Azure HDInsight Part 1: An Overview discusses the three main monitoring categories: cluster health and availability, resource utilization and performance, and job status and logs. Part 2 is centered on the first topic, monitoring cluster health and availability. Part 3 discussed monitoring performance and resource utilization. This blog covers the third of those topics, workload metrics and logs, in more depth.


During normal operations when your Azure HDInsight clusters are healthy and performing optimally, you will likely focus your attention on monitoring the workloads running on your clusters and viewing relevant logs to assist with debugging. Azure HDInsight offers two tools that can be used to monitor cluster workloads: Apache Ambari and integration with Azure Monitor logs. Apache Ambari is included with all Azure HDInsight clusters and provides an easy-to-use web user interface that can be used to monitor the cluster and perform configuration changes. Azure Monitor collects metrics and logs from multiple resources such as HDInsight clusters, into an Azure Monitor Log Analytics workspace. An Azure Monitor Log Analytics workspace presents your metrics and logs as structured, queryable tables that can be used to configure custom alerts. Azure Monitor logs provide an excellent overall experience for monitoring workloads and interacting with logs, especially if you have multiple clusters.

Azure Monitor logs

Azure Monitor logs enable data generated by multiple resources such as HDInsight clusters to be collected and aggregated in one place to achieve a unified monitoring experience. As a prerequisite, you will need a Log Analytics workspace to store the collected data. If you have not already created one, you can follow these instructions for creating an Azure Monitor Log Analytics workspace. You can then easily configure an HDInsight cluster to send a host of logs and metrics to Azure Monitor Log Analytics.

HDInsight monitoring solutions

Azure HDInsight offers pre-made, monitoring dashboards in the form of solutions that can be used to monitor the workloads running on your clusters. There are solutions for Apache Spark, Hadoop, Apache Kafka, live long and process (LLAP), Apache HBase, and Apache Storm available in the Azure Marketplace. Please see our documentation to learn how to install a monitoring solution. These solutions are workload-specific, allowing you to monitor metrics like  central processing unit (CPU) time, available YARN memory, and logical disk writes across multiple clusters of a given type. Selecting a graph takes you to the query used to generate it, shown in the logs view.

An example of the job graph showing stages 0 through 3 for a spark job.

 

The HDInsight Spark monitoring solutions provide a simple pre-made dashboard where you can monitor workload-specific metrics for multiple clusters on a single pane of glass.

The pre-made dashboard for Kafka we offer as part of HDInsight for monitoring Kafka clusters.

The HDInsight Kafka monitoring solution enables you to monitor all of your Kafka clusters on a single pane of glass.

Query using the logs blade

You can also use the logs view in your Log Analytics workspace to query the metrics and tables directly.

HDInsight clusters emit several workload-specific tables of logs, such as log_resourcemanager_CL, log_spark_CL, log_kafkaserver_CL, log_jupyter_CL, log_regionserver_CL, and log_hmaster_CL.

On the metrics side, clusters emit several metrics tables, including metrics_sparkapps_CL, metrics_resourcemanager_queue_root_CL, metrics_kafka_CL, and metrics_hmaster_CL. For more information, please see our documentation, Query Azure Monitor logs to monitor HDInsight clusters.

The log blade in a Log Analytics workspace used to query metrics and logs tables.

The Logs blade in a Log Analytics workspace lets you query collected metrics and logs across many clusters.

Azure Monitor alerts

You can also set up Azure Monitor alerts that will trigger when the value of a metric or the results of a query meet certain conditions. You can condition on a query returning a record with a value that is greater than or less than a certain threshold, or even on the number of results returned by a query. For example, you could create an alert to send an email if a Spark job fails or if a Kafka disk usage becomes over 90 percent full.

There are several types of actions you can choose to trigger when your alert fires such as an email, SMS, push notification, voice, an Azure Function, an Azure LogicApp, a webhook, an IT service management (ITSM), or an automation runbook. You can set multiple actions for a single alert, and find more information about these different types of actions by visiting our documentation, Create and manage action groups in the Azure Portal.

Finally, you can specify a severity for the alert in addition to the name. The ability to specify severity is a powerful tool that can be used when creating multiple alerts. For example, you could create an alert to raise a Sev 1 warning alert if a single head node becomes unavailable and another alert that raises a Sev 0 critical alert in the unlikely event that both head nodes go down. Alerts can be grouped by severity when viewed later.

Apache Ambari

The Apache Ambari dashboard provides links to several different views for monitoring workloads on your cluster.

ResourceManager user interface

The ResourceManager user interface provides several views to monitor jobs on a YARN-based cluster. Here, you can see multiple views, including an overview of finished or running apps and their resource usage, a view of scheduled jobs by queue, and a list of job execution history and the status of each. You can click on an individual application ID to view more details about that job.

The Applications tab in YARN UI, which shows a list of application execution history for a cluster.

Spark History Server

The Apache Spark History Server shows detailed information for completed Spark jobs, allowing for easy monitoring and debugging.  In addition to the traditional tabs across the top (jobs, stages, executors, etc.), you will find additional data, graph, and diagnostic tabs to help with further debugging.

The pre-made dashboard for Spark we offer as part of HDInsight for monitoring Spark clusters.

Cluster logs

YARN log files are available on HDInsight clusters and can be accessed through the ResourceManager logs link in Apache Ambari. For more information about cluster logs, please see our documentation, Manage logs for an HDInsight cluster.

Next steps

If you haven’t read the other blogs in this series, you can check them out below:

About Azure HDInsight

Azure HDInsight is an easy, cost-effective, enterprise-grade service for open source analytics that enables customers to easily run popular open source frameworks including Apache Hadoop, Spark, Kafka, and others. The service is available in 36 regions and Azure Government and national clouds. Azure HDInsight powers mission-critical applications in a wide variety of sectors and enables a wide range of use cases including extract, transform, and load (ETL), streaming, and interactive querying.

Azure HPC Cache: Reducing latency between Azure and on-premises storage

Today we’re previewing the Azure HPC Cache service, a new Azure offering that empowers organizations to more easily run large, complex high-performance computing (HPC) workloads in Azure. Azure HPC Cache reduces latency for applications where data may be tethered to existing data center infrastructure because of dataset sizes and operational scale.

Scale your HPC pipeline using data stored on-premises or in Azure. Azure HPC Cache delivers the performant data access you need to be able to run your most demanding, file-based HPC workloads in Azure, without moving petabytes of data, writing new code, or modifying existing applications.

For users familiar with the Avere vFXT for Azure application available through the Microsoft Azure Marketplace, Azure HPC Cache offers similar functionality in a more seamless experience—meaning even easier data access and simpler management via the Azure Portal and API tools. The service can be driven with Azure APIs and is proactively monitored on the back end by the Azure HPC Cache support team and maintained by Azure service engineers. What is the net benefit? The Azure HPC Cache service delivers all the performance benefits of the Avere vFXT caching technology at an even lower total cost of ownership.

Azure HPC Cache works by automatically caching active data in Azure that is located both on-premises and in Azure, effectively hiding latency to on-premises network-attached storage (NAS), Azure-based NAS environments using Azure NetApp Files or Azure Blob Storage. The cache delivers high-performance seamless network file system (NFSv3) access to files in the Portable Operating System Interface (POSIX) compliant directory structures. The cache can also aggregate multiple data sources into an aggregated name space to present a single directory structure to clients. Azure compute clients can then access data as though it all originated on a single NAS filer.

Ideal for cloud-bursting applications or hybrid NAS environments, Azure HPC Cache lets you keep your data on existing datacenter-resident Azure NetApp or Dell EMC Isilon arrays. Whether you need to store data on premises while you develop your cloud strategy for security and compliance reasons, or because you simply have so much data on-premises that you don’t want to move it, you can still take full advantage of Azure compute services and do it sooner, rather than later. Once you are ready or able to shift data to Azure Storage resources, you can still run file-based workloads with ease. Azure HPC Cache provides the performance you need to lift and shift your pipeline.

Azure HPC Cache provides high-performance file caching for HPC workloads running in Azure.  

To the cloud in days, not months

Combined with other Azure services such as the Azure HB- and HC-series virtual machines (VMs) for HPC and the Azure CycleCloud HPC workload manager, Azure HPC Cache lets you quickly reproduce your on-premise environment in the cloud and access on-premise data without committing to a large-scale migration. You can also expect to run your HPC workloads in Azure at performance levels similar to your on-premises infrastructure.

Azure HPC Cache service is easy to initiate and manage from the Azure Portal. Once your network has been set up and your on-premises environment has IP connectivity to Azure, you can typically turn on Azure HPC Cache service in about ten minutes. Imagine being able to do HPC jobs in days rather than waiting for months while your IT team fine-tunes data migration strategies and completes all required data moves and synchronization processes.

From burst to all-in: Your choice, your pace

The high-performance Azure HPC Cache delivers the scale-out file access required by HPC applications across an array of industries, from finance to government, life sciences, manufacturing, media, and oil and gas. The service is ideally suited for read-heavy workloads running on 1,000 to 50,000 compute cores. Because Azure HPC Cache is a metered service with usage charges included on your Azure bill, you can turn it off—and stop the meter—when you’re done.

In demanding workloads, Azure HPC Cache provides efficient file access to data stored on-premises or in Azure Blob and can be used with cloud orchestration technologies for management.

Azure HPC Cache helps HPC users access Azure resources more simply and economically. You can deliver exactly the performance needed for computationally intensive workloads, in time to meet demand. Start by using Azure capacity for short-term demand, and enabling a hybrid NAS environment, or go all-cloud and make Azure your permanent IT infrastructure. Azure HPC Cache provides the seamless data access you need to leverage cloud resources in a manner and at a pace that suits your unique business needs and use cases.

Proven technology maintained by Azure experts

Azure HPC Cache service is the latest innovation in a continuum of high-performance caching solutions built on Avere Systems FXT Edge Filer foundational technology. Who uses this technology? A diverse, global community that includes post-production studio artists in the UK, weather researchers in Poland, animators in Toronto, investment bankers in New York City, bioinformaticists in Cambridge and Switzerland, and many, many more of the world’s most demanding HPC users. Azure HPC Cache combines this most sought-after technology with the technical expertise and deep-bench support of the Microsoft Azure team.

Can’t wait to try it?

Ready to get off the sidelines and start running your HPC workloads in Azure? We have a few opportunities for customers to preview Azure HPC Cache. Just complete a short survey, and we’ll review your submission for suitability.

The Azure HPC Cache team is committed to helping deliver on Microsoft’s “Cloud for all” mission and will work with you to design a cloud that you can use to quickly turn your ideas into solutions. Have questions? Email them to [email protected].

MileIQ and Azure Event Hubs: Billions of miles streamed

This post was co-authored by Shubha Vijayasarathy, Program Manager, Azure Messaging (Event Hubs)

With billions of miles logged, MileIQ provides stress-free logging and accurate mileage reports for millions of drivers. Logging and reporting miles driven is a necessity for independent contractors to organizations with employees who need to drive for work. MileIQ automates mileage logging to create accurate records of miles driven, minimizing the effort and time needed with manual calculations. Real-time mileage tracking produces over a million location signal events per hour, requiring fast and resilient event processing that scales.

MileIQ leverages Apache Kafka to ingest massive streams of data:

  • Event processing: Events that demand time-consuming processing are put into Kafka, and multiple processors consume and process these asynchronously.
  • Communication among micro-services: Events are published by the event-owning micro-service on Kafka topics. The other micro-services, which are interested in these events, subscribe to these topics to consume the events.
  • Data Analytics: As all the important events are published on Kafka, the data analytics team subscribes to the topics it is interested in and pulls all the data it requires for data processing.

Growth Challenges

As with any successful venture, growth introduces operational challenges as infrastructure struggles to support the growing demand. In MileIQ’s case, the effort and resources required to maintain Apache Kafka clusters multiplied exponentially with adoption. A seemingly simple task, like modifying a topic’s retention configuration, now becomes an operational burden as the number of Kafka clusters scale to meet the increase in data.

Leveraging a managed service, enabled MileIQ to shift resources from operations and maintenance to focus on new ways to drive business impact. A couple reasons why the MileIQ team selected Azure Event Hubs for Kafka:

  • Fully managed platform as a service (PaaS): With little configuration or management overhead, Event Hubs for Kafka provides a PaaS Kafka experience without the need to manage, configure, or run Kafka clusters.
  • Supports multiple Kafka use-cases: Event Hubs for Apache Kafka provides support at the protocol level, enabling integration of existing Kafka applications with no code changes and minimal configuration change. MileIQ’s existing Kafka producers and consumers, as well as other streaming applications like Apache Kafka MirrorMaker and Apache Spark, integrated seamlessly with the Kafka-enabled Event Hub.
  • Deliver streaming data to Azure Blob storage: The Capture feature of Event Hubs automatically send data from Azure Event Hubs for Kafka to Blob storage. MileIQ uses the data in Blob storage for data analytics and backup.
  • Enterprise performance: The Dedicated-tier cluster offers single-tenant deployments with a guaranteed 99.99% SLA. MileIQ performance tests showed the Dedicated-tier cluster was able to consistently produce a throughput rate of 6,000 events per second.

* Testing based on one event at a time synchronously to address specific use-cases focused on consistency over throughput. Testing batching and produce asynchronously resulted in a much higher throughput.

Set up for success

As a result of migrating Apache Kafka to a managed service, MileIQ now has the infrastructure needed to support future growth.

“To sum up, our experience switching over to Azure Event Hubs Kafka has been excellent. To start with, the onboarding was straightforward, integration was seamless, and we continue to receive great help and support from the Azure Event Hubs Kafka team. In the near future, we look forward to the release of new features that the Azure Event Hubs Kafka team is working on – Geo Replication, Idempotent Producers, Kafka Streams, etc.”

“Migrating to Azure Event Hubs Kafka was a painless experience. Straightforward onboarding seamless integration, and support from the Event Hubs team every step of the way.  We’re excited to see what’s next and look forward to a continued partnership.” – MileIQ

Start streaming data

Data is valuable only when there is an easy way to process and get timely insights from data sources. Azure Event Hubs provides a fully managed distributed stream processing platform with low latency and seamless integration with Apache Kafka applications.

What are you waiting for? Time to get event-ing!

Enjoyed this blog? Follow us as we update the features list we will start supporting. Leave us your valuable feedback, questions, or comments below.

Happy event-ing!

New capabilities in Stream Analytics reduce development time for big data apps

Azure Stream Analytics is a fully managed PaaS offering that enables real-time analytics and complex event processing on fast moving data streams. Thanks to zero-code integration with over 15 Azure services, developers and data engineers can easily build complex pipelines for hot-path analytics within a few minutes. Today, at Inspire, we are announcing various new innovations in Stream Analytics that help further reduce time to value for solutions that are powered by real-time insights. These are as follows:

Bringing the power of real-time insights to Azure Event Hubs customers

Today, we are announcing one-click integration with Event Hubs. Available as a public preview feature, this allows an Event Hubs customer to visualize incoming data and start to write a Stream Analytics query with one click from the Event Hub portal. Once the query is ready, they will be able to operationalize it in few clicks and start deriving real time insights. This will significantly reduce the time and cost to develop real-time analytics solutions.

GIF showing the one-click integration between Event Hubs and Azure Stream Analytics

One-click integration between Event Hubs and Azure Stream Analytics

Augmenting streaming data with SQL reference data support

Reference data is a static or slow changing dataset used to augment real-time data streams to deliver more contextual insights. An example scenario would be currency exchange rates regularly updated to reflect market trends, and then converting a stream of billing events in different currencies to a common currency of choice.

Now generally available (GA), this feature provides out-of-the-box support for Azure SQL Database as reference data input. This includes the ability to automatically refresh your reference dataset periodically. Also, to preserve the performance of your Stream Analytics job, we provide the option to fetch incremental changes from your Azure SQL Database by writing a delta query. Finally, Stream Analytics leverages versioning of reference data to augment streaming data with the reference data that was valid at the time the event was generated. This ensures repeatability of results.

New analytics functions for stream processing

  • Pattern matching:

      With the new MATCH_RECOGNIZE function, you can easily define event patterns using regular expressions and aggregate methods to verify and extract values from the match. This enables you to easily express and run complex event processing (CEP) on your streams of data. For example, this function will enable users to easily author a query to detect “head and shoulder” patterns on the on a stock market feed.

      • Use of analytics function as aggregate:

          You can now use aggregates such as SUM, COUNT, AVG, MIN, and MAX directly with the OVER clause, without having to define a window. Analytics functions as Aggregates enables users to easily express queries such as “Is the latest temperature greater than the maximum temperature reported in the last 24 hours?”

          Egress to Azure Data Lake Storage Gen2

          Azure Stream Analytics is a central component within the Big Data analytics pipelines of Azure customers. While Stream Analytics focuses on the real-time or hot-path analytics, services like Azure Data Lake help enable batch processing and advanced machine learning. Azure Data Lake Storage Gen2 takes core capabilities from Azure Data Lake Storage Gen1 such as a Hadoop compatible file system, Azure Active Directory, and POSIX based ACLs and integrates them into Azure Blob Storage. This combination enables best in class analytics performance along with storage tiering and data lifecycle management capabilities and the fundamental availability, security, and durability capabilities of Azure Storage.

          Azure Stream Analytics now offers native zero-code integration with Azure Data Lake Storage Gen2 output (preview.)

          Enhancements to blob output

          • Native support for Apache parquet format:

              Native support for egress in Apache parquet format into Azure Blob Storage is now generally available. Parquet is a columnar format enabling efficient big data processing. By outputting data in parquet format into a blob store or a data lake, you can take advantage of Azure Stream Analytics to power large scale streaming extract, transfer, and load (ETL), to run batch processing, to train machine learning algorithms, or to run interactive queries on your historical data. We are now announcing general availability of this feature for egress to Azure Blob Storage.

              • Managed identities (formerly MSI) authentication:

                  Azure Stream Analytics now offers full support for Managed Identity based authentication with Azure Blob Storage on the output side. Customers can continue to use the connection string based authentication model. This feature is available as a public preview.

                  Many of these features just started rolling out worldwide and will be available in all regions within several weeks.

                  Feedback

                  The Azure Stream Analytics team is highly committed to listening to your feedback and letting the user voice influence our future investments. We welcome you to join the conversation and make your voice heard via our UserVoice page.

                  Event-driven analytics with Azure Data Lake Storage Gen2

                  Most modern-day businesses employ analytics pipelines for real-time and batch processing. A common characteristic of these pipelines is that data arrives at irregular intervals from diverse sources. This adds complexity in terms of having to orchestrate the pipeline such that data gets processed in a timely fashion.

                  The answer to these challenges lies in coming up with a decoupled event-driven pipeline using serverless components that responds to changes in data as they occur.

                  An integral part of any analytics pipeline is the data lake. Azure Data Lake Storage Gen2 provides secure, cost effective, and scalable storage for the structured, semi-structured, and unstructured data arriving from diverse sources. Azure Data Lake Storage Gen2’s performance, global availability, and partner ecosystem make it the platform of choice for analytics customers and partners around the world. Next comes the event processing aspect. With Azure Event Grid, a fully managed event routing service, Azure Functions, a serverless compute engine, and Azure Logic Apps, a serverless workflow orchestration engine, it is easy to perform event-based processing and workflows responding to the events in real-time.

                  Today, we’re very excited to announce that Azure Data Lake Storage Gen2 integration with Azure Event Grid is in preview! This means that Azure Data Lake Storage Gen2 can now generate events that can be consumed by Event Grid and routed to subscribers with webhooks, Azure Event Hubs, Azure Functions, and Logic Apps as endpoints. With this capability, individual changes to files and directories in Azure Data Lake Storage Gen2 can automatically be captured and made available to data engineers for creating rich big data analytics platforms that use event-driven architectures.

                  Modern data warehouse

                  The diagram above shows a reference architecture for the modern data warehouse pipeline built on Azure Data Lake Storage Gen2 and Azure serverless components. Data from various sources lands in Azure Data Lake Storage Gen2 via Azure Data Factory and other data movement tools. Azure Data Lake Storage Gen2 generates events for new file creation, updates, renames, or deletes which are routed via Event Grid and Azure Function to Azure Databricks. A databricks job processes the file and writes the output back to Azure Data Lake Storage Gen2. When this happens, Azure Data Lake Storage Gen2 publishes a notification to Event Grid which invokes an Azure Function to copy data to Azure SQL Data Warehouse. Data is finally served via Azure Analysis Services and PowerBI.

                  The events that will be made available for Azure Data Lake Storage Gen2 are BlobCreated, BlobDeleted, BlobRenamed, DirectoryCreated, DirectoryDeleted, and DirectoryRenamed. Details on these events can be found in the documentation “Azure Event Grid event schema for Blob storage.”

                  Some key benefits include:

                  • Seamless integration to automate workflows enables customers to build an event-driven pipeline in minutes.
                  • Enable alerting with rapid reaction to creation, deletion, and renaming of files and directories. A myriad of scenarios would benefit from this – especially those associated with data governance and auditing. For example, alert and notify of all changes to high business impact data, set up email notifications for unexpected file deletions, as well as detect and act upon suspicious activity from an account.
                  • Eliminate the complexity and expense of polling services and integrate events coming from your data lake with third-party applications using webhooks such as billing and ticketing systems.

                  Next steps

                  Azure Data Lake Storage Gen2 Integration with Azure Event Grid is now available in West Central US and West US 2. Subscribing to Azure Data Lake Storage Gen2 events works the same as it does for Azure Storage accounts. To learn more, see the documentation “Reacting to Blob storage events.” We would love to hear more about your experiences with the preview and get your feedback at [email protected]

                  Build more accurate forecasts with new capabilities in automated machine learning

                  We are excited to announce new capabilities which are apart of time-series forecasting in Azure Machine Learning service. We launched preview of forecasting in December 2018, and we have been excited with the strong customer interest. We listened to our customers and appreciate all the feedback. Your responses helped us reach this milestone. Thank you.

                  Featured image, general availability for Automated Machine Learning Time Series Forecasting

                  Building forecasts is an integral part of any business, whether it’s revenue, inventory, sales, or customer demand. Building machine learning models is time-consuming and complex with many factors to consider, such as iterating through algorithms, tuning your hyperparameters and feature engineering. These choices multiply with time series data, with additional considerations of trends, seasonality, holidays and effectively splitting training data.

                  Forecasting within automated machine learning (ML) now includes new capabilities that improve the accuracy and performance of our recommended models:

                  • New forecast function
                  • Rolling-origin cross validation
                  • Configurable Lags
                  • Rolling window aggregate features
                  • Holiday detection and featurization

                  Expanded forecast function

                  We are introducing a new way to retrieve prediction values for the forecast task type. When dealing with time series data, several distinct scenarios arise at prediction time that require more careful consideration. For example, are you able to re-train the model for each forecast? Do you have the forecast drivers for the future? How can you forecast when you have a gap in historical data? The new forecast function can handle all these scenarios.

                  Let’s take a closer look at common configurations of train and prediction data scenarios, when using the new forecasting function. For automated ML the forecast origin is defined as the point when the prediction of forecast values should begin. The forecast horizon is how far out the prediction should go into the future.

                  In many cases training and prediction do not have any gaps in time. This is the ideal because the model is trained on the freshest available data. We recommend you set your forecast this way if your prediction interval allows time to retrain, for example in more fixed data situations such as financial forecasts rate or supply chain applications using historical revenue or known order volumes.

                  Ideal use case when training and prediction data have no gaps in time.

                  When forecasting you may know future values ahead of time. These values act as contextual information that can greatly improve the accuracy of the forecast. For example, the price of a grocery item is known weeks in advance, which strongly influences the “sales” target variable. Another example is when you are running what-if analyses, experimenting with future values of drivers like foreign exchange rates. In these scenarios the forecast interface lets you specify forecast drivers describing time periods for which you want the forecasts (Xfuture). 

                  If train and prediction data have a gap in time, the trained model becomes stale. For example, in high-frequency applications like IoT it is impractical to retrain the model constantly, due to high velocity of change from sensors with dependencies on other devices or external factors e.g. weather. You can provide prediction context with recent values of the target (ypast) and the drivers (Xpast) to improve the forecast. The forecast function will gracefully handle the gap, imputing values from training and prediction context where necessary.

                  Using contextual data to assist forecast when training and prediction data have gaps in time.

                  In other scenarios, such as sales, revenue, or customer retention, you may not have contextual information available for future time periods. In these cases, the forecast function supports making zero-assumption forecasts out to a “destination” time. The forecast destination is the end point of the forecast horizon. The model maximum horizon is the number of periods the model was trained to forecast and may limit the forecast horizon length.

                  Use case when no gap in time exists between training and prediction data and no contextual data is available.

                  The forecast model enriches the input data (e.g. adds holiday features) and imputes missing values. The enriched and imputed data are returned with the forecast.

                  Notebook examples for sales forecast, bike demand and energy forecast can be found on GitHub.

                  Rolling-origin cross validation

                  Cross-validation (CV) is a vital procedure for estimating and reducing out-of-sample error for a model. For time series data we need to ensure training only occurs using values to the past of the test data. Partitioning the data without regard to time does not match how data becomes available in production, and can lead to incorrect estimates of the forecaster’s generalization error.

                  To ensure correct evaluation, we added rolling-origin cross validation (ROCV) as the standard method to evaluate machine learning models on time series data. It divides the series into training and validation data using an origin time point. Sliding the origin in time generates the cross-validation folds.

                  As an example, when we do not use ROCV, consider a hypothetical time-series containing 40 observations. Suppose the task is to train a model that forecasts the series up-to four time-points into the future. A standard 10-fold cross validation (CV) strategy is shown in the image below. The y-axis in the image delineates the CV folds that will be made while the colors distinguish training points (blue) from validation points (orange). In the 10-fold example below, notice how folds one through nine result in model training on dates future to be included the validation set resulting inaccurate training and validation results.

                  Cross validation showing training points spread across folds and distributed across time points causing data leakage in validation

                  This scenario should be avoided for time-series instead, when we use an ROCV strategy as shown below, we preserve the time series data integrity and eliminate the risk of data leakage.

                  Rolling-Origin Cross Validation (ROCV) showing training points distributed on each fold at the end of the time period to eliminate data leakage during validation

                  ROCV is used automatically for forecasting. You simply pass the training and validation data together and set the number of cross validation folds. Automated machine learning (ML) will use the time column and grain columns you have defined in your experiment to split the data in a way that respects time horizons. Automated ML will also retrain the selected model on the combined train and validation set to make use of the most recent and thus most informative data, which under the rolling-origin splitting method ends up in the validation set.

                  Lags and rolling window aggregates

                  Often the best information a forecaster can have is the recent value of the target. Creating lags and cumulative statistics of the target then increases accuracy of your predictions.

                  In automated ML, you can now specify target lag as a model feature. Adding lag length identifies how many rows to lag based on your time interval. For example, if you wanted to lag by two units of time, you set the lag length parameter to two.

                  The table below illustrates how a lag length of two would be treated. Green columns are engineered features with lags of sales by one day and two day. The blue arrows indicate how each of the lags are generated by the training data. Not a number (Nan) are created when sample data does not exist for that lag period.

                  Table illustrating how a lag length og two would be treated

                  In addition to the lags, there may be situations where you need to add rolling aggregation of data values as features. For example, when predicting energy demand you might add a rolling window feature of three days to account for thermal changes of heated spaces. The table below shows feature engineering that occurs when window aggregation is applied. Columns for minimum, maximum, and sum are generated on a sliding window of three based on the defined settings. Each row has a new calculated feature, in the case of date January 4, 2017 maximum, minimum, and sum values are calculated using temp values for January 1, 2017, January 2, 2017, and January 3, 2017. This window of “three” shifts along to populate data for the remaining rows.

                  Table showing feature engineering that occurs when window aggregation is applied.

                  Generating and using these additional features as extra contextual data helps with the accuracy of the trained model. This is all possible by adding a few parameters to your experiment settings.

                  Holiday features

                  For many time series scenarios, holidays have a strong influence on how the modeled system behaves. The time before, during, and after a holiday can modify the series’ patterns, especially in scenarios such as sales and product demand. Automated ML will create additional features as input for model training on daily datasets. Each holiday generates a window over your existing dataset that the learner can assign an effect to. With this update, we will support over 2000 holidays in over 110 countries. To use this feature, simply pass the country code as a part of the time series settings. The example below shows input data in the left table and the right table shows updated dataset with holiday featurization applied. Additional features or columns are generated that add more context when models are trained for improved accuracy.

                  Training data on left shows without holiday features applied, table on the right shows.

                  Get started with time-series forecasting in automated ML

                  With these new capabilities automated ML increases support more complex forecasting scenarios, provides more control to configure training data using lags and window aggregation and improves accuracy with new holiday featurization and ROCV. Azure Machine Learning aims to enable data scientists of all skill levels to use powerful machine learning technology that simplifies their processes and reduces the time spent training models. Get started by visiting our documentation and let us know what you think – we are committed to make automated ML better for you!

                  Learn more about the Azure Machine Learning service and get started with a free trial.