Six things to consider when using Video Indexer at scale

Your large archive of videos to index is ever-expanding, thus you have been evaluating Microsoft Video Indexer and decided that you want to take your relationship with it to the next level by scaling up.
In general, scaling shouldn’t be difficult, but when you first face such process you might not be sure what is the best way to do it. Questions like “are there any technological constraints I need to take into account?”, “Is there a smart and efficient way of doing it?”, and “can I prevent spending excess money in the process?” can cross your mind. So, here are six best practices of how to use Video Indexer at scale.

1. When uploading videos, prefer URL over sending the file as a byte array

Video Indexer does give you the choice to upload videos from URL or directly by sending the file as a byte array, but remember that the latter comes with some constraints.

First, it has file size limitations. The size of the byte array file is limited to 2 GB compared to the 30 GB upload size limitation while using URL.

Second and more importantly for your scaling, sending files using multi-part means high dependency on your network, service reliability, connectivity, upload speed, and lost packets somewhere in the world wide web, are just some of the issues that can affect your performance and hence your ability to scale. 

Illustration of the different issues that can affect reliability of uploading a file using multi-part

When you upload videos using URL you just need to give us a path to the location of a media file and we will take care of the rest (see below the field from the upload-video API).

To upload videos using URL via API you can check this short-code sample or you can use AzCopy for a fast and reliable way to get your content to a storage account from which you can submit it to Video Indexer using SAS URL.

URL address field in the uploadVideo API

2. Increase media reserved units if needed

Usually in the proof of concept stage when you just start using Video Indexer, you don’t need a lot of computing power. Now, when you want to scale up your usage of Video Indexer you have a larger archive of videos you want to index and you want the process to be at a pace that fits your use case. Therefore, you should think about increasing the number of compute resources you use if the current amount of computing power is just not enough.

In Azure Media Services, when talking about computing power and parallelization we talk about media reserved units (RUs), those are the compute units that determine the parameters for your media processing tasks. The number of RUs affects the number of media tasks that can be processed concurrently in each account and their type determines the speed of processing and one video might require more than one RU if its indexing is complex. When your RUs are busy, new tasks will be held in a queue until another resource is available.

We know you want to operate efficiently and you don’t want to have resources that eventually will stay idle part of the time. For that reason, we offer an auto-scale system that spins RUs down when less processing is needed and spin RUs up when you are in your rush hours (up to fully use all of your RUs). You can easily enable this functionality by turning on the autoscale in the account settings or using Update-Paid-Account-Azure-Media-Services API.

autoscale button in the account settingsAPI sample to update paid account on AMS with autoScale = trueTo minimize indexing duration and low throughput we recommend you start with 10 RUs of type S3. Later if you scale up to support more content or higher concurrency, and you need more resources to do so, you can contact us using the support system (on paid accounts only) to ask for more RUs allocation.

3. Respect throttling

Video Indexer is built to deal with indexing at scale, and when you want to get the most out of it you should also be aware of the system’s capabilities and design your integration accordingly. You don’t want to send an upload request for a batch of videos just to discover that some of the movies didn’t upload and you are receiving an HTTP 429 response code (too many requests). It can happen due to the fact that you sent more requests than the limit of movies per minute we support. Don’t worry, in the HTTP response, we add a retry-after header. The header we will specify when you should attempt your next retry. Make sure you respect it before trying your next request.

The documentation of the HTTP 429 response the user receives

4. Use callback URL

Have you ever called customer service and their response was “I’m now processing your request, it will take a few minutes. You can leave your phone number and we’ll get back to you when it is done”? The cases when you do leave your number and they call you back the second your request was processed are exactly the same concept as using callback URL.

Thus we recommend that instead of polling the status of your request constantly from the second you sent the upload request, you can just add a callback URL, and wait for us to update you. As soon as there is any status change in your upload request, we will send a POST notification to the URL you sent.

You can add a callback URL as one of the parameters of the upload-video API (see below the description from the API). If you are not sure how to do it, you can check the code samples from our GitHub repo. By the way, for callback URL you can also use Azure Functions, a serverless event-driven platform that can be triggered by HTTP and implement a following flow.

callback URL address field in the uploadVideo API

5. Use the right indexing parameters for you

Probably the first thing you need to do when using Video Indexer, and specifically when trying to scale, is to think about how to get the most out of it with the right parameters for your needs. Think about your use case, by defining different parameters you can save yourself money and make the indexing process for your videos faster.

We are giving you the option to customize your usage in Video Indexer by choosing those indexing parameters. Don’t set the preset to streaming it if you don’t plan to watch it, don’t index video insights if you only need audio insights, it is that easy.

Before uploading and indexing your video read this short documentation, check the indexingPreset and streamingPreset parts to get a better idea of what your options are.

6. Index in optimal resolution, not highest resolution

Not too long ago, we were in times when HD videos didn’t exist. Now, we have videos of varied qualities from HD to 8K. The question is, what video quality do you need for indexing your videos? The higher the quality of the movie you upload means the higher the file size, and this leads to higher computing power and time needed to upload the video.

Our experiences show that, in many cases, indexing performance has almost no difference between HD (720P) videos and 4K videos. Eventually, you’ll get almost the same insights with the same confidence.

For example, for the face detection feature, a higher resolution can help with the scenario where there are many small but contextually important faces. However, this will come with a quadratic increase in runtime (and therefore higher COGS) and an increased risk of false positives.

Therefore, we recommend you to verify that you get the right results for your use case and to first test it locally. Upload the same video in 720P and in 4K and compare the insights you get. Remember, No need to use a cannon to kill a fly.

Have questions or feedback? We would love to hear from you. Use our UserVoice page to help us prioritize features, leave a comment below or email [email protected] for any questions.

We want to hear about your use case, and we can help you scale.

Combine the Power of Video Indexer and Computer Vision

We are pleased to introduce the ability to export high-resolution keyframes from Azure Media Service’s Video Indexer. Whereas keyframes were previously exported in reduced resolution compared to the source video, high resolution keyframes extraction gives you original quality images and allows you to make use of the image-based artificial intelligence models provided by the Microsoft Computer Vision and Custom Vision services to gain even more insights from your video. This unlocks a wealth of pre-trained and custom model capabilities. You can use the keyframes extracted from Video Indexer, for example, to identify logos for monetization and brand safety needs, to add scene description for accessibility needs or to accurately identify very specific objects relevant for your organization, like identifying a type of car or a place.

Let’s look at some of the use cases we can enable with this new introduction.

Using keyframes to get image description automatically

You can automate the process of “captioning” different visual shots of your video through the image description model within Computer Vision, in order to make the content more accessible to people with visual impairments. This model provides multiple description suggestions along with confidence values for an image. You can take the descriptions of each high-resolution keyframe and stitch them together to create an audio description track for your video.

Image description within Computer Vision

Using Keyframes to get logo detection

While Video Indexer detects brands in speech and visual text, it does not support brands detection from logos yet. Instead, you can run your keyframes through Computer Vision’s logo-based brands detection model to detect instances of logos in your content.

This can also help you with brand safety as you now know and can control the brands showing up in your content. For example, you might not want to showcase the logo of a company directly competing with yours. Also, you can now monetize on the brands showing up in your content through sponsorship agreements or contextual ads.

Furthermore, you can cross-reference the results of this model for you keyframe with the timestamp of your keyframe to determine when exactly a logo is shown in your video and for how long. For example, if you have a sponsorship agreement with a content creator to show your logo for a certain period of time in their video, this can help determine if the terms of the agreement have been upheld.

Computer Vision’s logo detection model can detect and recognize thousands of different brands out of the box. However, if you are working with logos that are specific to your use case or otherwise might not be a part of the out of the box logos database, you can also use Custom Vision to build a custom object detector and essentially train your own database of logos by uploading and correctly labeling instances of the logos relevant to you.

Computer Vision's logo detector, detecting the Microsoft logo.

Using keyframes with other Computer Vision and Custom Vision offerings

The Computer Vision APIs provide different insights in addition to image description and logo detection, such as object detection, image categorization, and more. The possibilities are endless when you use high-resolution keyframes in conjunction with these offerings.

For example, the object detection model in Computer Vision gives bounding boxes for common out of the box objects that are already detected as part of Video Indexer today. You can use these bounding boxes to blur out certain objects that don’t meet your standards.

Object detection model

High-resolution keyframes in conjunction with Custom Vision can be leveraged to achieve many different custom use cases. For example, you can train a model to determine what type of car (or even what breed of cat) is showing in a shot. Maybe you want to identify the location or the set where a scene was filmed for editing purposes. If you have objects of interest that may be unique to your use case, use Custom Vision to build a custom classifier to tag visuals or a custom object detector to tag and provide bounding boxes for visual objects.

Try it for yourself

These are just a few of the new opportunities enabled by the availability of high-resolution keyframes in Video Indexer. Now, it is up to you to get additional insights from your video by taking the keyframes from Video Indexer and running additional image processing using any of the Vision models we have just discussed. You can start doing this by first uploading your video to Video Indexer and taking the high-resolution keyframes after the indexing job is complete and second creating an account and getting started with the Computer Vision API and Custom Vision.

Have questions or feedback? We would love to hear from you. Use our UserVoice page to help us prioritize features, leave a comment below or email [email protected] for any questions.

A year of bringing AI to the edge

This post is co-authored by Anny Dow, Product Marketing Manager, Azure Cognitive Services.

In an age where low-latency and data security can be the lifeblood of an organization, containers make it possible for enterprises to meet these needs when harnessing artificial intelligence (AI).

Since introducing Azure Cognitive Services in containers this time last year, businesses across industries have unlocked new productivity gains and insights. The combination of both the most comprehensive set of domain-specific AI services in the market and containers enables enterprises to apply AI to more scenarios with Azure than with any other major cloud provider. Organizations ranging from healthcare to financial services have transformed their processes and customer experiences as a result.


These are some of the highlights from the past year:

Employing anomaly detection for predictive maintenance

Airbus Defense and Space, one of the world’s largest aerospace and defense companies, has tested Azure Cognitive Services in containers for developing a proof of concept in predictive maintenance. The company runs Anomaly Detector for immediately spotting unusual behavior in voltage levels to mitigate unexpected downtime. By employing advanced anomaly detection in containers without further burdening the data scientist team, Airbus can scale this critical capability across the business globally.

“Innovation has always been a driving force at Airbus. Using Anomaly Detector, an Azure Cognitive Service, we can solve some aircraft predictive maintenance use cases more easily.”  —Peter Weckesser, Digital Transformation Officer, Airbus

Automating data extraction for highly-regulated businesses

As enterprises grow, they begin to acquire thousands of hours of repetitive but critically important work every week. High-value domain specialists spend too much of their time on this. Today, innovative organizations use robotic process automation (RPA) to help manage, scale, and accelerate processes, and in doing so free people to create more value.

Automation Anywhere, a leader in robotic process automation, partners with these companies eager to streamline operations by applying AI. IQ Bot, their unique RPA software, automates data extraction from documents of various types. By deploying Cognitive Services in containers, Automation Anywhere can now handle documents on-premises and at the edge for highly regulated industries:

“Azure Cognitive Services in containers gives us the headroom to scale, both on-premises and in the cloud, especially for verticals such as insurance, finance, and health care where there are millions of documents to process.” —Prince Kohli, Chief Technology Officer for Products and Engineering, Automation Anywhere

For more about Automation Anywhere’s partnership with Microsoft to democratize AI for organizations, check out this blog post.

Delighting customers and employees with an intelligent virtual agent

Lowell, one of the largest credit management services in Europe, wants credit to work better for everybody. So, it works hard to make every consumer interaction as painless as possible with the AI. Partnering with Crayon, a global leader in cloud services and solutions, Lowell set out to solve the outdated processes that kept the company’s highly trained credit counselors too busy with routine inquiries and created friction in the customer experience. Lowell turned to Cognitive Services to create an AI-enabled virtual agent that now handles 40 percent of all inquiries—making it easier for service agents to deliver greater value to consumers and better outcomes for Lowell clients.

With GDPR requirements, chatbots weren’t an option for many businesses before containers became available. Now companies like Lowell can ensure the data handling meets stringent compliance standards while running Cognitive Services in containers. As Carl Udvang, Product Manager at Lowell explains:

“By taking advantage of container support in Cognitive Services, we built a bot that safeguards consumer information, analyzes it, and compares it to case studies about defaulted payments to find the solutions that work for each individual.”

One-to-one customer care at scale in data-sensitive environments has become easier to achieve.

Empowering disaster relief organizations on the ground

A few years ago, there was a major Ebola outbreak in Liberia. A team from USAID was sent to help mitigate the crisis. Their first task on the ground was to find and categorize the information such as the state of healthcare facilities, wifi networks, and population density centers.  They tracked this information manually and had to extract insights based on a complex corpus of data to determine the best course of action.

With the rugged versions of Azure Stack Edge, teams responding to such crises can carry a device running Cognitive Services in their backpack. They can upload unstructured data like maps, images, pictures of documents and then extract content, translate, draw relationships among entities, and apply a search layer. With these cloud AI capabilities available offline, at their fingertips, response teams can find the information they need in a matter of moments. In Satya’s Ignite 2019 keynote, Dean Paron, Partner Director of Azure Storage and Edge, walks us through how Cognitive Services in Azure Stack Edge can be applied in such disaster relief scenarios (starting at 27:07): 

Transforming customer support with call center analytics

Call centers are a critical customer touchpoint for many businesses, and being able to derive insights from customer calls is key to improving customer support. With Cognitive Services, businesses can transcribe calls with Speech to Text, analyze sentiment in real-time with Text Analytics, and develop a virtual agent to respond to questions with Text to Speech. However, in highly regulated industries, businesses are typically prohibited from running AI services in the cloud due to policies against uploading, processing, and storing any data in public cloud environments. This is especially true for financial institutions.

A leading bank in Europe addressed regulatory requirements and brought the latest transcription technology to their own on-premises environment by deploying Cognitive Services in containers. Through transcribing calls, customer service agents could not only get real-time feedback on customer sentiment and call effectiveness, but also batch process data to identify broad themes and unlock deeper insights on millions of hours of audio. Using containers also gave them flexibility to integrate with their own custom workflows and scale throughput at low latency.

What’s next?

These stories touch on just a handful of the organizations leading innovation by bringing AI to where data lives. As running AI anywhere becomes more mainstream, the opportunities for empowering people and organizations will only be limited by the imagination.

Visit the container support page to get started with containers today.

For a deeper dive into these stories, visit the following

Multi-language identification and transcription in Video Indexer

Multi-language speech transcription was recently introduced into Microsoft Video Indexer at the International Broadcasters Conference (IBC). It is available as a preview capability and customers can already start experiencing it in our portal. More details on all our IBC2019 enhancements can be found here.

Multi-language videos are common media assets in the globalization context, global political summits, economic forums, and sport press conferences are examples of venues where speakers use their native language to convey their own statements. Those videos pose a unique challenge for companies that need to provide automatic transcription for video archives of large volumes. Automatic transcription technologies expect users to explicitly determine the video language in advance to convert speech to text. This manual step becomes a scalability obstacle when transcribing multi-language content as one would have to manually tag audio segments with the appropriate language.

Microsoft Video Indexer provides a unique capability of automatic spoken language identification for multi-language content. This solution allows users to easily transcribe multi-language content without going through tedious manual preparation steps before triggering it. By that, it can save anyone with large archive of videos both time and money, and enable discoverability and accessibility scenarios.

Multi-language audio transcription in Video Indexer

The multi-language transcription capability is available as part of the Video Indexer portal. Currently, it supports four languages including English, French, German and Spanish, while expecting up to three different languages in an input media asset. While uploading a new media asset you can select the “Auto-detect multi-language” option as shown below.

1.	A new multi-language option available in the upload page of Video Indexer portal

Our application programming interface (API) supports this capability as well by enabling users to specify ‘multi’ as the language in the upload API. Once the indexing process is completed, the index JavaScript object notation (JSON) will include the underlying languages. Refer to our documentation for more details.

Additionally, each instance in the transcription section will include the language in which it was transcribed.

2.	A transcription snippet from Video Indexer timeline presenting different language segments

Customers can view the transcript and identified languages by time, jump to the specific places in the video for each language, and even see the multi-language transcription as video captions. The result transcription is also available as closed caption files (VTT, TTML, SRT, TXT, and CSV).

two languages


Language identification from an audio signal is a complex task. Acoustic environment, speaker gender, and speaker age are among a variety of factors that affect this process. We represent audio signal using a visual representation, such as spectrograms, assuming that, different languages induce unique visual patterns which can be learned using deep neural networks.

Our solution has two main stages to determine the languages used in multi-language media content. First, it employs a deep neural network to classify audio segments with very high granularity, in other words, very few seconds. While a good model will successfully identify the underlying language, it can still miss-identify some segments due to similarities between languages. Therefore, we apply a second stage for examining these misses and smooth the results accordingly.

3.	A new insight pane showing the detected spoken languages and their exact occurrences on the timeline

Next steps

We introduced a differentiated capability for multi-language speech transcription. With this unique capability in Video Indexer, you can become more effective about the content of your videos as it allows you to immediately start searching across videos for different language segments. During the coming few months, we will be improving this capability by adding support for more languages and improving the model’s accuracy.

For more information, visit Video Indexer’s portal or the Video Indexer developer portal, and try this new capability. Read more about the new multi-language option and how to use it in our documentation.

Please use our UserVoice to share feedback and help us prioritize features or email [email protected] with any questions.

Start building with Azure Cognitive Services for free

This post was co-authored by Tina Coll, Sr Product Marketing Manager, Azure Cognitive Services.

Innovate at no cost to you, with out-of-the box AI services that are newly available for Azure free account users. Join the 1.3 million developers who have been using Cognitive Services to build AI powered apps to date. With the broadest offering of AI services in the market, Azure Cognitive Services can unlock AI for more scenarios than other cloud providers. Give your apps, websites, and bots the ability to see, understand, and interpret people’s needs — all it takes is an API call — by using natural methods of communication. Businesses in various industries have transformed how they operate using the very same Cognitive Services now available to you with an Azure free account.

Get started with an Azure free account today, and learn more about Cognitive Services.

These examples are just a small handful of what you can make possible with these services:

  • Improve app security with face detection: With Face API, detect and compare human faces. See how Uber uses Face API to authenticate drivers.
  • Automatically extract text and detect languages: Easily and accurately detect the language of any text string, simplifying development processes and allowing you to quickly translate and serve localized content. Learn how Chevron applied Form Recognizer for robotic process automation, quickly extracting text from documents.
  • Personalize your business’ homepage: Use Personalizer to deliver the most relevant content and experiences to each user on your homepage.
  • Develop your own computer vision model in minutes: Use your own images to teach Custom Vision Service the concepts you want it to learn and build your own model. Find out how Minsur, the largest tin mine in the western hemisphere, harnesses Custom Vision for sustainable mining practices.
  • Create inclusive apps: With Computer Vision and Immersive Reader, your camera becomes an inclusive tool that turns pictures into spoken words for low vision users.
  • Build conversational experiences for your customers: Give your bot the ability to interact with your users with Azure Cognitive Services. See how LaLiga, the Spanish men’s soccer league, engages hundreds of millions of fans with its chatbot using LUIS, QnAMaker, and more.

It’s easy to get started

1. Create an Azure free account.

2. Visit the Azure portal to deploy services.

3. Find step-by-step guidance for deploying Cognitive Services.

Leveraging Cognitive Services to simplify inventory tracking

The team of interns at the New England Research and Development Center in Cambridge
Who spends their summer at the Microsoft Garage New England Research & Development Center (or “NERD”)? The Microsoft Garage internship seeks out students who are hungry to learn, not afraid to try new things, and able to step out of their comfort zones when faced with ambiguous situations. The program brought together Grace Hsu from Massachusetts Institute of Technology, Christopher Bunn from Northeastern University, Joseph Lai from Boston University, and Ashley Hong from Carnegie Mellon University. They chose the Garage internship because of the product focus—getting to see the whole development cycle from ideation to shipping—and learning how to be customer obsessed.

Microsoft Garage interns take on experimental projects in order to build their creativity and product development skills through hacking new technology. Typically, these projects are proposals that come from our internal product groups at Microsoft, but when Stanley Black & Decker asked if Microsoft could apply image recognition for asset management on construction sites, this team of four interns accepted the challenge of creating a working prototype in twelve weeks.

Starting with a simple request for leveraging image recognition, the team conducted market analysis and user research to ensure the product would stand out and prove useful. They spent the summer gaining experience in mobile app development and AI to create an app that recognizes tools at least as accurately as humans can.

The problem

In the construction industry, it’s not unusual for contractors to spend over 50 hours every month tracking inventory, which can lead to unnecessary delays, overstocking, and missing tools. All together, large construction sites could lose more than $200,000 worth of equipment over the course of a long project. Addressing this problem is an unstandardized mix that typically involves barcodes, Bluetooth, RFID tags, and QR codes. The team at Stanley Black & Decker asked, “wouldn’t it be easier to just take a photo and have the tool automatically recognized?”

Because there are many tool models with minute differences, recognizing a specific drill, for example, requires you to read a model number like DCD996. Tools can also be assembled with multiple configurations, such as with or without a bit or battery pack attached, and can be viewed from different angles. You also need to take into consideration the number of lighting conditions and possible backgrounds you’d come across on a typical construction site. It quickly becomes a very interesting problem to solve using computer vision.

Four different DeWalt drills that look very similar

How they hacked it

Classification algorithms can be easily trained to reach strong accuracy when identifying distinct objects, like differentiating between a drill, a saw, and a tape measure. Instead, they wanted to know if a classifier could accurately distinguish between very similar tools like the four drills shown above. In the first iteration of the project, the team explored PyTorch and Microsoft’s Custom Vision service. Custom Vision appeals to users by not requiring a high level of data science knowledge to get a working model off the ground, and with enough images (roughly 400 for each tool), Custom Vision proved to be an adequate solution. However, it immediately became apparent that manually gathering this many images would be challenging to scale for a product line with thousands of tools. The focus quickly shifted to find ways of synthetically generating the training images.

For their initial approach, the team did both three-dimensional scans and green screen renderings of the tools. These images were then overlaid with random backgrounds to mimic a real photograph. While this approach seemed promising, the quality of the images produced proved challenging.

In the next iteration, in collaboration with Stanley Black & Decker’s engineering team, the team explored a new approach using photo-realistic renders from computer-aided design (CAD) models. They were able to use relatively simple Python scripts to resize, rotate, and randomly overlay these images on a large set of backgrounds. With this technique, the team could generate thousands of training images within minutes.

    Image generated in front of a green screen vs an image rendered from CAD

On the left is an image generated in front of a green screen versus an extract from CAD on the right.

Benchmarking the iterations

The Custom Vision service offers reports on the accuracy of the model as shown below.

Exemplary report extracted from the custom vision service
For a classification model that targets visually similar products, a confusion matrix like the one below is very helpful. A confusion matrix visualizes the performance of a prediction model by comparing the true label of a class in the rows with the label outputted by the model in the columns. The higher the scores on the diagonal, the more accurate the model is. When high values are off the diagonal it helps the data scientists understand which two classes are being confused with each other by the trained model.

Existing Python libraries can be used to quickly generate a confusion matrix with a set of test images.
Confusion matrix for 10 products from DeWalt 

The result

The team developed a React Native application that runs on both iOS and Android and serves as a lightweight asset management tool with a clean and intuitive UI. The app adapts to various degrees of Wi-Fi availability and when a reliable connection is present, the images taken are sent to the APIs of the trained Custom Vision model on Azure Cloud. In the absence of an internet connection, the images are sent to a local computer vision model.

These local models can be obtained using Custom Vision, which exports models to Core ML for iOS, TensorFlow for Android, or as a Docker container that can run on a Linux App Service in Azure. An easy framework for the addition of new products to the machine learning model can be implemented by exporting rendered images from CAD and generating synthetic images.

Captures of the user interface of the inventory appCaptures of the user interface of the inventory app
Images in order from left to right: inventory checklist screen, camera functionality to send a picture to Custom Vision service, display of machine learning model results, and a manual form to add a tool to the checklist.


What’s next

Looking for an opportunity for your team to hack on a computer vision project? Search for an OpenHack near you.

Microsoft OpenHack is a developer focused event where a wide variety of participants (Open) learn through hands-on experimentation (Hack) using challenges based on real world customer engagements designed to mimic the developer journey. OpenHack is a premium Microsoft event that provides a unique upskilling experience for customers and partners. Rather than traditional presentation-based conferences, OpenHack offers a unique hands-on coding experience for developers.

The learning paths can also help you get hands on with the cognitive services.

Enable receipt understanding with Form Recognizer’s new capability

One of the newest members of the Azure AI portfolio, Form Recognizer, applies advanced machine learning to accurately extract text, key-value pairs, and tables from documents. With just a few samples, it tailors its understanding to supplied documents, both on-premises and in the cloud. 

Introducing the new pre-built receipt capability

Form Recognizer focuses on making it simpler for companies to utilize the information hiding latent in business documents such as forms. Now we are making it easier to handle one the most commonplace documents in a business, receipts, “out of the box.”  Form Recognizer’s new pre-built receipt API identifies and extracts key information on sales receipts, such as the time and date of the transaction, merchant information, the amounts of taxes, totals, and more, with no training required.

Sample receipt with extracted information from Form Recognizer’s new prebuilt receipt feature

Streamlining expense reporting

Business expense reporting can be cumbersome for everyone involved in the process. Manually filling out and approving expense reports is a significant time sink for both employees and managers. Aside from productivity lost to expense reporting, there are also pain points around auditing expense reports. A solution to automatically extract merchant and transaction information from receipts can significantly reduce the manual effort of reporting and auditing expenses.

Given the proliferation of mobile cameras, modern expense reports often contain images of receipts that are faded, crumpled up, or taken in suboptimal lighting conditions. Existing receipt solutions often target high quality scanned images and are not robust enough to handle such real-world conditions.

Enhance your expense reporting process using the pre-built capability

Form Recognizer eases common pain points in expense reporting, delivering real value back to business. By using the receipt API to extract merchant and transaction information from receipts, developers can unlock new experiences in the workforce. And since the pre-built model for receipts works off the shelf without training, it reduces the speed to deployment.

For employees, expense applications leveraging Form Recognizer can pre-populate expense reports with key information extracted from receipts. This saves employees time in managing expenses and travel that they can focus on their core roles. For central teams like finance within a company, it also helps expense auditing by using the key data extracted from receipts for verification. The optical character recognition (OCR) technology behind the service can handle receipts that are captured in a wide variety of conditions,  including smartphone cameras, reducing the amount of manual searching and reading of transaction documents required by auditors.

Our customer: Microsoft’s internal finance operations

The pre-built receipt functionality of Form Recognizer has already been deployed by Microsoft’s internal expense reporting tool, MSExpense, to help auditors identify potential anomalies. Using the data extracted, receipts are sorted into low, medium, or high risk of potential anomalies. This enables the auditing team to focus on high risk receipts and reduce the number of potential anomalies that go unchecked.

MSExpense also plans to leverage receipt data extraction and risk scoring to modernize the expense reporting process. Instead of identifying risky expenses during auditing, such automated processing can flag potential issues earlier in the process during the reporting or approval of the expenses. This reduces the turnaround time for processing the expense and any reimbursement.

“The pre-built receipt feature of Form Recognizer enables our application not only to scale from sampling 5 percent of receipts to 100 percent, but more importantly to streamline employee expense report experience by auto-populating/creating expense transactions, creating happy path to payment, receipt data insights to approver managers, giving employees time back to be used in value-add activities for our company. The service was simple to integrate and start seeing value.“

—Luciana Siciliano, Microsoft FinOps (MSExpense)

Learn more

To learn more about Form Recognizer and the rest of the Azure AI ecosystem, please visit our website and read the documentation.

Get started by contacting us.

For additional questions please reach out to us at [email protected]

Leveraging complex data to build advanced search applications with Azure Search

Data is rarely simple. Not every piece of data we have can fit nicely into a single Excel worksheet of rows and columns. Data has many diverse relationships such as the multiple locations and phone numbers for a single customer or multiple authors and genres of a single book. Of course, relationships typically are even more complex than this, and as we start to leverage AI to understand our data the additional learnings we get only add to the complexity of relationships. For that reason, expecting customers to have to flatten the data so it can be searched and explored is often unrealistic. We heard this often and it quickly became our number one most requested Azure Search feature. Because of this we were excited to announce the general availability of complex types support in Azure Search. In this post, I want to take some time to explain what complex types adds to Azure Search and the kinds of things you can build using this capability. 

Azure Search is a platform as a service that helps developers create their own cloud search solutions.

What is complex data?

Complex data consists of data that includes hierarchical or nested substructures that do not break down neatly into a tabular rowset. For example a book with multiple authors, where each author can have multiple attributes, can’t be represented as a single row of data unless there is a way to model the authors as a collection of objects. Complex types provide this capability, and they can be used when the data cannot be modeled in simple field structures such as strings or integers.

Complex types applicability

At Microsoft Build 2019,  we demonstrated how complex types could be leveraged to build out an effective search application. In the session we looked at the Travel Stack Exchange site, one of the many online communities supported by StackExchange.

The StackExchange data was modeled in a JSON structure to allow easy ingestion it into Azure Search. If we look at the first post made to this site and focus on the first few fields, we see that all of them can be modeled using simple datatypes, including tags which can be modeled as a collection, or array of strings.

   "id": "1",
    "CreationDate": "2011-06-21T20:19:34.73",
    "Score": 8,
    "ViewCount": 462,
    "BodyHTML": "

My fiancée and I are looking for a good Caribbean cruise in October and were wondering which "Body": "my fiancée and i are looking for a good caribbean cruise in october and were wondering which islands "OwnerUserId": 9, "LastEditorUserId": 101, "LastEditDate": "2011-12-28T21:36:43.91", "LastActivityDate": "2012-05-24T14:52:14.76", "Title": "What are some Caribbean cruises for October?", "Tags": [ "caribbean", "cruising", "vacations" ], "AnswerCount": 4, "CommentCount": 4, "CloseDate": "0001-01-01T00:00:00",​

However, as we look further down this dataset we see that the data quickly gets more complex and cannot be mapped into a flat structure. For example, there can be numerous comments and answers associated with a single document.  Even votes is defined here as a complex type (although technically it could have been flattened, but that would add work to transform the data).

"CloseDate": "0001-01-01T00:00:00",
    "Comments": [
            "Score": 0,
            "Text": "To help with the cruise line question: Where are you located? My wife and I live in New Orlea
            "CreationDate": "2011-06-21T20:25:14.257",
           "UserId": 12
            "Score": 0,
            "Text": "Toronto, Ontario. We can fly out of anywhere though.",
            "CreationDate": "2011-06-21T20:27:35.3",
            "UserId": 9
            "Score": 3,
            "Text": ""Best" for what?  Please read [this page](
            "UserId": 20
            "Score": 2,
            "Text": "What do you want out of a cruise? To relax on a boat? To visit islands? Culture? Adventure?
            "CreationDate": "2011-06-24T05:07:16.643",
            "UserId": 65
    "Votes": {
        "UpVotes": 10,
        "DownVotes": 2
    "Answers": [
            "IsAcceptedAnswer": "True",
            "Body": "This is less than an answer, but more than a comment…nnA large percentage of your travel b
            "Score": 7,
            "CreationDate": "2011-06-24T05:12:01.133",
            "OwnerUserId": 74

All of this data is important to the search experience. For example, you might want to:

In fact, we could even improve on the existing StackExchange search interface by leveraging Cognitive Search to extract key phrases from the answers to supply potential phrases for autocomplete as the user types in the search box.

All of this is now possible because not only can you map this data to a complex structure, but the search queries can support this enhanced structure to help build out a better search experience.

Next Steps

If you would like to learn more about Azure Search complex types, please visit the documentation, or check out the video and associated code I made which digs into this Travel StackExchange data in more detail.

Using Azure Search custom skills to create personalized job recommendations

This blog post was co-authored by Kabir Khan, Software Engineer II , Learning Engineering Research and Developement.

The Microsoft Worldwide Learning Innovation lab is an idea incubation lab within Microsoft that focuses on developing personalized learning and career experiences. One of the recent experiences that the lab developed focused on offering skills-based personalized job recommendations. Research shows that job search is one of the most stressful times in someone’s life. Everyone remembers at some point looking for their next career move and how stressful it was to find a job that aligns with their various skills.

Harnessing Azure Search custom skills together with our library of technical capabilities, we were able to build a feature that offers personalized job recommendations based on identified capabilities from resumes. The feature parses a resume to identify technical skills (highlighted and checkmarked in the figure below.) It then ranks jobs based on the skills most relevant to the capabilities in the resume. Another helpful ability is in the UI layout, where the user can view the gaps in their skills (non-highlighted skills in the figure below) for jobs they’re interested in and work towards building those skills.

An image of the Worldwide Learning personalized jobs search demo UI

Figure one: Worldwide Learning Personalized Jobs Search Demo UI

In this example, our user is interested in transitioning from a Software Engineering role to Program Management. Displayed in the image above you can see how the top jobs for our user are in Program Management but they are ranked based on our user’s unique capabilities in areas like AI, Machine Learning and Cloud Computing, resulting in the top ranked job on the Bing Search and AI team which deals with all three.

How we used Azure Search

Image of the Worldwide Learning personalized jobs search architecture.

Figure two: Worldwide Learning Personalized Jobs Search Architecture

The above architecture diagram shows the data flow for our application. We started with around 2000 job openings pulled directly from the Microsoft Careers website as an example. We then indexed these jobs, adding a custom Azure Search cognitive skill to extract capabilities from the descriptions of each job. This allows a user to search for a job based on a capability like “Machine Learning”. Then, when a user uploads a resume, we upload it to Azure Blob storage and run an Azure Search indexer. Leveraging a mix of cognitive skills provided by Azure and our custom skill to extract capabilities, we end up with a good representation of the user’s capabilities.

To personalize the job search, we leverage the tag boosting scoring profile built into Azure Search. Tag boosting ranks search results by the user’s search query and the number of matching “tags” (in this case capabilities) with the target index. So, in our example, we pass the user’s capabilities along with their search query and get jobs that best match our user’s unique set of capabilities.

With Azure Search skills, our team was able to make the personalization of job search, a desirable capability among job seekers and recruiters, possible through this proof of concept. You can use the same process we followed to achieve the same goal for your own careers site. We open sourced the Skills Extractor Library that we used in this example and made it available in a container.

Please be aware that before running this sample, you must have the following:

  • Install the Azure CLI. This article requires the Azure CLI version 2.0 or later. Run az –version to find the version you have.
  • You can also use the Azure Cloud Shell.

To learn more about this feature, you can view the live demo (starts at timecode 00:50:00) and read more in our GitHub repository.

Feedback and support

We’re eager to improve, so please take a couple of minutes to answer some questions about your experience using this survey. For support requests, please contact us at [email protected].

Using Text Analytics in call centers

Azure Cognitive Services provides Text Analytics APIs that simplify extracting information from text data using natural language processing and machine learning. These APIs wrap pre-built language processing capabilities, for example, sentiment analysis, key phrase extraction, entity recognition, and language detection.

Using Text Analytics, businesses can draw deeper insights from interactions with their customers. These insights can be used to create management reports, automate business processes, for competitive analysis, and more. One area that can provide such insights is recorded customer service calls which can provide the necessary data to:

  • Measure and improve customer satisfaction
  • Track call center and agent performance
  • Look into performance of various service areas

In this blog, we will look at how we can gain insights from these recorded customer calls using Azure Cognitive Services.

Using a combination of these services, such as Text Analytics and Speech APIs, we can extract information from the content of customer and agent conversations. We can then visualize the results and look for trends and patterns.

Diagram showing how combination of Cognitive Services can extract information

The sequence is as follows:

  • Using Azure Speech APIs, we can convert the recorded calls to text. With the text transcriptions in hand, we can then run Text Analytics APIs to gain more insight into the content of the conversations.
  • The sentiment analysis API provides information on the overall sentiment of the text in three categories positive, neutral, and negative. At each turn of the conversation between the agent and customer, we can:
    • See how the customer sentiment is improving, staying the same, or declining.
    • Evaluate the call, the agent, or either for their effectiveness in handling customer complaints during different times.
    • See when an agent is consistently able to turn negative conversations into positive or vice versa and identify opportunities for training.
  • Using the key phrase extraction API, we can extract the key phrases in the conversation. This data, in combination with the detected sentiment, can assign categories to a set of key phrases during the call. With this data in hand, we can:
    • See which phrases carry negative or positive sentiment.
    • Evaluate shifts in sentiment over time or during product and service announcements.

Table showing overall sentiment in three text categories

  • Using the entity recognition API, we can extract entities such as person, organization, location, date time, and more. We can use this data, for example, to:
    • Tie the call sentiment to specific events such as product launches or store openings in an area.
    • Use customer mentions of competitors for competitive intelligence and analysis.
  • Lastly, Power BI can help visualize the insights and communicate the patterns and trends to drive to action.

Power BI graph visualizing the insights and communicating the patterns and trends

Using the Azure Cognitive Services Text Analytics, we can gain deeper insights into customer interactions and go beyond simple customer surveys into the content of their conversations.

A sample code implementation of the above workflow can be found on GitHub.

Accelerate bot development with Bot Framework SDK and other updates

Conversational experiences have become the norm, whether you’re looking to track a package or to find out a store’s hours of operation. At Microsoft Build 2019, we highlighted a few customers who are building such conversational experiences using the Microsoft Bot Framework and Azure Bot Service to transform their customer experience.

As users become more familiar with bots and virtual assistants, they will invariably expect more from their conversational experiences. For this reason, Bot Framework SDK and tools are designed to help developers be more productive in building conversational AI solutions. Here are some of the key announcements we made at Build 2019:

Bot Framework SDK and tools

Adaptive dialogs

The Bot Framework SDK now supports adaptive dialogs (preview). Adaptive dialog dynamically updates conversation flow based on context and events. Developers can define actions, each of which can have a series of steps defined by the result of events happening in the conversation to dynamically adjust to context. This is especially handy when dealing with conversation context switches and interruptions in the middle of a conversation. Adaptive dialog combines input recognition, event handling, model of the conversation (dialog) and output generation into one cohesive, self-contained unit. The diagram below depicts how adaptive dialogs can allow a user to switch contexts. In this example, a user is looking to book a flight, but switches context by asking for weather related information which may influence travel plans.

An image depicting the flow of adaptive dialogs and context switching from book flights to weather requests.

You can read more about adaptive dialogs here.


Developers can compose conversational experiences by stitching together re-usable conversational capabilities, known as skills. Implemented as Bot Framework bots, skills include language models, dialogs, and cards that are reusable across applications. Current skills, available in preview, include Email, Calendar, and Points of Interest.

 Images of the UI for skills such as Mail, Calendar, and Point of Interest.

Within an enterprise using skills you can now integrate multiple sub-bots owned by different teams into a central bot, or more broadly leverage common capabilities provided by other developers. With the preview of skills, developers can create a new bot (from the Virtual Assistant template) and add/remove skills with one command line operation incorporating all dispatch and configuration changes. Get started with skill developer templates (.NET, TS).

Virtual assistant solution accelerator

The Enterprise Template is now the Virtual Assistant Template, allowing developers to build a virtual assistant with out of the box with skills, adaptive cards, typescript generator, updated conversational telemetry and PowerBI analytics, and ARM based automated Azure deployment. It also provides C# template simplified and aligned to ASP.NET MVC pattern with dependency injection. Developers who have already made use of the Enterprise Template and want to use the new capabilities can follow these steps to get started quickly.


The Bot Framework Emulator has released a preview of the new Bot Inspector feature: a way to debug and test your Bot Framework SDK v4 bots on channels like Microsoft Teams, Slack, Cortana, Facebook Messenger, Skype, etc. As you have the conversation, messages will be mirrored to the Bot Framework Emulator where you can inspect the message data that the bot received. Additionally, a snapshot of the bot state for any given turn between the channel and the bot is rendered as well. You can inspect this data by clicking on the “Bot State” element in the conversation mirror. Read more about Bot Inspector.

Language generation (preview)

Streamlines the creation of smart and dynamic bot responses by constructing meaningful, variable, and grammatically correct responses that a bot can send back to the user. Visit the GitHub repo for more details.

QnA Maker

Easily handle multi-turn conversation

With QnA Maker, you can now handle a predefined set of multi-turn question and answer flows. For example, you can configure QnA Maker to help troubleshoot a product with a customer by preconfiguring a set of questions and follow up question prompts to lead users to specific answers. QnA Maker supports extraction of hierarchical QnA pairs from a URL, .pdf, or .docx files. Read more about QnA Maker multi-turn in our docs, check out the latest samples, and watch a short video.

Simplified deployment

We’ve simplified the process of deploying a bot. Using a pre-defined bot framework v4 template, you can create a bot from any published QnA Maker knowledge base. Not only can you now create a complex QnA Maker knowledge base in minutes, but you can now deploy it to supported channels like Teams, Skype, or Slack in minutes.

Language Understanding (LUIS)

Language Understanding has added several features that let developers extract more detailed information from text, so users can now build more intelligent solutions with less effort.

Roles for any entity type

We have extended roles to all entity types, which allows the same entities to be classified with different subtypes based on context.

New visual analytics dashboard

There’s now a more detailed, visually-rich, comprehensive analytics dashboard. It’s user-friendly design highlights common issues most users face when designing applications by providing simple explanations on how to resolve them to help users gain more insight into their models’ quality, potential data problems, and guidance to adopt best practices.

Dynamic lists

Data is ever-changing and different from one end-user to another. Developers now have more granular control of what they can do with Language Understanding, including being able to identify and update models at runtime through dynamic lists and external entities. Dynamic lists are used to append to list entities at prediction time, permitting user-specific information to get matched exactly.

Read more about the new Language Understanding features, available through our new v3 API, in our docs. Customers like BMW, Accenture, Vodafone, and LaLiga are using Azure to build sophisticated bots faster and find new ways to connect with their customers.

Get started

With these enhancements, we are delivering value across the entire Microsoft Bot Framework SDKs and tools, Language Understanding, and QnA maker in order to help developers become more productive in building a variety of conversational experiences.

We look forward to seeing what conversational experiences you will build for your customers. Get started today!

Watch on-demand sessions at Microsoft Build 2019:

AI-first content understanding, now across more types of content for even more use cases

This post is authored by Elad Ziklik, Principal Program Manager, Applied AI.

Today, data isn’t the barrier to innovation, usable data is. Real-world information is messy and carries valuable knowledge in ways that are not readily usable and require extensive time, resources, and data science expertise to process. With Knowledge Mining, it’s our mission to close the gap between data and knowledge.

We’re making it easier to uncover latent insights across all your content with:

  • Azure Search’s cognitive search capability (general availability)
  • Form Recognizer (preview)

Cognitive search and expansion into new scenarios

Announced at Microsoft Build 2018, Azure Search’s cognitive search capability uniquely helps developers apply a set of composable cognitive skills to extract knowledge from a wide range of content. Deep integration of cognitive skills within Azure Search enables the application of facial recognition, key phrase extraction, sentiment analysis, and other skills to content with a single click. This knowledge is organized and stored in a search index, enabling new experiences for exploring the data.

Cognitive search, now generally available, delivers:

  • Faster performance – Improved throughput capabilities with increased processing speeds up to 30 times faster than in preview. Completing previously hour-long tasks in only a couple of minutes.
  • Support of complex data types – Natively supported to extend the types of data that can be stored and searched (this has been the most requested Azure Search feature.) Raw datasets can include hierarchical or nested substructures that do not break down neatly into a tabular rowset, for example multiple locations and phone numbers for a single customer.
  • New skills – Extended library of pre-built skills based on customer feedback. Improved support for processing images, added ability to create conditional skills, and shaper skills that allow for better control and management of multiple skills in a skillset. Plus, entity recognition provides additional information to each entity identified, such as the Wikipedia URL.
  • Easy implementation – The solution accelerator provides all the resources needed to quickly build a prototype, including templates for deploying Azure resources, a search index, custom skills, a web app, and PowerBI reports. Use the accelerator to jump start development efforts and apply cognitive search to your business needs.

See what’s possible when you apply cognitive search to unstructured content, like art:

Tens of thousands of customers use Azure Search today, processing over 260 billion files each month. Now with cognitive search, millions of enrichments are performed over data ranging from PDFs to Office documents, from JSON files to JPEGs. This is possible because cognitive search reduces the complexity to orchestrate complex enrichment pipelines containing custom and prebuilt skills, resulting in deeper insight of content. Customers across industries including healthcare, legal, media, and manufacturing use this capability to solve business challenges.

“Complex customer needs and difficult markets are our daily business. Cognitive search enables us to augment expert knowledge and experience for reviewing complex technical requirements into an automated solution that empowers knowledge workers throughout our organization.”  Chris van Ravenswaay, Business Solution Manager, Howden

Extending AI-driven content understanding beyond search

Many scenarios outside of search require extracted insights from messy, complicated information. Expanding cognitive search to support unique scenarios, we are excited to announce the preview of the knowledge store capability within cognitive search – allowing access to AI-generated annotations in table and JSON format for application in non-search use cases like PowerBI dashboards, machine learning models, organized data repositories, bots, and other custom applications.

Form Recognizer, a new Cognitive Service

The Form Recognizer Cognitive Service, available in preview, applies advanced machine learning to accurately extract text, key-value pairs, and tables from documents.

With as few as 5 samples, Form Recognizer tailors its understanding to your documents. You can also use the REST interface of the Form Recognizer API to then integrate into cognitive search indexes, automate business processes, and create custom workflows for your business. You can turn forms into usable data at a fraction of the time and cost, so you can focus more time acting on the information rather than compiling it.

Container support for Form Recognizer supports use on the edge, on-premises, and in the cloud. The portable architecture can be deployed directly to Azure Kubernetes Service or Azure Container Instances or to a Kubernetes cluster deployed to Azure Stack.

Organizations like Chevron and Starbucks are using Form Recognizer to accelerate extraction of knowledge from forms and make faster decisions.

We look forward to seeing how you leverage these products to drive impact for your business.

Getting Started