Load non-text file from Azure Blob Storage

March 23, 2016


Report Abuse
Access Azure Blob Storage files that cannot be imported via the Reader module using Python and a Shared Access Signature.
# Description The **Reader** module can be used to import selected file types from Azure Blob Storage into Azure Machine Learning Studio. The **Execute Python Script** module can be used to access files in other formats, including compressed files and images, using a Shared Access Signature (SAS). This experiment demonstrates how to generate an SAS using Microsoft Azure Storage Explorer and employ the SAS to access and decompress a compressed csv file. ## Data files ### Input file to be loaded Our example input file, `test.csv.gz`, is a comma-separated value (csv) file that has been compressed using `gzip`. The contents of the file in uncompressed form are: Name,Quest,Favorite Color Sir Lancelot,To seek the Holy Grail,Blue Sir Galahad of Camelot,I seek the Grail,Blue -- no wait yellow The compressed file was uploaded as a `BlockBlob` to Azure Blob Storage using [Microsoft Azure Storage Explorer](http://storageexplorer.com/), into a previously-created Azure storage account and container. (Please see Robin Shahan's [article about Azure storage accounts](https://azure.microsoft.com/en-us/documentation/articles/storage-create-storage-account/) for more information on creating storage accounts.) ### Additional Python packages The experiment also includes a zip file containing the `azure-storage` Python package and its dependencies. It was generated by creating a new Python virtual environment in Cygwin (see [this tutorial](http://anythingsimple.blogspot.com/2010/04/using-pip-virtualenv-and.html) by David L. for more information), installing the `azure-storage` package via `pip`, and compressing the contents of the virtual environment's `site-packages` directory. This file was uploaded as a dataset in Azure Machine Learning Studio. ## Obtaining a Shared Access Signature ### Overview Your Azure Blob Storage account name and key can be used to access files from within Python, but you may not wish to store this sensitive information within a shared or published Azure Machine Learning workspace. A Shared Access Signature (SAS) can be used in lieu of an account key to provide limited access to specific containers or blobs. Several options for generating an SAS are available; for this experiment, we used [Microsoft Azure Storage Explorer](http://storageexplorer.com/), a Windows/Mac/Linux utility that is also handy for uploading files to Blob Storage. (For more information on SAS and generation options, please see Tamra Myers's [Shared Access Signatures](https://azure.microsoft.com/en-us/documentation/articles/storage-dotnet-shared-access-signature-part-1/) articles.) ### Generating an SAS with Microsoft Azure Storage Explorer After installing and launching [Microsoft Azure Storage Explorer](http://storageexplorer.com/), navigate to the desired storage account and container using the directory tree at left, then right-click the container of interest and select the "Get Shared Access Signature" option. ![Location of Get Shared Access Signature... option][1] In the dialog box that appears, select an appropriate expiration date for the signature. Read permissions will be sufficient to access the contents of the blob storage container from within Azure ML. Click the Create button to generate the signature. ![Shared Access Signature properties dialog box][2] A window containing the new SAS URL will then open. The latter part of this URL (everything after the quotation mark) is called the SAS token. ![Generated SAS URL][3] ## Using the Shared Access Signature token within Azure Machine Learning ![Experiment overview][4] ### Manual input of account access information Our experiment contains three modules. At upper left, an **Input Data Manually** module contains the account name, SAS token, container name, and file name (blob name) to be used in this example. This data is written in csv format. account_name,sas_token,container_name,file_name mwahlamlgallery,st=2016...&c...vHI%3D,compressedfileexample,test.csv.gz ### Non-standard Python packages The **Saved Datatasets** module at upper right is the zip file containing non-standard Python packages used for connection to Azure Blob Storage. This module is connected to the Script Bundle port of the **Execute Python script**. ### Python script The **Execute Python Script** module copies the file from blob storage to its local workspace, then uses the [Pandas](http://pandas.pydata.org/) package to load the compressed csv file as a data frame. This requires loading a few non-standard packages: from azure.storage.blob import BlockBlobService import pandas as pd The script begins by accessing the necessary information provided by the **Input Data Manually** module: def azureml_main(input_df = None, unused_middle_port = None): # Reformat input data required to access blob storage account_name = list(input_df['account_name'])[0] sas_token = list(input_df['sas_token'])[0] container_name = list(input_df['container_name'])[0] file_name = list(input_df['file_name'])[0] Next, a connection to Azure Blob Storage is opened and the file is copied locally: sas_service = BlockBlobService(account_name = account_name, sas_token = sas_token, protocol = 'http') sas_service.get_blob_to_path(container_name, file_name, file_name) For this example, the data are exported from the module by reading the compressed gzip file into a data frame using Pandas. output_df = pd.read_csv(file_name, compression='gzip') return output_df If your use case requires you to export the data to other modules in Azure Machine Learning Studio, but your data do not lend themselves to the dataframe format (e.g. images), you may consider using serialization. After running the experiment, you can view the imported data by right-clicking the left output port and selecting the Visualize option. [1]: http://i.imgur.com/R2P4FXL.png [2]: http://i.imgur.com/8cYgzHQ.png [3]: http://i.imgur.com/WYSAJqQ.png [4]: http://i.imgur.com/UgHn6v0.png