Sample 1: Download dataset from UCI: Adult 2 class dataset

By for February 13, 2015

52649 views
189901 downloads


Report Abuse
This sample demonstrates how to download a dataset from a http location, add column names to the dataset and examine the dataset and compute some basic statistics.
##Download Dataset## This experiment demonstrates how to use the **Reader** module to read data into Azure ML using HTTP, and then add a header to the data by using the **Enter Data** module. ##Data The dataset we want to use in our experiment contains income and demographics extracted from the public census data. We obtained the dataset from the [UCI repository](http://archive.ics.uci.edu/ml/machine-learning-databases/adult) by using the **Reader** module to specify the location of the source data. From the data dictionary, we know that the data is in CSV format, without a header row, so we will specify those options in the **Reader** module and use the following modules to improve the data: - Using the **Enter Data** module, we will manually create a header row. - Using the **Execute R Script** module, we will insert the header row into the dataset. Finally, we will output some basic statistics for the dataset using the **Descriptive Statistics** module. ##Creating the Experiment## First we need to configure the **Reader** module: 1. For **Data source** we select _Web URL via HTTP_. 2. In the **URL** text box, we provide the URL for the source data, including the name and file extension of the CSV data file. 3. For the **Data format** option, we select _CSV_. 4. The data file does not have a header row, so we leave the **CSV or TSV has header row** option unchecked. ![][image1] ![][image2] One way to change column names would be to use the **Metadata Editor** module and provide a comma-separated list of names for the columns. However, a long list of column names is hard to see in the text box, so we will show you an alternate way to rename the columns. 1. First, use the **Enter Data** module to type a list of column names to be used as the header row. The illustration above shows the column names we typed in. (You can get a full list of the columns in the census data from the UCI repository) 2. Next, use the **Execute R Script** module to insert the header rows into the dataset. The following diagram shows the example code. (To get a copy of this sample R code, you can create a copy of this experiment, and then edit the **Execute R Script** module.) ![][image3] ##Data Visualization## You can review the format of the original data on the UCI Machine Learning Repository, at [http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data](http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data). Note that the original data has no column names. However, when you import the data into Azure ML Studio using the **Reader** module, default column names are assigned to all the columns. To view the auto-generated column names, right-click the output port of the **Reader** module and select **Visualize**. ![][image4] ![][image5] After you use the **Execute R Script** module to insert the header row created by using the **Enter Data** module, the modified dataset is as shown in the diagram below. This small change makes the data much easier to read and work with. ![][image6] Finally, we use the **Descriptive Statistics** module to compute some basic statistics on the dataset, and use the **Visualize** option from the output port to view the results. ![][image7] <!-- Images --> [image1]:http://az712634.vo.msecnd.net/samplesimg/v1/S1/reader_parameters.PNG [image2]:http://az712634.vo.msecnd.net/samplesimg/v1/S1/enter_data.PNG [image3]:http://az712634.vo.msecnd.net/samplesimg/v1/S1/r_code.PNG [image4]:http://az712634.vo.msecnd.net/samplesimg/v1/S1/reader_visualize.PNG [image5]:http://az712634.vo.msecnd.net/samplesimg/v1/S1/reader_output.PNG [image6]:http://az712634.vo.msecnd.net/samplesimg/v1/S1/header_data.PNG [image7]:http://az712634.vo.msecnd.net/samplesimg/v1/S1/desc_stats.PNG