This experiment demonstrates prediction of corn yield & futures price based on weather and market data, using a feature selection strategy.
## Corn yield and futures price prediction Futures are contracts to buy and sell commodites on a future date at a specified price. Corporations can use futures to hedge against price increases and ensure access to limited goods, but accurate prediction of future commodity values is essential for avoiding unwise purchases. Corn grain poses a special challenge because future prices reflect weather-dependent supply as well as variable demand. This experiment demonstrates how publically-available meteorological and economic data can be used to predict corn futures price. Our approach to corn yield prediction is inspired by Westcott and Jewison's ["Weather Effects on Expected Corn and Soybean Yields"](http://www.ers.usda.gov/publications/fds-feed-outlook/fds-13g-01.aspx) (2013). ## Feature selection In this experiment, the number of available features (types of weather and market data recorded in each state in each year) exceeds the size of the training set (number of planting seasons). This experiment demonstrates the sequential forward selection approach to choosing a subset of features for use in prediction. # Data ## Sources Daily beginning-of-trade prices for a continuous corn futures index ([Corn Futures, Continuous Contract #1, Front Month](https://www.quandl.com/data/CHRIS/CME_C1-Corn-Futures-Continuous-Contract-1-C1-Front-Month)) were obtained with permission from [Quandl](https://www.quandl.com). Adjustments for inflation were performed using consumer price indices available in [The World Bank](http://www.worldbank.org/)'s [DataBank of World Development Indicators](http://databank.worldbank.org/data/reports.aspx?source=world-development-indicators). Seasonal corn yield (1866-2015) and weekly planting progress (1979-2015) were retrieved from the [United States Department of Agriculture](http://www.usda.gov/wps/portal/usda/usdahome)'s [National Agricultural Statistics Service](http://www.nass.usda.gov/) [Quick Stats](http://quickstats.nass.usda.gov/) tool. Monthly temperature, precipitation, and Palmer Drought Severity Index information were obtained from the [National Oceanic and Atmospheric Administration](http://www.noaa.gov/)'s [National Climatic Data Center](http://www.ncdc.noaa.gov/) via the [nClimDiv FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/cirs/climdiv/). Beginning-of-year corn grain stockpiles and total corn grain supply data were provided by the [USDA Feed Grains Database](http://www.ers.usda.gov/data-products/feed-grains-database.aspx). ## Preparation Several reformatting steps and mergers are required to assemble these datasets into a shared data table that can be used for predictions with AzureML's standard tools. These actions are performed within **Execute R Script** and **Execute Python Script** modules. Some of the common procedures used are briefly summarized below; see the module contents for more details. ### Casting features into columns The input file `planting_progress.csv` contains separate rows for planting progress (percentage of planned acreage planted) in each week of the planting season, e.g. Year Period State state_id Value 1 2015 WEEK08 TEXAS 48 2 2 2015 WEEK09 TEXAS 48 4 3 2015 WEEK10 TEXAS 48 6 We use the `dcast()` function from R's `reshape2` package to group the planting progress for each state and year combination, e.g. Year State state_id Week08 Week09 Week10 1 2015 TEXAS 48 2 4 6 ### Handling missing data States do not typically report planting progress of 0% (planting not yet begun) or 100% (planting completed). These missing entries for early/late weeks of the planting season are filled in by an **Execute R Script** module. ### Feature creation Severe weather at both extremes (drought and flood, high and cold temperature) can adversely impact yield. Using **Execute R script** modules, we add binary and nonlinear features to account for these effects in a linear regression model. Specifically, we create binary indicators from the Palmer Drought Severity Index which signify whether water availability was extreme (either drought or flood) in each month. We also add new features representing the square of monthly precipitation to account for nonlinear effects of precipitation. ### Weighted averaging of features Following the approach of Westcott and Jewison (2013), we focus our model on eight states which together account for three quarters of all corn production in the United States: Iowa, Illinois, Nebraska, Minnesota, Indiana, South Dakota, Ohio, and Missouri. Producing acreage-weighted averages of the features in our model over these eightstates reduces both noise and the number of missing values. (Unfortunately, it also reduces the number of observations available for training and validation). This step is performed within an **Execute R Script** module. ### Adjusting for inflation Some variation in corn grain price with time is attributable to inflation. We adjust futures prices to 2010 US dollar equivalents using The World Bank's consumer price index (The World Bank: World Development Indicators). We also summarize the daily price information by computing a monthly average. For convenience with string manipulations, we perform this step within an **Execute Python Script** module. # Feature Selection and Model Training ## Motivation for use of Sequential Forward Selection Our goal is to predict the corn yield and corn futures prices in December earlier in the year. As an example, we have chosen the end of July as the prediction time: we hide information that would only have been available later in the year using a **Project Columns** module. Our candidate features for corn yield prediction include: * Year (crop yields have risen gradually with adoption of modern agricultural practices) * Monthly precipitation, March through July * Monthly precipitation squared, March through July * Mean temperature, March through July * Max temperature, March through July * Palmer Drought Severity Index, boolean indicating severity status, March through July * Planting progress, by week in planting season Candidate features for corn futures price prediction include all of the above, as well as: * Same-year average corn futures price, January through July * Fraction of yearly corn crop stockpiled at the beginning of the year * Total acres planted We may not be able to naively guess which of these features will be most valuable for predicting corn yield or futures prices. However, our experiment is underpowered to use all of these features: we must choose a subset to use for our predictor. One approach to feature selection would be to compute the correlation between each feature and the variable we would like to predict. This type of feature selection can be performed with the built-in **Filter Based Feature Selection** module. Unfortunately, this approach may lead to selection of features which are strongly correlated with one another and therefore redundant. Sequential forward selection, however, ensures that each feature contributes to the model's performance independently of other chosen features. The approach is iterative: in each round, a model's current performance is compared to the same model with one candidate feature added. The candidate feature which best improves performance is added to the model; the process continues until either (i) no candidate feature improves the model quality or (ii) an upper limit on the number of features is reached. ## Implementation We use a **Linear Regression** model for both feature selection and predictor training. This choice reflects: * the small number of observations in our dataset (which would lead to overfitting by most nonlinear models) * the speed of parameter estimation in linear models (each round of SFS requires building many candidate models) Feature selection is performed within an **Execute R Script** module; pre-existing Azure ML modules are used for predictor training, scoring, and evaluation. # Performance ## Crop yield prediction We separate weather and crop data from the years 1950-2015 into training (n=46) and validation (n=20) sets using the **Split Data** module. The five features selected for prediction of crop yield were the year, planting progress for two weeks in March/April, and the maximum and mean temperatures in July. The coefficients fit from this model suggest that early planting and unusually high July temperatures are negatively correlated with yield. In the validation set, average corn yield was 106 +/- 37 bushels/acre. Our corn yield predictor performed relatively well: its mean absolute error was 7.5 bushels/acre, and the coefficient of determination between the predicted and observed yields was R^2=0.92. ## Corn futures price prediction Availability of corn futures price and consumer price index data (needed toad just futures prices for inflation) limited the observations available for corn futures price prediction to the years 1975-2014. A **Split Data** module was used to separate these years into training (n=28) and validation (n=12) data sets. Sequential forward selection resulted in five features: the price of corn per bushel in June and July, precipitation in July, and planting progress for two weeks in late April. Midsummer corn prices and early planting positively correlated with December prices, while July precipitation was negatively correlated with December corn price. The average price of corn per bushel in December was $5.92 +/- $2.58 in the validation set. The mean absolute error in predicted price was $0.62, and the coefficient of determination between the predicted and observed prices was R^2=0.85.