The original data (from merging multiple files) had 131,932 rows and 17 columns (attributes). After cleaning and pivoting on unique trials, we ended up with a data set of 23280 rows (each row a unique farming experiment) and 155 columns (each column explains an attribute of that farming experiment).
Data Analytics in Agriculture
Data analytics is about getting to know your data and preparing it for models. For any project on application of data science in agriculture, or any data science project in general, exploratory data analysis (EDA) is the crucial first step. EDA deals with presenting the insights from data in form of graphs, charts and statistical analysis. This helps understanding the features in data, possible correlations and trends, which in business term is also called descriptive analytics.
Among 155 columns some were output such as grain yield, and some were input. Inputs varied from controlled independent (e.g. amount of fertilizer) and uncontrolled independent (e.g. precipitation) to uncontrolled dependent (e.g. severity of diseases). Also, some of the columns were considered junk for our purpose (e.g. email address of the contact person in charge of experiment).
For the EDA we focused on the visualization of the main features and also the interaction between features.
a. Visualization of Main Features
We cannot visualize all of the features in this post, but some of the important features from each category are presented to get a sense of the data set.
Considering many farms did not collect data for some of the attributes, null values were present in most of the features. Accordingly for most features, the total number of valid values is smaller than the entire number of farming trials.
b. Interaction Between Features
To further illustrate the complexity of otherwise ‘well established’ task of adding fertilizers, these graphs show the effect of adding high and low amount of nitrogen fertilizers on the first and second round of fertilizing.
Among the farmers who added fertilizers for the first round, some added for the second and third rounds as well. The graph compares weight of 1000-grain for top 15% and bottom 15% amount of nitrogen fertilizer added in first and second rounds. In the first round, farmers who added relatively high amount of nitrogen (top 15% quartile) adversely affected the size of grains obtained from their farms (left image). On the other hand, high nitrogen fertilizer improved grain quality on the second round compared to those who added too little or no nitrogen fertilizer (bottom 15% quartile).
The variance in the inputs determine to what extent they play a role in the outputs. On top of that, some features are inherently more important. For example the variance in soil moisture plays a critical role on yield, while variance in wind direction does not have such impact. This is where variation in one important attribute could totally overshadow changes in less important ones. See the effect of space between rows (higher than 50 cm or lower than 15 cm) on the yield and the 1000-grain weight. In both case the p value is much smaller than 0.01.
The space between rows shows that each country followed a protocol (for farming practice) that could be radically different from other countries. For example, the average space between rows for Ukraine was almost half of South Africa’s, which was almost half of Japan’s.
Following different farming practices, in addition to environmental factors, resulted in outputs that were affected more by location rather than genes.
The case of large complex datasets with a high number of attributes is where statistical tools and data analytics, though essential initial steps, are not enough to drive predictive insights. That is where machine learning models play a critical role as predictive analysis tools.
Machine Learning in Agriculture
Data science in agriculture is about taking the wealth of insights extracted from data analytics techniques to the next level and provides predictability using machine learning algorithms. This approach provides deeper insight in understanding and using farming data. In spite of huge potential of application of data science in agriculture, its practical applications on industry levels are relatively limited. In this case study, we show the capability of data science in predicting yield and diseases severity of wheat, and focus on the most important features in prediction.
1. Predicting Yield in Wheat
In data science, the supervised machine learning is applicable when dataset has labels. Labels are what the ML model learns to predict. In this case, labels could be yield, 1000-grain weight, or occurrence of a specific disease. Graph below shows schematic of the neural network model we used for predictive analysis.
We used the same model to predict the value of 1000-grain weight. The neural net could predict 1000-grain weight with R2 score of 48% and mean absolute error of 6 g. The relatively small error compared to the yield prediction is partially because of smaller relative variance of 1000-grain weight values compared to that of yield.
2. Predicting Diseases in Wheat
The performance of the classifier is shown as a confusion matrix where the diagonal represents the cases that the model got it right. A reliable classification model should result in values much larger in diagonal of confusion matrix compared to other cells.
3. Feature Importance for Data Science in Agriculture
Top 5 important attributes that affected the yield of wheat farms
Data Science in Agriculture and Data Leakage
Machine learning models together with descriptive analysis make a powerful tool as data science in agriculture for farmers to reduce cost, predict yield and diseases and act upon unlocked information obtained from data science techniques.