نوع مقاله : مقاله پژوهشی
نویسندگان
گروه مهندسی بیوسیستم، دانشکده کشاورزی، دانشگاه شهید چمران اهواز، اهواز، ایران
چکیده
کلیدواژهها
موضوعات
عنوان مقاله [English]
نویسندگان [English]
Accurate prediction of the yield of strategic crops such as sugarcane plays a key role in optimal resource management and ensuring food security. The main objective of this study is to develop a robust and interpretable model based on the K-Nearest Neighbors (KNN) regression algorithm to predict sugarcane yield prior to harvest. To this end, Sentinel-2 satellite imagery was integrated with engineered agronomic features, including indicators of water and fertilizer use efficiency. In addition, the K-means clustering algorithm was employed to partition fields into homogeneous groups, enabling the KNN model to better capture spatial heterogeneity and improve prediction accuracy. Key vegetation indices were extracted from Sentinel-2 time-series imagery, and engineered features were generated to enrich the dataset. The proposed model achieved a coefficient of determination (R²) of 0.8706 and a root mean square error (RMSE) of 7.80 t ha⁻¹ on the test dataset. Feature importance analysis revealed that the engineered variables—particularly water productivity—were among the main predictors of yield. The results demonstrate that integrating satellite data with a simple yet effective KNN model provides a transparent and practical tool to support decision-making in precision agriculture.
کلیدواژهها [English]
EXTENDED ABSTRACT
Sugarcane is a globally important crop that serves as a primary source of sugar and a vital feedstock for biofuels. Traditional yield estimation methods are often labor-intensive, time-consuming, and limited in spatial coverage, highlighting the need for more scalable, data-driven approaches. The advent of remote sensing technologies has transformed agricultural monitoring by providing extensive, multi-temporal, and high-resolution datasets. While machine learning algorithms show promise, many existing models rely solely on spectral data and often overlook the predictive power of engineered agro-technical features that quantify farm management efficiency. The main objective of this study is to develop a robust and interpretable K-Nearest Neighbors (KNN) regression model for pre-harvest sugarcane yield prediction by integrating Sentinel-2 imagery with engineered agro-technical features. Key features, such as water and fertilizer use efficiency, are incorporated to enhance model performance. Additionally, K-means clustering is applied to group farms into homogeneous categories, enabling the KNN model to better capture spatial heterogeneity and improve prediction accuracy. Overall, the study aims to provide a reliable and transparent framework for estimating pre-harvest sugarcane yield using the synergy of satellite imagery and machine learning techniques.
This study was conducted on sugarcane farms managed by the Dehkhoda Sugarcane Agro-industry during the crop years 2017 to 2024. Agronomic and management data: The ground dataset consisted of records from 2,417 unique farm plots. Key variables included final gross yield (tons/ha), water consumption (m³/ha), total fertilizer application (kg/ha), and soil electrical conductivity (EC). Multi-temporal Sentinel-2 Level 2A (atmospherically corrected) images were used. Standard vegetation indices such as NDVI, EVI, and GNDVI were calculated from the Sentinel-2 bands. In addition, five engineered features were developed to quantify agro-technical efficiency and interactions: Water Use Efficiency, Fertilizer Efficiency, a Vegetation Health Index, the Irrigation-to-Fertilizer Ratio, and a Soil-Vegetation Interaction metric. The top 25 most informative features were then selected using the SelectKBest algorithm with an F-regression scoring function.
Model evaluation: The dataset was split into 80% training and 20% testing sets, and a 5-fold cross-validation technique was applied to the training data to ensure model robustness. Model performance was assessed using three standard metrics: the coefficient of determination (R²), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE).
The statistical summary of the dataset revealed significant variability in final gross yield, ranging from 14.58 to 147.13 t/ha. The K-means clustering successfully partitioned the farm plots into four distinct agro-technical profiles, as confirmed by the mean feature values for each cluster. For example, Cluster 1 contained high-yielding plots with low soil salinity and high fertilizer use, while Cluster 2 included low-yielding plots with the lowest average fertilizer input. This spatial stratification was a crucial step, as it enabled the KNN model to learn from a more homogeneous set of neighboring plots, thereby enhancing predictive accuracy. The proposed KNN model demonstrated strong performance on the unseen test data, achieving an R² of 0.8706, RMSE of 7.80 t/ha, and MAE of 6.59 t/ha. The R² value indicates that the model explains approximately 87% of the variance in final gross yield. A scatter plot of predicted versus actual yields further confirmed the model's high accuracy, with data points closely aligned along the 1:1 line. Feature importance analysis revealed that the engineered features, particularly water use efficiency, were significant predictors of yield, highlighting the importance of integrating on-farm management indicators with remote sensing features.
This study successfully developed and validated a robust and highly interpretable framework for pre-harvest sugarcane yield prediction. By integrating unsupervised K-means clustering with KNN regression and enhancing it with engineered agro-technical features, the proposed model achieved an R² of 0.8706 on unseen test data. The findings demonstrate that a relatively simple, sample-based algorithm—when coupled with effective feature engineering and spatial stratification—can perform competitively with more complex black-box models. The superior interpretability and computational efficiency of the KNN approach make it a practical and transparent tool for decision-making in real-world precision agriculture applications. Future work should focus on integrating dynamic meteorological data and exploring multi-task learning frameworks to predict multiple crop parameters simultaneously.
Feryal Jaderi: Data collection and analysis, and initial writing
Nasim Monjezi: Study design, text revision, Providing expert opinions and reviews
Not applicable
The authors would like to thank Shahid Chamran University of Ahvaz for providing funding for this research.
The authors avoided data fabrication, falsification, plagiarism, and misconduct.
The author declares no conflict of interest.