پیش‌بینی عملکرد نیشکر با استفاده از شاخص‌های پوشش‌های گیاهی ماهواره‌ای سنتینل2 و الگوریتم‌های خوشه بندی K-MEANS و رگرسیون KNN

نوع مقاله : مقاله پژوهشی

نویسندگان

گروه مهندسی بیوسیستم، دانشکده کشاورزی، دانشگاه شهید چمران اهواز، اهواز، ایران

10.22059/ijbse.2025.403177.665620

چکیده

پیش‌بینی دقیق عملکرد محصولات استراتژیک مانند نیشکر، نقشی کلیدی در مدیریت بهینه منابع و تضمین امنیت غذایی دارد. هدف اصلی این پژوهش، توسعه مدلی مقاوم و قابل تفسیر بر پایه الگوریتم رگرسیون K-نزدیک‌ترین همسایه (KNN)  برای پیش‌بینی عملکرد نیشکر پیش از برداشت است. در این راستا، تصاویر ماهواره‌ای سنتینل۲ با ویژگی‌های زراعی مهندسی‌شده، شامل شاخص‌های کارایی مصرف آب و کود، تلفیق شدند. افزون بر این، الگوریتم خوشه‌بندی K-میانگین برای تقسیم مزارع به گروه‌های همگن به‌کار رفت تا مدل KNN بتواند ناهمگنی فضایی را بهتر درک کند و دقت پیش‌بینی را افزایش دهد. شاخص‌های کلیدی پوشش گیاهی از تصاویر سری زمانی سنتینل۲ استخراج و ویژگی‌های مهندسی‌شده برای غنی‌سازی داده‌ها ایجاد شدند. مدل پیشنهادی در مجموعه آزمون به ضریب تبیین  (R²) برابر با 8706/0 و جذر میانگین مربعات خطا  (RMSE) معادل 80/7 تن در هکتار دست یافت. تحلیل اهمیت ویژگی‌ها نشان داد که متغیرهای مهندسی‌شده، به‌ویژه بهره‌وری آب، از پیش‌بینی‌کننده‌های اصلی عملکرد هستند. نتایج این پژوهش نشان می‌دهد که تلفیق داده‌های ماهواره‌ای با مدل KNN ساده اما اثربخش، ابزاری شفاف و عملی برای پشتیبانی تصمیم‌گیری در کشاورزی دقیق فراهم می‌آورد.

کلیدواژه‌ها

موضوعات


عنوان مقاله [English]

Predicting Sugarcane Yield Using Sentinel-2 Vegetation Indices, K-Means Clustering, and K-Nearest Neighbors (KNN) Regression

نویسندگان [English]

  • ّFeryal Jaderi
  • nasim monjezi
Biosystems engineering Dept., Faculty of Agriculture, Shahid Chamran University of Ahvaz, Ahvaz, Iran
چکیده [English]

Accurate prediction of the yield of strategic crops such as sugarcane plays a key role in optimal resource management and ensuring food security. The main objective of this study is to develop a robust and interpretable model based on the K-Nearest Neighbors (KNN) regression algorithm to predict sugarcane yield prior to harvest. To this end, Sentinel-2 satellite imagery was integrated with engineered agronomic features, including indicators of water and fertilizer use efficiency. In addition, the K-means clustering algorithm was employed to partition fields into homogeneous groups, enabling the KNN model to better capture spatial heterogeneity and improve prediction accuracy. Key vegetation indices were extracted from Sentinel-2 time-series imagery, and engineered features were generated to enrich the dataset. The proposed model achieved a coefficient of determination (R²) of 0.8706 and a root mean square error (RMSE) of 7.80 t ha⁻¹ on the test dataset. Feature importance analysis revealed that the engineered variables—particularly water productivity—were among the main predictors of yield. The results demonstrate that integrating satellite data with a simple yet effective KNN model provides a transparent and practical tool to support decision-making in precision agriculture.

کلیدواژه‌ها [English]

  • K-Means clustering
  • KNN regression
  • Sentinel-2
  • Sugarcane
  • Yield prediction

EXTENDED ABSTRACT

 

Introduction

Sugarcane is a globally important crop that serves as a primary source of sugar and a vital feedstock for biofuels. Traditional yield estimation methods are often labor-intensive, time-consuming, and limited in spatial coverage, highlighting the need for more scalable, data-driven approaches. The advent of remote sensing technologies has transformed agricultural monitoring by providing extensive, multi-temporal, and high-resolution datasets. While machine learning algorithms show promise, many existing models rely solely on spectral data and often overlook the predictive power of engineered agro-technical features that quantify farm management efficiency. The main objective of this study is to develop a robust and interpretable K-Nearest Neighbors (KNN) regression model for pre-harvest sugarcane yield prediction by integrating Sentinel-2 imagery with engineered agro-technical features. Key features, such as water and fertilizer use efficiency, are incorporated to enhance model performance. Additionally, K-means clustering is applied to group farms into homogeneous categories, enabling the KNN model to better capture spatial heterogeneity and improve prediction accuracy. Overall, the study aims to provide a reliable and transparent framework for estimating pre-harvest sugarcane yield using the synergy of satellite imagery and machine learning techniques.

Materials and Methods

This study was conducted on sugarcane farms managed by the Dehkhoda Sugarcane Agro-industry during the crop years 2017 to 2024. Agronomic and management data: The ground dataset consisted of records from 2,417 unique farm plots. Key variables included final gross yield (tons/ha), water consumption (m³/ha), total fertilizer application (kg/ha), and soil electrical conductivity (EC). Multi-temporal Sentinel-2 Level 2A (atmospherically corrected) images were used. Standard vegetation indices such as NDVI, EVI, and GNDVI were calculated from the Sentinel-2 bands. In addition, five engineered features were developed to quantify agro-technical efficiency and interactions: Water Use Efficiency, Fertilizer Efficiency, a Vegetation Health Index, the Irrigation-to-Fertilizer Ratio, and a Soil-Vegetation Interaction metric. The top 25 most informative features were then selected using the SelectKBest algorithm with an F-regression scoring function.

Two-step modeling approach:

  1. The K-means algorithm was applied for unsupervised clustering of farm plots based on combined spectral and agro-technical features. The optimal number of clusters (k=4) was determined using the Elbow Method, and the resulting cluster labels were added as a new categorical feature to the dataset.
  2. The K-Nearest Neighbors (KNN) algorithm was used to predict yield.

Model evaluation: The dataset was split into 80% training and 20% testing sets, and a 5-fold cross-validation technique was applied to the training data to ensure model robustness. Model performance was assessed using three standard metrics: the coefficient of determination (R²), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE).

Results and Discussion

The statistical summary of the dataset revealed significant variability in final gross yield, ranging from 14.58 to 147.13 t/ha. The K-means clustering successfully partitioned the farm plots into four distinct agro-technical profiles, as confirmed by the mean feature values for each cluster. For example, Cluster 1 contained high-yielding plots with low soil salinity and high fertilizer use, while Cluster 2 included low-yielding plots with the lowest average fertilizer input. This spatial stratification was a crucial step, as it enabled the KNN model to learn from a more homogeneous set of neighboring plots, thereby enhancing predictive accuracy. The proposed KNN model demonstrated strong performance on the unseen test data, achieving an R² of 0.8706, RMSE of 7.80 t/ha, and MAE of 6.59 t/ha. The R² value indicates that the model explains approximately 87% of the variance in final gross yield. A scatter plot of predicted versus actual yields further confirmed the model's high accuracy, with data points closely aligned along the 1:1 line. Feature importance analysis revealed that the engineered features, particularly water use efficiency, were significant predictors of yield, highlighting the importance of integrating on-farm management indicators with remote sensing features.

Conclusion

This study successfully developed and validated a robust and highly interpretable framework for pre-harvest sugarcane yield prediction. By integrating unsupervised K-means clustering with KNN regression and enhancing it with engineered agro-technical features, the proposed model achieved an R² of 0.8706 on unseen test data. The findings demonstrate that a relatively simple, sample-based algorithm—when coupled with effective feature engineering and spatial stratification—can perform competitively with more complex black-box models. The superior interpretability and computational efficiency of the KNN approach make it a practical and transparent tool for decision-making in real-world precision agriculture applications. Future work should focus on integrating dynamic meteorological data and exploring multi-task learning frameworks to predict multiple crop parameters simultaneously.

Author Contributions

Feryal Jaderi: Data collection and analysis, and initial writing

Nasim Monjezi: Study design, text revision, Providing expert opinions and reviews

Data Availability Statement

Not applicable

Acknowledgements

The authors would like to thank Shahid Chamran University of Ahvaz for providing funding for this research.

Ethical considerations

The authors avoided data fabrication, falsification, plagiarism, and misconduct.

Conflict of interest

The author declares no conflict of interest.

Alemán-Montes, B., Zabala, A., Henríquez, C., & Serra, P. (2023). Modelling Two Sugarcane Agro-Industrial Yields Using Sentinel/Landsat Time-Series Data and Their Spatial Validation at Different Scales in Costa Rica. Remote sensing, 15(23), 5476. DOI: https://doi.org/10.3390/rs15235476
Akbarian, S., Xu, C., Wang, W., Ginns, S., & Lim, S. (2022). Sugarcane yields prediction at the row level using a novel cross-validation approach to multi-year multispectral images. Computers and Electronics in Agriculture, 198, 107024. DOI: https://doi.org/10.1016/j.compag.2022.107024
Canal Filho, R., Molin, J., Silva, E., Wei, M., & Sterle, L. (2025). Adaptive multi-year machine learning model to predict sugarcane yield. In Precision agriculture'25 (pp. 716–722). Wageningen Academic. DOI: https://doi.org/10.1163/9789004725232_094
de França e Silva, N. R., Chaves, M. E. D., Luciano, A. C. d. S., Sanches, I. D. A., de Almeida, C. M., & Adami, M. (2024). Sugarcane yield estimation using satellite remote sensing data in empirical or mechanistic modeling: A systematic review. Remote sensing, 16(5), 863. DOI: https://doi.org/10.3390/rs16050863
Dimov, D., Uhl, J. H., Löw, F., & Seboka, G. N. (2022). Sugarcane yield estimation through remote sensing time series and phenology metrics. Smart Agricultural Technology, 2, 100046. DOI: https://doi.org/10.1016/j.atech.2022.100046
Ghafarian Malmiri, H. R., Saberi, M. A., Mozafari, Gh. A., & Arabi Aliabad, F. (2024). Rice Paddies Mapping and Yield Estimating using Satellite Images and Remote Sensing Techniques (Case study: Kunduz province, Afghanistan) Geography and Development, 22 (74),187-218. DOI: http://dx.doi.org/10.22111/GDIJ.2024.8177 (in Persian (
Haboudane, D., Miller, J. R., Tremblay, N., Zarco-Tejada, P. J., & Dextraze, L. (2002). Integrated narrow-band vegetation indices for prediction of crop chlorophyll content for application to precision agriculture. Remote sensing of environment81(2-3), 416-426.‏
Iftikhar, M., Qadri, S., Nadeem, M., & Nawaz, S. A. (2024). Remote Sensing Based Sugarcane Yield Prediction Model using Artificial Intelligence. Journal of Computing & Biomedical Informatics, 6(02), 451–462. DOI: https://doi.org/10.56979/602/2024
Li, H., Di, L., Zhang, C., Lin, L., Guo, L., Zhao, H., ... & Hong, R. (2023), July. A review of remote sensing in sugarcane mapping. In 2023 11th International Conference on Agro-Geoinformatics (Agro-Geoinformatics) (pp. 1-5). IEEE.‏ DOI: https://doi.org/10.1109/Agro-Geoinformatics59224.2023.10233506
Suwanlee, S. R., Pinasu, D., Som-ard, J., Borgogno-Mondino, E., & Sarvia, F. (2024). Estimating sugarcane aboveground biomass and carbon stock using the combined time series of sentinel data with machine learning algorithms. Remote sensing, 16(5), 750. DOI: https://doi.org/10.3390/rs16050750
Sun, J., Sun, C., Li, Z., Qian, Y., & Li, T. (2024). Prediction method of sugarcane important phenotype data based on multi-model and multi-task. PloS one19(12), e0312444.‏ DOI: https://doi.org/10.1371/journal.pone.0312444
Taravat, A., Abebe, G., Gessesse, B., & Tadesse, T. (2024). Estimation of Sugarcane Yield Using Multi-Temporal Sentinel 2 Satellite Imagery and Random Forest Regression. DOI: https://doi.org/10.5194/isprs-archives-XLVIII-4-W9-2024-357-2024
Tanut, B., Waranusast, R., & Riyamongkol, P. (2021). High accuracy pre-harvest sugarcane yield forecasting model utilizing drone image analysis, data mining, and reverse design method. Agriculture, 11(7), 682. DOI: https://doi.org/10.3390/agriculture11070682
Vasconcelos, J. C. S., Arantes, C. S., Speranza, E. A., Antunes, J. F. G., Barbosa, L. A. F., & Cançado, G. M. d. A. (2025). Predicting Sugarcane Yield Through Temporal Analysis of Satellite Imagery During the Growth Phase. Agronomy, 15(4), 793. DOI: https://doi.org/10.3390/agronomy15040793
Xue, J., & Su, B. (2017). Significant remote sensing vegetation indices: A review of developments and applications. Journal of sensors2017(1), 1353691.‏ DOI: https://doi.org/10.1155/2017/1353691
Zhu, L., Liu, X., Wang, Z., & Tian, L. (2023). High-precision sugarcane yield prediction by integrating 10-m Sentinel-1 VOD and Sentinel-2 GRVI indexes. European Journal of Agronomy, 149, 126889. DOI: https://doi.org/10.1016/j.eja.2023.126889