Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Water Resources Management (2023) 37:3699–3714

https://doi.org/10.1007/s11269-023-03522-z

Daily Scale River Flow Forecasting Using Hybrid Gradient


Boosting Model with Genetic Algorithm Optimization

Huseyin Cagan Kilinc1 · Iman Ahmadianfar2 · Vahdettin Demir3 · Salim Heddam4 ·


Ahmed M. Al‑Areeq5 · Sani I. Abba5 · Mou Leong Tan6,7 · Bijay Halder8 ·
Haydar Abdulameer Marhoon9,10 · Zaher Mundher Yaseen11,5

Received: 19 January 2023 / Accepted: 11 April 2023 / Published online: 3 May 2023
© The Author(s), under exclusive licence to Springer Nature B.V. 2023

Abstract
Accurate and sustainable management of water resources is among the most important cir-
cumstances of basin and river engineering. In this study, a hybrid machine learning (ML)
model was generated using CatBoost and Genetic Algorithm (GA) for significant impact on
river flow prediction. The study was applied to Sakarya Basin, which is located in semi-arid
climatic conditions in Turkey. The forecast performance of the models was observed by devel-
oping a day-step ahead forecast scenario with the data of Adatepe, Aktaş and Rüstümköy flow
measurement stations (FMS). The daily flow data of the specified stations between 2002 and
2012 were used and the performance of the proposed model was tested by comparing with
CatBoost, Long-Short Term Memory (LSTM) and the classical estimation method, Linear
Regression (LR). The study was also aimed to improve the predictive performance of genetic
algorithms combined with the gradient boosting model (GA-CatBoost). The developed hybrid
model outperformed the benchmarked models. The results showed that the developed model
can be successfully applied in river flow forecasting.

Keywords Gradient boosting · River flow · Deep learning · Hybrid model

1 Introduction

1.1 Research Background

The most fundamental element in understanding the distribution and movement of water
in the watershed basin is the water cycle. Maintaining the balance of nature in the basin
is possible with the water cycle. The hydrological cycle has essential elements including
precipitation, evaporation, transpiration, and river flows, which are vital components of the
cycle (Yaseen et al. 2018; Mahmood and Jia 2022). Effective planning and sustainability of

* Huseyin Cagan Kilinc


[email protected]
* Zaher Mundher Yaseen
[email protected]
Extended author information available on the last page of the article

13
Vol.:(0123456789)
3700 H. C. Kilinc et al.

water have evolved into one of the most critical issues in hydrology. Accurate and sensitive
forecasting of river flow has an active role in natural disasters (Xie et al. 2019). Accord-
ingly, river regimes should be observed closely due to accurate basin management (Imrie
et al. 2000). One of the key elements of managing sustainable water resources is accurate
stream flow forecasting (Kilinc 2022). As observation techniques improve and historical
hydrological data grows, it is expected that some purely data-based approaches will be
included in river flow forecasting. From this point of view, data-driven models give better
results when digital input and output data are provided (Niu et al. 2021). It is widely well-
known that several ML models produced a powerful models for river flow modeling (Khan
and Maity 2020). Artificial neural networks come to the fore in this field because they take
into account the non-linear relationships between parameter differences and produce more
realistic results (Yukseltan et al. 2021).

1.2 Literature Review

With the improvement of ML models, hydrologists’ inquisitiveness in ML models has


correspondingly augmented (Naganna et al. 2020; Munawar et al. 2021; Tur and Yontem
2021). Data mining models assist to discover connections and patterns in huge datasets that
are very challenging to encounter using human assistance (Zounemat-Kermani et al. 2021).
Due to the complexity of non-linear hydrological time series, computational models have
difficulties in estimating coding based on prediction models. As computational program-
ming becomes more complex and pervasive, ML increases progressively its prominence in
hydrological time series (river currents, wind, evaporation, etc.). This situation has brought
to the fore Deep Learning (DL) models, which is a sub-type of ML that involves learning
from large data using neural network algorithms (Qi et al. 2023; Zheng et al. 2023).
In the literature, there are many deep learning models based on LSTM, GRU and
metaheuristic algorithms that have been successfully applied in time series predictions.
(Kilinc and Yurtsever 2022). RNN deep learning model presents an essential model that
recalls each past interaction in the hidden state of the RNN. The RNN model is flourishing
in modeling time series, whereas accessing long ago data is difficult for the RNN model.
RNN model is able to comprehend more problematic dependencies from past interactions
of the sequential dataset (Sarioglu and Yaslan 2019). DL model have several hyperparam-
eters (batch size, number of hidden units, number of layers) that impact the learning perfor-
mance regardless of application area (Goliatt and Yaseen 2023).
Hybrid ML algorithms have been witnessed to be highly implemented with observed
successful applications of hydrological processes (Kim et al. 2019). Metaheuristic algo-
rithms seek to acquire the optimum solution in the solution space faster by using efficient
search operations in a high-level operating environment (Ardabili et al. 2020). In addition,
gradient increasing decision tree algorithms are frequently encountered in the literature
recently. These algorithms are hybridized in different areas and their success performances
are examined.
When the literature is reviewed, it is seen that the hybrid models formed in the field in
time series forecasting are increasing rapidly. Karbasi et al. (2022) used generalized regres-
sion neural network (GRNN) and XGBoost for comparing the performance of the newly
developed models based on estimating reference evapotranspiration. The results demon-
strated that an efficient DL model had higher ability and accuracy in forecasting. Li et al.
(2022) built a predictive model for estimating water quality. The data used in the study were
evaluated on five models, namely classification tree, random forest, CatBoost, XGBoost,

13
Daily Scale River Flow Forecasting Using Hybrid Gradient Boosting… 3701

and LightGBM. The work recommended that LightGBM had a promising potential for pre-
dicting water quality. Nguyen et al. (2021) developed hybrid models by using two deep-
learning models with XGBoost to forecast hourly water levels. The study verified that
hybrid XGBoost models might be beneficial to many existing models for hourly water level
forecasting. Ibrahim et al. (2022) adopted forecasting reservoir inflows by operating differ-
ent ML algorithms including XGBoost. The results revealed that XGBoost outperformed
all other models. Fan et al. (2018) proposed two ML algorithms including XGBoost for
accurate forecasting of daily solar radiation data. The results revealed that the XGBoost
model was highly recommended to estimate daily solar radiation and precipitation data.
He et al. (2020) developed a new hybrid model based on gradient boost regression. This
hybrid model was applied on several hydrological station data. The results showed that the
hybrid model can be proposed as a promising model. Carvalho and de Assis de Souza Filho
(2021) generated hybrid model based on boosting regression. The results indicated that
boosting algorithms can be used to time series forecasting. Xia et al. (2022) used LSTM
and XGBoost to predict the water quality indicators. The results exhibited that LSTM per-
formed better than XGBoost. Zhang et al. (2020) operated different meteorological variables
to estimate evapotranspiration using the CatBoost model. The results illustrated that Cat-
Boost expanded the performance ability. When the performances of the models in the litera-
ture are concerned, it is noticed that algorithms such as CatBoost, XGboost, and LightGBM
enhance the existing models. Existing algorithms can be hybridized with various DL models
and their impacts on flow models can be examined. The contribution of other algorithms to
be optimized to the LSTM DL model used in this study is also significant. Consequently,
when creating a hybrid model, it is necessary to determine the optimum model with the
most appropriate parameters.

1.3 Research Motivation and Objectives

In this study, river flow forecasting at three different stations (i.e., Adatepe, Aktaş and
Rüstümköy), Sakarya Basin, Turkey, was adopted. Considering the water stress in Turkey,
it is an inevitable fact that Sakarya Province may experience difficulties in agriculture,
industry and drinking water management. This is due to water scarcity and water scarcity
in the near future. The continuity of the computer aid models applied in the study was
evaluated with the help of daily river flow data and forecasting performance of the models.
The success of the developed hybrid ML model in prediction/forecasting accuracy
shows its applicability and possible contribution to the analysis processes. The application
of Catboost, which is a very popular algorithm recently (Huang et al. 2019; Ivanov et al.
2022), with a hybrid model created in river flow forecasting is one of the elements reveal-
ing the originality of the study. Worth to mention, GA used in this research produces a set
of solutions consisting of different solutions instead of producing a single solution to the
problems. The suitability of the solution in the GA increases its chance of being selected.
In the GA optimization, not all possible solutions are created immediately. Optimum or
near-optimal solution is reached through some of the possible solutions.
The limited number of studies conducted with new and up-to-date models for river flow
simulation throughout the Sakarya Basin reveals the contribution of this study to the litera-
ture. The forecasting process was carried out using the hybridization of CatBoost model

13
3702 H. C. Kilinc et al.

with the GA algorithm. The performance of the developed hybrid model was validated
with Linear Regression model, Catboost and LSTM benchmark models.

2 Case Study and Datasets Description

Turkey is among the countries included in the water stress category. In addition, consid-
ering the demands of society for water resources, water scarcity is likely to occur many
issues. In order to prevent possible water crises and related problems, it is significantly
essential to evaluate and manage hydrological data on a basin basis (Ceribasi et al. 2022).
The covered region in the study is Sakarya Basin, which is one of Turkey’s 25 river basins.
The basin is located in the northwest of the country and neighbors with wide valleys such
as Susurluk, Konya, Akarçay, Western Black Sea, and Kızılırmak Basins. The Sakarya
river, whose location is illustrated in Fig. 1, rises 3 km southeast of the Çifteler district
center of Eskişehir and is fed by many small streams. The river, which is one of the note-
worthy rivers of Turkey, is 510 km long and 60–70 m wide. This width reaches 150 m in
places. The total drainage area of the river is 55312 ­km2. Sakarya Basin has a transitional
climate due to the fact that it is in a region where three different geographical regions inter-
sect. The average temperature in the region is 15.1 °C, the lowest average temperature is
5 °C in January, and the highest average is 25.6 °C in July. The average precipitation is
349.8 mm (Solak et al. 2020).
The forecast performance of the models was analyzed with daily river flow obtained
from three critical stations located in various branches of the Sakarya Basin. The locations
(FMSs) of the flow measurement stations are pictured in Fig. 1. In Fig. 1b, FMSs with vari-
ous agricultural, industrial, and hydrological conditions in the Sakarya Basin are demon-
strated. Among these stations, Adatepe (E12A057), Aktaş (E12A024), and Rüstümköy
(E12A022) were determined to be employed in the study. With a precipitation area of ​​56224
­km2, the E12A057 station located at the exit point of the Sakarya Basin represents the
flow rate of almost the entire basin. In addition, the E12A022 station located in the Upper
Sakarya region, which is one of the lower basins of the Sakarya river basin, is very crucial
due to its location. The E12A022 station which is located on the main line of the river feed-
ing the tributaries is quite important on account of its establishment. E12A024, another sta-
tion located on the Sakarya River, is another station determined in the study since it is an
area with intense agricultural activities. In addition, the details of the geographical coordi-
nates of the stations are given in Table 1.

3 Machine Learning Models

3.1 LSTM Networks

LSTM is the most common and well-known model among DL models. Learning the data,
the ability to generalize, and the freedom to work with an unlimited number of variables
are the most prominent factors in the popularity of these models. The model displayed in
Fig. 2a and designed by Hochreiter and Schmidhuber (1997), can be used to make predic-
tions in non-linear models that target problem-causing structures and long-term dependen-
cies relying on the degree of slope at the junction points.

13
Daily Scale River Flow Forecasting Using Hybrid Gradient Boosting… 3703

Fig. 1  Sakarya River Basin map and Location of FMSs on Sakarya River Basin map

The model retains the cell structure found in ANNs. The main distinction of LSTM
from the ANN model is the memory structure it contains. In LSTM networks, which
have the ability to maintain information in their memory for a long time, the informa-
tion flow is controlled by structures called cell states. There are three different gates in
the structure of long-term memory networks (Ghimire et al. 2021).

13
3704 H. C. Kilinc et al.

Table 1  Characteristics features of FMSs at study region


Station River Coordinates Cathment Elevation (m) Observation (year)
Area (km2)
East North
(° ‘ “) (° ‘ “)

1257 Adatepe 30°36′07" 41°01′34" 56224.4 6 2002–2012


1224 Aktaş 31°20′13" 39°19′15" 4298 837 2002–2012
1222 Rüstümköy 29°46′06" 40°15′23" 2021.6 198 2002–2012

There are different gates for information control in the LSTM model. These three gates
have different task definitions in the LSTM structure. The model consists of entrance, forget
and exit doors. The network can interact with cells only through gates. The values that ought

Fig. 2  a Representation of a simple recurrent network unit on the left and LSTM block on the right, b The
flowchart of hybrid model (GA-CatBoost)

13
Daily Scale River Flow Forecasting Using Hybrid Gradient Boosting… 3705

to be stored, used, and discarded are grasped. On the grounds that the layer is backpropa-
gated, the weights in the sigmoid function are only updated in order to learn to provide useful
transitions when assigning less critical features. The calculations made in the sections in the
LSTM block structure are demonstrated in the equations below (Danandeh Mehr et al. 2022).
it = 𝜎(WI ht−1 + WI xt + bI ) (input gate) (1)

ft = 𝜎(WF ht−1 + WF xt + bf ) (gate to forget) (2)

̃ = tanh(Wc ht−1 + Wc ht ) (intermediate cell state)


C (3)

Ot = 𝜎(WO ht−1 + Wo xt + bO ) (output gate) (4)

̃ + (ft ∗ ct−1 ) (cell state)


ct = (it ∗ C) (5)

ht = ot + tanh(ct ) (new cell state) (6)


Input vector xt, it entrances gate (1), ft forget gate (2), ct cell state (3), ot output gate (4), ht
output vector (5), σ “sigmoid activation” function and tanh “tanh activation” is the function.

3.2 CatBoost

CatBoost, an ML algorithm based on gradient enhancing decision tree (GBDT), was devel-
oped by Yandex. Unlike other algorithms based on decision trees; It can resolve heteroge-
neous features, noisy data, and complex dependencies. The advantages of using CatBoost
include the need to configure several hyperparameters while avoiding overfitting and
obtaining more generalized models. With these features, CatBoost has made improvements
over other advanced boosts (Basilio and Goliatt 2022). CatBoost Algorithm, which has a
new method for processing categorical features implemented in the algorithm, runs the
categorical features to generate new numerical features based on the categorical features
and their combinations. CatBoost uses encoding for categorical features with different val-
ues depending on the training mode. In order for Catboost to perceive the features in the
dataset as categorical features, not numerical data, it needs to pass the cat features argu-
ment.Unlike traditional decision trees, CatBoost handles categorical features during train-
ing time rather than preprocessing time. In addition, although various methods are used in
the use of categorical features in gradient enhancement, these methods cause shifts in the
estimates. Therefore, the CatBoost algorithm is recommended to improve the predictions
and solve the overfitting problem (Singh and Goyal 2020). According to Prokhorenkova
et al. (2018), if we sample as a permutation (θ = [σ1, σ2, …, σn]nT) from the given dataset,
the permutation is substituted with (Eq. (7)): In Eq. (7), P is the previous value and β is the
weight of the previous value. (Fernández-Carrillo et al. 2022).
∑p−1 � �
j=1
x𝜎j,k = x𝜎p,k .y𝜎j + 𝛽.p
x𝜎p,k = ∑ � � (7)
p−1
j=1
x j,𝜎j,k = x 𝜎p,k + 𝛽

13
3706 H. C. Kilinc et al.

The fact that the CatBoost algorithm uses binary decision trees as the main estimator can also
be shown among the advantages of this algorithm. (Dorogush et al. 2018). In this study, while
forming the CatBoost model; Keras, Tensorflow, Catboost, NumPy, Pandas, and Mathplot librar-
ies were employed. The parameter values in the CatBoost model were: the number of iterations
was set to 1000, the maximum tree depth was 3, and the ratio of datasets to subsets was set to 1.
loss function is set to RMSE. Od-wait is set to Iter and Od-type is set to 20. Thereby, when the
model is overfitting, to stop training the incremental counter value is equal to the specified num-
ber of trials and continue training after iteration with the optimum metric value.

3.3 GA‑CatBoost Hybrid Model

In the Catboost model, the characteristics of the parameters influence the performance of the net-
work. In this study, the genetic algorithm algorithm was utilized to enhance the crucial hyperpa-
rameters of the model such as depth, learning_rate, number of iterations, random_seed, od_wait.
Foremost, the classical Catboost model was repeatedly trained with random parameters, and the
parameters that provided the best test results were used in the benchmark model. The result list
of the GA algorithm is as binary values; in order to hybridize the Catboost model, random seed
and od wait parameters were added. Moreover, the Catboost model was reworked according to
this list, and the results of the model were successful according to the RMSE values which ​​were
accepted as the result of the proposed model. In order to present an accurate comparison, the
parameters such as the number of iterations and learning rate outside these parameters are fixed
to the same value for all models. In Fig. 2b, the steps of adding the GA algorithm to the Catboost
network are displayed on the flow chart. The performance analysis of the hybrid GA-CatBoost
model was tested on LSTM, CatBoost and LR models and its success was observed. In addition,
according to the flow chart, firstly the dataset was made suitable for training and the number
of iterations was determined. The GA model was included in the loop with the determined ini-
tial values. The results from the GA model were used as CatBoost parameters. The results were
evaluated with the Catboost algorithm and RMSE values. By comparing all the results in the
iteration, the results of the model with the best RMSE value were recorded.

3.4 Modeling Development Phase

The Python program was used to compare the models in the study. One of the latest versions of
this program, 3.10, and optimizers working integrated into this version are utilized. In the study,
Keras and Deep libraries were used for the stages enfolding the analysis processes (prediction
process and model training). A distinguishable loss functions and optimizations are used in the
training processes of the LSTM, GA, and CatBoost models employed to compare the forecast-
ing performances. For LSTM, 1000 periods, 16 batch sizes, ADAM optimization, and RMSE
loss functions are utilized. Likewise, the same parameters were used for CatBoost.

4 Application Results and Discussion

4.1 Statistical Metrics Evaluations

In this study, a model was created based on the observation of the data and the per-
formance analysis of the model was compared with the benchmark models. Statistical

13
Daily Scale River Flow Forecasting Using Hybrid Gradient Boosting… 3707

criteria were used as the analysis method and the results of these measurement metrics
are given in Table 2.
The LR model is a model for assembling univariate time series analysis and forecast-
ing. In the LR model, the values in the time series are modeled over the past values ​​and the
forecasting is made (Box and Tiao 1975). LR models are models that are applied to non-
stationary series but converted to stationary by the difference-taking process. Many time
series contain linear relationships as well as non-linear relationships. The model has been
employed in many time series forecasting in this respect (Kumar and Singh 2022; Wang
et al. 2022). Model results of river flow stations are demonstrated in scatter plots in Fig. 3.
In addition, the success of the models was observed with the statistical evaluation crite-
ria detailed in Table 2. Among these evaluation criteria, there are commonly used RMSE,
MAPE, MSE, standard deviation (SD), and ­R2.

4.2 Effects of Models on Performance Analysis

The CatBoost, LSTM, and LR models were compared against each other in the study area.
Regression lines were employed for each scatter plot. Regression lines delivered an advan-
tage in analyzing the performance of the models. In order to compare the performance
analysis of the models, the distribution of the graphs and the test data on these graphs were
investigated. In the graphs with the distribution of test outcomes in Fig. 3a, the R­ 2 values​​
of the Adatepe station are 0.9134, 0.9471, 0.9511, and 0.6382 for LSTM, CatBoost, GA-
CatBoost, and LR, respectively. Whereas these values of Adatepe station are discussed, the
daily river flow values estimated by the proposed model demonstrated a very good perfor-
mance compared to the other three models. According to other statistical evaluation cri-
teria, GA-CatBoost performed quite well in all four evaluation criteria compared to other
models as well. When the LSTM and LR models were compared at the station, the LSTM
model fell behind the LR model in SD and MAPE evaluation criteria. However, it outper-
formed the LR model according to RMSE and MSE values. There is an observed immense
dissimilarity between the ­R2 values for these two models. It illustrates that the correlation
between current estimates in the LR model is weak.
At Aktaş station, test results are presented in Fig. 3b, the ­R2 values are 0.9579 accord-
ing to the LSTM model, 0.9703 according to the CatBoost model, 0.9740 according to

Table 2  Statistical metrics results (indicated as ­m3/s)


Station Model RMSE MAPE MSE S.D R2

Adatepe LSTM 32.7325 15.0959 1071.4220 0.2351 0.9134


CatBoost 25.5767 13.3735 654.1695 0.2079 0.9471
GA-CatBoost 24.6104 12.1993 605.6733 0.1806 0.9511
Linear Regression 79.8814 14.0149 6381.0480 0.2167 0.6382
Aktaş LSTM 0.4206 8.9042 0.1769 0.1586 0.9579
CatBoost 0.3598 7.8600 0.1295 0.1358 0.9703
GA-CatBoost 0.3370 7.3325 0.1136 0.1304 0.9740
Linear Regression 0.9332 5.5964 0.8709 0.1124 0.8096
Rüstümköy LSTM 8.8826 91.4490 78.9019 0.2120 0.7659
CatBoost 6.8580 22.4252 47.0330 0.5358 0.8385
GA-CatBoost 6.4210 23.0644 41.2301 0.4801 0.8577
Linear Regression 12.2541 24.7982 150.1647 0.2330 0.8191

13
3708 H. C. Kilinc et al.

Fig. 3  Scatter plots presenta-


tion for a Adatepe, b Aktaş, and
c Rüstümköy, over the testing
phase

13
Daily Scale River Flow Forecasting Using Hybrid Gradient Boosting… 3709

GA-CatBoost and 0.8096 and the LR model. As for this evaluation criterion, GA-CatBoost
values were slightly achieved in the other models. At the same station, the LSTM model is
quite successful compared to the LR model. When the RMSE and MSE statistical measure-
ment models are investigated, LSTM is more successful than LR, while the success of the LR
model is observed in other criteria. Adatepe flow measurement station, which is located at the
point where it pours into the Black Sea from the west of Karasu district and is also the last
station of the Sakarya River before it spills into the Black Sea, is located in the region where
industrial wastewater and domestic wastewater are observed most intensely. As this situation
develops uncontrolled discharge, it also raises the flow with the influence of heavy rainwater.
When domestic wastewater and non-point source pollution are added, the seriousness of the
pollution in the river’s course within the provincial borders will be seen more evidently. In
addition, the station’s act as an accumulation and downstream point in Sakarya River cre-
ates a strong correlation between estimated flow data and daily flow data. The stated features
exhibit Adatepe as one of the most critical and momentous flow measurement stations in the
Sakarya Basin. Sakarya Basin is more developed than other parts of Turkey in terms of many
economic and cultural activities such as industry, trade, transportation, and tourism, as it is
located between the two largest cities of our country, Ankara and Istanbul. Considering the
performance relations of the models at the Rüstumköy flow measurement station located in
the region where agricultural activities are intense and its proximity to Bursa province, the
GA-CatBoost, CatBoost, and LR models offered significantly close results except the LSTM
model. When the R ­ 2 values at​​ this station are examined, they are ranked with CatBoost
0.8385, GA-CatBoost 0.8577, LSTM 0.7659, and LR 0.8191 values. At this station, the LR
model could not outperform LSTM in any of the criteria except MAPE evaluation criteria.
When the performances of all four models are evaluated against each other, it is observed
that GA-CatBoost is more promising in all three stations than the CatBoost, LSTM, and LR
models. It has been caught that the model analysis of the CatBoost model is applicable to the
Aktaş station where the R ­ 2 value (0.9703) is the maximum among the three-station data. The
high values at Rüstümköy and Adatepe stations also support the success of the GA-CatBoost
model on river flow data. In the analysis of the five evaluation criteria, performance and suc-
cess were observed in the order GA-CatBoost > CatBoost > LSTM > LR. When the results
received with Catboost, an innovative algorithm developed to process Sequential Boosting
and categorical variables, are examined, the success of the model in predicting river flows
is extremely important. Its success against LSTM, one of the popular DL models of recent
times, is significantly critical. CatBoost was developed to combat forecasting bias in existing
gradient enhancement algorithms. It has been determined that the losses and losses expe-
rienced in reaching the target are generated by the deviation of the forecasting. It has been
observed that the Catboost algorithm effectively solves this problem. The CatBoost utilizes
the sequential boost model to avoid predictive scrolling. The model uses a modified version
of the standard gradient boosting algorithm to avoid losses while reaching the target. The
difference between the predicted value and the observed value is defined as the residual.
The catboost algorithm primarily attempts to avoid residual drifts. The CatBoost calculates
a goal-based statistic for each category of a nominal variable. The fact that it is established
on a goal-based statistic rather than reaching the result with a random variable and hyperpa-
rameter observed in the LSTM model brings the CatBoost model to the forefront (Patrous
2018; Zeng et al. 2023)v. The ordinal goal statistics approach is based on the assumption that
the past influences the future. In Catboost, the randomization of the order of the rows before
the focus line and the target statistic calculation has been successfully applied to river flow
forecasting as can be seen from the datasets (Table 2 and Fig. 3) obtained from the three sta-
tions. Collecting and preparing data before modeling is done is a step that forms the basis of

13
3710 H. C. Kilinc et al.

Fig. 4  Taylor diagram of Observed, LSTM, CatBoost, GA-CatBoost and Linear regression model

modeling. If data with higher accuracy and detail are available in the region where the flow
data will be examined, the model can be created hourly and minutely instead of daily flow
data. This will be useful to examine in detail the dry periods or the seasonal periods when the
region suffers from water shortage. The hybrid model created in the study is also important
for the management of sustainable water resources. In future modeling studies, adding evapo-
ration and temperature data to the hybrid model as input parameters will improve the math-
ematical algorithms of the model, increase the forecasting accuracy and effectively manage
water resources throughout the basin.
The Taylor diagram was used to graphically summarize how closely the models
matched the observations in the river flow forecasting (Taylor 2001). The correlation
between the Taylor diagram and the models is measured in terms of their standard devia-
tions. The obtained results with the indicated diagram are shown in Fig. 4 separately for
the three stations. The outcomes of the linear regression, the proposed hybrid model, and

13
Daily Scale River Flow Forecasting Using Hybrid Gradient Boosting… 3711

the comparison model for three stations are demonstrated on the Taylor diagram. When
the Taylor diagram for the three stations is concerned, the success of the hybrid model is
noticed. The order from best to worst for all 3 stations was as follows: Proposed model,
CatBoost model, LSTM model, and Linear regression. In addition to the high success of
DL models against linear regression; The success of the Catboost, which surpassed the
LSTM, whose success has been proven in many studies, and the hybrid model that made
this success even more stable, were quite impressive.

5 Conclusion

A hybridized ML model based on the combination of CatBoost model with GA optimization


algorithm for modeling daily scale river flow at Sakarya Basin, Turkey. The performance
of the established GA-CatBoost model was observed at the three modeled stations. Based
on the statistical metrics, at the three stations (i.e., Adatepe, Aktaş, and Rüstümköy), the
GA-CatBoost model attained the superior modeling results at Aktaş station. Based on the
convergence analysis, it was observed that the GA-CatBoost algorithm converged faster pro-
cess than the CatBoost algorithm in solving the learning process. Remodeling the developed
hybrid model with different optimizers and analyzing it with different input parameters indi-
cate that promising results will be obtained for other studies planned in the future. Although
the success of the CatBoost algorithm is undeniable, this model also has some restrictions.
It outperforms other algorithms when only categorical data in the CatBoost model are used.
The ability to classify data is one of the greatest benefits of this model. CatBoost can signifi-
cantly perform inadequately if the variables are not properly set. This may require investigat-
ing the performance analysis of river flow forecasting with various boosting algorithms in
future studies.
Acknowledgements The authors acknowledge the data source “General Directorate of Electrical Works
Survey Administration”. In addition, this research was previously published as preprint, readers can refer to
the published research (Kilinc et al. 2023).

Author Contribution Huseyin Cagan Kilinc: Conceptualization; Data curation; Formal analysis; Methodol-
ogy; Investigation; Visualization; Writing—original draft,—review & editing draft preparation; Resources;
Software. Iman Ahmadianfar: Data curation; Formal analysis; Methodology; Investigation; Visualization;
Writing—original draft,—review & editing draft preparation. Vahdettin Demir: Data curation; Formal anal-
ysis; Methodology; Investigation; Visualization; Writing—original draft,—review & editing draft prepara-
tion. Salim Heddam: Data curation; Formal analysis; Methodology; Investigation; Visualization; Writing—
original draft,—review & editing draft preparation. Ahmed M. Al-Areeq: Data curation; Formal analysis;
Methodology; Investigation; Visualization; Writing—original draft,—review & editing draft preparation.
Sani I. Abba: Data curation; Formal analysis; Methodology; Investigation; Visualization; Writing—original
draft,—review & editing draft preparation. Mou Leong Tan: Data curation; Formal analysis; Methodology;
Investigation; Visualization; Writing—original draft,—review & editing draft preparation. Bijay Halder:
Data curation; Formal analysis; Methodology; Investigation; Visualization; Writing—original draft,—
review & editing draft preparation. Haydar Abdulameer Marhoon: Data curation; Formal analysis; Meth-
odology; Investigation; Visualization; Writing—original draft,—review & editing draft preparation. Zaher
Mundher Yaseen: Supervision, Conceptualization; Formal analysis; Project administration; Investigation;
Writing—review & editing.

Availability of Data and Materials The data will be available upon reasonable request.

Declarations
Ethical Approval The authors undertake that this article has not been published in any other journal and that
no plagiarism has occurred.

13
3712 H. C. Kilinc et al.

Consent to Participate The authors agree to participate in the journal.

Consent to Publish The authors agree to publish in the journal.

Competing Interest The authors declare no conflict of interest.

References
Ardabili S, Mosavi A, Dehghani M, Várkonyi-Kóczy AR (2020) Deep learning and machine learning in
hydrological processes climate change and earth systems a systematic review. In: Engineering for Sus-
tainable Future: Selected papers of the 18th International Conference on Global Research and Educa-
tion Inter-Academia–2019 18. Springer, pp 52–62
Basilio SA, Goliatt L (2022) Gradient boosting hybridized with exponential natural evolution strategies for
estimating the strength of geopolymer self-compacting concrete. Knowledge-Based Eng Sci 3:1–16
Box GEP, Tiao GC (1975) Intervention analysis with applications to economic and environmental problems.
J Am Stat Assoc 70:70–79. https://​doi.​org/​10.​1080/​01621​459.​1975.​10480​264
Carvalho TMN, de Assis de Souza Filho F (2021) Variational mode decomposition hybridized with gradi-
ent boost regression for seasonal forecast of residential water demand. Water Resour Manag 35:3431–
3445. https://​doi.​org/​10.​1007/​s11269-​021-​02902-7
Ceribasi G, Ceyhunlu AI, Wałęga A, Młyński D (2022) Investigation of the effect of climate change on
energy produced by hydroelectric power plants (HEPPs) by trend analysis method: A case study for
dogancay I-II HEPPs. Energies 15:2474. https://​doi.​org/​10.​3390/​en150​72474
Danandeh Mehr A, Rikhtehgar Ghiasi A, Yaseen ZM et al (2022) A novel intelligent deep learning predic-
tive model for meteorological drought forecasting. J Ambient Intell Humaniz Comput 1–15
Dorogush AV, Ershov V, Gulin A (2018) CatBoost: Gradient boosting with categorical features support.
arXiv Prepr arXiv​18101​1363
Fan J, Wang X, Wu L et al (2018) Comparison of Support Vector Machine and Extreme Gradient Boosting for
predicting daily global solar radiation using temperature and precipitation in humid subtropical climates: A
case study in China. Energy Convers Manag 164:102–111. https://​doi.​org/​10.​1016/j.​encon​man.​2018.​02.​087
Fernández-Carrillo VH, Quej-Chi VH, De los Santos-Posadas HM, Carrillo-Ávila E (2022) Do AI models
improve taper estimation? A comparative approach for teak. Forests 13:1465. https://​doi.​org/​10.​3390/​
f1309​1465
Ghimire S, Yaseen ZM, Farooque AA et al (2021) Streamflow prediction using an integrated methodology
based on convolutional neural network and long short-term memory networks. Sci Rep 11:1–26
Goliatt L, Yaseen ZM (2023) Development of a hybrid computational intelligent model for daily global
solar radiation prediction. Expert Syst Appl 212:118295
He X, Luo J, Li P, Zuo G, Xie J (2020) A hybrid model based on variational mode decomposition and gradi-
ent boosting regression tree for monthly runoff forecasting. Water Resour Manag 34:865–884
Hochreiter S, Schmidhuber JJ (1997) Long short-term memory. Neural Comput 9:1–32. https://​doi.​org/​10.​
1162/​neco.​1997.9.​8.​1735
Huang G, Wu L, Ma X et al (2019) Evaluation of CatBoost method for prediction of reference evapotranspi-
ration in humid regions. J Hydrol. https://​doi.​org/​10.​1016/j.​jhydr​ol.​2019.​04.​085
Ibrahim KSMH, Huang YF, Ahmed AN et al (2022) Forecasting multi-step-ahead reservoir monthly and
daily inflow using machine learning models based on different scenarios. Appl Intell. https://​doi.​
org/​10.​1007/​s10489-​022-​04029-7
Imrie CE, Durucan S, Korre A (2000) River flow prediction using artificial neural networks: generalisa-
tion beyond the calibration range. J Hydrol 233:138–153
Ivanov AM, Gorbarenko AV, Kireeva MB, Povalishnikova ES (2022) Identifying climate change impacts
on hydrological behavior on large-scale with machine learning algorithms. Geogr Environ Sustain
15:80–87
Karbasi M, Jamei M, Ali M et al (2022) Forecasting weekly reference evapotranspiration using Auto
Encoder Decoder Bidirectional LSTM model hybridized with a Boruta-CatBoost input optimizer.
Comput Electron Agric 198:107121
Khan MI, Maity R (2020) Hybrid deep learning approach for multi-step-ahead daily rainfall prediction
using GCM simulations. IEEE Access 8:52774–52784. https://​doi.​org/​10.​1109/​access.​2020.​29809​77
Kilinc HC, Ahmadianfar I, Demir V et al (2023) Daily scale streamflow forecasting based-hybrid gradient
boosting machine learning model. Researchsquare

13
Daily Scale River Flow Forecasting Using Hybrid Gradient Boosting… 3713

Kilinc HC (2022) Daily streamflow forecasting based on the hybrid particle swarm optimization and long
short-term memory model in the Orontes basin. Water 14:490. https://​doi.​org/​10.​3390/​w1403​0490
Kilinc HC, Yurtsever A (2022) Short-term streamflow forecasting using hybrid deep learning model
based on grey wolf algorithm for hydrological time series. Sustainability 14:3352
Kim J, Han H, Johnson LE et al (2019) Hybrid machine learning framework for hydrological assess-
ment. J Hydrol 577:123913
Kumar P, Singh AK (2022) A comparison between MLR, MARS, SVR and RF techniques: hydrological
time-series modeling. J Hum Earth Futur 3:90–98
Li L, Qiao J, Yu G et al (2022) Interpretable tree-based ensemble model for predicting beach water qual-
ity. Water Res 211:118078. https://​doi.​org/​10.​1016/j.​watres.​2022.​118078
Mahmood R, Jia S (2022) A comprehensive approach to develop a hydrological model for the simu-
lation of all the important hydrological components: The case of the Three-Tiver Headwater
Region, China. Water 14:2778. https://​doi.​org/​10.​3390/​w1418​2778
Munawar HS, Hammad AWA, Waller ST (2021) A review on flood management technologies related to
image processing and machine learning. Autom Constr
Naganna SR, Beyaztas BH, Bokde N, Armanuos AM (2020) On the evaluation of the gradient tree
boosting model for groundwater level forecasting. Knowledge-Based Eng Sci 1:48–57
Nguyen DH, Le Hien X, Heo J-Y, Bae D-H (2021) Development of an extreme gradient boosting model
integrated with evolutionary algorithms for hourly water level prediction. IEEE Access 9:125853–
125867. https://​doi.​org/​10.​1109/​access.​2021.​31112​87
Niu D, Diao L, Zang Z et al (2021) A machine-learning approach combining wavelet packet denoising with
catboost for weather forecasting. Atmosphere (Basel) 12:1618. https://​doi.​org/​10.​3390/​atmos​12121​618
Patrous Z (2018) Evaluating XGBoost for user classification by using behavioral features extracted from
smartphone sensors. Msc. Thesis, KTH Royal Institute Of Technology, Stockholm, Sweden
Prokhorenkova L, Gusev G, Vorobev A et al (2018) CatBoost: unbiased boosting with categorical fea-
tures. Adv Neural Inf Process Syst 31
Qi C, Wu M, Liu H et al (2023) Machine learning exploration of the mobility and environmental assess-
ment of toxic elements in mining-associated solid wastes. J Clean Prod 136771
Sarioglu FC, Yaslan Y (2019) Item prediction with RNN using different types of user-item interactions.
Signal Process Commun Appl Conf
Singh SK, Goyal A (2020) Performance analysis of machine learning algorithms for cervical cancer
detection. Int J Healthc Inf Syst Informatics. https://​doi.​org/​10.​4018/​IJHISI.​20200​40101
Solak CN, Peszek Ł, Yilmaz E et al (2020) Use of diatoms in monitoring the Sakarya river basin. Turkey
Water 12:703. https://​doi.​org/​10.​3390/​w1203​0703
Taylor KE (2001) Summarizing multiple aspects of model performance in a single diagram. J Geophys
Res 106:7183–7192
Tur R, Yontem S (2021) A comparison of soft computing methods for the prediction of wave height
parameters. Knowledge-Based Eng Sci 2:31–46
Wang L, Guo Y, Fan M (2022) Improving annual streamflow prediction by extracting information from
high-frequency components of streamflow. Water Resour Manag 36:4535–4555. https://​doi.​org/​10.​
1007/​s11269-​022-​03262-6
Xia F, Jiang D, Kong L et al (2022) Prediction of dichloroethene concentration in the groundwater of a
contaminated site using XGBoost and LSTM. Int J Environ Res Public Health 19:9374. https://​doi.​
org/​10.​3390/​ijerp​h1915​9374
Xie T, Zhang G, Hou J et al (2019) Hybrid forecasting model for non-stationary daily runoff series: A case
study in the Han River Basin, China. J Hydrol
Yaseen ZM, Sulaiman SO, Deo RC, Chau K-W (2018) An enhanced extreme learning machine model for
river flow forecasting: state-of-the-art, practical applications in water resource engineering area and
future research direction. J Hydrol 569:387–408. https://​doi.​org/​10.​1016/j.​jhydr​ol.​2018.​11.​069
Yukseltan E, Yucekaya A, Bilge AH, Agca Aktunc E (2021) Forecasting models for daily natural gas con-
sumption considering periodic variations and demand segregation. Socioecon Plann Sci 74:100937.
https://​doi.​org/​10.​1016/j.​seps.​2020.​100937
Zeng H, Shao B, Dai H et al (2023) Prediction of fluctuation loads based on GARCH family-CatBoost-
CNNLSTM. Energy 263:126125
Zhang Y, Zhao Z, Zheng J (2020) CatBoost: A new approach for estimating daily reference crop evapotran-
spiration in arid and semi-arid regions of Northern China. J Hydrol 588:125087. https://​doi.​org/​10.​
1016/j.​jhydr​ol.​2020.​125087
Zheng Z, Ali M, Jamei M et al (2023) Design data decomposition-based reference evapotranspiration forecast-
ing model: A soft feature filter based deep learning driven approach. Eng Appl Artif Intell 121:105984

13
3714 H. C. Kilinc et al.

Zounemat-Kermani M, Batelaan O, Fadaee M, Hinkelmann R (2021) Ensemble machine learning para-


digms in hydrology: A review. J Hydrol 598. https://​doi.​org/​10.​1016/j.​jhydr​ol.​2021.​126266

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and
applicable law.

Authors and Affiliations

Huseyin Cagan Kilinc1 · Iman Ahmadianfar2 · Vahdettin Demir3 · Salim Heddam4 ·


Ahmed M. Al‑Areeq5 · Sani I. Abba5 · Mou Leong Tan6,7 · Bijay Halder8 ·
Haydar Abdulameer Marhoon9,10 · Zaher Mundher Yaseen11,5
Iman Ahmadianfar
[email protected]
Vahdettin Demir
[email protected]
Salim Heddam
[email protected]
Ahmed M. Al‑Areeq
[email protected]
Sani I. Abba
[email protected]
Mou Leong Tan
[email protected]
Bijay Halder
[email protected]
Haydar Abdulameer Marhoon
[email protected]
1
Department of Civil Engineering, İstanbul Aydın University, Istanbul, Turkey
2
Department of Civil Engineering, Behbahan Khatam Alanbia University of Technology,
Behbahan, Iran
3
Department of Civil Engineering, KTO Karatay University, Konya 42020, Turkey
4
Faculty of Science, Agronomy Department, University 20 Août 1955 Skikda, Route El Hadaik, BP
26, Skikda, Algeria
5
Interdisciplinary Research Center for Membranes and Water Security, King Fahd University
of Petroleum & Minerals (KFUPM), Dhahran, Saudi Arabia
6
Geography Section, School of Humanities, Universiti Sains Malaysia, 11800 Penang, Malaysia
7
School of Geography, Nanjing Normal University, Nanjing 210023, China
8
Department of Remote Sensing and GIS, Vidyasagar University, Midnapore 721102, India
9
Information and Communication Technology Research Group, Scientific Research Center, Al-
Ayen University, Thi‑Qar, Iraq
10
College of Computer Sciences and Information Technology, University of Kerbala, Karbala, Iraq
11
Civil and Environmental Engineering Department, King Fahd University of Petroleum & Minerals,
Dhahran 31261, Saudi Arabia

13

You might also like