RFplus {InterpolateR} | R Documentation |
Machine learning algorithm for fusing ground and satellite precipitation data.
Description
MS-GOP (RFplus) is a machine learning algorithm for merging satellite-based and ground precipitation data. It combines Random Forest for spatial prediction, residual modeling for bias correction, and quantile mapping for final adjustment, ensuring accurate precipitation estimates across different temporal scales
Usage
RFplus(
BD_Obs,
BD_Coord,
Covariates,
n_round = NULL,
wet.day = FALSE,
ntree = 2000,
seed = 123,
training = 1,
stat_validation = NULL,
Rain_threshold = NULL,
method = c("RQUANT", "QUANT", "none"),
ratio = 15,
save_model = FALSE,
name_save = NULL
)
Arguments
BD_Obs |
A
The dataset should be structured as follows: > BD_Obs # A data.table or data.frame with n rows (dates) and m+1 columns (stations + Date) Date ST001 ST002 ST003 ST004 ... <date> <dbl> <dbl> <dbl> <dbl> ... 1 2015-01-01 0 0 0 0 ... 2 2015-01-02 0 0 0 0.2 ... 3 2015-01-03 0.1 0 0 0.1 ...
|
BD_Coord |
|
Covariates |
A list of covariates used as independent variables in the RFplus model. Each covariate should be a
|
n_round |
Numeric indicating the number of decimal places to round the corrected values. If |
wet.day |
Numeric value indicating the threshold for wet day correction. Values below this threshold will be set to zero.
|
ntree |
Numeric indicating the maximum number trees to grow in the Random Forest algorithm. The default value is set to 2000. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times. If this value is too low, the prediction may be biased. |
seed |
Integer for setting the random seed to ensure reproducibility of results (default: 123). |
training |
Numerical value between 0 and 1 indicating the proportion of data used for model training. The remaining data are used for validation. Note that if you enter, for example, 0.8 it means that 80 % of the data will be used for training and 20 % for validation. If you do not want to perform validation, set training = 1. (Default training = 1). |
stat_validation |
A character vector specifying the names of the stations to be used for validation. This option should only be filled in when it is desired to manually enter the stations used for validation. If this parameter is NULL, and the formation is different from 1, a validation will be performed using random stations. The vector must contain the names of the stations selected by the user for validation. For example, stat_validation = c(“ST001”, “ST002”). (Default stat_validation = NULL). |
Rain_threshold |
List of numerical vectors defining precipitation thresholds to classify precipitation into different categories according to its intensity. This parameter should be entered only when the validation is to include categorical metrics such as Critical Success Index (CSI), Probability of Detection (POD), False Alarm Rate (FAR), etc. Each list item should represent a category, with the category name as the list item name and a numeric vector specifying the lower and upper bounds of that category. Note: See the "Notes" section for additional details on how to define categories, use this parameter for validation, and example configurations. |
method |
A character string specifying the quantile mapping method used for distribution adjustment. Options are:
|
ratio |
integer Maximum search radius (in kilometers) for the quantile mapping setting using the nearest station. (default = 15 km) |
save_model |
Logical value indicating whether the interpolation file should be saved to disk. The default value is |
name_save |
Character string indicating the name under which the interpolation raster file will be saved. By default the algorithm sets as output name: 'Model_RFplus'. |
Details
The RFplus
method implements a three-step approach:
-
Base Prediction: Random Forest model is trained using satellite data and covariates.
-
Residual Correction: A second Random Forest model is used to correct the residuals from the base prediction.
-
Distribution Adjustment: Quantile mapping (QUANT or RQUANT) is applied to adjust the distribution of satellite data to match the observed data distribution.
The final result combines all three steps, correcting the biases while preserving the outliers, and improving the accuracy of satellite-derived data such as precipitation and temperature.
Value
A list containing two elements:
Ensamble:
A SpatRaster
object containing the bias-corrected layers for each time step. The number of layers
corresponds to the number of dates for which the correction is applied. This represents the corrected satellite data adjusted for bias.
Validation: A list containing the statistical results obtained from the validation process. This list includes:
-
gof
: A data table with goodness-of-fit metrics such as Kling-Gupta Efficiency (KGE), Nash-Sutcliffe Efficiency (NSE), Percent Bias (PBIAS), Root Mean Square Error (RMSE), and Pearson Correlation Coefficient (CC). These metrics assess the overall performance of the bias correction process. -
categorical_metrics
: A data frame containing categorical evaluation metrics such as Probability of Detection (POD), Success Ratio (SR), False Alarm Rate (FAR), Critical Success Index (CSI), and Hit Bias (HB). These metrics evaluate the classification performance of rainfall event predictions based on user-defined precipitation thresholds.
Notes
The Rain_threshold
parameter is used to calculate categorical metrics such as the Critical Success Index (CSI),
Probability of Detection (POD), False Alarm Rate (FAR), success ratio (SR), Hit BIAS (HB),Heidke Skill Score (HSS);
Hanssen-Kuipers Discriminant (HK); Equal Threat Score (ETS) or Gilbert Skill Score.
The parameter should be entered as a named list, where each item represents a category and the name of the item is the category name.
The elements of each category must be a numeric vector with two values: the lower and upper limits of the category.
For example:
Rain_threshold = list(
no_rain = c(0, 1),
light_rain = c(1, 5),
moderate_rain = c(5, 20),
heavy_rain = c(20, 40),
violent_rain = c(40, Inf)
)
Precipitation values will be classified into these categories based on their intensity. Users can define as many categories as necessary, or just two (e.g., "rain" vs. "no rain"). It is important that these categories are entered according to the study region, as each study region may have its own categories.
Author(s)
Jonnathan Augusto landi Bermeo, jonnathan.landi@outlook.com
Examples
# Load the data
data("BD_Obs", package = "InterpolateR")
data("BD_Coord", package = "InterpolateR")
# Load the covariates
Covariates <- list(
MSWEP = terra::rast(system.file("extdata/MSWEP.nc", package = "InterpolateR")),
CHIRPS = terra::rast(system.file("extdata/CHIRPS.nc", package = "InterpolateR")),
DEM = terra::rast(system.file("extdata/DEM.nc", package = "InterpolateR"))
)
# Apply the RFplus bias correction model
model = RFplus(BD_Obs, BD_Coord, Covariates, n_round = 1, wet.day = 0.1,
ntree = 2000, seed = 123, training = 0.8,
Rain_threshold = list(no_rain = c(0, 1), light_rain = c(1, 5)),
method = "RQUANT", ratio = 10, save_model = FALSE, name_save = NULL)
# Visualize the results
# Precipitation results within the study area
modelo_rainfall = model$Ensamble
# Validation statistic results
# goodness-of-fit metrics
metrics_gof = model$Validation$gof
# categorical metrics
metrics_cat = model$Validation$categorical_metrics