Code
library(poissonreg)
library(vip)
library(sf)
library(geobr)
library(arrow)
library(curl)
library(glue)
library(fs)
library(ggiraph)
library(janitor)
library(patchwork)
library(tidyverse)
library(tidymodels)
This exploration exercise will use data from the paper Linking land-use and land-cover transitions to their ecological impact in the Amazon (Nunes et al. 2022). They collected data from different areas inside different regions in the Amazon. The data holds information about Land Use and Land Cover (LULC), biodiversity, soil characteristics and carbon stocks. The data availability provides a good opportunity to practice data manipulation and modelling techniques.
From the available data, I will try two exercises:
Predict the species richness of an area with information about: land use and land cover, carbon stocks and soil properties?
Classify the land use and land cover with information about: species richness, carbon stocks and soil properties?
I will try different models to perform the predictions and classifications, and at the same time, will try to present the data and results in a intuitive manner. In this exercise, I chose to use Poisson Regression (Coxe, West, and Aiken 2009; Hayat and Higgins 2014), Random Forest (Breiman 2001; Biau and Scornet 2016), and Decision Trees (Kingsford and Salzberg 2008; Ville 2013).
Here are the packages used in this exercise:
library(poissonreg)
library(vip)
library(sf)
library(geobr)
library(arrow)
library(curl)
library(glue)
library(fs)
library(ggiraph)
library(janitor)
library(patchwork)
library(tidyverse)
library(tidymodels)
I will also set some settings that will be used across the document.
# These are the colors to represent LULC classes.
<-
my_palette c(
"#004d25", "#11823b", "#99AA38", "#721121",
"#A5402D", "#F15156", "#FFC07F", "#804D54"
)
# These are the labels for the LULC classes
<-
my_labels c(
"Undisturbed Forest", "Logged Forest",
"Logged Burned Forest", "Old Secondary Forest",
"Young Secondary Forest", "Reforestation",
"Pasture", "Mechanized Agriculture"
)
Before conducting the exploratory analysis and the modelling, I will download the data, which is available at Zenodo, and perform some data manipulation in order to store all data in a single table.
I am going to download the data directly from the Zenodo repository, in which I simply provide the link to the data collection. After that I just need to extract the compressed files, move the files I’m interested to another folder, and delete the rest. By the end of the process, we can see that I have 25 files available for analysis. Some files are named with “STM” and “PGM”, which are both regions explored in the study, so many tables holding the similar information are divided by these two regions. We can observe files named with animals names, there are also tables named with “soil” and “carbon”, so these are the ones I am interested at. There is also a table containing environment information, which I’m also going to use.
<- "data/LULC_transitions-v1.1.zip"
file_name
# Download by providing the link to the data
curl_download(
url = glue(
"https://zenodo.org/record/6518744/files/",
"cassioalencarnunes/LULC_transitions-v1.1.zip?download=1"
),destfile = file_name
)
# Extract files from compressed archive
unzip(
zipfile = file_name,
exdir = "data/"
)
# List files inside data folder
<-
data_files dir_ls(
path = "data/cassioalencarnunes-LULC_transitions-c501bd5/Data/"
)
# Move data files to another folder
file_move(path = data_files, new_path = "data/")
# Delete other files
dir_delete(path = "data/cassioalencarnunes-LULC_transitions-c501bd5/")
# Delete compressed archive
file_delete(path = "data/LULC_transitions-v1.1.zip")
# Some information about the files of interest
dir_info("data/") %>%
select(path, size)
path <fs_path> | size <fs_bytes> | |
---|---|---|
data/Ants_PGM.txt | 123.96K | |
data/Ants_STM.txt | 147.23K | |
data/Bees_PGM.txt | 19.76K | |
data/Birds.txt | 352.92K | |
data/Carbon_all_PGM.txt | 9.13K | |
data/Carbon_all_STM.txt | 7.43K | |
data/Dung_beetles.txt | 107.26K | |
data/Lianas_PGM.txt | 89.05K | |
data/Lianas_STM.txt | 79.01K | |
data/Saplings_PGM.txt | 298.53K |
Now that I have all the data I need, lets start organizing them in a format that will be easier to explore and use to train and test the models. My objective here is to get all the tables I’m interested, and transform them into a single table. We can see that the tables are separated by themes, so I will manipulate them in bundles, going from one theme to another. I separated them into five categories: “Transects”, which holds information about the areas where data was collected; “Biodiversity, which are tables with data about species presence in the evaluated areas;”Environment”, with information about terrain, LULC, coordinates; “Soil”, that stores data about soil chemical characteristics; “Carbon”, which holds information about the biomass present in different places inside the areas.
Starting with the transects data, which holds data of the two regions, STM (municipalities of Santarém, Belterra, and Mojuí-dos-Campos) and PGM (municipality of Paragominas), it also have other information about the identification of the transects, and also respectively LULC class. The identification of each transect was performed by assigning the catchment where it is located (“catchment”), and the number of the transect inside the catchment (“transect”). The column “transect_code” is the union between both columns I have just mentioned, which creates a unique identification for each transect. This table will serve as a base to join all the other tables.
<-
tr_class read_delim(
file = "data/Transect_classification.csv",
col_select = c(2:6), # Remove unnamed id column
name_repair = ~
str_replace(
string = tolower(.), # I prefer everything in lowercase
pattern = "-", # These traces are problematic
replacement = "_"
), col_types = "_ccccc"
)
head(tr_class)
region <chr> | catchment <chr> | transect <chr> | transect_code <chr> | lu_uf <chr> |
---|---|---|---|---|
STM | 103 | 1 | 103_1 | AP |
STM | 103 | 2 | 103_2 | SFyoung |
STM | 103 | 3 | 103_3 | SFold |
STM | 103 | 4 | 103_4 | PA |
STM | 103 | 5 | 103_5 | SFyoung |
STM | 103 | 7 | 103_7 | AP |
There are a lot of tables containing biodiversity variables to be analysed. Since they are stored in many different tables, I will map the manipulations using the purrr package, in which we will apply a function to every biodiversity table. The function will open the tables, pivot them to a longer format (there are many columns indicating a certain specie), and than create a new column indicating what is the group that the species are part of. I also transformed the species number variable into species richness, where the count of individuals found in a single area does not matter, just the occurrence of it (as a binary variable). After that I just summed the number of species found, for each specie category, in each transect, to find the species richness.
By the end of the process I ended up with one table with four columns, containing the region and the transect where the data was collected, the specie category that was found in the respective transect, and the count of different species for each of these categories.
# Create list of files that contains tables with biodiversity data
<-
tables_list dir_ls(path = "data/") %>%
str_subset(
pattern = regex(
"ants|trees|bees|birds|dung_beetles|liana|saplings",
ignore_case = TRUE
)
)
# Merge tables in a single table
<-
biodiversity map_df(
.x = tables_list,
.f = function(t) {
<-
table read_delim(
file = t,
name_repair = ~ tolower(.),
col_select = !matches("aaaaa"), # Remove weird column
col_types = "cccc"
)
%>%
table pivot_longer( # Get the number of individuals for each specie
cols = starts_with("sp"),
names_to = "specie",
values_to = "number"
%>%
) mutate( # Create variable with the name of the species category
bio_specie = str_remove_all(
string = t,
pattern = "data/|_|STM|PGM|.txt"
)
)
}
)
# Transform the number of individuals into species richness
<- biodiversity %>%
biodiversity mutate(number = if_else(number > 0, 1, 0)) %>% # Specie was observed or not
group_by(region, transect_code, bio_specie) %>%
summarise(bio_richness = sum(number, na.rm = FALSE), .groups = "drop")
head(biodiversity)
region <chr> | transect_code <chr> | bio_specie <chr> | bio_richness <dbl> |
---|---|---|---|
PGM | 100_1 | Ants | 17 |
PGM | 100_1 | Bees | 13 |
PGM | 100_1 | Birds | 43 |
PGM | 100_10 | Ants | 24 |
PGM | 100_10 | Bees | 10 |
PGM | 100_10 | Birds | 44 |
There was a weird column in the dung beetle table, named “aaaaaa” (screaming in despair?), I removed it just as the author of the paper.
The environment data table contained a large number of columns. There was data about the LULC classes, spatial location, terrain characteristics, soil texture, and many other columns in which I could not find their meaning. Since the data collection lacks metadata to provide information about variables, I decided to use only the ones I knew what they were, and that interested me. The manipulation of this table was very simple, I just renamed columns, fixed the name of a LULC class, and rounded some numbers. The columns with coordinate information informs that they are in Universal Transverse Mercator (UTM), which is a projected coordinate system, however there is no additional information about it, but I am going to keep it since I believe I can figure it out.
By the end of the process I ended up with a table of 11 variables about the environment of the data collection area. However, I’m still not sure if I’m actually going to use these data in my analysis.
<-
envr read_delim(
file = "data/environment_all.txt",
col_select = c(1:7, 37:38, 49:50),
name_repair = ~
str_replace(
string = tolower(.), # I prefer everything in lowercase
pattern = "-", # These traces are problematic
replacement = "_"
), col_types = "cccc"
%>%
) mutate(
land_useclass =
str_replace( # Fix land use variable
string = land_useclass,
pattern = "Mechanisd_agriculture",
replacement = "Mechanized_agriculture"
)%>%
) rename(
transect_code = transectcode,
env_silt = silt_all, env_clay = clay_all,
env_elev = elev_mean, env_slope = slope_mean,
%>%
) mutate(
across(c(env_silt:env_slope), ~ round(.x, digits = 2))
)
head(envr)
region <chr> | catchment <chr> | transect <chr> | transect_code <chr> | utm_x <dbl> | |
---|---|---|---|---|---|
STM | 103 | 1 | 103_1 | 730352 | |
STM | 103 | 2 | 103_2 | 729806 | |
STM | 103 | 3 | 103_3 | 729758 | |
STM | 103 | 4 | 103_4 | 732165 | |
STM | 103 | 5 | 103_5 | 733822 | |
STM | 103 | 7 | 103_7 | 729736 |
The variables of the soil tables are mostly about chemical properties of the soil, like acidity and fertility. This data don’t require a lot of modifications, I will just rename some variables, transform nitrogen from decimals to percentages, and will round some other variables. Since there are two tables, one for each region, I will map the data manipulation for both tables and merge the results using the purrr::map_df()
function.
<-
soil map_df(
.x = dir_ls(path = "data/", regexp = "Soil"),
.f = function(t) {
<-
table read_delim(
file = t,
col_select = c(2:9, 11, 12), # I'm leaving sodium out
name_repair = ~ tolower(.),
col_types = "_cccc"
)
}%>%
) rename(
soil_ph = ph, soil_n = nperc, soil_p = p,
soil_k = k, soil_ca = camg, soil_al = al
%>%
) mutate(
soil_n = soil_n * 100, # to percentage
across(c(soil_ph, soil_p:soil_al), ~ round(.x, digits = 2))
)
head(soil)
region <chr> | catchment <chr> | transect <chr> | transect_code <chr> | soil_ph <dbl> | |
---|---|---|---|---|---|
PGM | 44 | 12 | 44_12 | 5.25 | |
PGM | 44 | 15 | 44_15 | 5.40 | |
PGM | 44 | 2 | 44_2 | 5.00 | |
PGM | 44 | 3 | 44_3 | 5.28 | |
PGM | 44 | 4 | 44_4 | 4.28 | |
PGM | 44 | 5 | 44_5 | 4.61 |
The last tables that I’m going to use are the ones containing data about carbon stocks. The carbon partition used in the published study followed the guidelines for separating carbon stocks by the Intergovernmental Panel on Climate Change (IPCC). The categories of soil are “above-ground carbon pool”, “dead wood pool”, “litter carbon pool”, and “soil carbon pool”. These are important variables to estimate an ecosystem biomass.
This table also did not require a lot of modification, just rename some variables, and round numbers. I will also map the manipulation functions for both files (one for each region), to transform it into one table.
<-
carbon map_df(
.x = dir_ls(path = "data/", regexp = "Carbon"),
.f = function(t) {
<-
table read_delim(
file = t,
name_repair = ~ tolower(.), # I prefer everything in lowercase
col_types = "c"
%>%
) mutate(region = str_extract(t, "PGM|STM"))
}%>%
) rename(
carbon_aboveground = aboveground_pool, carbon_litter = litter_pool,
carbon_deadwood = dead_pool, carbon_soil = soil_pool_new
%>%
) mutate(
across(c(carbon_aboveground:carbon_soil), ~ round(.x, digits = 2))
)
head(carbon)
transect_code <chr> | carbon_aboveground <dbl> | carbon_litter <dbl> | carbon_deadwood <dbl> | carbon_soil <dbl> | |
---|---|---|---|---|---|
100_1 | 0.00 | 5.58 | 4.19 | 52.73 | |
100_10 | 7.43 | 4.86 | 1.46 | 42.45 | |
100_2 | 6.86 | 4.60 | 8.47 | 98.40 | |
100_3 | 71.33 | 7.37 | 30.34 | 83.86 | |
100_4 | 0.00 | 8.17 | 21.89 | 68.16 | |
100_5 | 113.90 | 11.46 | 39.47 | 62.28 |
After organizing the tables, I can now merge them altogether into a unique table. I am going to make a series of join operations, merging the tables by the identification columns: “region”, “catchment”, “transect” and “transect_code.” After joining them I will also make some last modifications before exploring the data, I will change the LULC names more beautiful for the column “land_useclass”, and will transform “region” and “lu_uf” into ordered factors. I will also save the resulting table into a .parquet file, which is a compressed tabular format.
<- tr_class %>%
full_data inner_join(
y = biodiversity,
by = c("region", "transect_code")
%>%
) inner_join(
y = soil,
by = c("region", "catchment", "transect", "transect_code")
%>%
) inner_join(
y = carbon,
by = c("region", "transect_code")
%>%
) inner_join(
y = envr,
by = c("region", "catchment", "transect", "transect_code")
%>%
) mutate(
land_useclass = str_replace_all( # Make LULC names more beautiful
string = land_useclass,
pattern = "-|_",
replacement = " "
),region = fct(region, levels = c("STM", "PGM")),
lu_uf = fct(
lu_uf, levels = c(
"UF", "LF", "BF", "LBF", "SFold", "SFyoung", "REF", "PA", "MA",
"AP", "FRU", "SHA"
)
)
)
write_parquet(
x = full_data,
sink = "data/lulc_eco.parquet",
version = "latest"
)
file_delete(
path = dir_ls(
"data/",
glob = "*.parquet",
invert = TRUE
)
)
rm(list = setdiff(ls(), c("my_palette", "my_labels")))
Now we have a single table to start exploring! Lets begin with some descriptive analysis. My object here is to find the most general information of the variables I select for analysis, I will keep separating the analysis into categories of variables, as I have done above.
In total, the published study collected data from a total of 339 transects, 156 from STM and 183 from PGM. I want to know what are the LULC characteristics and distribution in these regions. For that, I will count the occurrence of each LULC, and plot their spatial distribution over the study areas.
Some LULC classes presented a very low frequency: Burned Forests (BF), Abandoned Plantations (AP), Fruiticulture (FRU) and Small Holder’s Agriculture (SHA), accumulated only 7, 4, 4 and 4 observations, respectively. Because of that, I’m not going to analyse the data for these for LULC classes in this project.
<-
data read_parquet(
file = "data/lulc_eco.parquet",
col_select = c("region", "transect_code", "lu_uf")
%>%
) distinct(region, transect_code, lu_uf, .keep_all = TRUE) %>%
group_by(lu_uf) %>%
count(lu_uf) %>%
ungroup()
<- data %>%
lu_subset arrange(desc(n)) %>%
slice_max(n, n = 8) %>% # Filter LULC classes with more observations
pull(lu_uf)
%>%
data arrange(desc(n))
lu_uf <fct> | n <int> | |||
---|---|---|---|---|
PA | 72 | |||
LF | 68 | |||
LBF | 65 | |||
SFyoung | 33 | |||
MA | 26 | |||
SFold | 25 | |||
UF | 21 | |||
REF | 10 | |||
BF | 7 | |||
AP | 4 |
With the filtered data, I’m going to explore the spatial distribution of the transects and their LULC classes. But for that, I need to discover what is the coordinate system being used. I know that it is based in a Universal Transverse Mercator projection, the problem is that it is divided by zones. After some trials, I found that STM data is in region 21 and PGM in region 23. It is not possible to find more details about the coordinate system, as the ellipsoid used to represent the earth surface, so I used the most common for global data, the WGS84. Even if it may not be the correct one, it wont make a lot of difference in this case, since I’m not looking for extreme spatial accuracy for my map.
Now I can create my figure! I’m going to plot two maps (one for each region), along with bars representing the amount of observations for each LULC class. I am also going to add some interactive functionalities so you can hover with your mouse over the scales and points to find more information.
# Get the boundary of the municipalities
<-
muni read_municipality(
code_muni = "PA",
showProgress = FALSE
%>%
) filter(code_muni %in% c(1506807, 1501451, 1505502))
# Transform data into spatial format
<-
data map2_df(
.x = c("STM", "PGM"),
.y = c("21", "23"), # UTM regions
.f = function(r, z) {
<-
data read_parquet(
file = "data/lulc_eco.parquet",
col_select = c(
"region", "transect_code", "lu_uf", "utm_x", "utm_y"
)%>%
) filter(
== r,
region %in% lu_subset
lu_uf %>%
) distinct(region, transect_code, lu_uf, .keep_all = TRUE) %>%
st_as_sf(
coords = c("utm_x", "utm_y"),
crs = glue(
"+proj=utm +zone={z} +south +datum=WGS84 +units=m +no_defs"
)%>%
) st_transform(st_crs(muni))
}
)
<-
maps map2(
.x = c("STM", "PGM"),
.y = list(c(1506807, 1501451), c(1505502)),
.f = function(r, m) {
ggplot() +
geom_sf(data = muni %>% filter(code_muni %in% m)) +
geom_sf_interactive(
data = data %>% filter(region == r),
aes(
color = lu_uf,
tooltip = lu_uf,
data_id = lu_uf,
`data-id` = lu_uf
),size = 1,
expand = FALSE,
extra_interactive_params = "data-id"
+
) scale_color_manual_interactive(
values = my_palette,
tooltip = function(breaks) { as.character(my_labels)},
extra_interactive_params = "data-id",
`data-id` = data %>%
arrange(data, lu_uf) %>%
distinct(lu_uf),
data_id = function(breaks) { as.character(breaks) }
+
) annotate(
"text",
-Inf, Inf,
label = glue("bold({r})"),
parse = TRUE,
hjust = 0, vjust = 1
+
) labs(x = NULL, y = NULL, title = NULL) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
guides(
color = guide_legend_interactive(override.aes = list(size = 4))
+
) theme_void(base_family = "sans") +
theme(
legend.title = element_blank(),
panel.grid.major = element_blank(),
legend.position = "bottom"
)
}
)
<-
bars map(
.x = c("STM", "PGM"),
.f = function(r) {
ggplot() +
geom_col_interactive(
data = data %>%
filter(region == r) %>%
group_by(lu_uf) %>%
count(),
aes(
data_id = lu_uf,
`data-id` = lu_uf,
x = n,
y = lu_uf,
tooltip = glue("Observations: {n}"),
fill = lu_uf
),show.legend = FALSE
+
) lims(x = c(0, max(pull(add_count(data, lu_uf), n)))) +
labs(x = NULL, y = NULL, title = NULL) +
scale_y_discrete(expand = c(0, 0)) +
scale_fill_manual_interactive(
values = my_palette,
extra_interactive_params = "data-id",
`data-id` = data %>%
arrange(data, lu_uf) %>%
distinct(lu_uf),
data_id = function(breaks) { as.character(breaks) }
+
) labs(y = "", x = "") +
theme_void(base_family = "sans") +
theme(
text = element_text(size = 15, family = "Open Sans"),
axis.text.y = element_blank(),
legend.position = ""
)
}
)
rm(muni)
girafe(
code =
print(
2]] + bars[[2]] +
maps[[plot_spacer() + plot_spacer() +
1]] + bars[[1]] +
maps[[plot_layout(
ncol = 2, nrow = 3,
guides = "collect",
heights = c(2, 0.2, 2),
width = c(3, 1)
&
) theme(legend.position = "bottom")
), options = list(
opts_tooltip(
css = "background-color:white;color:black;padding:5px;border-radius:2px;"
),opts_hover_inv(css = "opacity:0.5;"),
opts_hover(girafe_css("stroke:black;fill:black;")),
opts_hover_key(girafe_css("stroke:black;fill:black")),
opts_toolbar(saveaspng = FALSE),
fonts = list(sans = "Open Sans")
) )
We can observe that STM region presented a more balanced distribution of LULC classes being analysed. The most common LULC classes are Pasture (PA), Logged Primary Forests (LF) Logged and Burned Primary Forests (LBF). For both regions, Undisturbed Forests are concentrated to the West, and landscapes with intense human activities are concentrated to the East. So there is some kind of transition pattern, from undisturbed to disturbed landscapes.
And how about the species that live in these areas? We can imagine that biodiversity of undisturbed landscape are likely higher than of disturbed ones. I want to try to observe patterns of species richness distributed over the LULC classes. For that I’m going to create a matrix plot, showing the median of species richness in each LULC class (I’m sure that taking the median of species richness is not a good practice, but I’m doing it here just to explore some patterns).
<-
data read_parquet(
file = "data/lulc_eco.parquet",
col_select = c(
"region", "transect_code", "lu_uf",
"bio_richness", "bio_specie"
)%>%
) filter(lu_uf %in% lu_subset) %>%
group_by(region, lu_uf, bio_specie) %>%
summarise(bio_richness = median(bio_richness), .groups = "keep") %>%
mutate(
id = cur_group_id(),
richness_label = glue(
"{lu_uf} \n",
"{bio_specie} \n",
"Number of species: {round(bio_richness)}"
)%>%
) drop_na()
<- ggplot(data) +
plot facet_wrap(vars(region)) +
geom_tile_interactive(
aes(
data_id = id,
x = bio_specie,
y = lu_uf,
fill = bio_richness,
tooltip = richness_label
)+
) scale_fill_gradient(low = "#2c2929", high = "#C0C0C0") +
labs(y = "", x = "") +
theme_bw(base_family = "sans") +
theme(
text = element_text(size = 15, family = "Open Sans"),
axis.text.y = element_text(
hjust = 1, vjust = 0.5, margin = margin(r = 20), angle = 45
),axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
legend.position = "",
panel.grid = element_blank()
)
girafe(
ggobj = plot,
options = list(
opts_tooltip(
css = "background-color:white;color:black;padding:5px;border-radius:2px;"
),opts_hover_inv(css = "opacity:0.5;"),
opts_hover(
girafe_css(
css = "stroke:black;cursor:pointer;r:5px;",
)
),opts_toolbar(saveaspng = FALSE),
fonts = list(sans = "Open Sans")
) )
We can observe that undisturbed landscapes presents higher species richness, the Mechanized Agriculture (MA) presents the lowest values of specie richness. Trees, Saplings and Birds are the most diverse categories of species. Both regions present very similar patterns, so it seems that species richness are not different between regions. Unfortunately, bees species were not recorded for the STM region.
Now lets take a look from another perspective. I’m going to plot an histogram of the overall specie richness of each transect, grouped by regions and by more general LULC classes: Mechanized Agriculture (MA), Pasture (PA), Primary Forests (PF) and Secondary Forests (SF).
<-
data read_parquet(
file = "data/lulc_eco.parquet",
col_select = c(
"region", "transect_code", "land_useclass",
"bio_richness", "bio_specie"
)%>%
) group_by(region, land_useclass, transect_code) %>%
summarise(
bio_richness = sum(bio_richness, na.rm = TRUE), .groups = "drop_last"
%>%
) mutate(
id = cur_group_id(),
label = glue(
"Total Observations: {n()}",
"\n",
"Mean: {round(mean(bio_richness))}",
"\n",
"Median: {round(median(bio_richness))}",
"\n",
"Variance: {round(var(bio_richness))}"
)%>%
) ungroup() %>%
add_count(land_useclass) %>%
filter(
> 0,
bio_richness dense_rank(desc(n)) <= 4 # Filter the four LULC classes with most obs
%>%
) select(-n) %>%
drop_na()
<-
plot ggplot(data) +
facet_grid_interactive(
interactive_on = "both",
~ land_useclass,
region labeller = labeller(
land_useclass = labeller_interactive(
aes(
tooltip = glue("{land_useclass}"),
label = c("MA", "PA", "PF", "SF")
)
)
)+
) geom_histogram_interactive(
aes(
x = bio_richness,
tooltip = label,
data_id = id
),binwidth = 15
+
) labs(x = "Number of Species", y = "Count") +
theme_bw(base_family = "sans") +
theme(
text = element_text(size = 15, family = "Open Sans"),
axis.text.y = element_text(
hjust = 1, vjust = 0.5, margin = margin(r = 5)
),axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
strip.text.x = element_text_interactive()
)
girafe(
ggobj = plot,
options = list(
opts_tooltip(
css = "background-color:white;color:black;padding:5px;border-radius:2px;"
),opts_hover_inv(css = "opacity:0.5;"),
opts_hover(
girafe_css(
css = "stroke:black;cursor:pointer;r:5px;",
)
),opts_toolbar(saveaspng = FALSE),
fonts = list(sans = "Open Sans")
) )
The histograms confirms that difference of species richness between both regions are not apparent, but note that this doesn’t mean that they are the same! Maybe the species inside each region are different, maybe they hold different number of individuals population, I imagine that both these measurements are important too. We can also observe that the variance of specie richness are much higher for PF and SF than PA and MA, which may be caused because PF and SF present more diverse conditions, some are less disturbed, others are more disturbed.
So by now we already know that LULC classes and biodiversity are related in some manner. Lets explore if chemical properties of soils are also related to LULC classes. For that I’m simply going to plot the soil variables distributions grouped by general LULC classes, it is a very simple, but very informative type of visualization.
<-
data read_parquet(
file = "data/lulc_eco.parquet",
col_select = c(
"region", "transect_code", "land_useclass", "soil_ph":"soil_al"
)%>%
) distinct(transect_code, .keep_all = TRUE) %>%
pivot_longer(
cols = soil_ph:soil_al,
names_to = "soil_props",
values_to = "value"
%>%
) drop_na() %>%
add_count(land_useclass) %>%
filter(
dense_rank(desc(n)) <= 4 # Filter the four LULC classes with most obs
%>%
) select(-n)
<-
plot ggplot(data) +
facet_wrap(
vars(soil_props),
scales = "free_x",
labeller = labeller(
soil_props = c(
soil_al = "Al", soil_ca = "Ca",
soil_k = "K", soil_n = "N",
soil_p = "P", soil_ph = "pH"
)
)+
) geom_violin(
aes(
y = land_useclass,
x = value
)+
) scale_x_continuous(breaks = pretty_breaks(4)) +
theme_bw(base_family = "sans") +
theme(
axis.title = element_blank(),
text = element_text(size = 15, family = "Open Sans"),
axis.text.y = element_text(hjust = 0, margin = margin(r = 30))
)
girafe(
ggobj = plot,
options = list(
opts_tooltip(
css = "background-color:white;color:black;padding:5px;border-radius:2px;"
),opts_hover_inv(css = "opacity:0.5;"),
opts_hover(
girafe_css(
css = "stroke:black;cursor:pointer;r:5px;",
)
),opts_toolbar(saveaspng = FALSE),
fonts = list(sans = "Open Sans")
) )
The most noticeable pattern in these distributions, at least in my opinion, is the differences between forests and pasture/agriculture. This is very obvious, the effect of human activities over the soil is caused not just by the land cover, but by the practices used in the land uses. When areas are going to be used for livestock and crops, Calcium (Ca) is applied in order to “eliminate” Aluminium (Al) (which is toxic for roots of most of the plants) and to elevate pH (helps plants to absorb nutrients from soil). It is easy to see these differences in the distributions. There are also differences in nutrients that are important to plants Nitrogen (N), Phosphorus (P) and Potassium (K). They are also in higher concentrations at Mechanized Agriculture (MA) and Pasture (PA), since these classes are normally under exogenous fertilization regimes. However, we can also observe that forests present more extreme values of nutrients concentration.
My question now is if soil chemical characteristics would help a prediction model, that aims to predict species richness values. I will explore this checking if they are correlated, which they are probably, since both are related to LULC classes. However, maybe the amount of variation of soil properties may help improving the predictions. Lets take a look.
<-
data read_parquet(
file = "data/lulc_eco.parquet",
col_select = c(
"region", "transect_code", "lu_uf",
"bio_specie", "bio_richness",
"soil_ph":"soil_al"
)%>%
) filter(lu_uf %in% lu_subset) %>%
group_by(region, lu_uf, transect_code) %>%
summarise(
bio_richness = sum(bio_richness, na.rm = TRUE),
across(soil_ph:soil_al, ~ max(.x)),
.groups = "drop"
%>%
) pivot_longer(
:soil_al,
soil_phnames_to = "soil_props"
%>%
) drop_na()
<-
plot ggplot(data) +
facet_wrap(
vars(soil_props),
scales = "free_x",
labeller = labeller(
soil_props = c(
soil_al = "Al", soil_ca = "Ca",
soil_k = "K", soil_n = "N",
soil_p = "P", soil_ph = "pH"
)
)+
) geom_point(aes(x = value, y = bio_richness, color = lu_uf)) +
scale_color_manual(values = my_palette) +
scale_x_continuous(breaks = pretty_breaks(4)) +
labs(y = "Specie Richness", x = NULL) +
theme_bw(base_family = "sans") +
theme(
text = element_text(size = 15, family = "Open Sans"),
legend.title = element_blank()
)
plot
When analyzing the whole data, it is difficult to see correlation, maybe just some species categories might have correlation with soil chemical characteristics (such as lianas, saplings and trees). Also it is not surprising that these variables dispersion in the scatter plot are specially flat at Pastures (PA) and Mechanized Agriculture (MA), since these areas tend to be much more homogeneous than natural landscapes. Remember that the histograms of biodiversity showed lower variation of species richness for PA and MA. Since the relation between soil chemical properties and biodiversity is so unclear and messy, I’ll evaluate if I’m going to use them to train my prediction models.
For the carbon data, I’m also going to explore the distribution of carbon stocks divided by general LULC classes. First, I will see the distributions of all the carbon stocks together. It is possible to observe that Primary Forests have the highest amount of biomass, followed by secondary forests. This is expected specially due to the above-ground and dead wood biomass!
<-
data read_parquet(
file = "data/lulc_eco.parquet",
col_select = c(
"region", "transect_code", "land_useclass",
"carbon_aboveground":"carbon_soil"
)%>%
) distinct(transect_code, .keep_all = TRUE) %>%
drop_na() %>%
pivot_longer(
cols = carbon_aboveground:carbon_soil,
names_to = "carbon",
values_to = "value"
%>%
) add_count(land_useclass) %>%
filter(
dense_rank(desc(n)) <= 4 # Filter the four LULC classes with most obs
%>%
) select(-n) %>%
group_by(region, land_useclass, transect_code) %>%
summarise(value = sum(value), .groups = "drop")
ggplot(data) +
geom_violin(
aes(
y = land_useclass,
x = value
)+
) scale_x_continuous(breaks = pretty_breaks(4)) +
theme_bw(base_family = "sans") +
theme(
axis.title = element_blank(),
text = element_text(size = 15, family = "Open Sans"),
axis.text.y = element_text(hjust = 0, margin = margin(r = 30))
)
So lets repeat the same figure, but analyzing the Litter and Soil carbon stocks separately. We can see that litter biomass is the lowest for Mechanized Agriculture (MA), which is also expected, crops like soybean, rice, cotton do not produce a lot of litter, and many systems apply soil tillage. But for soil carbon the story is different, it is difficult to see big differences between the LULC classes. This is not new information, estimating effects of LULC classes over soil carbon is a difficult task, but we are not going to discuss this matter in this document.
<-
data read_parquet(
file = "data/lulc_eco.parquet",
col_select = c(
"region", "transect_code", "land_useclass",
"carbon_litter", "carbon_soil"
)%>%
) distinct(transect_code, .keep_all = TRUE) %>%
drop_na() %>%
pivot_longer(
cols = carbon_litter:carbon_soil,
names_to = "carbon",
values_to = "value"
%>%
) add_count(land_useclass) %>%
filter(
dense_rank(desc(n)) <= 4 # Filter the four LULC classes with most obs
%>%
) select(-n)
ggplot(data) +
facet_wrap(
vars(carbon),
scales = "free_x",
labeller = labeller(
carbon = c(carbon_litter = "Litter", carbon_soil = "Soil")
)+
) geom_violin(
aes(
y = land_useclass,
x = value
)+
) scale_x_continuous(breaks = pretty_breaks(4)) +
theme_bw(base_family = "sans") +
theme(
axis.title = element_blank(),
text = element_text(size = 15, family = "Open Sans"),
axis.text.y = element_text(hjust = 0, margin = margin(r = 30))
)
After exploring a data a little, I already have some knowledge about what I’m dealing with. With this information, I’m going to filter my data before starting to train my models. Regarding the variables, I’m going to drop some chemical soil characteristics (Ca, K, N, P), and for carbon stocks, I will drop Dead Wood and Soil carbon variables I will also only keep the rows that belongs to the 8 LULC classes with most observations, and I will remove Specie Richness values for Bees, which would cause a serious imbalance in the data.
<-
data read_parquet(
file = "data/lulc_eco.parquet",
col_select = c(
"region", "catchment", "transect", "land_useclass", "lu_uf",
"bio_richness", "bio_specie",
"soil_ph", "soil_al",
"carbon_aboveground", "carbon_litter"
)%>%
) filter(
%in% lu_subset,
lu_uf != "Bees"
bio_specie %>%
) drop_na()
write_parquet(
x = data,
sink = "data/filtered_lulc_eco.parquet",
version = "latest"
)
Now its time to start the modelling exercise. I’m going to train and test four models, two prediction models (Poisson Regression and Random Forest), and two classification models (Decision Tree and Random Forest).
I will start the exercise preparing the train and test data. Normally it is performed a random sampling separating 70% of the data for training, and 30% for testing (proportions vary). In this exercise I will take a different approach, I will use the STM region to train my models (n = 156), and the PGM region to test them (n = 183). I will choose this strategy to keep both splits as independent as they could be, using a “external” data set to test my models. I choose the STM region as train split because it is more balanced than the PGM region (regarding the LULC classes).
<-
data_pred read_parquet(file = "data/filtered_lulc_eco.parquet") %>%
drop_na() %>%
mutate(lu_uf = droplevels(lu_uf))
#set.seed(821715)
#data_split <- initial_split(data, prop = 2/4)
<- filter(data_pred, region == "STM")
train_data_pred <- filter(data_pred, region == "PGM")
test_data_pred
<- data_pred %>%
data_class pivot_wider(
names_from = bio_specie,
values_from = bio_richness
%>%
) drop_na()
<- filter(data_class, region == "STM")
train_data_class <- filter(data_class, region == "PGM") test_data_class
The first model will be the Poisson Regression, a Generalized Linear Model (GLM), which is appropriate to predict count data (count of number of different species in one area!). Before training the model, I will set some pre processing steps: remove near zero variance variables, and create dummy variables to represent nominal predictors.
<-
biod_rec recipe(bio_richness ~ ., data = train_data_pred) %>%
update_role(
region, catchment, transect, land_useclass,new_role = "ID"
%>%
) step_nzv(all_predictors()) %>%
step_dummy(all_nominal_predictors())
<-
biod_mod poisson_reg() %>%
set_engine("glm")
<-
biod_wflow workflow() %>%
add_model(biod_mod) %>%
add_recipe(biod_rec)
<-
biod_fit %>%
biod_wflow fit(data = train_data_pred)
<-
biod_aug augment(biod_fit, test_data_pred) %>%
clean_names()
biod_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: poisson_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_nzv()
• step_dummy()
── Model ───────────────────────────────────────────────────────────────────────
Poisson Regression Model Specification (regression)
Computational engine: glm
The Poisson Regression, as expected, did not predicted any negative numbers, which could happen in case we tried to predict values using Linear Least Squares Regression. This model do not present any parameter to be optimized before fitting the final model. The MAE points out that the mean absolute error is around 10 species, which can be a good performance for some categories of species, or bad for others (some categories of species are more abundant than others).
<- biod_aug %>%
mod_met metrics(truth = bio_richness, estimate = pred) %>%
clean_names() %>%
mutate(estimate = round(estimate, digits = 2))
%>%
biod_aug ggplot() +
geom_point(
aes(
x = bio_richness,
y = pred,
color = lu_uf
)+
) scale_color_manual(values = my_palette) +
annotate(
geom = "text",
x = 0, y = c(80, 70, 60),
hjust = 0,
label = c(
glue("bold({toupper(mod_met$metric[1])})- {mod_met$estimate[1]}"),
glue("bold({toupper(mod_met$metric[3])})- {mod_met$estimate[3]}"),
glue("bold({toupper(mod_met$metric[2])})- {mod_met$estimate[2]}")
),parse = TRUE,
size = 5
+
) labs(x = "Observations", y = "Estimations") +
geom_abline(slope = 1, intercept = 0) +
coord_equal() +
theme_bw() +
theme(
legend.title = element_blank()
)
Now lets train a Random Forest model, and we will check if it will be able to perform better than the Poisson Regression in the task to predict Specie Richness. The Random Forest is known to be very robust to imbalances, it does not require creation of dummies for ordinal data, performing well with categorized data. Before performing the final fit to be tested, I will estimate three parameters of the model, in order to find a good combination for using in the final fit. I will perform this optimization using V-Fold Cross-Validation in the training set.
<-
biod_mod rand_forest(
mtry = tune(),
trees = 100,
min_n = tune()
%>%
) set_engine("ranger") %>%
set_mode("regression")
<-
biod_rec recipe(bio_richness ~ ., data = train_data_pred) %>%
update_role(
region, catchment, transect, land_useclass,new_role = "ID"
%>%
) step_nzv(all_predictors())
<-
biod_wflow workflow() %>%
add_model(biod_mod) %>%
add_recipe(biod_rec)
<- vfold_cv(train_data_pred, v = 5, repeats = 100)
biod_fold
<- biod_wflow %>%
biod_res tune_grid(
biod_fold,grid = 50,
control = control_grid(save_pred = TRUE),
metrics = metric_set(rmse)
)
<- biod_res %>%
biod_best select_best(metric = "rmse")
%>%
biod_res show_best() %>%
clean_names()
mtry <int> | min_n <int> | metric <chr> | estimator <chr> | mean <dbl> | n <int> | std_err <dbl> | |
---|---|---|---|---|---|---|---|
5 | 2 | rmse | standard | 8.017347 | 500 | 0.02706415 | |
5 | 4 | rmse | standard | 8.021977 | 500 | 0.02671044 | |
4 | 6 | rmse | standard | 8.024201 | 500 | 0.02659441 | |
5 | 10 | rmse | standard | 8.031246 | 500 | 0.02702575 | |
5 | 13 | rmse | standard | 8.041638 | 500 | 0.02742045 |
We found the parameters that resulted in the models with lower RMSE values, which punishes bigger error values. Now that we know the parameters values to use, lets update the model and perform the final fit.
<-
biod_mod rand_forest(
mtry = biod_best$mtry,
trees = 100,
min_n = biod_best$min_n
%>%
) set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
<- biod_wflow %>%
biod_wflow update_model(biod_mod)
<-
biod_fit %>%
biod_wflow fit(data = train_data_pred)
<-
biod_aug augment(biod_fit, test_data_pred) %>%
clean_names()
biod_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_nzv()
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)
Main Arguments:
mtry = biod_best$mtry
trees = 100
min_n = biod_best$min_n
Engine-Specific Arguments:
importance = impurity
Computational engine: ranger
The Random Forest performed better than the Poisson Regression in predicting Specie Richness. The MAE dropped by more than 2 species count. The dispersion of points is also less clustered than what we observed from the Poisson Regression.
<- biod_aug %>%
mod_met metrics(truth = bio_richness, estimate = pred) %>%
clean_names() %>%
mutate(estimate = round(estimate, digits = 2))
%>%
biod_aug ggplot() +
geom_point(
aes(
x = bio_richness,
y = pred,
color = lu_uf
)+
) scale_color_manual(values = my_palette) +
annotate(
geom = "text",
x = 0, y = c(90, 80, 70),
hjust = 0,
label = c(
glue("bold({toupper(mod_met$metric[1])})- {mod_met$estimate[1]}"),
glue("bold({toupper(mod_met$metric[3])})- {mod_met$estimate[3]}"),
glue("bold({toupper(mod_met$metric[2])})- {mod_met$estimate[2]}")
),parse = TRUE,
size = 5
+
) labs(x = "Observations", y = "Estimations") +
geom_abline(intercept = 0, slope = 1) +
coord_equal() +
theme_bw() +
theme(
legend.title = element_blank()
)
If we check the importance of each variable in predicting Species Richness, we will see that obviously the specie category was the most important one. After that, we have the above-ground biomass and LULC classes. As expected the soil chemical characteristics were the least important variables.
<- biod_fit %>%
var_importance extract_fit_parsnip() %>%
vi() %>%
clean_names()
%>%
var_importance ggplot() +
geom_col(
aes(x = importance, y = reorder(variable, importance)),
fill = "transparent",
color = "#000000"
+
) labs(x = "Importance", y = NULL) +
theme_bw()
The classification will aim to correctly identify the LULC classes using the biodiversity, soil and carbon data. I will start training a Decision Tree algorithm, and than another Random Forest (but this time for classification). For the Decision Tree, we will need to center and scale numerical predictors, as also create dummy variables and filter near zero variance variables. I will also optimize three parameters of my model specification, but this time I will use Bootstraps.
<-
dt_mod_tune decision_tree(
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
%>%
) set_engine("rpart") %>%
set_mode("classification")
<-
dt_rec recipe(lu_uf ~ ., data = train_data_class) %>%
update_role(
region, catchment, transect, land_useclass,new_role = "ID"
%>%
) step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
<-
dt_wflow workflow() %>%
add_model(dt_mod_tune) %>%
add_recipe(dt_rec)
<- bootstraps(train_data_class, times = 100)
dt_boots
<- dt_wflow %>%
dt_res tune_grid(
dt_boots,grid = 50,
control = control_grid(save_pred = TRUE),
metrics = metric_set(accuracy)
)
<- dt_res %>%
dt_best select_best(metric = "accuracy")
%>%
dt_res show_best() %>%
clean_names()
cost_complexity <dbl> | tree_depth <int> | min_n <int> | metric <chr> | estimator <chr> | |
---|---|---|---|---|---|
3.760749e-02 | 12 | 19 | accuracy | multiclass | |
1.082650e-02 | 3 | 23 | accuracy | multiclass | |
5.727236e-02 | 4 | 27 | accuracy | multiclass | |
3.625256e-07 | 4 | 20 | accuracy | multiclass | |
1.627550e-03 | 3 | 21 | accuracy | multiclass |
We selected the best model, which did not presented a very high accuracy for the validation set. Lets see if these results holds when testing the model with external data.
<-
dt_mod decision_tree(
cost_complexity = dt_best$cost_complexity,
tree_depth = dt_best$tree_depth,
min_n = dt_best$min_n
%>%
) set_engine("rpart") %>%
set_mode("classification")
<- dt_wflow %>%
dt_wflow update_model(dt_mod)
<-
dt_fit %>%
dt_wflow fit(data = train_data_class)
<-
dt_aug augment(dt_fit, test_data_class) %>%
clean_names()
dt_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_zv()
• step_normalize()
• step_dummy()
── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)
Main Arguments:
cost_complexity = dt_best$cost_complexity
tree_depth = dt_best$tree_depth
min_n = dt_best$min_n
Computational engine: rpart
The performance of the Decision Trees was not satisfactory, with a low value of global accuracy. However, It was expected that classifying such refined classification of forests states would be challenging. The interesting is that the model made a decent job at separating Primary Forests (UF, LF, LBF) from Secondary Forests (SFold and SFyoung) and Mechanized Agriculture (MA) and Pasture (PA).
<- dt_aug %>%
acc accuracy(truth = lu_uf, pred_class) %>%
clean_names() %>%
mutate(estimate = round(estimate, digits = 2))
%>%
dt_aug select(truth = lu_uf, pred = pred_class) %>%
mutate(score = 1) %>%
group_by(truth, pred) %>%
summarise(
score = sum(score),
.groups = "drop"
%>%
) right_join(
expand(data = dt_aug, truth = lu_uf, pred = pred_class)
%>%
) replace_na(list(score = 0)) %>%
mutate(cond_score = if_else(truth == pred, score, NA_real_)) %>%
ggplot(aes(x = truth, y = pred)) +
geom_tile(
aes(fill = cond_score),
na.rm = TRUE
+
) geom_text(aes(label = score)) +
scale_y_discrete(limits = rev) +
scale_fill_gradient(
high = "#6f6f6f", low = "#C0C0C0",
na.value = "transparent",
guide = NULL
+
) labs(
title = glue("Global Accuracy: {acc$estimate}"),
x = "Prediction",
y = "Truth"
+
) theme_bw() +
theme(panel.grid = element_blank())
I’m going to optimize and train the Random Forest the same way I did with the Decision Tree.
<-
rf_mod_tune rand_forest(
mtry = tune(),
trees = 100,
min_n = tune()
%>%
) set_engine("ranger") %>%
set_mode("classification")
<-
rf_rec recipe(lu_uf ~ ., data = train_data_class) %>%
update_role(
region, catchment, transect, land_useclass,new_role = "ID"
%>%
) step_zv(all_predictors())
<-
rf_wflow workflow() %>%
add_model(rf_mod_tune) %>%
add_recipe(rf_rec)
<- bootstraps(train_data_class, times = 100)
rf_boots
<- rf_wflow %>%
rf_res tune_grid(
rf_boots,grid = 50,
control = control_grid(save_pred = TRUE),
metrics = metric_set(accuracy)
)
<- rf_res %>%
rf_best select_best(metric = c("accuracy"))
%>%
rf_res show_best()
mtry <int> | min_n <int> | .metric <chr> | .estimator <chr> | mean <dbl> | |
---|---|---|---|---|---|
6 | 29 | accuracy | multiclass | 0.5907425 | |
6 | 17 | accuracy | multiclass | 0.5894465 | |
8 | 33 | accuracy | multiclass | 0.5878731 | |
9 | 27 | accuracy | multiclass | 0.5873548 | |
7 | 22 | accuracy | multiclass | 0.5870907 |
Accuracy metrics did not differ much from the Decision Tree model, now lets apply the final fit with the whole train data and see the final results for the Random Forest model.
<-
rf_mod rand_forest(
mtry = rf_best$mtry,
trees = 100,
min_n = rf_best$min_n
%>%
) set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
<- rf_wflow %>%
rf_wflow update_model(rf_mod)
<-
rf_fit %>%
rf_wflow fit(data = train_data_class)
<-
rf_aug augment(rf_fit, test_data_class) %>%
clean_names()
rf_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_zv()
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)
Main Arguments:
mtry = rf_best$mtry
trees = 100
min_n = rf_best$min_n
Engine-Specific Arguments:
importance = impurity
Computational engine: ranger
The overall accuracy is very similar from the Decision Tree, the same happened for the confusion matrix. The class with lowest accuracy is reforestation, which is also the LULC class with least observations.
<- rf_aug %>%
acc accuracy(truth = lu_uf, pred_class) %>%
clean_names() %>%
mutate(estimate = round(estimate, digits = 2))
%>%
rf_aug select(truth = lu_uf, pred = pred_class) %>%
mutate(score = 1) %>%
group_by(truth, pred) %>%
summarise(
score = sum(score),
.groups = "drop"
%>%
) right_join(
expand(data = rf_aug, truth = lu_uf, pred = pred_class)
%>%
) replace_na(list(score = 0)) %>%
mutate(cond_score = if_else(truth == pred, score, NA_real_)) %>%
ggplot(aes(x = truth, y = pred)) +
geom_tile(
aes(fill = cond_score),
na.rm = TRUE
+
) geom_text(aes(label = score)) +
scale_y_discrete(limits = rev) +
scale_fill_gradient(
high = "#6f6f6f", low = "#C0C0C0",
na.value = "transparent",
guide = NULL
+
) labs(
title = glue("Global Accuracy: {acc$estimate}"),
x = "Prediction",
y = "Truth"
+
) theme_bw() +
theme(panel.grid = element_blank())
This modelling exercise showed that the prediction models showed a decent result, while the classification models did not showed satisfactory results. This document will be improved over time, and more techniques of modelling will be added, so maybe the results will change over time.