Exploring: Linking land-use and land-cover transitions to their ecological impact in the Amazon

Author

Hugo Tameirão Seixas

Abstract
In this project I will explore a data collection containing information about land use and land cover (LULC), biodiversity, soil characteristics and carbon stocks in the Amazon biome. The data was published along a study that analysed the effects of LULC transitions over ecosystem services. Here, I will use this data to perform descriptive analysis and train models of prediction and classification.

Introduction

This exploration exercise will use data from the paper Linking land-use and land-cover transitions to their ecological impact in the Amazon (). They collected data from different areas inside different regions in the Amazon. The data holds information about Land Use and Land Cover (LULC), biodiversity, soil characteristics and carbon stocks. The data availability provides a good opportunity to practice data manipulation and modelling techniques.

From the available data, I will try two exercises:

  • Predict the species richness of an area with information about: land use and land cover, carbon stocks and soil properties?

  • Classify the land use and land cover with information about: species richness, carbon stocks and soil properties?

I will try different models to perform the predictions and classifications, and at the same time, will try to present the data and results in a intuitive manner. In this exercise, I chose to use Poisson Regression (; ), Random Forest (; ), and Decision Trees (; ).

Here are the packages used in this exercise:

Code
library(poissonreg)
library(vip)
library(sf)
library(geobr)
library(arrow)
library(curl)
library(glue)
library(fs)
library(ggiraph)
library(janitor)
library(patchwork)
library(tidyverse)
library(tidymodels)

I will also set some settings that will be used across the document.

Code
# These are the colors to represent LULC classes.
my_palette <-
  c(
    "#004d25", "#11823b", "#99AA38", "#721121", 
    "#A5402D", "#F15156", "#FFC07F", "#804D54"
  )

# These are the labels for the LULC classes
my_labels <-
  c(
    "Undisturbed Forest", "Logged Forest", 
    "Logged Burned Forest", "Old Secondary Forest", 
    "Young Secondary Forest", "Reforestation", 
    "Pasture", "Mechanized Agriculture"
  )
Warning

This project is only an exercise to apply techniques of data manipulation, visualization, exploratory analysis and modelling. It is not meant to make inferences about real world process ecological processes, it does not have theoretical importance.

Download and organize data

Before conducting the exploratory analysis and the modelling, I will download the data, which is available at Zenodo, and perform some data manipulation in order to store all data in a single table.

Download

I am going to download the data directly from the Zenodo repository, in which I simply provide the link to the data collection. After that I just need to extract the compressed files, move the files I’m interested to another folder, and delete the rest. By the end of the process, we can see that I have 25 files available for analysis. Some files are named with “STM” and “PGM”, which are both regions explored in the study, so many tables holding the similar information are divided by these two regions. We can observe files named with animals names, there are also tables named with “soil” and “carbon”, so these are the ones I am interested at. There is also a table containing environment information, which I’m also going to use.

Code
file_name <- "data/LULC_transitions-v1.1.zip"

# Download by providing the link to the data
curl_download(
  url = glue(
    "https://zenodo.org/record/6518744/files/",
    "cassioalencarnunes/LULC_transitions-v1.1.zip?download=1"
  ),
  destfile = file_name
)

# Extract files from compressed archive
unzip(
  zipfile = file_name,
  exdir = "data/"
)

# List files inside data folder
data_files <- 
  dir_ls(
    path = "data/cassioalencarnunes-LULC_transitions-c501bd5/Data/"
  )

# Move data files to another folder
file_move(path = data_files, new_path = "data/")

# Delete other files
dir_delete(path = "data/cassioalencarnunes-LULC_transitions-c501bd5/")

# Delete compressed archive
file_delete(path = "data/LULC_transitions-v1.1.zip")

# Some information about the files of interest
dir_info("data/") %>%
  select(path, size)
ABCDEFGHIJ0123456789
path
<fs_path>
size
<fs_bytes>
data/Ants_PGM.txt123.96K
data/Ants_STM.txt147.23K
data/Bees_PGM.txt19.76K
data/Birds.txt352.92K
data/Carbon_all_PGM.txt9.13K
data/Carbon_all_STM.txt7.43K
data/Dung_beetles.txt107.26K
data/Lianas_PGM.txt89.05K
data/Lianas_STM.txt79.01K
data/Saplings_PGM.txt298.53K

Organize

Now that I have all the data I need, lets start organizing them in a format that will be easier to explore and use to train and test the models. My objective here is to get all the tables I’m interested, and transform them into a single table. We can see that the tables are separated by themes, so I will manipulate them in bundles, going from one theme to another. I separated them into five categories: “Transects”, which holds information about the areas where data was collected; “Biodiversity, which are tables with data about species presence in the evaluated areas;”Environment”, with information about terrain, LULC, coordinates; “Soil”, that stores data about soil chemical characteristics; “Carbon”, which holds information about the biomass present in different places inside the areas.

Starting with the transects data, which holds data of the two regions, STM (municipalities of Santarém, Belterra, and Mojuí-dos-Campos) and PGM (municipality of Paragominas), it also have other information about the identification of the transects, and also respectively LULC class. The identification of each transect was performed by assigning the catchment where it is located (“catchment”), and the number of the transect inside the catchment (“transect”). The column “transect_code” is the union between both columns I have just mentioned, which creates a unique identification for each transect. This table will serve as a base to join all the other tables.

Code
tr_class <- 
  read_delim(
    file = "data/Transect_classification.csv",
    col_select = c(2:6), # Remove unnamed id column
    name_repair = ~ 
      str_replace(
        string = tolower(.), # I prefer everything in lowercase
        pattern = "-", # These traces are problematic 
        replacement = "_"
      ), 
    col_types = "_ccccc"
  )

head(tr_class)
ABCDEFGHIJ0123456789
region
<chr>
catchment
<chr>
transect
<chr>
transect_code
<chr>
lu_uf
<chr>
STM1031103_1AP
STM1032103_2SFyoung
STM1033103_3SFold
STM1034103_4PA
STM1035103_5SFyoung
STM1037103_7AP

There are a lot of tables containing biodiversity variables to be analysed. Since they are stored in many different tables, I will map the manipulations using the purrr package, in which we will apply a function to every biodiversity table. The function will open the tables, pivot them to a longer format (there are many columns indicating a certain specie), and than create a new column indicating what is the group that the species are part of. I also transformed the species number variable into species richness, where the count of individuals found in a single area does not matter, just the occurrence of it (as a binary variable). After that I just summed the number of species found, for each specie category, in each transect, to find the species richness.

By the end of the process I ended up with one table with four columns, containing the region and the transect where the data was collected, the specie category that was found in the respective transect, and the count of different species for each of these categories.

Code
# Create list of files that contains tables with biodiversity data
tables_list <- 
  dir_ls(path = "data/") %>%
  str_subset(
    pattern = regex(
      "ants|trees|bees|birds|dung_beetles|liana|saplings", 
      ignore_case = TRUE
    )
  )

# Merge tables in a single table
biodiversity <- 
  map_df(
    .x = tables_list,
    .f = function(t) {
    
      table <- 
        read_delim(
          file = t,
          name_repair = ~ tolower(.),
          col_select = !matches("aaaaa"), # Remove weird column
          col_types = "cccc"
        )

      table %>%
        pivot_longer( # Get the number of individuals for each specie
          cols = starts_with("sp"),
          names_to = "specie",
          values_to = "number"
        ) %>%
        mutate( # Create variable with the name of the species category
          bio_specie = str_remove_all(
            string = t, 
            pattern = "data/|_|STM|PGM|.txt"
          )
        )
    
    }
  )

# Transform the number of individuals into species richness
biodiversity <- biodiversity %>%
  mutate(number = if_else(number > 0, 1, 0)) %>% # Specie was observed or not
  group_by(region, transect_code, bio_specie) %>%
  summarise(bio_richness = sum(number, na.rm = FALSE), .groups = "drop")

head(biodiversity)
ABCDEFGHIJ0123456789
region
<chr>
transect_code
<chr>
bio_specie
<chr>
bio_richness
<dbl>
PGM100_1Ants17
PGM100_1Bees13
PGM100_1Birds43
PGM100_10Ants24
PGM100_10Bees10
PGM100_10Birds44

There was a weird column in the dung beetle table, named “aaaaaa” (screaming in despair?), I removed it just as the author of the paper.

The environment data table contained a large number of columns. There was data about the LULC classes, spatial location, terrain characteristics, soil texture, and many other columns in which I could not find their meaning. Since the data collection lacks metadata to provide information about variables, I decided to use only the ones I knew what they were, and that interested me. The manipulation of this table was very simple, I just renamed columns, fixed the name of a LULC class, and rounded some numbers. The columns with coordinate information informs that they are in Universal Transverse Mercator (UTM), which is a projected coordinate system, however there is no additional information about it, but I am going to keep it since I believe I can figure it out.

By the end of the process I ended up with a table of 11 variables about the environment of the data collection area. However, I’m still not sure if I’m actually going to use these data in my analysis.

Code
envr <-
  read_delim(
    file = "data/environment_all.txt",
    col_select = c(1:7, 37:38, 49:50),
    name_repair = ~ 
      str_replace(
        string = tolower(.), # I prefer everything in lowercase
        pattern = "-", # These traces are problematic 
        replacement = "_"
      ), 
    col_types = "cccc"
  ) %>%
  mutate(
    land_useclass = 
      str_replace( # Fix land use variable 
        string = land_useclass, 
        pattern = "Mechanisd_agriculture",
        replacement = "Mechanized_agriculture"
      )
  ) %>%
  rename(
    transect_code = transectcode,
    env_silt = silt_all, env_clay = clay_all,
    env_elev = elev_mean, env_slope = slope_mean,
  ) %>%
  mutate(
    across(c(env_silt:env_slope), ~ round(.x, digits = 2))
  )

head(envr)
ABCDEFGHIJ0123456789
region
<chr>
catchment
<chr>
transect
<chr>
transect_code
<chr>
utm_x
<dbl>
STM1031103_1730352
STM1032103_2729806
STM1033103_3729758
STM1034103_4732165
STM1035103_5733822
STM1037103_7729736

The variables of the soil tables are mostly about chemical properties of the soil, like acidity and fertility. This data don’t require a lot of modifications, I will just rename some variables, transform nitrogen from decimals to percentages, and will round some other variables. Since there are two tables, one for each region, I will map the data manipulation for both tables and merge the results using the purrr::map_df() function.

Code
soil <- 
  map_df(
    .x = dir_ls(path = "data/", regexp = "Soil"),
    .f = function(t) {
      
      table <- 
        read_delim(
          file = t,
          col_select = c(2:9, 11, 12), # I'm leaving sodium out
          name_repair = ~ tolower(.),
          col_types = "_cccc"
        )
      
    }
  ) %>%
  rename(
    soil_ph = ph, soil_n = nperc, soil_p = p, 
    soil_k = k, soil_ca = camg, soil_al = al
  ) %>%
  mutate(
    soil_n = soil_n * 100, # to percentage
    across(c(soil_ph, soil_p:soil_al), ~ round(.x, digits = 2))
  ) 

head(soil)
ABCDEFGHIJ0123456789
region
<chr>
catchment
<chr>
transect
<chr>
transect_code
<chr>
soil_ph
<dbl>
PGM441244_125.25
PGM441544_155.40
PGM44244_25.00
PGM44344_35.28
PGM44444_44.28
PGM44544_54.61

The last tables that I’m going to use are the ones containing data about carbon stocks. The carbon partition used in the published study followed the guidelines for separating carbon stocks by the Intergovernmental Panel on Climate Change (IPCC). The categories of soil are “above-ground carbon pool”, “dead wood pool”, “litter carbon pool”, and “soil carbon pool”. These are important variables to estimate an ecosystem biomass.

This table also did not require a lot of modification, just rename some variables, and round numbers. I will also map the manipulation functions for both files (one for each region), to transform it into one table.

Code
carbon <- 
  map_df(
    .x = dir_ls(path = "data/", regexp = "Carbon"),
    .f = function(t) {
      
      table <- 
        read_delim(
          file = t,
          name_repair = ~ tolower(.), # I prefer everything in lowercase
          col_types = "c"
        ) %>%
        mutate(region = str_extract(t, "PGM|STM"))
      
    }
  ) %>%
  rename(
    carbon_aboveground = aboveground_pool, carbon_litter = litter_pool,
    carbon_deadwood = dead_pool, carbon_soil = soil_pool_new
  ) %>%
  mutate(
    across(c(carbon_aboveground:carbon_soil), ~ round(.x, digits = 2))
  )

head(carbon)
ABCDEFGHIJ0123456789
transect_code
<chr>
carbon_aboveground
<dbl>
carbon_litter
<dbl>
carbon_deadwood
<dbl>
carbon_soil
<dbl>
100_10.005.584.1952.73
100_107.434.861.4642.45
100_26.864.608.4798.40
100_371.337.3730.3483.86
100_40.008.1721.8968.16
100_5113.9011.4639.4762.28

Merge

After organizing the tables, I can now merge them altogether into a unique table. I am going to make a series of join operations, merging the tables by the identification columns: “region”, “catchment”, “transect” and “transect_code.” After joining them I will also make some last modifications before exploring the data, I will change the LULC names more beautiful for the column “land_useclass”, and will transform “region” and “lu_uf” into ordered factors. I will also save the resulting table into a .parquet file, which is a compressed tabular format.

Code
full_data <- tr_class %>%
  inner_join(
    y = biodiversity, 
    by = c("region", "transect_code")
  ) %>%
  inner_join(
    y = soil, 
    by = c("region", "catchment", "transect", "transect_code")
  ) %>%
  inner_join(
    y = carbon, 
    by = c("region", "transect_code")
  ) %>%
  inner_join(
    y = envr, 
    by = c("region", "catchment", "transect", "transect_code")
  ) %>%
  mutate(
    land_useclass = str_replace_all( # Make LULC names more beautiful
      string = land_useclass, 
      pattern = "-|_",
      replacement = " "
    ),
    region = fct(region, levels = c("STM", "PGM")),
    lu_uf = fct(
      lu_uf, 
      levels = c(
        "UF", "LF", "BF", "LBF", "SFold", "SFyoung", "REF", "PA", "MA", 
        "AP", "FRU", "SHA"
      )
    )
  )

write_parquet(
  x = full_data,
  sink = "data/lulc_eco.parquet",
  version = "latest"
)

file_delete(
  path = dir_ls(
    "data/",
    glob = "*.parquet",
    invert = TRUE
  )
)

rm(list = setdiff(ls(), c("my_palette", "my_labels")))

Descriptive Analysis

Now we have a single table to start exploring! Lets begin with some descriptive analysis. My object here is to find the most general information of the variables I select for analysis, I will keep separating the analysis into categories of variables, as I have done above.

In total, the published study collected data from a total of 339 transects, 156 from STM and 183 from PGM. I want to know what are the LULC characteristics and distribution in these regions. For that, I will count the occurrence of each LULC, and plot their spatial distribution over the study areas.

Some LULC classes presented a very low frequency: Burned Forests (BF), Abandoned Plantations (AP), Fruiticulture (FRU) and Small Holder’s Agriculture (SHA), accumulated only 7, 4, 4 and 4 observations, respectively. Because of that, I’m not going to analyse the data for these for LULC classes in this project.

Code
data <- 
  read_parquet(
    file = "data/lulc_eco.parquet",
    col_select = c("region", "transect_code", "lu_uf")
  ) %>%
  distinct(region, transect_code, lu_uf, .keep_all = TRUE) %>%
  group_by(lu_uf) %>%
  count(lu_uf) %>%
  ungroup()

lu_subset <- data %>%
  arrange(desc(n)) %>%
  slice_max(n, n = 8) %>% # Filter LULC classes with more observations
  pull(lu_uf)

data %>%
  arrange(desc(n))
ABCDEFGHIJ0123456789
lu_uf
<fct>
n
<int>
PA72
LF68
LBF65
SFyoung33
MA26
SFold25
UF21
REF10
BF7
AP4

With the filtered data, I’m going to explore the spatial distribution of the transects and their LULC classes. But for that, I need to discover what is the coordinate system being used. I know that it is based in a Universal Transverse Mercator projection, the problem is that it is divided by zones. After some trials, I found that STM data is in region 21 and PGM in region 23. It is not possible to find more details about the coordinate system, as the ellipsoid used to represent the earth surface, so I used the most common for global data, the WGS84. Even if it may not be the correct one, it wont make a lot of difference in this case, since I’m not looking for extreme spatial accuracy for my map.

Now I can create my figure! I’m going to plot two maps (one for each region), along with bars representing the amount of observations for each LULC class. I am also going to add some interactive functionalities so you can hover with your mouse over the scales and points to find more information.

Code
# Get the boundary of the municipalities
muni <- 
  read_municipality(
    code_muni = "PA", 
    showProgress = FALSE
  ) %>% 
  filter(code_muni %in% c(1506807, 1501451, 1505502))

# Transform data into spatial format
data <- 
  map2_df( 
    .x = c("STM", "PGM"),
    .y = c("21", "23"), # UTM regions
    .f = function(r, z) {
      
      data <-
        read_parquet(
          file = "data/lulc_eco.parquet",
          col_select = c(
            "region", "transect_code", "lu_uf", "utm_x", "utm_y"
          )
        ) %>%
        filter(
          region == r, 
          lu_uf %in% lu_subset
        ) %>%
        distinct(region, transect_code, lu_uf, .keep_all = TRUE) %>%
        st_as_sf(
          coords = c("utm_x", "utm_y"), 
          crs = glue(
            "+proj=utm +zone={z} +south +datum=WGS84 +units=m +no_defs"
          )
        ) %>%
        st_transform(st_crs(muni))
      
    }
  )

maps <- 
  map2(
    .x = c("STM", "PGM"),
    .y = list(c(1506807, 1501451), c(1505502)),
    .f = function(r, m) {
      
      ggplot() +
        geom_sf(data = muni %>% filter(code_muni %in% m)) +
        geom_sf_interactive(
          data = data %>% filter(region == r),
          aes(
            color = lu_uf,
            tooltip = lu_uf,
            data_id = lu_uf,
            `data-id` = lu_uf
          ),
          size = 1,
          expand = FALSE,
          extra_interactive_params = "data-id"
        ) +
        scale_color_manual_interactive(
          values = my_palette,
          tooltip = function(breaks) { as.character(my_labels)},
          extra_interactive_params = "data-id",
          `data-id` = data %>% 
            arrange(data, lu_uf) %>% 
            distinct(lu_uf),
          data_id = function(breaks) { as.character(breaks) }
        ) +
        annotate(
          "text", 
          -Inf, Inf, 
          label = glue("bold({r})"), 
          parse = TRUE,
          hjust = 0, vjust = 1
        ) +
        labs(x = NULL, y = NULL, title = NULL) +
        scale_x_continuous(expand = c(0, 0)) +
        scale_y_continuous(expand = c(0, 0)) +
        guides(
          color = guide_legend_interactive(override.aes = list(size = 4))
        ) +
        theme_void(base_family = "sans") +
        theme(
          legend.title = element_blank(),
          panel.grid.major = element_blank(),
          legend.position = "bottom"
        )
      
    }
  )

bars <- 
  map(
    .x = c("STM", "PGM"),
    .f = function(r) {
      
      ggplot() +
        geom_col_interactive(
          data = data %>% 
            filter(region == r) %>%
            group_by(lu_uf) %>%
            count(),
          aes(
            data_id = lu_uf,
            `data-id` = lu_uf,
            x = n,
            y = lu_uf,
            tooltip = glue("Observations: {n}"),
            fill = lu_uf
          ),
          show.legend = FALSE
        ) +
        lims(x = c(0, max(pull(add_count(data, lu_uf), n)))) +
        labs(x = NULL, y = NULL, title = NULL) +
        scale_y_discrete(expand = c(0, 0)) +
        scale_fill_manual_interactive(
          values = my_palette,
          extra_interactive_params = "data-id",
          `data-id` = data %>% 
            arrange(data, lu_uf) %>% 
            distinct(lu_uf),
          data_id = function(breaks) { as.character(breaks) }
        ) +
        labs(y = "", x = "") +
        theme_void(base_family = "sans") +
        theme(
          text = element_text(size = 15, family = "Open Sans"),
          axis.text.y = element_blank(),
          legend.position = ""
        )
      
    }
  )
  
rm(muni)

girafe(
  code = 
    print(
      maps[[2]] + bars[[2]] + 
        plot_spacer() + plot_spacer() + 
        maps[[1]] + bars[[1]] + 
        plot_layout(
          ncol = 2, nrow = 3,
          guides = "collect",
          heights = c(2, 0.2, 2),
          width = c(3, 1)
        ) & 
        theme(legend.position = "bottom")
      ), 
  options = list(
    opts_tooltip(
      css = "background-color:white;color:black;padding:5px;border-radius:2px;"
    ),
    opts_hover_inv(css = "opacity:0.5;"),
    opts_hover(girafe_css("stroke:black;fill:black;")),
    opts_hover_key(girafe_css("stroke:black;fill:black")),
    opts_toolbar(saveaspng = FALSE),
    fonts = list(sans = "Open Sans")
  )
)
P G M S T M UF LF LBF SFold SFyoung REF PA MA

We can observe that STM region presented a more balanced distribution of LULC classes being analysed. The most common LULC classes are Pasture (PA), Logged Primary Forests (LF) Logged and Burned Primary Forests (LBF). For both regions, Undisturbed Forests are concentrated to the West, and landscapes with intense human activities are concentrated to the East. So there is some kind of transition pattern, from undisturbed to disturbed landscapes.

And how about the species that live in these areas? We can imagine that biodiversity of undisturbed landscape are likely higher than of disturbed ones. I want to try to observe patterns of species richness distributed over the LULC classes. For that I’m going to create a matrix plot, showing the median of species richness in each LULC class (I’m sure that taking the median of species richness is not a good practice, but I’m doing it here just to explore some patterns).

Code
data <- 
  read_parquet(
    file = "data/lulc_eco.parquet",
    col_select = c(
      "region", "transect_code", "lu_uf",
      "bio_richness", "bio_specie"
    )
  ) %>%
  filter(lu_uf %in% lu_subset) %>%
  group_by(region, lu_uf, bio_specie) %>%
  summarise(bio_richness = median(bio_richness), .groups = "keep") %>%
  mutate(
    id = cur_group_id(),
    richness_label = glue(
      "{lu_uf} \n",
      "{bio_specie} \n",
      "Number of species: {round(bio_richness)}"
    )
  ) %>%
  drop_na()

plot <- ggplot(data) +
  facet_wrap(vars(region)) +
  geom_tile_interactive(
    aes(
      data_id = id,
      x = bio_specie,
      y = lu_uf,
      fill = bio_richness,
      tooltip = richness_label
    )
  ) +
  scale_fill_gradient(low = "#2c2929", high = "#C0C0C0") +
  labs(y = "", x = "") +
  theme_bw(base_family = "sans") +
  theme(
    text = element_text(size = 15, family = "Open Sans"),
    axis.text.y = element_text(
      hjust = 1, vjust = 0.5, margin = margin(r = 20), angle = 45
    ),
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
    legend.position = "",
    panel.grid = element_blank()
  )

girafe(
  ggobj = plot, 
  options = list(
    opts_tooltip(
      css = "background-color:white;color:black;padding:5px;border-radius:2px;"
    ),
    opts_hover_inv(css = "opacity:0.5;"),
    opts_hover(
      girafe_css(
        css = "stroke:black;cursor:pointer;r:5px;",
      )
    ),
    opts_toolbar(saveaspng = FALSE),
    fonts = list(sans = "Open Sans")
  )
)
STM PGM Ants Bees Birds Dungbeetles Lianas Saplings Trees Ants Bees Birds Dungbeetles Lianas Saplings Trees UF LF LBF SFold SFyoung REF PA MA

We can observe that undisturbed landscapes presents higher species richness, the Mechanized Agriculture (MA) presents the lowest values of specie richness. Trees, Saplings and Birds are the most diverse categories of species. Both regions present very similar patterns, so it seems that species richness are not different between regions. Unfortunately, bees species were not recorded for the STM region.

Now lets take a look from another perspective. I’m going to plot an histogram of the overall specie richness of each transect, grouped by regions and by more general LULC classes: Mechanized Agriculture (MA), Pasture (PA), Primary Forests (PF) and Secondary Forests (SF).

Code
data <- 
  read_parquet(
    file = "data/lulc_eco.parquet",
    col_select = c(
      "region", "transect_code", "land_useclass",
      "bio_richness", "bio_specie"
    )
  ) %>%
  group_by(region, land_useclass, transect_code) %>%
  summarise(
    bio_richness = sum(bio_richness, na.rm = TRUE), .groups = "drop_last"
  ) %>%
  mutate(
    id = cur_group_id(),
    label = glue(
      "Total Observations: {n()}",
      "\n",
      "Mean: {round(mean(bio_richness))}",
      "\n",
      "Median: {round(median(bio_richness))}",
      "\n",
      "Variance: {round(var(bio_richness))}"
    )
  ) %>%
  ungroup() %>%
  add_count(land_useclass) %>%
  filter(
    bio_richness > 0,
    dense_rank(desc(n)) <= 4 # Filter the four LULC classes with most obs
  ) %>%
  select(-n) %>%
  drop_na()

plot <- 
  ggplot(data) +
    facet_grid_interactive(
      interactive_on = "both",
      region ~ land_useclass,
      labeller = labeller(
        land_useclass = labeller_interactive(
          aes(
            tooltip = glue("{land_useclass}"),
            label = c("MA", "PA", "PF", "SF")
          )
        )
      )
    ) +
    geom_histogram_interactive(
      aes(
        x = bio_richness,
        tooltip = label,
        data_id = id
      ),
      binwidth = 15
    ) +
  labs(x = "Number of Species", y = "Count") +
    theme_bw(base_family = "sans") +
    theme(
      text = element_text(size = 15, family = "Open Sans"),
      axis.text.y = element_text(
        hjust = 1, vjust = 0.5, margin = margin(r = 5)
      ),
      axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
      strip.text.x = element_text_interactive()
    )

girafe(
  ggobj = plot, 
  options = list(
    opts_tooltip(
      css = "background-color:white;color:black;padding:5px;border-radius:2px;"
    ),
    opts_hover_inv(css = "opacity:0.5;"),
    opts_hover(
      girafe_css(
        css = "stroke:black;cursor:pointer;r:5px;",
      )
    ),
    opts_toolbar(saveaspng = FALSE),
    fonts = list(sans = "Open Sans")
  )
)
MA PA PF SF STM PGM 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 0 5 10 15 20 0 5 10 15 20 Number of Species Count

The histograms confirms that difference of species richness between both regions are not apparent, but note that this doesn’t mean that they are the same! Maybe the species inside each region are different, maybe they hold different number of individuals population, I imagine that both these measurements are important too. We can also observe that the variance of specie richness are much higher for PF and SF than PA and MA, which may be caused because PF and SF present more diverse conditions, some are less disturbed, others are more disturbed.

So by now we already know that LULC classes and biodiversity are related in some manner. Lets explore if chemical properties of soils are also related to LULC classes. For that I’m simply going to plot the soil variables distributions grouped by general LULC classes, it is a very simple, but very informative type of visualization.

Code
data <- 
  read_parquet(
    file = "data/lulc_eco.parquet",
    col_select = c(
      "region", "transect_code", "land_useclass", "soil_ph":"soil_al"
    )
  ) %>%
  distinct(transect_code, .keep_all = TRUE) %>%
  pivot_longer(
    cols = soil_ph:soil_al,
    names_to = "soil_props",
    values_to = "value"
  ) %>%
  drop_na() %>%
  add_count(land_useclass) %>%
  filter(
    dense_rank(desc(n)) <= 4 # Filter the four LULC classes with most obs
  ) %>%
  select(-n)

plot <- 
  ggplot(data) +
  facet_wrap(
    vars(soil_props), 
    scales = "free_x",
    labeller = labeller(
      soil_props = c(
        soil_al = "Al", soil_ca = "Ca", 
        soil_k = "K", soil_n = "N", 
        soil_p = "P", soil_ph = "pH"
      )
    )
  ) +
  geom_violin(
    aes(
      y = land_useclass, 
      x = value
    )
  ) + 
  scale_x_continuous(breaks = pretty_breaks(4)) +
  theme_bw(base_family = "sans") +
  theme(
    axis.title = element_blank(),
    text = element_text(size = 15, family = "Open Sans"),
    axis.text.y = element_text(hjust = 0, margin = margin(r = 30))
  )

girafe(
  ggobj = plot, 
  options = list(
    opts_tooltip(
      css = "background-color:white;color:black;padding:5px;border-radius:2px;"
    ),
    opts_hover_inv(css = "opacity:0.5;"),
    opts_hover(
      girafe_css(
        css = "stroke:black;cursor:pointer;r:5px;",
      )
    ),
    opts_toolbar(saveaspng = FALSE),
    fonts = list(sans = "Open Sans")
  )
)
N P pH Al Ca K 5 10 15 20 25 5 10 15 4 5 6 0 1 2 3 2 4 6 50 100 Mechanized agriculture Pasture Primary forest Secondary forest Mechanized agriculture Pasture Primary forest Secondary forest

The most noticeable pattern in these distributions, at least in my opinion, is the differences between forests and pasture/agriculture. This is very obvious, the effect of human activities over the soil is caused not just by the land cover, but by the practices used in the land uses. When areas are going to be used for livestock and crops, Calcium (Ca) is applied in order to “eliminate” Aluminium (Al) (which is toxic for roots of most of the plants) and to elevate pH (helps plants to absorb nutrients from soil). It is easy to see these differences in the distributions. There are also differences in nutrients that are important to plants Nitrogen (N), Phosphorus (P) and Potassium (K). They are also in higher concentrations at Mechanized Agriculture (MA) and Pasture (PA), since these classes are normally under exogenous fertilization regimes. However, we can also observe that forests present more extreme values of nutrients concentration.

My question now is if soil chemical characteristics would help a prediction model, that aims to predict species richness values. I will explore this checking if they are correlated, which they are probably, since both are related to LULC classes. However, maybe the amount of variation of soil properties may help improving the predictions. Lets take a look.

Code
data <- 
  read_parquet(
    file = "data/lulc_eco.parquet",
    col_select = c(
      "region", "transect_code", "lu_uf",
      "bio_specie", "bio_richness", 
      "soil_ph":"soil_al"
    )
  ) %>%
  filter(lu_uf %in% lu_subset) %>%
  group_by(region, lu_uf, transect_code) %>%
  summarise(
    bio_richness = sum(bio_richness, na.rm = TRUE), 
    across(soil_ph:soil_al, ~ max(.x)),
    .groups = "drop"
  ) %>%
  pivot_longer(
    soil_ph:soil_al,
    names_to = "soil_props"
  ) %>%
  drop_na()

plot <- 
  ggplot(data) +
  facet_wrap(
    vars(soil_props), 
    scales = "free_x",
    labeller = labeller(
      soil_props = c(
        soil_al = "Al", soil_ca = "Ca", 
        soil_k = "K", soil_n = "N", 
        soil_p = "P", soil_ph = "pH"
      )
    )
  ) +
  geom_point(aes(x = value, y = bio_richness, color = lu_uf)) +
  scale_color_manual(values = my_palette) +
  scale_x_continuous(breaks = pretty_breaks(4)) +
  labs(y = "Specie Richness", x = NULL) +
  theme_bw(base_family = "sans") +
  theme(
    text = element_text(size = 15, family = "Open Sans"),
    legend.title = element_blank()
  )
  

plot

When analyzing the whole data, it is difficult to see correlation, maybe just some species categories might have correlation with soil chemical characteristics (such as lianas, saplings and trees). Also it is not surprising that these variables dispersion in the scatter plot are specially flat at Pastures (PA) and Mechanized Agriculture (MA), since these areas tend to be much more homogeneous than natural landscapes. Remember that the histograms of biodiversity showed lower variation of species richness for PA and MA. Since the relation between soil chemical properties and biodiversity is so unclear and messy, I’ll evaluate if I’m going to use them to train my prediction models.

For the carbon data, I’m also going to explore the distribution of carbon stocks divided by general LULC classes. First, I will see the distributions of all the carbon stocks together. It is possible to observe that Primary Forests have the highest amount of biomass, followed by secondary forests. This is expected specially due to the above-ground and dead wood biomass!

Code
data <- 
  read_parquet(
    file = "data/lulc_eco.parquet",
    col_select = c(
      "region", "transect_code", "land_useclass", 
      "carbon_aboveground":"carbon_soil"
    )
  ) %>%
  distinct(transect_code, .keep_all = TRUE) %>%
  drop_na() %>%
  pivot_longer(
    cols = carbon_aboveground:carbon_soil,
    names_to = "carbon",
    values_to = "value"
  ) %>%
  add_count(land_useclass) %>%
  filter(
    dense_rank(desc(n)) <= 4 # Filter the four LULC classes with most obs
  ) %>%
  select(-n) %>%
  group_by(region, land_useclass, transect_code) %>%
  summarise(value = sum(value), .groups = "drop")

ggplot(data) +
  geom_violin(
    aes(
      y = land_useclass, 
      x = value
    )
  ) + 
  scale_x_continuous(breaks = pretty_breaks(4)) +
  theme_bw(base_family = "sans") +
  theme(
    axis.title = element_blank(),
    text = element_text(size = 15, family = "Open Sans"),
    axis.text.y = element_text(hjust = 0, margin = margin(r = 30))
  )

So lets repeat the same figure, but analyzing the Litter and Soil carbon stocks separately. We can see that litter biomass is the lowest for Mechanized Agriculture (MA), which is also expected, crops like soybean, rice, cotton do not produce a lot of litter, and many systems apply soil tillage. But for soil carbon the story is different, it is difficult to see big differences between the LULC classes. This is not new information, estimating effects of LULC classes over soil carbon is a difficult task, but we are not going to discuss this matter in this document.

Code
data <- 
  read_parquet(
    file = "data/lulc_eco.parquet",
    col_select = c(
      "region", "transect_code", "land_useclass", 
      "carbon_litter", "carbon_soil"
    )
  ) %>%
  distinct(transect_code, .keep_all = TRUE) %>%
  drop_na() %>%
  pivot_longer(
    cols = carbon_litter:carbon_soil,
    names_to = "carbon",
    values_to = "value"
  ) %>%
  add_count(land_useclass) %>%
  filter(
    dense_rank(desc(n)) <= 4 # Filter the four LULC classes with most obs
  ) %>%
  select(-n)

ggplot(data) +
  facet_wrap(
    vars(carbon), 
    scales = "free_x",
    labeller = labeller(
      carbon = c(carbon_litter = "Litter", carbon_soil = "Soil")
    )
  ) +
  geom_violin(
    aes(
      y = land_useclass, 
      x = value
    )
  ) + 
  scale_x_continuous(breaks = pretty_breaks(4)) +
  theme_bw(base_family = "sans") +
  theme(
    axis.title = element_blank(),
    text = element_text(size = 15, family = "Open Sans"),
    axis.text.y = element_text(hjust = 0, margin = margin(r = 30))
  )

Filter data

After exploring a data a little, I already have some knowledge about what I’m dealing with. With this information, I’m going to filter my data before starting to train my models. Regarding the variables, I’m going to drop some chemical soil characteristics (Ca, K, N, P), and for carbon stocks, I will drop Dead Wood and Soil carbon variables I will also only keep the rows that belongs to the 8 LULC classes with most observations, and I will remove Specie Richness values for Bees, which would cause a serious imbalance in the data.

Code
data <- 
  read_parquet(
    file = "data/lulc_eco.parquet",
    col_select = c(
      "region", "catchment", "transect", "land_useclass", "lu_uf",
      "bio_richness", "bio_specie",
      "soil_ph", "soil_al",
      "carbon_aboveground", "carbon_litter"
    )
  ) %>%
  filter(
    lu_uf %in% lu_subset,
    bio_specie != "Bees"
  ) %>%
  drop_na()

write_parquet(
  x = data,
  sink = "data/filtered_lulc_eco.parquet",
  version = "latest"
)

Modelling

Now its time to start the modelling exercise. I’m going to train and test four models, two prediction models (Poisson Regression and Random Forest), and two classification models (Decision Tree and Random Forest).

I will start the exercise preparing the train and test data. Normally it is performed a random sampling separating 70% of the data for training, and 30% for testing (proportions vary). In this exercise I will take a different approach, I will use the STM region to train my models (n = 156), and the PGM region to test them (n = 183). I will choose this strategy to keep both splits as independent as they could be, using a “external” data set to test my models. I choose the STM region as train split because it is more balanced than the PGM region (regarding the LULC classes).

Code
data_pred <- 
  read_parquet(file = "data/filtered_lulc_eco.parquet") %>%  
  drop_na() %>%
  mutate(lu_uf = droplevels(lu_uf))

#set.seed(821715)

#data_split <- initial_split(data, prop = 2/4)

train_data_pred <- filter(data_pred, region == "STM")
test_data_pred  <- filter(data_pred, region == "PGM")

data_class <- data_pred %>%
  pivot_wider(
    names_from = bio_specie,
    values_from = bio_richness
  ) %>%
  drop_na()

train_data_class <- filter(data_class, region == "STM")
test_data_class  <- filter(data_class, region == "PGM")

Poisson Regression

The first model will be the Poisson Regression, a Generalized Linear Model (GLM), which is appropriate to predict count data (count of number of different species in one area!). Before training the model, I will set some pre processing steps: remove near zero variance variables, and create dummy variables to represent nominal predictors.

Code
biod_rec <- 
  recipe(bio_richness ~ ., data = train_data_pred) %>%
  update_role(
    region, catchment, transect, land_useclass,
    new_role = "ID"
  ) %>%
  step_nzv(all_predictors()) %>%
  step_dummy(all_nominal_predictors())

biod_mod <-
  poisson_reg() %>%
  set_engine("glm")

biod_wflow <- 
  workflow() %>% 
  add_model(biod_mod) %>% 
  add_recipe(biod_rec)

biod_fit <- 
  biod_wflow %>%
  fit(data = train_data_pred)

biod_aug <- 
  augment(biod_fit, test_data_pred) %>%
  clean_names()

biod_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: poisson_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_nzv()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Poisson Regression Model Specification (regression)

Computational engine: glm 

The Poisson Regression, as expected, did not predicted any negative numbers, which could happen in case we tried to predict values using Linear Least Squares Regression. This model do not present any parameter to be optimized before fitting the final model. The MAE points out that the mean absolute error is around 10 species, which can be a good performance for some categories of species, or bad for others (some categories of species are more abundant than others).

Code
mod_met <- biod_aug %>% 
  metrics(truth = bio_richness, estimate = pred) %>%
  clean_names() %>%
  mutate(estimate = round(estimate, digits = 2))

biod_aug %>%
  ggplot() +
  geom_point(
    aes(
      x = bio_richness,
      y = pred,
      color = lu_uf
    )
  ) +
  scale_color_manual(values = my_palette) +
  annotate(
    geom = "text",
    x = 0, y = c(80, 70, 60),
    hjust = 0,
    label = c(
      glue("bold({toupper(mod_met$metric[1])})- {mod_met$estimate[1]}"),
      glue("bold({toupper(mod_met$metric[3])})- {mod_met$estimate[3]}"),
      glue("bold({toupper(mod_met$metric[2])})- {mod_met$estimate[2]}")
    ),
    parse = TRUE,
    size = 5
  ) +
  labs(x = "Observations", y = "Estimations") +
  geom_abline(slope = 1, intercept = 0) +
  coord_equal() +
  theme_bw() +
  theme(
    legend.title = element_blank()
  )

Random Forest

Now lets train a Random Forest model, and we will check if it will be able to perform better than the Poisson Regression in the task to predict Specie Richness. The Random Forest is known to be very robust to imbalances, it does not require creation of dummies for ordinal data, performing well with categorized data. Before performing the final fit to be tested, I will estimate three parameters of the model, in order to find a good combination for using in the final fit. I will perform this optimization using V-Fold Cross-Validation in the training set.

Code
biod_mod <-
  rand_forest(
    mtry = tune(),
    trees = 100,
    min_n = tune()
  ) %>%
  set_engine("ranger") %>%
  set_mode("regression")

biod_rec <- 
  recipe(bio_richness ~ ., data = train_data_pred) %>%
  update_role(
    region, catchment, transect, land_useclass,
    new_role = "ID"
  ) %>%
  step_nzv(all_predictors())

biod_wflow <- 
  workflow() %>% 
  add_model(biod_mod) %>% 
  add_recipe(biod_rec)

biod_fold <- vfold_cv(train_data_pred, v = 5, repeats = 100)

biod_res <- biod_wflow %>% 
  tune_grid(
    biod_fold,
    grid = 50,
    control = control_grid(save_pred = TRUE),
    metrics = metric_set(rmse)
  )

biod_best <- biod_res %>% 
  select_best(metric = "rmse")

biod_res %>%
  show_best() %>%
  clean_names()
ABCDEFGHIJ0123456789
mtry
<int>
min_n
<int>
metric
<chr>
estimator
<chr>
mean
<dbl>
n
<int>
std_err
<dbl>
52rmsestandard8.0173475000.02706415
54rmsestandard8.0219775000.02671044
46rmsestandard8.0242015000.02659441
510rmsestandard8.0312465000.02702575
513rmsestandard8.0416385000.02742045

We found the parameters that resulted in the models with lower RMSE values, which punishes bigger error values. Now that we know the parameters values to use, lets update the model and perform the final fit.

Code
biod_mod <-
  rand_forest(
    mtry = biod_best$mtry,
    trees = 100,
    min_n = biod_best$min_n
  ) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("regression")

biod_wflow <- biod_wflow %>%
  update_model(biod_mod)

biod_fit <- 
  biod_wflow %>%
  fit(data = train_data_pred)

biod_aug <- 
  augment(biod_fit, test_data_pred) %>%
  clean_names()

biod_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_nzv()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = biod_best$mtry
  trees = 100
  min_n = biod_best$min_n

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger 

The Random Forest performed better than the Poisson Regression in predicting Specie Richness. The MAE dropped by more than 2 species count. The dispersion of points is also less clustered than what we observed from the Poisson Regression.

Code
mod_met <- biod_aug %>% 
  metrics(truth = bio_richness, estimate = pred) %>%
  clean_names() %>%
  mutate(estimate = round(estimate, digits = 2))

biod_aug %>%
  ggplot() +
  geom_point(
    aes(
      x = bio_richness,
      y = pred,
      color = lu_uf
    )
  ) +
  scale_color_manual(values = my_palette) +
  annotate(
    geom = "text",
    x = 0, y = c(90, 80, 70),
    hjust = 0,
    label = c(
      glue("bold({toupper(mod_met$metric[1])})- {mod_met$estimate[1]}"),
      glue("bold({toupper(mod_met$metric[3])})- {mod_met$estimate[3]}"),
      glue("bold({toupper(mod_met$metric[2])})- {mod_met$estimate[2]}")
    ),
    parse = TRUE,
    size = 5
  ) +
  labs(x = "Observations", y = "Estimations") +
  geom_abline(intercept = 0, slope = 1) +
  coord_equal() +
  theme_bw() +
  theme(
    legend.title = element_blank()
  )

If we check the importance of each variable in predicting Species Richness, we will see that obviously the specie category was the most important one. After that, we have the above-ground biomass and LULC classes. As expected the soil chemical characteristics were the least important variables.

Code
var_importance <- biod_fit %>%
  extract_fit_parsnip() %>%
  vi() %>%
  clean_names()

var_importance %>%
  ggplot() + 
  geom_col(
    aes(x = importance, y = reorder(variable, importance)),
    fill = "transparent",
    color = "#000000"
  ) +
  labs(x = "Importance", y = NULL) +
  theme_bw()

Decision Tree

The classification will aim to correctly identify the LULC classes using the biodiversity, soil and carbon data. I will start training a Decision Tree algorithm, and than another Random Forest (but this time for classification). For the Decision Tree, we will need to center and scale numerical predictors, as also create dummy variables and filter near zero variance variables. I will also optimize three parameters of my model specification, but this time I will use Bootstraps.

Code
dt_mod_tune <-
  decision_tree(
    cost_complexity = tune(),
    tree_depth = tune(),
    min_n = tune()
  ) %>% 
  set_engine("rpart") %>%
  set_mode("classification")

dt_rec <- 
  recipe(lu_uf ~ ., data = train_data_class) %>%
  update_role(
    region, catchment, transect, land_useclass,
    new_role = "ID"
  ) %>%
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors()) 

dt_wflow <- 
  workflow() %>% 
  add_model(dt_mod_tune) %>% 
  add_recipe(dt_rec)

dt_boots <- bootstraps(train_data_class, times = 100)

dt_res <- dt_wflow %>%
  tune_grid(
    dt_boots,
    grid = 50,
    control = control_grid(save_pred = TRUE),
    metrics = metric_set(accuracy)
  )

dt_best <- dt_res %>% 
  select_best(metric = "accuracy")

dt_res %>%
  show_best() %>%
  clean_names()
ABCDEFGHIJ0123456789
cost_complexity
<dbl>
tree_depth
<int>
min_n
<int>
metric
<chr>
estimator
<chr>
3.760749e-021219accuracymulticlass
1.082650e-02323accuracymulticlass
5.727236e-02427accuracymulticlass
3.625256e-07420accuracymulticlass
1.627550e-03321accuracymulticlass

We selected the best model, which did not presented a very high accuracy for the validation set. Lets see if these results holds when testing the model with external data.

Code
dt_mod <-
  decision_tree(
    cost_complexity = dt_best$cost_complexity,
    tree_depth = dt_best$tree_depth,
    min_n = dt_best$min_n
  ) %>%
  set_engine("rpart") %>%
  set_mode("classification")

dt_wflow <- dt_wflow %>%
  update_model(dt_mod)

dt_fit <- 
  dt_wflow %>%
  fit(data = train_data_class)

dt_aug <- 
  augment(dt_fit, test_data_class) %>%
  clean_names()

dt_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_zv()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)

Main Arguments:
  cost_complexity = dt_best$cost_complexity
  tree_depth = dt_best$tree_depth
  min_n = dt_best$min_n

Computational engine: rpart 

The performance of the Decision Trees was not satisfactory, with a low value of global accuracy. However, It was expected that classifying such refined classification of forests states would be challenging. The interesting is that the model made a decent job at separating Primary Forests (UF, LF, LBF) from Secondary Forests (SFold and SFyoung) and Mechanized Agriculture (MA) and Pasture (PA).

Code
acc <- dt_aug %>% 
  accuracy(truth = lu_uf, pred_class) %>%
  clean_names() %>%
  mutate(estimate = round(estimate, digits = 2))

dt_aug %>%
  select(truth = lu_uf, pred = pred_class) %>% 
  mutate(score = 1) %>%
  group_by(truth, pred) %>% 
  summarise(
    score = sum(score), 
    .groups = "drop"
  ) %>%
  right_join(
    expand(data = dt_aug, truth = lu_uf, pred = pred_class)
  ) %>%
  replace_na(list(score = 0)) %>%
  mutate(cond_score = if_else(truth == pred, score, NA_real_)) %>%
  ggplot(aes(x = truth, y = pred)) +
  geom_tile(
    aes(fill = cond_score),
    na.rm = TRUE
  ) +
  geom_text(aes(label = score)) +
  scale_y_discrete(limits = rev) +
  scale_fill_gradient(
    high = "#6f6f6f", low = "#C0C0C0",
    na.value = "transparent",
    guide = NULL
  ) +
  labs(
    title = glue("Global Accuracy: {acc$estimate}"),
    x = "Prediction",
    y = "Truth"
  ) +
  theme_bw() +
  theme(panel.grid = element_blank())

Random forest

I’m going to optimize and train the Random Forest the same way I did with the Decision Tree.

Code
rf_mod_tune <-
  rand_forest(
    mtry = tune(),
    trees = 100,
    min_n = tune()
  ) %>%
  set_engine("ranger") %>%
  set_mode("classification")

rf_rec <- 
  recipe(lu_uf ~ ., data = train_data_class) %>%
  update_role(
    region, catchment, transect, land_useclass,
    new_role = "ID"
  ) %>%
  step_zv(all_predictors())

rf_wflow <- 
  workflow() %>% 
  add_model(rf_mod_tune) %>% 
  add_recipe(rf_rec)

rf_boots <- bootstraps(train_data_class, times = 100)

rf_res <- rf_wflow %>%
  tune_grid(
    rf_boots,
    grid = 50,
    control = control_grid(save_pred = TRUE),
    metrics = metric_set(accuracy)
  )

rf_best <- rf_res %>% 
  select_best(metric = c("accuracy"))

rf_res %>%
  show_best()
ABCDEFGHIJ0123456789
mtry
<int>
min_n
<int>
.metric
<chr>
.estimator
<chr>
mean
<dbl>
629accuracymulticlass0.5907425
617accuracymulticlass0.5894465
833accuracymulticlass0.5878731
927accuracymulticlass0.5873548
722accuracymulticlass0.5870907

Accuracy metrics did not differ much from the Decision Tree model, now lets apply the final fit with the whole train data and see the final results for the Random Forest model.

Code
rf_mod <-
  rand_forest(
    mtry = rf_best$mtry,
    trees = 100,
    min_n = rf_best$min_n
  ) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")

rf_wflow <- rf_wflow %>%
  update_model(rf_mod)

rf_fit <- 
  rf_wflow %>%
  fit(data = train_data_class)

rf_aug <- 
  augment(rf_fit, test_data_class) %>%
  clean_names()

rf_wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_zv()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  mtry = rf_best$mtry
  trees = 100
  min_n = rf_best$min_n

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger 

The overall accuracy is very similar from the Decision Tree, the same happened for the confusion matrix. The class with lowest accuracy is reforestation, which is also the LULC class with least observations.

Code
acc <- rf_aug %>% 
  accuracy(truth = lu_uf, pred_class) %>%
  clean_names() %>%
  mutate(estimate = round(estimate, digits = 2))

rf_aug %>%
  select(truth = lu_uf, pred = pred_class) %>% 
  mutate(score = 1) %>%
  group_by(truth, pred) %>% 
  summarise(
    score = sum(score), 
    .groups = "drop"
  ) %>%
  right_join(
    expand(data = rf_aug, truth = lu_uf, pred = pred_class)
  ) %>%
  replace_na(list(score = 0)) %>%
  mutate(cond_score = if_else(truth == pred, score, NA_real_)) %>%
  ggplot(aes(x = truth, y = pred)) +
  geom_tile(
    aes(fill = cond_score),
    na.rm = TRUE
  ) +
  geom_text(aes(label = score)) +
  scale_y_discrete(limits = rev) +
  scale_fill_gradient(
    high = "#6f6f6f", low = "#C0C0C0",
    na.value = "transparent",
    guide = NULL
  ) +
  labs(
    title = glue("Global Accuracy: {acc$estimate}"),
    x = "Prediction",
    y = "Truth"
  ) +
  theme_bw() +
  theme(panel.grid = element_blank())

Final remarks

This modelling exercise showed that the prediction models showed a decent result, while the classification models did not showed satisfactory results. This document will be improved over time, and more techniques of modelling will be added, so maybe the results will change over time.

References

Biau, Gérard, and Erwan Scornet. 2016. “A Random Forest Guided Tour.” TEST 25 (2): 197–227. https://doi.org/10.1007/s11749-016-0481-7.
Breiman, Leo. 2001. Machine Learning 45 (1): 5–32. https://doi.org/10.1023/a:1010933404324.
Coxe, Stefany, Stephen G. West, and Leona S. Aiken. 2009. “The Analysis of Count Data: A Gentle Introduction to Poisson Regression and Its Alternatives.” Journal of Personality Assessment 91 (2): 121–36. https://doi.org/10.1080/00223890802634175.
Hayat, Matthew J., and Melinda Higgins. 2014. “Understanding Poisson Regression.” Journal of Nursing Education 53 (4): 207–15. https://doi.org/10.3928/01484834-20140325-04.
Kingsford, Carl, and Steven L Salzberg. 2008. “What Are Decision Trees?” Nature Biotechnology 26 (9): 1011–13. https://doi.org/10.1038/nbt0908-1011.
Nunes, Cássio Alencar, Erika Berenguer, Filipe França, Joice Ferreira, Alexander C. Lees, Julio Louzada, Emma J. Sayer, et al. 2022. “Linking Land-Use and Land-Cover Transitions to Their Ecological Impact in the Amazon.” Proceedings of the National Academy of Sciences 119 (27). https://doi.org/10.1073/pnas.2202310119.
Ville, Barry de. 2013. “Decision Trees.” Wiley Interdisciplinary Reviews: Computational Statistics 5 (6): 448–55. https://doi.org/10.1002/wics.1278.

Reuse

CC BY 4.0