diff --git a/papers/Gagnon_Kebe_Tahiri/main.tex b/papers/Gagnon_Kebe_Tahiri/main.tex index dff48bd0ba..25ff8a2bf0 100644 --- a/papers/Gagnon_Kebe_Tahiri/main.tex +++ b/papers/Gagnon_Kebe_Tahiri/main.tex @@ -48,7 +48,7 @@ \subsection{Description of the data} \subsection{Data preprocessing} We used data from the article \citep{uhlir_adding_2021}, the IceAGE project, and related data from the BOLDSystem database, as described in \citep{uhlir_adding_2021}. Given the enormous variety of variables in these databases, we applied a selective reduction procedure. Variables with no variability (categorical data) were excluded from the study, for which all data were missing and were not linked to genetic sequences or spatial, environmental, and climatic variables. Out of the 495 available in the IceAGE dataset, we considered 62 specimens for which partial 16S rRNA mitochondrial gene sequences were available in the \citep{uhlir_adding_2021} article. -Next, we calculated the variance ($S^2$) in RStudio Desktop 4.3.2 for each of the selected variables (numerical and categorical). This step aimed to eliminate variables with low variation, as they are unlikely to provide essential data for analysis. We set a variance threshold of ≤ 0.1 to exclude uninformative variables. The latter retains variables whose variability is reasonably sufficient for the analyses while rejecting those with little variation. Only water salinity was eliminated based on this criterion ($S^2_\text{Salinity} = 0.02146629 \text{practical salinity units}^2, \text{PSU}^2$). The formula (see Equation \ref{variance}) and code (\autoref{lst:variance}) used to calculate the variance of the final variables, available in the data file on \href{https://github.com/tahiri-lab/Cumacea_aPhyloGeo}{GitHub}, are provided below: +Next, we calculated the variance ($S^2$) in RStudio Desktop 4.3.2 for each of the selected variables (numerical and categorical). This step aimed to eliminate variables with low variation, as they are unlikely to provide essential data for analysis. We set a variance threshold of ≤ 0.1 to exclude uninformative variables. The latter retains variables whose variability is reasonably sufficient for the analyses while rejecting those with little variation. Only water salinity was eliminated based on this criterion ($S^2_\text{Salinity} = 0.02146629\,\text{practical salinity units}^2, \text{PSU}^2$). The formula (see Equation \ref{variance}) and code (\autoref{lst:variance}) used to calculate the variance of the final variables, available in the data file on \href{https://github.com/tahiri-lab/Cumacea_aPhyloGeo}{GitHub}, are provided below: \begin{equation}\label{variance} S^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}