diff --git a/DESCRIPTION b/DESCRIPTION index 62a4614..3b91fd5 100755 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: miic Title: Learning Causal or Non-Causal Graphical Models Using Information Theory -Version: 2.0.0 +Version: 2.0.1 Authors@R: c(person(given = "Franck", family = "Simon", @@ -44,23 +44,35 @@ Authors@R: family = "Isambert", role = "aut", email = "herve.isambert@curie.fr")) -Description: We report an information-theoretic method which learns a large - class of causal or non-causal graphical models from purely observational - data, while including the effects of unobserved latent variables, commonly - found in many datasets. Starting from a complete graph, the method - iteratively removes dispensable edges, by uncovering significant information - contributions from indirect paths, and assesses edge-specific confidences - from randomization of available data. The remaining edges are then oriented - based on the signature of causality in observational data. This approach can - be applied on a wide range of datasets and provide new biological insights - on regulatory networks from single cell expression data, genomic alterations - during tumor development and co-evolving residues in protein structures. - Since the version 2.0, MIIC can in addition process stationary time series - to unveil temporal causal graphs. +Description: MIIC (Multivariate Information-based Inductive Causation) is a + causal discovery method, based on information theory principles, which + learns a large class of causal or non-causal graphical models from purely + observational data, while including the effects of unobserved latent + variables. Starting from a complete graph, the method iteratively removes + dispensable edges, by uncovering significant information contributions from + indirect paths, and assesses edge-specific confidences from randomization + of available data. The remaining edges are then oriented based on the + signature of causality in observational data. The recent more interpretable + MIIC extension (iMIIC) further distinguishes genuine causes from putative + and latent causal effects, while scaling to very large datasets (hundreds + of thousands of samples).Since the version 2.0, MIIC also includes a + temporal mode (tMIIC) to learn temporal causal graphs from stationary time + series data. MIIC has been applied to a wide range of biological and + biomedical data, such as single cell gene expression data, genomic + alterations in tumors, live-cell time-lapse imaging data (CausalXtract), + as well as medical records of patients. MIIC brings unique insights based + on causal interpretation and could be used in a broad range of other data + science domains (technology, climatology, economy, ...). For more information, you can refer to: - Simon et al. eLife, reviewed preprint , - Cabeli et al. PLoS Comp. Bio. 2020 , - Verny et al. PLoS Comp. Bio. 2017 . + Simon et al., eLife 2024, , + Ribeiro-Dantas et al., iScience 2024, , + Cabeli et al., NeurIPS 2021, , + Cabeli et al., Comput. Biol. 2020, , + Li et al., NeurIPS 2019, , + Verny et al., PLoS Comput. Biol. 2017, , + Affeldt et al., UAI 2015, . + Changes from the previous 1.5.3 release available on CRAN are available at + . License: GPL (>= 2) URL: https://github.com/miicTeam/miic_R_package BugReports: https://github.com/miicTeam/miic_R_package/issues @@ -79,4 +91,4 @@ LinkingTo: SystemRequirements: C++14 LazyData: true Encoding: UTF-8 -RoxygenNote: 7.3.1 +RoxygenNote: 7.3.2 diff --git a/NAMESPACE b/NAMESPACE index 35ae61e..d03eb85 100755 --- a/NAMESPACE +++ b/NAMESPACE @@ -7,11 +7,10 @@ export(computeThreePointInfo) export(discretizeMDL) export(discretizeMutual) export(estimateTemporalDynamic) +export(export) export(miic) -export(miic.export) -export(miic.write.network.cytoscape) -export(miic.write.style.cytoscape) -export(tmiic.export) +export(writeCytoscapeNetwork) +export(writeCytoscapeStyle) import(Rcpp) importFrom(stats,density) importFrom(stats,sd) diff --git a/NEWS.md b/NEWS.md index a505518..549cc55 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,114 +1,172 @@ -# Development version - -# v2.0.0 +# v2.0.1 ## Features -- tMIIC version for temporal causal discovery on stationary time series: - new mode of MIIC to reconstruct networks from temporal stationary datasets. - [Simon et al., eLife, reviewed preprint] - (https://www.biorxiv.org/content/10.1101/2024.02.06.579177v1.abstract) +* Release to CRAN. + +## Fixes and improvements + +* Faster post-processing in R for datasets with large number of variables. + +## Breaking changes + +Consolidating long-pending breaking changes: + +* Harmonization of exported function names using `camel case`. + +* Harmonization of parameters and return values using `snake case`. + +* Harmonization of abbreviations. + +All the documentation has been updated accordingly, if you encounter any issue +upgrading to this version, please consult the help of the relevant function +for more information about its interface. + +For the core `miic()` function, the main breaking changes in the interface +(when upgrading from the 1.5.3 release on CRAN) are: + +in the parameters: + +* `cplx`: renaming of the complexity term `"mdl"` → `"bic"` + +* `ori_proba_ratio` → `ort_proba_ratio` + +in the miic object returned: + +* `all.edges.summary` → `summary` + * `Nxy_ai` → `n_xy_ai` + * `log_confidence` → `info_shifted` + * `infOrt` → `ort_inferred` + * `trueOrt` → `ort_ground_truth` + * `isOrtOk` → `is_inference_correct` + * `isCausal` → `is_causal` + * `proba` → `p_y2x`, `p_x2y` + * `consensus` → `ort_consensus` + +* `orientations.prob` → `triples` + * `NI3` → `ni3` + * `Error` → `conflict` + +Still compared to 1.5.3, another important change in the behavior of `miic()` +is that, by default, `miic()` no longer propagates orientations +and allows latent variables discovery during orientation step. ## Known issues -- A (very) large number of contributors can lead to a memory fault. - Initial fix has been reverted due to side effects. +* Conditioning on a (very) large number of contributors can lead to a memory + fault. +# v2.0.0 + +## Features + +* tMIIC version for temporal causal discovery on stationary time series: + new mode of `miic()` to reconstruct networks from temporal stationary + datasets ([Simon et al., eLife 2024](https://www.biorxiv.org/content/10.1101/2024.02.06.579177v1.abstract)). + The temporal mode of `miic()` is not activated by default and can be enabled by + setting the newly added parameter `mode` to `"TS"`(Temporal Stationary). + A tuning of the temporal mode is possible through a set of new parameters: + `max_nodes`, `n_layers`, `delta_t`, `mov_avg` and `keep_max_data`. + # v1.8.1 ## Fixes and improvements -- The discretization of continuous variables has been modified when dealing +* The discretization of continuous variables has been improved when dealing with variables having a large number of identical values. -- Fix for memory overflow on shared memory space. +* Fix for memory overflow on shared memory space. # v1.8.0 ## Features -- Addition of the 'is consequence' prior knowledge. Consequence variables are - excluded from the possible contributors, edges between consequences are - ignored and edges between a non consequence and a consequence are pre-oriented - toward the consequence. +* Addition of the 'is consequence' prior knowledge. Consequence variables are + excluded from the possible contributors, edges between consequences are + ignored and edges between a non consequence and a consequence are pre-oriented + toward the consequence. + Information about consequence variables can be provided to `miic()` + in the `state_order`, by supplying an `is_consequence` column. # v1.7.0 ## Features -- iMIIC version introducing genuine vs putative causes, contextual variables - and multiple enhancements to deal with very large datasets. - [Ribeiro-Dantas et al., iScience 2024] - (https://arxiv.org/abs/2303.06423) +* iMIIC version introducing contextual variables, genuine vs putative causes + and multiple enhancements to deal with very large datasets ([Ribeiro-Dantas et al., iScience 2024](https://doi.org/10.1016/j.isci.2024.109736)). + Information on contextual variables can be provided to `miic()` + in the `state_order`, by supplying an `is_contextual`column and + genuine vs putative causes can be tuned by the newly added parameter + `ort_consensus_ratio`. # v1.6.0 ## Features -- Enhancement of orientations using mutual information supremum principle for - finite datasets. - [Cabeli et al., Why21 at NeurIPS 2021] - (http://kinefold.curie.fr/isambertlab/PAPERS/cabeli_Why21-NeurIPS2021.pdf) +* Enhancement of orientations using mutual information supremum principle for + finite datasets ([Cabeli et al., Why21 at NeurIPS 2021](http://kinefold.curie.fr/isambertlab/PAPERS/cabeli_Why21-NeurIPS2021.pdf)). + The use of enhanced orientations is controlled by the newly added parameter + `negative_info` of `miic()` and is activated by default. -- By default, MIIC does not propagate orientations anymore +* By default, `miic()` does not propagate orientations anymore and allows latent variables discovery during orientation step. # v1.5.3 ## Features -- Release to CRAN +* Release to CRAN # v1.5.2 ## Fixes and improvements -- Further refactoring of the C++ code for the computation of information. +* Further refactoring of the C++ code for the computation of information. -- Fix minor bugs in the continuous computation. +* Fix minor bugs in the continuous computation. -- Fix incompatibility with older versions of GCC (std::align). +* Fix incompatibility with older versions of GCC (std::align). # v1.5.1 ## Fixes and improvements -- Fix various bugs in the computation of information in the presence of NA +* Fix various bugs in the computation of information in the presence of NA values in the dataset. -- An overhaul of the C++ code base, better memory management, computation time +* An overhaul of the C++ code base, better memory management, computation time and code readability. # v1.5.0 ## Features -- Add a column `consensus` to the reconstructed graph's edges summary associated +* Add a column `consensus` to the reconstructed graph's edges summary associated with the option `consistent`, and a new parameter `consensus_threshold` accordingly. -- Add a parameter `ori_proba_ratio` to have more control on the orientation of +* Add a parameter `ori_proba_ratio` to have more control on the orientation of edges. ## Fixes and improvements -- Faster post processing in R. +* Faster post processing in R. -- Rework plot functionality. +* Rework plot functionality. -- Fix a bug in the orientation part about the log score. +* Fix a bug in the orientation part about the log score. -- Refactor of the C++ code base (orientation). +* Refactor of the C++ code base (orientation). # v1.4.2 ## Fixes and improvements -- Various fixes of memory leaks and ambiguous function calls (at least for all +* Various fixes of memory leaks and ambiguous function calls (at least for all that appear in CRAN check). -- Refactor of the C++ code base (confidence cut). +* Refactor of the C++ code base (confidence cut). ## Known issues -- Error when running the cosmicCancer example on CRAN's Solaris system. +* Error when running the cosmicCancer example on CRAN's Solaris system. ## Miscellaneous -- Move from BitBucket to GitHub, the repo is now public. +* Move from BitBucket to GitHub, the repo is now public. # v1.4.1 @@ -118,7 +176,7 @@ CRAN). # v1.4.0 ## Incompatible changes -- Standardize the API naming convention: `snake_case` for parameters and +* Standardize the API naming convention: `snake_case` for parameters and `camelCase` for functions. This should have led to a major version increment to v2.0.0 given the previous version on CRAN is v1.0.3. But v1.0.3 and earlier versions were not properly maintained and versioned under a version control @@ -126,22 +184,22 @@ CRAN). this version). ## Features -- The method now works with continuous variables (solely or mixed with discrete +* The method now works with continuous variables (solely or mixed with discrete variables), thanks to the discretization method as described in - [Cabeli et al., PLoS Comp. Bio. 2020](https://doi.org/10.1371/journal.pcbi.1007866). + [Cabeli et al., PLoS Comput. Biol. 2020](https://doi.org/10.1371/journal.pcbi.1007866). -- Add an option `consistent` to improve the reconstructed graph's +* Add an option `consistent` to improve the reconstructed graph's interpretability based on schemes as described in [Li et al., NeurIPS 2019](https://papers.nips.cc/paper/9573-constraint-based-causal-structure-learning-with-consistent-separating-sets). ## Fixes and improvements -- Various fixes of memory leaks and typos. +* Various fixes of memory leaks and typos. -- Major refactoring of the old C++ code base (still WIP) to improve readability +* Major refactoring of the old C++ code base (still WIP) to improve readability and flexibility, and to enforce proper coding style and documentation. -- Enforce proper coding style for the R code base. +* Enforce proper coding style for the R code base. ## Known issues -- Still have some memory leaks and CRAN check errors and notes on certain +* Still have some memory leaks and CRAN check errors and notes on certain platforms. diff --git a/R/computeInformation.R b/R/computeInformation.R index 1fbc11f..e8bd209 100644 --- a/R/computeInformation.R +++ b/R/computeInformation.R @@ -1,44 +1,51 @@ +#******************************************************************************* +# Filename : computeInformation.R +# +# Description: Compute 2 and 3 point (conditional) mutual information +#******************************************************************************* + +#=============================================================================== +# FUNCTIONS +#=============================================================================== +# computeMutualInfo +#------------------------------------------------------------------------------- #' Compute (conditional) mutual information -#' @description For discrete variables, the computation is based on the -#' empirical frequency minus a complexity cost (computed as BIC or with the -#' Normalized Maximum Likelihood). When continuous variables are present, each -#' continuous variable is discretized where the partitioning is chosen by -#' maximizing the mutual information minus the complexity cost. The estimation -#' based on the optimally discretized distributions effectively approaches the -#' mutual information computed on the original continuous variables. +#' @description For discrete or categorical variables, the (conditional) +#' mutual information is computed using the empirical frequencies minus a +#' complexity cost (computed as BIC or with the Normalized Maximum Likelihood). +#' When continuous variables are present, each continuous variable is +#' discretized for each mutual information estimate so as to maximize the +#' mutual information minus the complexity cost (see Cabeli 2020). #' #' @details For a pair of continuous variables \eqn{X} and \eqn{Y}, the mutual #' information \eqn{I(X;Y)} will be computed iteratively. In each iteration, the -#' algorithm optimizes first the partitioning of \eqn{X} and then that of -#' \eqn{Y}, while maximizing +#' algorithm optimizes the partitioning of \eqn{X} and then of \eqn{Y}, +#' in order to maximize #' \deqn{Ik(X_{d};Y_{d}) = I(X_{d};Y_{d}) - cplx(X_{d};Y_{d})} -#' where \eqn{cplx(X_{d}; Y_{d})} is the complexity cost of the current -#' partitioning (see Affeldt 2016 and Cabeli 2020). Upon convergence, the -#' information terms \eqn{I(X_{d};Y_{d})} and \eqn{Ik(X_{d};Y_{d})}, as well as -#' the partitioning of \eqn{X_{d}} and \eqn{Y_{d}} in terms of cutpoints, are -#' returned. +#' where \eqn{cplx(X_{d}; Y_{d})} is the complexity cost of the corresponding +#' partitioning (see Cabeli 2020). +#' Upon convergence, the information terms \eqn{I(X_{d};Y_{d})} +#' and \eqn{Ik(X_{d};Y_{d})}, as well as the partitioning of \eqn{X_{d}} +#' and \eqn{Y_{d}} in terms of cutpoints, are returned. #' -#' For conditional mutual information with conditioning set \eqn{U}, the +#' For conditional mutual information with a conditioning set \eqn{U}, the #' computation is done based on #' \deqn{ #' Ik(X;Y|U) = 0.5*(Ik(X_{d};Y_{d},U_{d}) - Ik(X_{d};U_{d}) #' + Ik(Y_{d};X_{d},U_{d}) - Ik(Y_{d};U_{d})), #' } -#' where each of the four summands is estimated independently. +#' where each of the four summands is estimated separately. #' #' @references #' \itemize{ -#' \item Verny et al., \emph{PLoS Comp. Bio. 2017.} -#' https://doi.org/10.1371/journal.pcbi.1005662 -#' \item Cabeli et al., \emph{PLoS Comp. Bio. 2020.} -#' https://doi.org/10.1371/journal.pcbi.1007866 -#' \item Affeldt et al., \emph{Bioinformatics 2016} +#' \item Cabeli \emph{et al.}, PLoS Comput. Biol. 2020, \href{https://doi.org/10.1371/journal.pcbi.1007866}{Learning clinical networks from medical records based on information estimates in mixed-type data} +#' \item Affeldt \emph{et al.}, UAI 2015, \href{https://auai.org/uai2015/proceedings/papers/293.pdf}{Robust Reconstruction of Causal Graphical Models based on Conditional 2-point and 3-point Information} #' } #' -#' @param X [a vector] -#' A vector that contains the observational data of the first variable. -#' @param Y [a vector] -#' A vector that contains the observational data of the second variable. +#' @param x [a vector] +#' The \eqn{X} vector that contains the observational data of the first variable. +#' @param y [a vector] +#' The \eqn{Y} vector that contains the observational data of the second variable. #' @param df_conditioning [a data frame] #' The data frame of the observations of the conditioning variables. #' @param maxbins [an integer] @@ -48,25 +55,25 @@ #' @param cplx [a string] #' The complexity model: #' \itemize{ -#' \item["mdl"] Minimum description Length -#' \item["nml"] Normalized Maximum Likelihood, less costly compared to "mdl" in -#' the finite sample case and will allow for more bins. +#' \item["bic"] Bayesian Information Criterion +#' \item["nml"] Normalized Maximum Likelihood, more accurate complexity cost +#' compared to BIC, especially on small sample size. #' } #' @param n_eff [an integer] -#' The number of effective samples. When there is significant autocorrelation in -#' the samples you may want to specify a number of effective samples that is -#' lower than the number of points in the distribution. +#' The effective number of samples. When there is significant autocorrelation +#' between successive samples, you may want to specify an effective number of +#' samples that is lower than the total number of samples. #' @param sample_weights [a vector of floats] #' Individual weights for each sample, used for the same reason as the effective -#' sample number but with individual precision. +#' number of samples but with individual weights. #' @param is_continuous [a vector of booleans] #' Specify if each variable is to be treated as continuous (TRUE) or discrete #' (FALSE), must be of length `ncol(df_conditioning) + 2`, in the order #' \eqn{X, Y, U1, U2, ...}. If not specified, factors and character vectors are #' considered as discrete, and numerical vectors as continuous. #' @param plot [a boolean] -#' Specify whether the XY joint space with discretization scheme is to be -#' plotted (requires `ggplot2` and `gridExtra`). +#' Specify whether the resulting XY optimum discretization is to be plotted +#' (requires `ggplot2` and `gridExtra`). #' #' @return A list that contains : #' \itemize{ @@ -74,11 +81,12 @@ #' the cutpoints for the partitioning of \eqn{X}. #' \item cutpoints2: Only when \eqn{Y} is continuous, a vector containing #' the cutpoints for the partitioning of \eqn{Y}. -#' \item niterations: Only when at least one of the input variables is +#' \item n_iterations: Only when at least one of the input variables is #' continuous, the number of iterations it takes to reach the convergence of #' the estimated information. -#' \item iterationN: Only when at least one of the input variables is -#' continuous, the list of vectors of cutpoints of each iteration. +#' \item iteration1, iteration2, ... Only when at least one of the input +#' variables is continuous, the list of vectors of cutpoints of each +#' iteration. #' \item info: The estimation of (conditional) mutual information without the #' complexity cost. #' \item infok: The estimation of (conditional) mutual information with the @@ -121,11 +129,11 @@ #' res <- computeMutualInfo(X, Y, df_conditioning = matrix(Z, ncol = 1), plot = TRUE) #' message("I(X;Y|Z) = ", res$info) #' } -#' -computeMutualInfo <- function(X, Y, +#------------------------------------------------------------------------------- +computeMutualInfo <- function(x, y, df_conditioning = NULL, maxbins = NULL, - cplx = c("nml", "mdl"), + cplx = c("nml", "bic"), n_eff = -1, sample_weights = NULL, is_continuous = NULL, @@ -142,7 +150,7 @@ computeMutualInfo <- function(X, Y, ) cplx <- match.arg(cplx) - input_data = data.frame(X, Y) + input_data = data.frame(x, y) if (!is.null(df_conditioning)) { input_data <- data.frame(input_data, df_conditioning) } @@ -151,7 +159,7 @@ computeMutualInfo <- function(X, Y, stop(paste( "Differing number of rows between `sample_weights` and input data:", length(sample_weights), - length(X) + length(x) )) } @@ -239,7 +247,7 @@ computeMutualInfo <- function(X, Y, # Parse cutpointsmatrix epsilon <- min(c(sd(X_num), sd(Y_num))) / 100 niterations <- nrow(rescpp$cutpointsmatrix) / maxbins - result$niterations <- niterations + result$n_iterations <- niterations for (i in 0:(niterations - 1)) { result[[paste0("iteration", i + 1)]] <- list() for (l in 1:2) { @@ -281,8 +289,8 @@ computeMutualInfo <- function(X, Y, } if (plot) { - nameDist1 <- deparse(substitute(X)) - nameDist2 <- deparse(substitute(Y)) + nameDist1 <- deparse(substitute(x)) + nameDist2 <- deparse(substitute(y)) if (base::requireNamespace("ggplot2", quietly = TRUE) && base::requireNamespace("gridExtra", quietly = TRUE)) { if (all(is_continuous[1:2])) { @@ -312,35 +320,35 @@ computeMutualInfo <- function(X, Y, return(result) } +#------------------------------------------------------------------------------- +# computeThreePointInfo +#------------------------------------------------------------------------------- #' Compute (conditional) three-point information -#' @description Three point information is defined based on mutual information. -#' For discrete variables, the computation is based on the -#' empirical frequency minus a complexity cost (computed as BIC or with the -#' Normalized Maximum Likelihood). When continuous variables are present, each -#' continuous variable is discretized where the partitioning is chosen by -#' maximizing the mutual information minus the complexity cost. +#' @description Three point information is defined and computed as the +#' difference of mutual information and conditional mutual information, e.g. +#' \deqn{I(X;Y;Z|U) = I(X;Y|U) - Ik(X;Y|U,Z)} +#' For discrete or categorical variables, the three-point information is +#' computed with the empirical frequencies minus a complexity cost +#' (computed as BIC or with the Normalized Maximum Likelihood). #' #' @details For variables \eqn{X}, \eqn{Y}, \eqn{Z} and a set of conditioning #' variables \eqn{U}, the conditional three point information is defined as -#' \deqn{Ik(X;Y;Z|U) = Ik(X;Y|U) - Ik(X;Y|U,Z)}, where \eqn{Ik} is the -#' regularized conditional mutual information. +#' \deqn{Ik(X;Y;Z|U) = Ik(X;Y|U) - Ik(X;Y|U,Z)} +#' where \eqn{Ik} is the shifted or regularized conditional mutual information. #' See \code{\link{computeMutualInfo}} for the definition of \eqn{Ik}. #' #' @references #' \itemize{ -#' \item Verny et al., \emph{PLoS Comp. Bio. 2017.} -#' https://doi.org/10.1371/journal.pcbi.1005662 -#' \item Cabeli et al., \emph{PLoS Comp. Bio. 2020.} -#' https://doi.org/10.1371/journal.pcbi.1007866 -#' \item Affeldt et al., \emph{Bioinformatics 2016} +#' \item Cabeli \emph{et al.}, PLoS Comput. Biol. 2020, \href{https://doi.org/10.1371/journal.pcbi.1007866}{Learning clinical networks from medical records based on information estimates in mixed-type data} +#' \item Affeldt \emph{et al.}, UAI 2015, \href{https://auai.org/uai2015/proceedings/papers/293.pdf}{Robust Reconstruction of Causal Graphical Models based on Conditional 2-point and 3-point Information} #' } #' -#' @param X [a vector] -#' A vector that contains the observational data of the first variable. -#' @param Y [a vector] -#' A vector that contains the observational data of the second variable. -#' @param Z [a vector] -#' A vector that contains the observational data of the third variable. +#' @param x [a vector] +#' The \eqn{X} vector that contains the observational data of the first variable. +#' @param y [a vector] +#' The \eqn{Y} vector that contains the observational data of the second variable. +#' @param z [a vector] +#' The \eqn{Z} vector that contains the observational data of the third variable. #' @param df_conditioning [a data frame] #' The data frame of the observations of the set of conditioning variables #' \eqn{U}. @@ -351,33 +359,33 @@ computeMutualInfo <- function(X, Y, #' @param cplx [a string] #' The complexity model: #' \itemize{ -#' \item["mdl"] Minimum description Length -#' \item["nml"] Normalized Maximum Likelihood, less costly compared to "mdl" in -#' the finite sample case and will allow for more bins. +#' \item["bic"] Bayesian Information Criterion +#' \item["nml"] Normalized Maximum Likelihood, more accurate complexity cost +#' compared to BIC, especially on small sample size. #' } #' @param n_eff [an integer] -#' The number of effective samples. When there is significant autocorrelation in -#' the samples you may want to specify a number of effective samples that is -#' lower than the number of points in the distribution. +#' The effective number of samples. When there is significant autocorrelation +#' between successive samples, you may want to specify an effective number of +#' samples that is lower than the total number of samples. #' @param sample_weights [a vector of floats] #' Individual weights for each sample, used for the same reason as the effective -#' sample number but with individual precision. +#' number of samples but with individual weights. #' @param is_continuous [a vector of booleans] #' Specify if each variable is to be treated as continuous (TRUE) or discrete -#' (FALSE), must be of length `ncol(df_conditioning) + 2`, in the order -#' \eqn{X, Y, U1, U2, ...}. If not specified, factors and character vectors are -#' considered as discrete, and numerical vectors as continuous. +#' (FALSE), must be of length `ncol(df_conditioning) + 3`, in the order +#' \eqn{X, Y, Z, U1, U2, ...}. If not specified, factors and character vectors +#' are considered as discrete, and numerical vectors as continuous. #' #' @return A list that contains : #' \itemize{ -#' \item I3: The estimation of (conditional) three-point information without the +#' \item i3: The estimation of (conditional) three-point information without the #' complexity cost. -#' \item I3k: The estimation of (conditional) three-point information with the -#' complexity cost (\eqn{I3k = I3 - cplx}). -#' \item I2: For reference, the estimation of (conditional) mutual information -#' \eqn{I(X;Y|U)} used in the estimation of \eqn{I3}. -#' \item I2k: For reference, the estimation of regularized (conditional) mutual -#' information \eqn{Ik(X;Y|U)} used in the estimation of \eqn{I3k}. +#' \item i3k: The estimation of (conditional) three-point information with the +#' complexity cost (\emph{i3k = i3 - cplx}). +#' \item i2: For reference, the estimation of (conditional) mutual information +#' \eqn{I(X;Y|U)} used in the estimation of \emph{i3}. +#' \item i2k: For reference, the estimation of regularized (conditional) mutual +#' information \eqn{Ik(X;Y|U)} used in the estimation of \emph{i3k}. #' } #' @export #' @useDynLib miic @@ -391,8 +399,8 @@ computeMutualInfo <- function(X, Y, #' X <- Z * 2 + rnorm(N, sd = 0.2) #' Y <- Z * 2 + rnorm(N, sd = 0.2) #' res <- computeThreePointInfo(X, Y, Z) -#' message("I(X;Y;Z) = ", res$I3) -#' message("Ik(X;Y;Z) = ", res$I3k) +#' message("I(X;Y;Z) = ", res$i3) +#' message("Ik(X;Y;Z) = ", res$i3k) #' #' \donttest{ #' # Independence, conditional dependence : X -> Z <- Y @@ -400,14 +408,14 @@ computeMutualInfo <- function(X, Y, #' Y <- runif(N) #' Z <- X + Y + rnorm(N, sd = 0.1) #' res <- computeThreePointInfo(X, Y, Z) -#' message("I(X;Y;Z) = ", res$I3) -#' message("Ik(X;Y;Z) = ", res$I3k) +#' message("I(X;Y;Z) = ", res$i3) +#' message("Ik(X;Y;Z) = ", res$i3k) #' } -#' -computeThreePointInfo <- function(X, Y, Z, +#------------------------------------------------------------------------------- +computeThreePointInfo <- function(x, y, z, df_conditioning = NULL, maxbins = NULL, - cplx = c("nml", "mdl"), + cplx = c("nml", "bic"), n_eff = -1, sample_weights = NULL, is_continuous = NULL) { @@ -423,7 +431,7 @@ computeThreePointInfo <- function(X, Y, Z, ) cplx <- match.arg(cplx) - input_data = data.frame(X, Y, Z) + input_data = data.frame(x, y, z) if (!is.null(df_conditioning)) { input_data <- data.frame(input_data, df_conditioning) } @@ -432,7 +440,7 @@ computeThreePointInfo <- function(X, Y, Z, stop(paste( "Differing number of rows between `sample_weights` and input data:", length(sample_weights), - length(X) + length(x) )) } diff --git a/R/data.R b/R/data.R index 5fcd16d..54e541b 100755 --- a/R/data.R +++ b/R/data.R @@ -61,47 +61,6 @@ NULL NULL - - - -#' Tetraploidization in vertebrate evolution -#' -#' 20,415 protein-coding genes in the human genome from Ensembl (v70) and information on the -#' retention of duplicates originating either from the two whole genome duplications at -#' the onset of vertebrates (‘ohnolog’) or from subsequent small scale duplications (‘SSD’) -#' as well as copy number variants (‘CNV’). -#' -#' @docType data -#' @name ohno -#' @usage data(ohno) -#' -#' @format A data.frame object. -#' -#' @keywords datasets -#' -#' @references Verny et al., PLoS Comp. Bio. 2017. -#' -#' @keywords data -NULL - - -#' Tetraploidization in vertebrate evolution -#' -#' 20,415 protein-coding genes in the human genome from Ensembl (v70) and information on the -#' retention of duplicates originating either from the two whole genome duplications at -#' the onset of vertebrates (‘ohnolog’) or from subsequent small scale duplications (‘SSD’) -#' as well as copy number variants (‘CNV’), category order. -#' -#' @docType data -#' @usage data(ohno_stateOrder) -#' @format A data.frame object. -#' @keywords datasets -#' @name ohno_stateOrder -#' @references Verny et al., PLoS Comp. Bio. 2017. -#' -#' @keywords data -NULL - #' Covid cases #' #' Demo dataset of chronological series to be used in temporal mode of miic. diff --git a/R/discretizeMutual.R b/R/discretizeMutual.R index e65d88f..6741506 100755 --- a/R/discretizeMutual.R +++ b/R/discretizeMutual.R @@ -1,63 +1,89 @@ -#' Iterative dynamic programming for (conditional) mutual information through optimized discretization. -#' @description This function chooses cutpoints in the input distributions by maximizing the mutual -#' information minus a complexity cost (computed as BIC or with the Normalized Maximum Likelihood). The -#' (conditional) mutual information computed on the optimized discretized distributions effectively approaches -#' the mutual information computed on the original continuous variables. +#******************************************************************************* +# Filename : discretizeMutual.R +# +# Description: Optimal discretization to compute (conditional) mutual +# information +#******************************************************************************* + +#=============================================================================== +# FUNCTIONS +#=============================================================================== +# discretizeMutual +#------------------------------------------------------------------------------- +#' Iterative dynamic programming for (conditional) mutual information through +#' optimized discretization. #' -#' @details For a pair of variables \eqn{X} and \eqn{Y}, the algorithm will in turn choose cutpoints on \eqn{X} -#' then on \eqn{Y}, maximizing \eqn{I(X_{d};Y_{d}) - cplx(X_{d};Y_{d})} where \eqn{cplx(X_{d};Y_{d})} is the -#' complexity cost of the considered discretizations of \eqn{X} and \eqn{Y} (see Affeldt 2016 and Cabeli 2020). -#' When the value \eqn{I(X_{d};Y_{d})} is stable between two iterations the discretization scheme of -#' \eqn{X_{d}} and \eqn{Y_{d}} is returned as well as \eqn{I(X_{d};Y_{d})} and \eqn{I(X_{d};Y_{d})-cplx(X_{d};Y_{d})}. +#' @description This function chooses cutpoints in the input distributions by +#' maximizing the mutual information minus a complexity cost +#' (computed as BIC or with the Normalized Maximum Likelihood). +#' The (conditional) mutual information computed on the optimized discretized +#' distributions effectively estimates the mutual information of the original +#' continuous variables. #' -#' With a set of conditioning variables \eqn{U}, the discretization scheme maximizes each term of the sum +#' @details For a pair of continuous variables \eqn{X} and \eqn{Y}, +#' the algorithm will iteratively choose cutpoints on \eqn{X} then on \eqn{Y}, +#' maximizing \eqn{I(X_{d};Y_{d}) - cplx(X_{d};Y_{d})} where +#' \eqn{cplx(X_{d};Y_{d})} is the complexity cost of the considered +#' discretizations of \eqn{X} and \eqn{Y} (see Cabeli 2020). +#' Upon convergence, the discretization scheme of \eqn{X_{d}} and \eqn{Y_{d}} +#' is returned as well as \eqn{I(X_{d};Y_{d})} +#' and \eqn{I(X_{d};Y_{d})-cplx(X_{d};Y_{d})}. +#' +#' With a set of conditioning variables \eqn{U}, the discretization scheme +#' maximizes each term of the sum #' \eqn{I(X;Y|U) \sim 0.5*(I(X_{d};Y_{d}, U_{d}) - I(X_{d};U_{d}) + I(Y_{d};X_{d}, U_{d}) - I(Y_{d};U_{d}))}. #' #' Discrete variables can be passed as factors and will be used "as is" to maximize each term. #' -#' #' @references #' \itemize{ -#' \item Verny et al., \emph{PLoS Comp. Bio. 2017.} https://doi.org/10.1371/journal.pcbi.1005662 -#' \item Cabeli et al., \emph{PLoS Comp. Bio. 2020.} https://doi.org/10.1371/journal.pcbi.1007866 -#' \item Affeldt et al., \emph{Bioinformatics 2016} +#' \item Cabeli \emph{et al.}, PLoS Comput. Biol. 2020, \href{https://doi.org/10.1371/journal.pcbi.1007866}{Learning clinical networks from medical records based on information estimates in mixed-type data} #' } #' -#' @param X [a vector] -#' A vector that contains the observational data of the first variable. -#' @param Y [a vector] -#' A vector that contains the observational data of the second variable. +#' @param x [a vector] +#' The \eqn{X} vector that contains the observational data of the first variable. +#' @param y [a vector] +#' The \eqn{Y} vector that contains the observational data of the second variable. #' @param matrix_u [a numeric matrix] #' The matrix with the observations of as many columns as conditioning variables. #' @param maxbins [an int] #' The maximum number of bins desired in the discretization. A lower number makes the computation faster, a higher #' number allows finer discretization (by default : 5 * cubic root of N). #' @param cplx [a string] -#' The complexity used in the dynamic programming. Either "mdl" for Minimum description Length or -#' "nml" for Normalized Maximum Likelihood, which is less costly in the finite sample case and -#' will allow more bins than mdl. +#' The complexity used in the dynamic programming: +#' \itemize{ +#' \item["bic"] Bayesian Information Criterion +#' \item["nml"] Normalized Maximum Likelihood, more accurate complexity cost +#' compared to BIC, especially on small sample size. +#' } #' @param n_eff [an int] -#' The number of effective samples. When there is significant autocorrelation in the samples you may -#' want to specify a number of effective samples that is lower than the number of points in the distribution. +#' @param n_eff [an integer] +#' The effective number of samples. When there is significant autocorrelation +#' between successive samples, you may want to specify an effective number of +#' samples that is lower than the total number of samples. #' @param sample_weights [a vector of floats] -#' Individual weights for each sample, used for the same reason as the effective sample number but with individual -#' precision. -#' @param is_discrete [a vector of booleans] -#' Specify if each variable is to be treated as discrete (TRUE) or continuous (FALSE) in a -#' logical vector of length ncol(matrix_u) + 2, in the order [X, Y, U1, U2...]. By default, -#' factors and character vectors are treated as discrete, and numerical vectors as continuous. +#' Individual weights for each sample, used for the same reason as the effective +#' number of samples but with individual weights. +#' @param is_continuous [a vector of booleans] +#' Specify if each variable is to be treated as continuous (TRUE) +#' or discrete (FALSE) in a logical vector of length ncol(matrix_u) + 2, +#' in the order [X, Y, U1, U2...]. By default, factors and character vectors +#' are treated as discrete, and numerical vectors as continuous. #' @param plot [a boolean] -#' Specify if the XY joint space with discretization scheme is to be plotted or not (requires -#' ggplot2 and gridExtra). +#' Specify whether the resulting XY optimum discretization is to be plotted +#' (requires `ggplot2` and `gridExtra`). #' #' @return A list that contains : #' \itemize{ #' \item{two vectors containing the cutpoints for each variable : -#' \emph{cutpoints1} corresponds to \emph{X}, -#' \emph{cutpoints2} corresponds to \emph{Y}.} -#' \item{\emph{niterations} is the number of iterations performed before convergence of the (C)MI estimation.} -#' \item{\emph{iterationN}, lists contatining the cutpoint vectors for each iteration.} -#' \item{\emph{info} and \emph{infok}, the estimated (C)MI value and (C)MI minus the complexity cost.} +#' \emph{cutpoints1} corresponds to \emph{x}, +#' \emph{cutpoints2} corresponds to \emph{y}.} +#' \item{\emph{n_iterations} is the number of iterations performed before +#' convergence of the (C)MI estimation.} +#' \item{\emph{iteration1, iteration2, ...}, lists containing +#' the cutpoint vectors for each iteration.} +#' \item{\emph{info} and \emph{infok}, the estimated (C)MI value +#' and (C)MI minus the complexity cost.} #' \item{if \emph{plot} == TRUE, a plot object (requires ggplot2 and gridExtra).} #' } #' @export @@ -83,7 +109,7 @@ #' Y <- as.numeric(Z == 1) + as.numeric(Z == 2) + 0.2 * rnorm(N) #' res <- miic::discretizeMutual(X, Y, cplx = "nml") #' message("I(X;Y) = ", res$info) -#' res <- miic::discretizeMutual(X, Y, matrix(Z, ncol = 1), is_discrete = c(FALSE, FALSE, TRUE)) +#' res <- miic::discretizeMutual(X, Y, matrix(Z, ncol = 1), is_continuous = c(TRUE, TRUE, FALSE)) #' message("I(X;Y|Z) = ", res$info) #' #' @@ -96,18 +122,18 @@ #' res <- discretizeMutual(X, Y, matrix_u = matrix(Z, ncol = 1), plot = TRUE) #' message("I(X;Y|Z) = ", res$info) #' } -#' -discretizeMutual <- function(X, - Y, +#------------------------------------------------------------------------------- +discretizeMutual <- function(x, + y, matrix_u = NULL, maxbins = NULL, cplx = "nml", n_eff = NULL, sample_weights = NULL, - is_discrete = NULL, + is_continuous = NULL, plot = TRUE) { - nameDist1 <- deparse(substitute(X)) - nameDist2 <- deparse(substitute(Y)) + nameDist1 <- deparse(substitute(x)) + nameDist2 <- deparse(substitute(y)) # Check the input arguments if (is.null(matrix_u)) { nbrU <- 0 @@ -115,10 +141,10 @@ discretizeMutual <- function(X, nbrU <- ncol(matrix_u) } - if (is.null(is_discrete)) { + if (is.null(is_continuous)) { is_discrete <- c( - (is.character(X) || is.factor(X)), - (is.character(Y) || is.factor(Y)) + (is.character(x) || is.factor(x)), + (is.character(y) || is.factor(y)) ) if (nbrU > 0) { for (z in 1:nbrU) { @@ -126,52 +152,54 @@ discretizeMutual <- function(X, is.factor(matrix_u[, z]))) } } + is_continuous <- (!is_discrete) } - + else + is_discrete <- (!is_continuous) if (all(is_discrete[1:2])) { - stop("Either X or Y must be continuous to be discretized.") + stop("Either x or y must be continuous to be discretized.") } - if (!(is.vector(X) || is.factor(X)) || - !(is.vector(Y) || is.factor(Y))) { + if (!(is.vector(x) || is.factor(x)) || + !(is.vector(y) || is.factor(y))) { stop( paste0( - "Please provide the two samples X and Y as numerical vectors ", + "Please provide the two samples x and y as numerical vectors ", "for continuous variables and factors or character vectors ", "for discrete variables." ) ) } - if (length(X) != length(Y)) { + if (length(x) != length(y)) { stop( paste0( "The two samples must have the same number of observation ", "(found ", - length(X), + length(x), " and ", - length(Y), + length(y), " )." ) ) } if ((!is.null(sample_weights)) && - (length(sample_weights) != length(X))) { + (length(sample_weights) != length(x))) { stop( paste0( "The sample weight vector must be of the same length as the ", "number of observations (found ", length(sample_weights), " while there are ", - length(X), + length(x), " observations)." ) ) } if ((!is.null(matrix_u) && !is.matrix(matrix_u)) || - (!is.null(matrix_u) && nrow(matrix_u) != length(X))) { + (!is.null(matrix_u) && nrow(matrix_u) != length(x))) { stop( paste0( "matrix_u is not a matrix or its number of rows differs from", @@ -180,10 +208,10 @@ discretizeMutual <- function(X, ) } - if (!is.null(is_discrete) && (length(is_discrete) != (2 + nbrU))) { + if (!is.null(is_continuous) && (length(is_continuous) != (2 + nbrU))) { stop( paste0( - "The vector passed as is_discrete argument must be the same", + "The vector passed as is_continuous argument must be the same", " length as the number of variables, which is ncol(matrix_u) ", "+ 2." ) @@ -192,9 +220,9 @@ discretizeMutual <- function(X, # Remove rows for which any input vector is NA matrix_u_NA <- matrix() - NArows <- logical(length(X)) - NArows <- NArows | is.na(X) - NArows <- NArows | is.na(Y) + NArows <- logical(length(x)) + NArows <- NArows | is.na(x) + NArows <- NArows | is.na(y) if (!is.null(matrix_u)) { for (k in 1:ncol(matrix_u)) { NArows <- NArows | is.na(matrix_u[, k]) @@ -208,8 +236,8 @@ discretizeMutual <- function(X, " rows with NAs in at least one of the inputs. Running on ", length(which(!NArows)), " samples." )) - X <- X[!NArows] - Y <- Y[!NArows] + x <- x[!NArows] + y <- y[!NArows] } if (length(which(!NArows)) < 3) { stop(paste0( @@ -224,24 +252,24 @@ discretizeMutual <- function(X, } } - initbins <- min(30, round(length(X)**(1 / 3))) + initbins <- min(30, round(length(x)**(1 / 3))) - if (is.null(maxbins) || maxbins > length(X) || maxbins < initbins) { - maxbins <- min(length(X), 5 * initbins, 50) + if (is.null(maxbins) || maxbins > length(x) || maxbins < initbins) { + maxbins <- min(length(x), 5 * initbins, 50) } # Converting factors to discrete numerical variables - X_orig <- X - Y_orig <- Y + X_orig <- x + Y_orig <- y if (is_discrete[1]) { - X <- as.factor(X) - levels(X) <- 1:nlevels(X) - X <- as.numeric(X) + x <- as.factor(x) + levels(x) <- 1:nlevels(x) + x <- as.numeric(x) } if (is_discrete[2]) { - Y <- as.factor(Y) - levels(Y) <- 1:nlevels(Y) - Y <- as.numeric(Y) + y <- as.factor(y) + levels(y) <- 1:nlevels(y) + y <- as.numeric(y) } if (nbrU > 0) { for (l in 0:(nbrU - 1)) { @@ -252,10 +280,9 @@ discretizeMutual <- function(X, } } } - is_continuous <- !is_discrete # Pass complexity parameter as int - if (cplx == "mdl") { + if (cplx == "bic") { intcplx <- 0 } else if (cplx == "nml") { intcplx <- 1 @@ -263,7 +290,7 @@ discretizeMutual <- function(X, warning( paste0( "cplx parameter not understood, please specify either ", - "\'mdl\' or \'nml\'. Running with the default option ", + "\'bic\' or \'nml\'. Running with the default option ", "(nml)." ) ) @@ -271,14 +298,14 @@ discretizeMutual <- function(X, } if (is.null(n_eff)) { - n_eff <- length(X) + n_eff <- length(x) } if (is.null(sample_weights)) { sample_weights <- numeric(0); } - input_data = data.frame(X,Y) + input_data = data.frame(x,y) if(!all(is.na(matrix_u_NA))) input_data = cbind(input_data, matrix_u_NA) n_samples <- nrow(input_data) n_nodes <- ncol(input_data) @@ -321,8 +348,8 @@ discretizeMutual <- function(X, niterations <- nrow(rescpp$cutpointsmatrix) / maxbins result <- list() - epsilon <- min(c(sd(X), sd(Y))) / 100 - result$niterations <- niterations + epsilon <- min(c(sd(x), sd(y))) / 100 + result$n_iterations <- niterations for (i in 0:(niterations - 1)) { result[[paste0("iteration", i + 1)]] <- list() for (l in 1:2) { @@ -331,10 +358,10 @@ discretizeMutual <- function(X, (1:maxbins)] clean_cutpoints <- clean_cutpoints[clean_cutpoints != -1] if (l == 1) { - data <- X + data <- x } else { if (l == 2) { - data <- Y + data <- y } else { data <- matrix_u[, l - 2] } @@ -377,7 +404,7 @@ discretizeMutual <- function(X, if (plot) { if (base::requireNamespace("ggplot2", quietly = TRUE) & base::requireNamespace("gridExtra", quietly = TRUE)) { if (!any(is_discrete[1:2])) { - result$plot <- jointplot_hist(X, Y, result, nameDist1, nameDist2) + result$plot <- jointplot_hist(x, y, result, nameDist1, nameDist2) } else if (!all(is_discrete[1:2])) { result$plot <- barplot_disc( X_orig, @@ -396,8 +423,9 @@ discretizeMutual <- function(X, result } +#------------------------------------------------------------------------------- # Plot functions - +#------------------------------------------------------------------------------- axisprint <- function(x) { sprintf("%6s", x) } diff --git a/R/miic.R b/R/miic.R index e7fd51a..ca4eb53 100755 --- a/R/miic.R +++ b/R/miic.R @@ -1,47 +1,64 @@ +#******************************************************************************* +# Filename : miic.R +# +# Description: main function of the miic the package (user front-end) +#******************************************************************************* + +#=============================================================================== +# FUNCTIONS +#=============================================================================== +# miic +#------------------------------------------------------------------------------- #' MIIC, causal network learning algorithm including latent variables #' -#' @description MIIC (Multivariate Information based Inductive Causation) combines +#' @description MIIC (Multivariate Information-based Inductive Causation) combines #' constraint-based and information-theoretic approaches to disentangle direct #' from indirect effects amongst correlated variables, including cause-effect #' relationships and the effect of unobserved latent causes. #' -#' @details In standard mode, starting from a complete graph, the method iteratively removes +#' @details Starting from a complete graph, the method iteratively removes #' dispensable edges, by uncovering significant information contributions from #' indirect paths, and assesses edge-specific confidences from randomization of #' available data. The remaining edges are then oriented based on the signature -#' of causality in observational data. +#' of causality in observational data. Miic distinguishes genuine causal edges +#' (with both reliable arrow heads and tails) from putative causal edges (with +#' one reliable arrow head only) and latent causal edges (with both reliable +#' arrow heads). (see Ribeiro-Dantas 2024) #' #' In temporal mode, miic reorganizes the dataset using the \emph{n_layers} and #' \emph{delta_t} parameters to transform the time steps into lagged samples. #' As starting point, a lagged graph is created with only edges having at #' least one node laying on the last time step. #' Then, miic standard algorithm is applied to remove dispensable edges. -#' The remaining edges are then oriented by using the temporality and the +#' The remaining edges are then duplicated to ensure time invariance +#' (stationary dynamic) and oriented using the temporality and the #' signature of causality in observational data. The use of temporal mode -#' is exposed in Simon \emph{et al.}, eLife reviewed preprint 2024. -#' -#' The method relies on an information theoretic based (conditional) independence -#' test which is described in (Verny \emph{et al.}, PLoS Comp. Bio. 2017), -#' (Cabeli \emph{et al.}, PLoS Comp. Bio. 2020). It deals with both categorical -#' and continuous variables by performing optimal context-dependent discretization. -#' As such, the input data frame may contain both numerical columns which will be -#' treated as continuous, or character / factor columns which will be treated +#' is presented in Simon 2024. +#' +#' The method relies on information theoretic principles which replace +#' (conditional) independence tests as described in Affeldt 2015, Cabeli 2020, +#' Cabeli 2021 and Ribeiro-Dantas 2024. It deals with both categorical and +#' continuous variables by performing optimal context-dependent discretization. +#' As such, the input data frame may contain both numerical columns which will +#' be treated as continuous, or character / factor columns which will be treated #' as categorical. For further details on the optimal discretization method and #' the conditional independence test, see the function discretizeMutual. -#' The user may also choose to run miic with scheme presented in -#' (Li \emph{et al.}, NeurIPS 2019) to improve the end result's interpretability -#' by ensuring consistent separating set during the skeleton iterations. +#' The user may also choose to run miic with scheme presented in Li 2019 +#' and Ribeiro-Dantas 2024 to improve the end result's interpretability +#' by ensuring consistent separating sets. #' #' @seealso \code{\link{discretizeMutual}} for optimal discretization and #' (conditional) independence test. #' #' @references #' \itemize{ -#' \item{Simon et al., \emph{eLife reviewed preprint} https://www.biorxiv.org/content/10.1101/2024.02.06.579177v1.abstract } -#' \item{Ribeiro-Dantas et al., \emph{iScience 2024} https://arxiv.org/abs/2303.06423 } -#' \item{Cabeli et al., \emph{PLoS Comp. Bio. 2020.} https://doi.org/10.1371/journal.pcbi.1007866 } -#' \item{Li et al., \emph{NeurIPS 2019} http://papers.nips.cc/paper/9573-constraint-based-causal-structure-learning-with-consistent-separating-sets.pdf } -#' \item{Verny et al., \emph{PLoS Comp. Bio. 2017.} https://doi.org/10.1371/journal.pcbi.1005662 } +#' \item Simon \emph{et al.}, eLife 2024, \href{https://www.biorxiv.org/content/10.1101/2024.02.06.579177v1.abstract}{CausalXtract: a flexible pipeline to extract causal effects from live-cell time-lapse imaging data} +#' \item Ribeiro-Dantas \emph{et al.}, iScience 2024, \href{https://doi.org/10.1016/j.isci.2024.109736}{Learning interpretable causal networks from very large datasets, application to 400,000 medical records of breast cancer patients} +#' \item Cabeli \emph{et al.}, NeurIPS 2021, \href{https://why21.causalai.net/papers/WHY21_24.pdf}{Reliable causal discovery based on mutual information supremum principle for finite dataset} +#' \item Cabeli \emph{et al.}, PLoS Comput. Biol. 2020, \href{https://doi.org/10.1371/journal.pcbi.1007866}{Learning clinical networks from medical records based on information estimates in mixed-type data} +#' \item Li \emph{et al.}, NeurIPS 2019, \href{http://papers.nips.cc/paper/9573-constraint-based-causal-structure-learning-with-consistent-separating-sets.pdf}{Constraint-based causal structure learning with consistent separating sets} +#' \item Verny \emph{et al.}, PLoS Comput. Biol. 2017, \href{https://doi.org/10.1371/journal.pcbi.1005662}{Learning causal networks with latent variables from multivariate information in genomic data} +#' \item Affeldt \emph{et al.}, UAI 2015, \href{https://auai.org/uai2015/proceedings/papers/293.pdf}{Robust Reconstruction of Causal Graphical Models based on Conditional 2-point and 3-point Information} #' } #' #' @param input_data [a data frame, required] @@ -106,9 +123,9 @@ #' Note that if a \emph{"delta_t"} column is present in the \emph{state_order}, #' its values will overwrite the function parameter. #' -#' \emph{"movavg"} (optional) contains an integer value that specifies the size of -#' the moving average window to be applied to the variable. -#' Note that if \emph{"movavg"} column is present in the \emph{state_order}, +#' \emph{"mov_avg"} (optional) contains an integer value that specifies the size +#' of the moving average window to be applied to the variable. +#' Note that if \emph{"mov_avg"} column is present in the \emph{state_order}, #' its values will overwrite the function parameter. #' #' @param true_edges [a data frame, optional, NULL by default] @@ -125,9 +142,11 @@ #' with \emph{n_layers} = 4 + \emph{delta_t} = 1 or #' \emph{n_layers} = 2 + \emph{delta_t} = 3 #' but not for \emph{n_layers} = 2 + \emph{delta_t} = 2 as there is no matching -#' edge in the time unfolded graph. Please note that the order is important: -#' var1, var2, 3 is interpreted as var1_lag3 - var2_lag0. Please note also that, -#' for contextual variables that are not lagged, the expected value in the +#' edge in the time unfolded graph.\cr +#' Please note that the order is important: in standard mode, "var1 var2" will +#' be interpreted as var1 -> var2 and in temporal mode, "var1 var2 3" is +#' interpreted as var1_lag3 -> var2_lag0. Please note also that, in temporal +#' mode, for contextual variables that are not lagged, the expected value in the #' third column for the time lag is NA. #' #' @param black_box [a data frame, optional, NULL by default] @@ -157,17 +176,17 @@ #' your compiler is compatible with openmp if you wish to use multithreading. #' #' @param cplx [a string, optional, "nml" by default, possible values: -#' "nml", "mdl"] +#' "nml", "bic"] #' #' In practice, the finite size of the input dataset requires that #' the 2-point and 3-point information measures should be \emph{shifted} #' by a \emph{complexity} term. The finite size corrections can be based on -#' the Minimal Description Length (MDL) criterion. -#' However, the MDL complexity criterion tends to underestimate the +#' the Bayesian Information Criterion (BIC). +#' However, the BIC complexity term tends to underestimate the #' relevance of edges connecting variables with many different categories, #' leading to the removal of false negative edges. To avoid such biases #' with finite datasets, the (universal) Normalized Maximum Likelihood (NML) -#' criterion can be used (see Affeldt \emph{et al.}, UAI 2015). +#' criterion can be used (see Affeldt 2015). #' #' @param orientation [a boolean value, optional, TRUE by default] #' @@ -176,33 +195,38 @@ #' 3-point information of unshielded triples and, in temporal mode, using time. #' If set to FALSE, the orientation step is not performed. #' -#' @param ori_proba_ratio [a floating point between 0 and 1, optional, +#' @param ort_proba_ratio [a floating point between 0 and 1, optional, #' 1 by default] #' #' The threshold when deducing the type of an edge tip (head/tail) #' from the probability of orientation. #' For a given edge tip, denote by p the probability of it being a head, -#' the orientation is accepted if (1 - p) / p < \emph{ori_proba_ratio}. +#' the orientation is accepted if (1 - p) / p < \emph{ort_proba_ratio}. #' 0 means reject all orientations, 1 means accept all orientations. #' -#' @param ori_consensus_ratio [a floating point between 0 and 1, optional, +#' @param ort_consensus_ratio [a floating point between 0 and 1, optional, #' NULL by default] -#' -#' The threshold when deducing the type of an consensus edge tip (head/tail) -#' from the average probability of orientation. -#' For a given edge tip, denote by p the probability of it being a head, -#' the orientation is accepted if (1 - p) / p < \emph{ori_consensus_ratio}. -#' 0 means reject all orientations, 1 means accept all orientations. -#' If not supplied, the \emph{ori_consensus_ratio} will be initialized with -#' the \emph{ori_proba_ratio} value. +#' Used to determine if orientations correspond to genuine causal edges +#' and, when consistency is activated, to deduce the orientations in +#' the consensus graph.\cr +#' Oriented edges will be marked as genuine causal when: +#' \eqn{ (1 - p_{head}) / p_{head} < } \emph{ort_consensus_ratio} +#' and \eqn{ p_{tail} / (1 - p_{tail}) < } \emph{ort_consensus_ratio}.\cr +#' When consistency is activated, \emph{ort_consensus_ratio} is used as +#' threshold when deducing the type of an consensus edge tip (head/tail) +#' from the average probability of orientations over the cycle of graphs. +#' For a given edge tip, denote by p the average probability of it being a head, +#' the orientation is accepted if (1 - p) / p < \emph{ort_consensus_ratio}.\cr +#' If not supplied, the \emph{ort_consensus_ratio} will be initialized with +#' the \emph{ort_proba_ratio} value. #' #' @param propagation [a boolean value, optional, FALSE by default] #' #' If set to FALSE, the skeleton is partially oriented with only the #' v-structure orientations. Otherwise, the v-structure orientations are -#' propagated to downstream undirected edges in unshielded triples following +#' propagated to downstream un-directed edges in unshielded triples following #' the propagation procedure, relying on probabilities (for more details, -#' see Verny \emph{et al.}, PLoS Comp. Bio. 2017). +#' see Verny 2017). #' #' @param latent [a string, optional, "orientation" by default, possible #' values: "orientation", "no", "yes"] @@ -221,13 +245,13 @@ #' expected to be independent. In case of correlated samples such as in #' Monte Carlo sampling approaches, the effective number of independent samples #' \emph{n_eff} can be estimated using the decay of the autocorrelation function -#' (Verny et al., PLoS Comp. Bio. 2017). This effective number \emph{n_eff} -#' of independent samples can be provided using this parameter. +#' (see Verny 2017). This effective number \emph{n_eff} of independent samples +#' can be provided using this parameter. #' #' @param n_shuffles [a positive integer, optional, 0 by default] #' #' The number of shufflings of the original dataset in order to evaluate -#' the edge specific confidence ratio of all inferred edges. +#' the edge specific confidence ratio of all retained edges. #' Default is 0: no confidence cut is applied. If the number of shufflings #' is set to an integer > 0, the confidence threshold must also be > 0 #' (e.g. \emph{n_shuffles} = 100 and \emph{conf_threshold} = 0.01). @@ -235,9 +259,8 @@ #' @param conf_threshold [a positive floating point, optional, 0 by default] #' #' The threshold used to filter the less probable edges following the skeleton -#' step. See Verny \emph{et al.}, PLoS Comp. Bio. 2017. -#' Default is 0: no confidence cut is applied. If the the confidence threshold -#' is set > 0, the number of shufflings must also be > 0 +#' step (see Verny 2017). Default is 0: no confidence cut is applied. If the +#' confidence threshold is set > 0, the number of shufflings must also be > 0 #' (e.g. \emph{n_shuffles} = 100 and \emph{conf_threshold} = 0.01). #' #' @param sample_weights [a numeric vector, optional, NULL by default] @@ -255,16 +278,17 @@ #' the joint distribution of \eqn{X} and \eqn{Y} on samples which are #' not missing on Z. #' This is a way to ensure that data are missing at random for the considered -#' interaction and to avoid selection bias. +#' interaction and detect bias due to values not missing at random. #' #' @param consistent [a string, optional, "no" by default, possible values: #' "no", "orientation", "skeleton"] #' #' If set to "orientation": iterate over skeleton and orientation steps to -#' ensure consistency of the network. -#' If set to "skeleton": iterate over skeleton step to get a consistent skeleton, -#' then orient edges and discard inconsistent orientations to ensure consistency -#' of the network (see Li \emph{et al.}, NeurIPS 2019 for details). +#' ensure consistency of the separating sets and all disconnected pairs in the +#' final network. +#' If set to "skeleton": iterate over skeleton step to get a consistent +#' skeleton, then orient edges including inconsistent orientations +#' (see Li 2019 for details). #' #' @param max_iteration [a positive integer, optional, 100 by default] #' @@ -293,9 +317,8 @@ #' #' @param negative_info [a boolean value, optional, FALSE by default] #' -#' For test purpose only. If TRUE, negative shifted mutual information is -#' allowed during the computation when mutual information is inferior to the -#' complexity term. +#' If TRUE, negative shifted mutual information is allowed during the +#' computation when mutual information is inferior to the complexity term. #' For small dataset with complicated structures, e.g. discrete variables with #' many levels, allowing for negative shifted mutual information may help #' identifying weak v-structures related to those discrete variables, @@ -304,15 +327,16 @@ #' (expected to be negative due to the small sample size). #' However, under this setting, a v-structure (X -> Z <- Y) in the final graph #' does not necessarily imply that X is dependent on Y conditioning on Z, -#' As a consequence, the interpretability of the final graph -#' is hindered. In practice, it's advised to keep this parameter as FALSE. +#' As a consequence, the reliability of certain orientations is not guaranteed. +#' By contrast, keeping this parameter as FALSE is more conservative and leads +#' to more reliable orientations (see Cabeli 2021 and Ribeiro-Dantas 2024). #' #' @param mode [a string, optional, "S" by default, possible values are -#' "S": Standard (IID samples) or "TS": Temporal Stationary"] +#' "S": Standard (non temporal data) or "TS": Temporal Stationary data] #' #' When temporal mode is activated, the time information must be provided #' in the first column of \emph{input_data}. For more details about temporal -#' stationary mode, see Simon \emph{et al.}, eLife reviewed preprint 2024. +#' stationary mode (see Simon 2024). #' #' @param n_layers [an integer, optional, NULL by default, must be >= 2 #' if supplied] @@ -335,7 +359,7 @@ #' If not supplied, the number of time steps between layers is estimated #' from the dynamic of the dataset and the number of layers. #' -#' @param movavg [an integer, optional, NULL by default, must be >= 2 +#' @param mov_avg [an integer, optional, NULL by default, must be >= 2 #' if supplied] #' #' Used only in temporal mode. When supplied, a moving average operation is @@ -352,7 +376,7 @@ #' #' Used only in temporal mode and if the \emph{n_layers} or \emph{delta_t} #' parameters are not supplied. \emph{max_nodes} is used as the maximum number -#' of nodes in the final graph to compute \emph{n_layers} and/or \emph{delta_t}. +#' of nodes in the final time-unfolded graph to compute \emph{n_layers} and/or \emph{delta_t}. #' The default is 50 to produce quick runs and can be increased up to 200 #' or 300 on recent computers to produce more precise results. #' @@ -362,73 +386,149 @@ #' #' @return A \emph{miic-like} object that contains: #' \itemize{ -#' \item{\emph{all.edges.summary:} a data frame with information about the relationship between -#' each pair of variables +#' \item{\emph{summary:} a data frame with information about the +#' relationship between relevant pair of variables. +#' +#' As returning the information on all possible pairs of variables could lead +#' to an huge data frame, by convention, the summary does not include pair of +#' variables not sharing information at all (\emph{I'(x,y) <= 0}). +#' However, as exception to this convention, when a ground truth is supplied +#' (using the \emph{true_edges} parameter), the edges that are not retained +#' by MIIC because the variables does not share information at all +#' but are present in the true edges will be included in the summary +#' to report correctly all the false negative edges. +#' +#' So, the summary contains these categories of edges: +#' \itemize{ +#' \item{ edges retained} +#' \item{ edges not retained after conditioning on some contributor(s)} +#' \item{ edges not retained without conditioning but present in true edges} +#' } +#' while these edges are not considered as relevant and are not included: +#' \itemize{ +#' \item{ edges not retained without conditioning and not in true edges} +#' } +#' +#' Information available in the summary are: #' \itemize{ -#' \item{ \emph{x:} X node} -#' \item{ \emph{y:} Y node} +#' \item{ \emph{x:} X node name} +#' \item{ \emph{y:} Y node name} #' \item{ \emph{type:} contains 'N' if the edge has been removed or 'P' for #' retained edges. If the true graph is supplied in the \emph{true_edges} #' parameter, 'P' becomes 'TP' (True Positive) or 'FP' (False Positive), #' while 'N' becomes 'TN' (True Negative) or 'FN' (False Negative). -#' Note that, as the \emph{all.edges.summary} does not contain all the -#' negative edges, edges not present are 'TN'.} +#' Note that, as the \emph{summary} does not contain all the +#' removed edges, edges not present have to be considered as 'N' +#' and, if the true graph is supplied, as 'TN'.} #' \item{ \emph{ai:} the contributing nodes found by the method which -#' participate in the mutual information between \emph{x} and \emph{y}, +#' contribute to the mutual information between \emph{x} and \emph{y}, #' and possibly separate them.} #' \item{ \emph{raw_contributions:} describes the share of total mutual -#' information between \emph{x} and \emph{y} explained by each contributor.} +#' information between \emph{x} and \emph{y} explained by each contributor, +#' measured by I'(x;y;ai|\{aj\}) / I'(x;y), +#' where \{aj\} is the separating set before adding ai.} #' \item{ \emph{contributions:} describes the share of remaining mutual #' information between \emph{x} and \emph{y} explained by each successive -#' contributors.} -#' \item{ \emph{info:} provides the pairwise mutual information times -#' \emph{Nxyi} for the pair (\emph{x}, \emph{y}).} -#' \item{ \emph{info_cond:} provides the conditional mutual information -#' times \emph{Nxy_ai} for the pair (\emph{x}, \emph{y}) when conditioned -#' on the collected nodes \emph{ai}. It is -#' equal to the \emph{info} column when \emph{ai} is an empty set.} -#' \item{ \emph{cplx:} gives the computed complexity between the (\emph{x}, -#' \emph{y}) variables taking into account the contributing nodes \emph{ai}. -#' Edges that have have more conditional information \emph{info_cond} -#' than \emph{cplx} are retained in the final graph.} -#' \item{ \emph{Nxy_ai:} gives the number of complete samples on which the -#' information and the complexity have been computed. If the input dataset +#' contributors, measured by I'(x;y;ai|\{aj\}) / I'(x;y|\{aj\}), +#' where \{aj\} is the separating set before adding ai. } +#' \item{ \emph{info:} the mutual information \emph{I(x;y)} times \emph{n_xy}, +#' the number of samples without missing or NA values for both \emph{x} +#' and \emph{y}. } +#' \item{ \emph{n_xy:} gives the number of samples on which the information +#' without conditioning has been computed. If the input dataset #' has no missing value, the number of samples is the same for all pairs #' and corresponds to the total number of samples.} -#' \item{ \emph{info_shifted:} represents the \emph{info} - \emph{cplx} value. -#' It is a way to quantify the strength of the edge (\emph{x}, \emph{y}).} -#' \item{ \emph{infOrt:} the orientation of the edge (\emph{x}, \emph{y}). -#' It is the same value as in the adjacency matrix at row \emph{x} and -#' column \emph{y} : 1 for unoriented, 2 for an edge from X to Y, -#' -2 from Y to X and 6 for bidirectional.} -#' \item{ \emph{trueOrt:} the orientation of the edge (\emph{x}, \emph{y}) -#' present in the true edges are provided.} -#' \item{ \emph{isOrtOk:} information about the consistency of the inferred -#' graph’s orientations with a reference graph is given (if true edges -#' are provided). -#' 'Y': the orientation is consistent; 'N': the orientation is not consistent -#' with the PAG (Partial Ancestor Graph) derived from the given true graph.} +#' \item{ \emph{info_cond:} the conditional mutual information \emph{I(x;y|ai)} +#' times the number of samples without NA \emph{n_xy_ai} used in the +#' computation. +#' \emph{info_cond} is equal to \emph{info} when \emph{ai} is an empty set.} +#' \item{ \emph{cplx:} the complexity term for the pair (\emph{x}, \emph{y}) +#' taking into account the contributing nodes \emph{ai}.} +#' \item{ \emph{n_xy_ai:} the number of samples without NA in +#' \emph{x}, \emph{y} and all nodes in \emph{ai} on which the +#' information and the complexity terms are computed. +#' If the input dataset has no missing value, the number of samples is the +#' same for all pairs and corresponds to the total number of samples.} +#' \item{ \emph{info_shifted:} value equal to \emph{info_cond} - \emph{cplx}. +#' Used to decide whether the edge is retained (when positive), +#' or removed (when zero or possibly negative when the parameter +#' \emph{negative_info} is set to TRUE).} +#' \item{ \emph{ort_inferred:} the orientation of the edge (\emph{x}, \emph{y}). +#' 0: edge removed, 1: un-directed, 2: directed from X to Y, -2: directed +#' from Y to X, 6: bi-directed.\cr +#' When the \emph{consistent} option is turned on and there is more than +#' one graph in the consistent cycle, this is the inferred orientation +#' of the edge in the last graph in the cycle. } +#' \item{ \emph{ort_ground_truth:} the orientation of the edge (\emph{x}, +#' \emph{y}) in the ground truth graph when true edges are provided.} +#' \item{ \emph{is_inference_correct:} indicates if the inferred orientation +#' agrees with the provided ground truth. TRUE: agrees, FALSE: disagrees and +#' set to NA when no ground truth is supplied.} +#' \item{ \emph{is_causal:} boolean value indicating the causal nature of the +#' arrow tips of an edge, based on the probabilities given in the columns +#' \emph{p_y2x} and \emph{p_x2y}. TRUE: when the edges is directed +#' and both the head and the tail are set with high confidence +#' (adjustable with the \emph{ort_consensus_ratio} parameter), +#' FALSE otherwise or NA if the edge is not retained. +#' More formally, an oriented edge is marked as genuine causal when +#' \eqn{ (1 - p_{head}) / p_{head} < } \emph{ort_consensus_ratio} +#' and \eqn{ p_{tail} / (1 - p_{tail}) < } \emph{ort_consensus_ratio}.\cr +#' A directed edge not marked as genuine causal indicates that only +#' the head is set with high confidence, while the tail remains uncertain. +#' This corresponds to a putative causal edge, which could either be +#' a genuine causal edge or a bi-directed edge from a latent confounder.\cr +#' Note that the genuine causality is deducible only when latent variables +#' are allowed and propagation is not allowed.} +#' \item{ \emph{ort_consensus:} Not computed (NAs) when +#' consistency is not activated or, when consistency is on, +#' if there is only one graph returned (no cycle). +#' When computed, indicates the consensus orientation of the edge +#' determined from the consensus skeleton and the \emph{ort_consensus_ratio} +#' threshold on averaged orientation probabilities over the cycle of graphs. +#' Possible values are 0: not connected, 1: un-directed, -2 or 2: directed +#' and 6: bi-directed (latent variable).} +#' \item{ \emph{is_causal_consensus:} Not computed (NAs) when +#' consistency is not activated or, when consistency is on, +#' if there is only one graph returned (no cycle). +#' When computed, work in the same way as \emph{is_causal} +#' but on the consensus graph.} +#' \item{ \emph{edge_stats:} Not computed (NAs) when +#' consistency is not activated or, when consistency is on, +#' if there is only one graph returned (no cycle). +#' When computed, contains the frequencies of all \emph{ort_inferred} +#' values present in the cycle of graphs for the edge (\emph{x, y}), +#' in the format [percentage(orientation)], separated by ";". +#' e.g. In a cycle of 4 graphs, if an edge is three times marked as 2 +#' (directed) and one time marked as 1 (un-directed), edge_stats will +#' contain "75\%(2);25\%(1)".} #' \item{ \emph{sign:} the sign of the partial correlation between variables #' \emph{x} and \emph{y}, conditioned on the contributing nodes \emph{ai}.} #' \item{ \emph{partial_correlation:} value of the partial correlation for the -#' edge (\emph{x}, \emph{y}) conditioned on the contributing nodes \emph{ai}.} -#' \item{ \emph{is_causal:} details about the nature of the arrow tip for a -#' directed edge. A directed edge in a causal graph does not necessarily imply -#' causation but it does imply that the cause-effect relationship is not the -#' other way around. An arrow-tip which is itself downstream of another -#' directed edge suggests stronger causal sense and is marked by a 'Y', -#' or 'N' otherwise.} -#' \item{ \emph{proba:} probabilities for the inferred orientation, derived -#' from the three-point mutual information (cf Affeldt & Isambert, UAI 2015 -#' proceedings) and noted as p(x->y);p(x<-y).} -#' \item{ \emph{confidence:} this column is computed when the confidence cut -#' is activated. It represents the ratio between the probability to reject -#' the edge (\emph{x}, \emph{y}) in the dataset versus the mean probability -#' to do the same in multiple (user defined) number of randomized datasets.} +#' edge (\emph{x, y}) conditioned on the contributing nodes \emph{ai}.} +#' \item{ \emph{p_y2x:} probability of the arrowhead from \emph{y} to \emph{x}, +#' of the inferred orientation, derived from the three-point mutual information +#' (see Verny 2017 and Ribeiro-Dantas 2024). NA if the edge is removed.} +#' \item{ \emph{p_x2y:} probability of the arrowhead from \emph{x} to \emph{y}, +#' of the inferred orientation, derived from the three-point mutual information +#' (see Verny 2017 and Ribeiro-Dantas 2024). NA if the edge is removed.} +#' \item{ \emph{confidence:} computed only when the confidence cut is +#' activated, NA otherwise. +#' When computed, it corresponds to a measure of the strength of the retained +#' edges: it is the ratio between the probability to reject the edge +#' \emph{exp(-info_shifted(x;y|ai))} in the original dataset and +#' the mean probability to do the same in \emph{n_shuffles} number +#' of randomized datasets. Edges with \emph{confidence} > \emph{conf_threshold} +#' will be filtered out from the graph. +#' (see parameters \emph{n_shuffles} and \emph{conf_threshold}) +#' } #' } #' } #' -#' \item{\emph{orientations.prob:} this data frame lists the orientation +#' \item{\emph{edges:} a data frame with the raw edges output coming from +#' the C++ core function. This data frame is used internally by MIIC to +#' produce the summary and contains all pairs of variables (\emph{x, y}). } +#' +#' \item{\emph{triples:} this data frame lists the orientation #' probabilities of the two edges of all unshielded triples of the #' reconstructed network with the structure: node1 -- mid-node -- node2: #' \itemize{ @@ -439,9 +539,9 @@ #' \item{ \emph{p3:} probability of the arrowhead mid-node <- node2} #' \item{ \emph{p4:} probability of the arrowhead mid-node -> node2} #' \item{ \emph{node2:} node at the end of the unshielded triplet} -#' \item{ \emph{NI3:} 3 point (conditional) mutual information * N} -#' \item{ \emph{Conflict:} indicates if there a conflict between the -#' computed probabilities and the NI3 value} +#' \item{ \emph{ni3:} 3 point (conditional) mutual information * N} +#' \item{ \emph{conflict:} indicates if there is a conflict between the +#' computed probabilities and the \emph{ni3} value} #' } #' } #' @@ -450,27 +550,64 @@ #' pairs of vertices are adjacent or not in the graph. The matrix can be read #' as a (row, column) set of couples where the row represents the source node #' and the column the target node. Since miic can reconstruct mixed networks -#' (including directed, undirected and bidirected edges), we will have a +#' (including directed, un-directed and bi-directed edges), we will have a #' different digit for each case: #' \itemize{ -#' \item{ 1: (\emph{x}, \emph{y}) edge is undirected} +#' \item{ 1: (\emph{x}, \emph{y}) edge is un-directed} #' \item{ 2: (\emph{x}, \emph{y}) edge is directed as \emph{x} -> \emph{y} } #' \item{ -2: (\emph{x}, \emph{y}) edge is directed as \emph{x} <- \emph{y} } -#' \item{ 6: (\emph{x}, \emph{y}) edge is bidirected} +#' \item{ 6: (\emph{x}, \emph{y}) edge is bi-directed} #' } #' } #' #' \item {\emph{proba_adj_matrix:} the probability adjacency matrix is -#' a square matrix used to represent the orientation probabilities associated +#' a square matrix used to represent the orientation probabilities associated #' to the edges tips. The value at ("row", "column") is the probability, #' for the edge between "row" and "column" nodes, of the edge tip on the "row" -#' side. A probability less than 0.5 is an indication of a possible tail -#' (cause) and a probability greater than 0.5 a possible head (effect). +#' side. A probability less than 0.5 is an indication of a possible tail +#' (cause) and a probability greater than 0.5 a possible head (effect). } +#' +#' \item {\emph{adj_matrices:} present only when consistency is activated. +#' The list of the adjacency matrices, one for each graph +#' which is part of the resulting cycle of graphs. +#' Each item is a square matrix with the same layout as \emph{adj_matrix}. } +#' +#' \item {\emph{proba_adj_matrices:} present only when consistency is activated. +#' The list of the probability adjacency matrices, one for each graph +#' which is part of the resulting cycle of graphs. Each item is a +#' square matrix with the same layout as \emph{proba_adj_matrix}. } +#' +#' \item {\emph{proba_adj_average:} present only when consistency is activated. +#' The average probability adjacency matrix is a square matrix used to +#' represent the orientation probabilities associated to the edges tips +#' of the consensus graph. Its layout is the same as \emph{proba_adj_matrix} +#' and it contains the averaged probability of edges tips over the resulting +#' cycle of graphs. } +#' +#' \item {\emph{is_consistent:} present only when consistency is activated. +#' TRUE if the returned graph is consistent, FALSE otherwise. } +#' +#' \item {\emph{time:} execution time of the different steps and total run-time +#' of the causal graph reconstruction by MIIC. } +#' +#' \item {\emph{interrupted:} TRUE if causal graph reconstruction has been +#' interrupted, FALSE otherwise. } +#' +#' \item {\emph{scores:} present only when true edges have been supplied. +#' Contains the scores of the returned graph in regard of the ground truth: +#' \itemize{ +#' \item{ \emph{tp}: number of edges marked as True Positive } +#' \item{ \emph{fp}: number of edges marked as False Positive } +#' \item{ \emph{fn}: number of edges marked as False Negative } +#' \item{ \emph{precision}: Precision } +#' \item{ \emph{recall}: Recall } +#' \item{ \emph{fscore}: F1-Score } +#' } #' } #' #' \item {\emph{params:} the list of parameters used for the network #' reconstruction. The parameters not supplied are initialized to their default -#' values. Otherwise, the parameters are checked and corrected if necessary.} +#' values. Otherwise, the parameters are checked and corrected if necessary. } #' #' \item {\emph{state_order:} the state order used for the network #' reconstruction. If no state order is supplied, it is generated by using @@ -502,76 +639,58 @@ #' data(hematoData) #' #' # execute MIIC (reconstruct graph) -#' miic.res <- miic( +#' miic_obj <- miic( #' input_data = hematoData[1:1000,], latent = "yes", #' n_shuffles = 10, conf_threshold = 0.001 #' ) #' #' # plot graph #' if(require(igraph)) { -#' plot(miic.res, method="igraph") +#' plot(miic_obj, method="igraph") #' } #' #' \donttest{ #' # write graph to graphml format. Note that to correctly visualize #' # the network we created the miic style for Cytoscape (http://www.cytoscape.org/). #' -#' miic.write.network.cytoscape(g = miic.res, file = file.path(tempdir(), "temp")) +#' writeCytoscapeNetwork(miic_obj, file = file.path(tempdir(), "temp")) #' #' # EXAMPLE CANCER #' data(cosmicCancer) #' data(cosmicCancer_stateOrder) #' # execute MIIC (reconstruct graph) -#' miic.res <- miic( +#' miic_obj <- miic( #' input_data = cosmicCancer, state_order = cosmicCancer_stateOrder, latent = "yes", #' n_shuffles = 100, conf_threshold = 0.001 #' ) #' #' # plot graph #' if(require(igraph)) { -#' plot(miic.res) -#' } -#' -#' # write graph to graphml format. Note that to correctly visualize -#' # the network we created the miic style for Cytoscape (http://www.cytoscape.org/). -#' miic.write.network.cytoscape(g = miic.res, file = file.path(tempdir(), "temp")) -#' -#' # EXAMPLE OHNOLOGS -#' data(ohno) -#' data(ohno_stateOrder) -#' # execute MIIC (reconstruct graph) -#' miic.res <- miic( -#' input_data = ohno, latent = "yes", state_order = ohno_stateOrder, -#' n_shuffles = 100, conf_threshold = 0.001 -#' ) -#' -#' # plot graph -#' if(require(igraph)) { -#' plot(miic.res) +#' plot(miic_obj) #' } #' #' # write graph to graphml format. Note that to correctly visualize #' # the network we created the miic style for Cytoscape (http://www.cytoscape.org/). -#' miic.write.network.cytoscape(g = miic.res, file = file.path(tempdir(), "temp")) +#' writeCytoscapeNetwork(miic_obj, file = file.path(tempdir(), "temp")) #' #' # EXAMPLE COVID CASES (time series demo) #' data(covidCases) #' # execute MIIC (reconstruct graph in temporal mode) -#' tmiic.res <- miic(input_data = covidCases, mode = "TS", n_layers = 3, delta_t = 1, movavg = 14) +#' tmiic_obj <- miic(input_data = covidCases, mode = "TS", n_layers = 3, delta_t = 1, mov_avg = 14) #' #' # to plot the default graph (compact) #' if(require(igraph)) { -#' plot(tmiic.res) +#' plot(tmiic_obj) #' } #' -#' # to plot the raw temporal network Using igraph +#' # to plot the raw temporal network #' if(require(igraph)) { -#' plot(tmiic.res, display="raw") +#' plot(tmiic_obj, display="raw") #' } #' -#' # to plot the full temporal network Using igraph +#' # to plot the full temporal network #' if(require(igraph)) { -#' plot(tmiic.res, display="lagged") +#' plot(tmiic_obj, display="lagged") #' } #' #' } @@ -583,8 +702,8 @@ miic <- function(input_data, n_threads = 1, cplx = "nml", orientation = TRUE, - ori_proba_ratio = 1, - ori_consensus_ratio = NULL, + ort_proba_ratio = 1, + ort_consensus_ratio = NULL, propagation = FALSE, latent = "orientation", n_eff = -1, @@ -599,11 +718,12 @@ miic <- function(input_data, mode = "S", n_layers = NULL, delta_t = NULL, - movavg = NULL, + mov_avg = NULL, keep_max_data = FALSE, max_nodes = 50, verbose = FALSE) { + miic_start = Sys.time() if (verbose) miic_msg ("Start MIIC...") if ( is.null(mode) || ( ! (mode %in% MIIC_VALID_MODES) ) ) @@ -619,8 +739,8 @@ miic <- function(input_data, n_threads = n_threads, cplx = cplx, orientation = orientation, - ori_proba_ratio = ori_proba_ratio, - ori_consensus_ratio = ori_consensus_ratio, + ort_proba_ratio = ort_proba_ratio, + ort_consensus_ratio = ort_consensus_ratio, propagation = propagation, latent = latent, n_eff = n_eff, @@ -653,15 +773,15 @@ miic <- function(input_data, params = params, n_layers = n_layers, delta_t = delta_t, - movavg = movavg, + mov_avg = mov_avg, keep_max_data = keep_max_data, max_nodes = max_nodes) params = list_ret$params state_order = tmiic_check_state_order_part2 (list_ret$state_order) list_ts = tmiic_extract_trajectories (input_data) - list_ts = tmiic_movavg (list_ts, state_order$movavg, - keep_max_data=params$keep_max_data, - verbose_level=ifelse (params$verbose, 2, 1) ) + list_ts = tmiic_mov_avg (list_ts, state_order$mov_avg, + keep_max_data=params$keep_max_data, + verbose_level=ifelse (params$verbose, 2, 1) ) state_order = tmiic_estimate_dynamic (list_ts, state_order, max_nodes=params$max_nodes, verbose_level=ifelse (params$verbose, 2, 1) ) @@ -709,6 +829,7 @@ miic <- function(input_data, # if (verbose) miic_msg ("-> Start reconstruction...") + pre_end = Sys.time() res <- miic.reconstruct (input_data = input_data, n_threads = params$n_threads, cplx = params$cplx, @@ -717,7 +838,7 @@ miic <- function(input_data, black_box = black_box, n_shuffles = params$n_shuffles, orientation = params$orientation, - ori_proba_ratio = params$ori_proba_ratio, + ort_proba_ratio = params$ort_proba_ratio, propagation = params$propagation, conf_threshold = params$conf_threshold, verbose = params$verbose, @@ -739,16 +860,28 @@ miic <- function(input_data, # # Post-traitment # - res$all.edges.summary <- summarizeResults ( + post_start = Sys.time() + res$summary <- summarizeResults ( observations = input_data, results = res, true_edges = true_edges, state_order = state_order, consensus_threshold = params$consensus_threshold, - ori_consensus_ratio = params$ori_consensus_ratio, + ort_consensus_ratio = params$ort_consensus_ratio, latent = (params$latent != "no"), - propagation = params$propagation, - verbose = params$verbose) + propagation = params$propagation) + + if (!is.null (true_edges)) + { + tp = sum (res$summary$type == "TP") + fp = sum (res$summary$type == "FP") + fn = sum (res$summary$type == "FN") + precision = tp / (tp + fp) + recall = tp / (tp + fn) + fscore = (2 * (precision * recall) ) / (precision + recall) + res$scores = c ("tp"=tp, "fp"=fp, "fn"=fn, "precision"=round (precision, 4), + "recall"=round (recall, 4), "fscore"=round (fscore, 4) ) + } res$params = params if (! (mode %in% MIIC_TEMPORAL_MODES) ) @@ -776,13 +909,21 @@ miic <- function(input_data, # The output of the reconstruction is the "raw" temporal graph, without # edges identical by stationarity. To have the "real" temporal graph, # we duplicate the edges using the stationary assumption and this "real" - # graph is stored the "all.edges.stationarity" data frame. + # graph is stored the "stationarity" data frame. # edges_dup_stat = tmiic_repeat_edges_over_history (res) res$tmiic <- list (lagged_state_order = state_order, lagged_black_box = black_box, lagged_true_edges = true_edges, - all.edges.stationarity = edges_dup_stat) + stationarity = edges_dup_stat) } + + miic_end = Sys.time() + pre_duration = as.numeric (pre_end - miic_start, units="secs") + post_duration = as.numeric (miic_end - post_start, units="secs") + miic_duration = as.numeric (miic_end - miic_start, units="secs") + res$time = c ("pre"=pre_duration, res$time, + "post"=post_duration, "total"=miic_duration) + return(res) } diff --git a/R/miic.plot.R b/R/miic.plot.R index 8d94cfa..2c463cc 100644 --- a/R/miic.plot.R +++ b/R/miic.plot.R @@ -1,41 +1,102 @@ -#' Export miic result to different plotting methods +#------------------------------------------------------------------------------- +# export +#------------------------------------------------------------------------------- +#' Export miic result for plotting (with igraph) #' #' @description This function creates an object built from the result returned -#' by \code{\link{miic}} that is ready to be fed to different plotting methods. +#' by \code{\link{miic}} that is ready to be fed to the plotting method. #' -#' @details See the details of specific function for each method. -#' For igraph, see \code{\link{getIgraph}}. +#' @details The behavior depends on the method used for the export. #' -#' @param miic.res [a miic graph object] -#' The graph object returned by the miic execution. -#' @param method A string representing the plotting method. -#' Currently only "igraph" is supported. -#' @param pcor_palette The color palette used to represent the partial correlations -#' (the color of the edges). The palette must be able to handle 201 shades -#' to cover the correlation range from -100 to +100. The default palette is -#' grDevices::colorRampPalette(c("blue", "darkgrey", "red"). +#' For igraph, edge attributes are passed to the igraph graph +#' and can be accessed with e.g. \code{E(g)$partial_correlation}. +#' See \code{\link{miic}} for more details on edge parameters. +#' By default, edges are colored according to the partial correlation +#' between two nodes conditioned on the conditioning set +#' (negative is blue, null is gray and positive is red) +#' and their width is based on the conditional mutual information +#' minus the complexity cost. +#' +#' @param miic_obj [a miic object, required] +#' +#' The object returned by the \code{\link{miic}} execution. +#' +#' @param method [a string, optional, default value "igraph"] +#' +#' The plotting method, currently only "igraph" is supported. +#' +#' @param pcor_palette [a color palette, optional, default value +#' grDevices::colorRampPalette(c("blue", "darkgrey", "red")] +#' +#' Used to represent the partial correlations (the color of the edges). +#' The palette must be able to handle 201 shades to cover the correlation range +#' from -100 to +100. +#' +#' @param display [a string, optional, default value "compact"] +#' +#' Used only when exporting object returned by miic in temporal mode. +#' It allows different representations of the temporal graph. +#' Possible values are \emph{"raw"}, \emph{"lagged"}, \emph{"compact"}, +#' \emph{"combine"}, \emph{"unique"}, \emph{"drop"}: +#' \itemize{ +#' \item When \emph{display} = \emph{"raw"}, the export function will +#' use the tmiic graph object as it, leading to the return of a lagged +#' graph. +#' \item When \emph{display} = \emph{"lagged"}, the export function will +#' repeat the edges over history assuming stationarity and return a lagged +#' graph. +#' \item When \emph{display} = \emph{"compact"}, the default, nodes +#' and edges are converted into a flattened version to produce a compact +#' view of the temporal network whilst still presenting all the information +#' in the export.\cr +#' e.g. X_lag1->Y_lag0, X_lag2<-Y_lag0 become respectively X->Y lag=1, +#' X<-Y lag=2. +#' \item When \emph{display} = \emph{"combine"}, prior to the export, +#' a pre-processing will be applied to kept only one edge +#' per pair of nodes. The info_shifted will be the highest one +#' of the summarized edges whilst the lag and orientation of the +#' summarized edge will be an aggregation.\cr +#' e.g. X_lag2->Y_lag0, X_lag0<-Y_lag1 will become X<->Y lag=1-2 with +#' the info_shifted of X_lag2->Y_lag0 if info_shifted of +#' X_lag2->Y_lag0 > X_lag0<-Y_lag1. +#' \item When \emph{display} = \emph{"unique"}, prior to the export, +#' a pre-processing will be applied to kept only the edges having the +#' highest info_shifted for a pair of nodes. +#' If several edges between the sames nodes have the same +#' info_shifted, then the edge kept is the one with the minimum lag.\cr +#' e.g. X_lag1->Y_lag0, X_lag0<-Y_lag2 with info_shifted of +#' X_lag1->Y_lag0 > X_lag0<-Y_lag2 become X->Y lag=1. +#' \item When \emph{display} = \emph{"drop"}, the same pre-processing +#' as \emph{"unique"} will be applied, then the lag information will be +#' dropped before the export. +#' } +#' +#' @param show_self_loops [a boolean, optional, TRUE by default] +#' +#' Used only when exporting object returned by miic in temporal mode. +#' When TRUE, the lagged edges starting and ending on the same node +#' are included in the igraph object. +#' When FALSE, only edges having different nodes are present in the igraph +#' object. #' #' @export #' #' @return A graph object adapted to the method. #' -#' @seealso -#' \code{\link{getIgraph}} for details on the igraph exported object. -#' #' @examples #' \donttest{ #' library(miic) #' data(hematoData) #' #' # execute MIIC (reconstruct graph) -#' miic.res <- miic( +#' miic_obj <- miic( #' input_data = hematoData, latent = "yes", #' n_shuffles = 10, conf_threshold = 0.001 #' ) #' #' # Using igraph #' if(require(igraph)) { -#' g = miic.export(miic.res, "igraph") +#' g = export(miic_obj, "igraph") #' plot(g) # Default visualisation, calls igraph::plot.igraph() #' #' # Specifying layout (see ?igraph::layout_) @@ -47,25 +108,53 @@ #' plot(g, vertex.shape="none", edge.color="gray85", vertex.label.color="gray10") #' } #' +#' # In temporal mode, execute MIIC +#' data(covidCases) +#' tmiic_obj <- miic(input_data = covidCases, mode = "TS", n_layers = 3, delta_t = 1, mov_avg = 14) +#' +#' # Plot by default the compact display of the temporal network using igraph +#' if(require(igraph)) { +#' g = export (tmiic_obj) +#' plot(g) +#' +#' # Plot the raw temporal network using igraph +#' g = export(tmiic_obj, display="raw") +#' plot(g) +#' +#' # Plot the complete temporal network using igraph (completed by stationarity) +#' g = export(tmiic_obj, display="lagged") +#' plot(g) +#' +#' # Specifying layout (see ?igraph::layout_) +#' l <- layout_on_grid(g, width = 5, height = 3, dim = 2) +#' plot(g, layout=l) +#' +#' # For compact temporal display, please be aware that the rendering of +#' # igraph::plot.igraph() is not optimal when the graph contains +#' # multiple edges between the same nodes. +#' # So, the recommend way to plot a compact graph is to use tmiic plotting: +#' plot(tmiic_obj) #' } #' - -miic.export <- function(miic.res, method = NULL, pcor_palette = NULL) { - if (is.null(miic.res$all.edges.summary)) { +#' } +#------------------------------------------------------------------------------- +export <- function (miic_obj, method="igraph", pcor_palette=NULL, + display="compact", show_self_loops=TRUE) + { + if ( is.null(miic_obj$summary) ) stop("The inferred network does not exist") - } - if (is.null(method)) { - stop("Plotting method is required") - } - if (method == "igraph") { - return(getIgraph(miic.res, pcor_palette = pcor_palette)) - } else { + if ( (!is.null(method)) && (method != "igraph") ) stop("Method not supported") - } -} + if ( is.null(miic_obj$tmiic) ) + return (getIgraph(miic_obj, pcor_palette=pcor_palette)) + else + return (tmiic_getIgraph (miic_obj, pcor_palette=pcor_palette, + display=display, show_self_loops=show_self_loops)) + } -#' Igraph plotting function for miic +#------------------------------------------------------------------------------- +#' Igraph export function for miic #' #' @description This functions returns an igraph object built from the result #' returned by \code{\link{miic}}. @@ -78,8 +167,8 @@ miic.export <- function(miic.res, method = NULL, pcor_palette = NULL) { #' (negative is blue, null is gray and positive is red) and their width is #' based on the conditional mutual information minus the complexity cost. #' -#' @param miic.res [a miic graph object] -#' The graph object returned by the miic execution. +#' @param miic_obj [a miic object] +#' The object returned by the \code{\link{miic}} execution. #' @param pcor_palette The color palette used to represent the partial correlations #' (the color of the edges). The palette must be able to handle 201 shades #' to cover the correlation range from -100 to +100. The default palette is @@ -87,42 +176,44 @@ miic.export <- function(miic.res, method = NULL, pcor_palette = NULL) { #' #' @return An igraph graph object. #' +#' @noRd +#' #' @seealso #' \code{\link{miic}} for details on edge parameters in the returned object, #' \code{\link[igraph]{igraph.plotting}} for the detailed description of the #' plotting parameters and \code{\link[igraph]{layout}} for different layouts. -#' -#' - -getIgraph <- function(miic.res, pcor_palette = NULL) { - if (is.null(miic.res$all.edges.summary)) { +#------------------------------------------------------------------------------- +getIgraph <- function(miic_obj, pcor_palette = NULL) { + if (is.null(miic_obj$summary)) { stop("The inferred network does not exist.") } if (!base::requireNamespace("igraph", quietly = TRUE)) { stop("Package 'igraph' is required.") } - summary = miic.res$all.edges.summary[miic.res$all.edges.summary$type %in% c('P', 'TP', 'FP'), ] + summary = miic_obj$summary[miic_obj$summary$type %in% c('P', 'TP', 'FP'), ] if (nrow(summary) > 0) { # Re-order summary so that all edges go from "x" to "y" for(row in 1:nrow(summary)){ - if(summary[row, "infOrt"] == -2){ + if(summary[row, "ort_inferred"] == -2){ summary[row, c("x","y")] = summary[row, c("y","x")] - summary[row, "infOrt"] = 2 - if(!is.na(summary[row, "proba"])){ - summary[row, "proba"] = paste0(rev( - strsplit(summary[row, "proba"], ";")[[1]]), collapse=";") + summary[row, "ort_inferred"] = 2 + if ( (!is.na(summary[row, "p_y2x"])) + && (!is.na(summary[row, "p_x2y"])) ) { + temp <- summary[row, "p_y2x"] + summary[row, "p_y2x"] <- summary[row, "p_x2y"] + summary[row, "p_x2y"] <- temp } - if(!is.na(summary[row, "trueOrt"])){ - summary[row, "trueOrt"] = 2 + if(!is.na(summary[row, "ort_ground_truth"])){ + summary[row, "ort_ground_truth"] = 2 } } } } # Create igraph object from summary - ig_graph = igraph::graph_from_data_frame(summary, - vertices=colnames(miic.res$adj_matrix)) + ig_graph = igraph::graph_from_data_frame (summary, + vertices=colnames(miic_obj$adj_matrix) ) # Set nodes visuals igraph::V(ig_graph)$color <- "lightblue" @@ -136,9 +227,9 @@ getIgraph <- function(miic.res, pcor_palette = NULL) { # Set correct orientations igraph::E(ig_graph)$arrow.mode = rep(0, igraph::gsize(ig_graph)) - igraph::E(ig_graph)$arrow.mode[igraph::E(ig_graph)$infOrt == 2] = 2 - igraph::E(ig_graph)$arrow.mode[igraph::E(ig_graph)$infOrt == -2] = 1 - igraph::E(ig_graph)$arrow.mode[igraph::E(ig_graph)$infOrt == 6] = 3 + igraph::E(ig_graph)$arrow.mode[igraph::E(ig_graph)$ort_inferred == 2] = 2 + igraph::E(ig_graph)$arrow.mode[igraph::E(ig_graph)$ort_inferred == -2] = 1 + igraph::E(ig_graph)$arrow.mode[igraph::E(ig_graph)$ort_inferred == 6] = 3 # Set edges visuals min_width = 0.2 @@ -165,37 +256,46 @@ getIgraph <- function(miic.res, pcor_palette = NULL) { #' Basic plot function of a miic network inference result #' -#' @description This function calls \code{\link{miic.export}} to build a +#' @description This function calls \code{\link{export}} to build a #' plottable object from the result returned by \code{\link{miic}} and plot it. #' -#' @details See the documentation of \code{\link{miic.export}} for further +#' @details See the documentation of \code{\link{export}} for further #' details. #' -#' @param x [a miic graph object] -#' The graph object returned by \code{\link{miic}}. -#' @param method A string representing the plotting method. Default to "igraph". -#' Currently only "igraph" is supported. -#' @param pcor_palette Optional. The color palette used to represent the partial -#' correlations (the color of the edges). See \code{\link{miic.export}} for details. -#' @param \dots Additional plotting parameters. See the corresponding plot function -#' for the complete list. +#' @param x [a miic object, required] +#' +#' The object returned by \code{\link{miic}} execution. +#' +#' @param method [a string, optional, default value "igraph"] +#' +#' The plotting method, currently only "igraph" is supported. +#' +#' @param pcor_palette [a color palette, optional, default value +#' grDevices::colorRampPalette(c("blue", "darkgrey", "red")] +#' +#' Used to represent the partial correlations (the color of the edges). +#' The palette must be able to handle 201 shades to cover the correlation range +#' from -100 to +100. +#' +#' @param \dots Additional plotting parameters. See the corresponding plot +#' function for the complete list. +#' #' For igraph, see \code{\link[igraph]{igraph.plotting}}. #' #' @export #' -#' @seealso \code{\link{miic.export}} for generic exports, -#' \code{\link{getIgraph}} for igraph export, +#' @seealso \code{\link{export}} for graphical exports, #' \code{\link[igraph]{igraph.plotting}} #' plot.miic = function(x, method = 'igraph', pcor_palette = NULL, ...) { if (method == 'igraph'){ if (base::requireNamespace("igraph", quietly = TRUE)) { - igraph_obj = miic.export (x, 'igraph', pcor_palette = pcor_palette) + igraph_obj = export (x, 'igraph', pcor_palette = pcor_palette) igraph::plot.igraph (igraph_obj, ...) } else { stop("Package 'igraph' is required.") } } else { - stop("Method not supported. See ?miic.export for supported methods.") + stop("Method not supported. See ?export for supported methods.") } } diff --git a/R/miic.reconstruct.R b/R/miic.reconstruct.R index 419dc44..2f6e093 100755 --- a/R/miic.reconstruct.R +++ b/R/miic.reconstruct.R @@ -10,7 +10,7 @@ miic.reconstruct <- function(input_data = NULL, latent = "orientation", n_shuffles = 0, orientation = TRUE, - ori_proba_ratio = 1, + ort_proba_ratio = 1, propagation = FALSE, conf_threshold = 0, verbose = FALSE, @@ -51,6 +51,7 @@ miic.reconstruct <- function(input_data = NULL, input_double <- as.vector(input_double) var_names <- colnames(input_data) + n_vars <- length (var_names) arg_list <- list( "conf_threshold" = conf_threshold, @@ -71,7 +72,7 @@ miic.reconstruct <- function(input_data = NULL, "n_threads" = n_threads, "no_init_eta" = FALSE, "orientation" = orientation, - "ori_proba_ratio" = ori_proba_ratio, + "ort_proba_ratio" = ort_proba_ratio, "propagation" = propagation, "test_mar" = test_mar, "mode" = mode, @@ -106,47 +107,67 @@ miic.reconstruct <- function(input_data = NULL, return(list(interrupted = TRUE)) # R-formalize returned object - # table of edges infomation + # + # Table of edges information + # n_row <- length(res$edges) - 1 header <- unlist(res$edges[1]) df <- data.frame(matrix(unlist(res$edges[2:(n_row + 1)]), nrow = n_row, byrow = TRUE), stringsAsFactors = FALSE) colnames(df) <- header df[df == "NA"] <- NA - df$Ixy <- as.numeric(df$Ixy) - df$Ixy_ai <- as.numeric(df$Ixy_ai) + df$i_xy <- as.numeric(df$i_xy) + df$i_xy_ai <- as.numeric(df$i_xy_ai) df$cplx <- as.numeric(df$cplx) - df$Rxyz_ai <- as.numeric(df$Rxyz_ai) - df$Nxy_ai <- as.numeric(df$Nxy_ai) + df$r_xyz_ai <- as.numeric(df$r_xyz_ai) + df$n_xy_ai <- as.numeric(df$n_xy_ai) df$confidence <- as.numeric(df$confidence) res$edges <- df + # + # Reshape items returned as list into their correct shape, + # add column/row names + # + res$adj_matrix <- matrix (unlist (res$adj_matrix), + ncol=n_vars, nrow=n_vars, byrow=TRUE, + dimnames=list (var_names, var_names) ) - # adj_matrix - res$adj_matrix <- matrix(unlist(res$adj_matrix), nrow = length(input_data), - byrow = TRUE) - colnames(res$adj_matrix) <- var_names - rownames(res$adj_matrix) <- var_names - - # proba_adj_matrix - res$proba_adj_matrix <- matrix(unlist(res$proba_adj_matrix), - nrow = length(input_data), byrow = TRUE) - colnames(res$proba_adj_matrix) <- var_names - rownames(res$proba_adj_matrix) <- var_names - - # adj_matrices (when consistent parameter is turned on) - if (length(res$adj_matrices) > 0) - res$adj_matrices <- matrix(unlist(res$adj_matrices), - ncol = length(res$adj_matrices)) + res$proba_adj_matrix <- matrix (unlist(res$proba_adj_matrix), + ncol=n_vars, nrow=n_vars, byrow=TRUE, + dimnames=list (var_names, var_names) ) + # + # Same : reshape items returned when consistent parameter is turned on + # + if (length (res$adj_matrices) > 0) + { + tmp_reshape = list() + for (i in 1:length (res$adj_matrices) ) + tmp_reshape[[i]] = matrix (unlist (res$adj_matrices[[i]]), + ncol=n_vars, nrow=n_vars, byrow=TRUE, + dimnames=list (var_names, var_names) ) + res$adj_matrices = tmp_reshape + } - # proba_adj_matrices (when consistent parameter is turned on) - if (length(res$proba_adj_matrices) > 0) { - res$proba_adj_matrices <- matrix(unlist(res$proba_adj_matrices), - ncol = length(res$proba_adj_matrices)) - res$proba_adj_matrices[res$proba_adj_matrices == -1] <- NA - adj_average <- rowMeans(res$proba_adj_matrices, na.rm = TRUE) - res$proba_adj_average <- matrix(unlist(adj_average), - nrow = length(input_data), byrow = TRUE) - } + if (length (res$proba_adj_matrices) > 0) + { + # First reshape with n_vars * n_vars rows, n_cycles columns to compute mean + # + tmp_reshape <- matrix (unlist (res$proba_adj_matrices), + ncol=length(res$proba_adj_matrices)) + tmp_reshape[tmp_reshape == -1] <- NA + adj_average <- rowMeans (tmp_reshape, na.rm = TRUE) + res$proba_adj_average <- matrix (unlist (adj_average), + ncol=n_vars, nrow=n_vars, byrow=TRUE, + dimnames=list (var_names, var_names) ) + # + # Final reshape into a list of n_cycles items, each item is n_vars rows, + # n_vars columns matrix + # + res$proba_adj_matrices = list() + for (i in 1:ncol (tmp_reshape) ) + res$proba_adj_matrices[[i]] = matrix (unlist (tmp_reshape[,i]), + ncol=n_vars, nrow=n_vars, byrow=TRUE, + dimnames=list (var_names, var_names) ) + } # save time time <- strsplit(as.character(res$time), " ") @@ -154,17 +175,17 @@ miic.reconstruct <- function(input_data = NULL, res$time <- stats::setNames( as.numeric(time), - c("init", "iter", "cut", "ori", "total") + c("init", "iter", "cut", "ort", "cpp") ) # create the data frame of the structures after orientation - orientations_prob <- res$orientations.prob + orientations_prob <- res$triples - if (length(res$orientations.prob) > 0) { + if (length(res$triples) > 0) { a <- length(orientations_prob[[1]]) b <- length(unlist(orientations_prob)) - tmp <- unlist(res$orientations.prob)[1:a] - res1 <- unlist(res$orientations.prob)[(a + 1):b] + tmp <- unlist(res$triples)[1:a] + res1 <- unlist(res$triples)[(a + 1):b] orientations_prob <- data.frame(matrix( res1, nrow = length(orientations_prob) - 1, @@ -179,7 +200,7 @@ miic.reconstruct <- function(input_data = NULL, orientations_prob[, c(8:9)] <- sapply(orientations_prob[, c(8:9)], as.numeric) } # update the returned matrix - res$orientations.prob <- orientations_prob + res$triples <- orientations_prob res$interrupted <- FALSE diff --git a/R/miic.utils.R b/R/miic.utils.R index bbbf4e1..fd1dffb 100755 --- a/R/miic.utils.R +++ b/R/miic.utils.R @@ -1,3 +1,9 @@ +#******************************************************************************* +# Filename : miic.utils.R +# +# Description: various utilities functions and constants for miic +#******************************************************************************* + #=============================================================================== # CONSTANTS #=============================================================================== @@ -13,7 +19,7 @@ STATE_ORDER_STANDARD_VALID_COLUMS <- c ("var_names", "var_type", "levels_increasing_order", "is_contextual", "is_consequence", "group", "group_color") STATE_ORDER_TEMPORAL_VALID_COLUMNS = c (STATE_ORDER_STANDARD_VALID_COLUMS, - "n_layers", "delta_t", "movavg") + "n_layers", "delta_t", "mov_avg") #=============================================================================== # FUNCTIONS @@ -164,7 +170,7 @@ check_input_data <- function (input_data, mode) # additional possible columns in temporal mode are: # * n_layers: the number of layers in the time unfolded graph # * delta_t: the number of time steps between layers -# * movavg: if a moving average must applied on some variables +# * mov_avg: if a moving average must applied on some variables # NB: is_consequence is not allowed in temporal mode # - mode: the MIIC mode # Return: the checked and eventually generated or completed state order dataframe @@ -261,8 +267,8 @@ check_state_order <- function (input_data, state_order, mode) state_order$n_layers[ is.na (state_order$n_layers) ] <- "NA" if ("delta_t" %in% colnames (state_order) ) state_order$delta_t[ is.na (state_order$delta_t) ] <- "NA" - if ("movavg" %in% colnames (state_order) ) - state_order$movavg[ is.na (state_order$movavg) ] <- "NA" + if ("mov_avg" %in% colnames (state_order) ) + state_order$mov_avg[ is.na (state_order$mov_avg) ] <- "NA" } # # Check variable in data not in the state_order @@ -662,15 +668,16 @@ check_state_order <- function (input_data, state_order, mode) #------------------------------------------------------------------------------- # check_other_df #------------------------------------------------------------------------------- -# input_data: a dataframe with variables as columns and rows as samples -# - df: the datafame to check, expected to be a 2 columns dataframe. All values -# in the dataframe are expected to be variables names. -# An invalid dataframe will be ignored, Invalid rows will be discarded -# - state_order: the dataframe returned by check_state_order -# - df_name: the datafame name (i.e. :"black box", "true edges") -# This value is used only to display messages +# input_data: a data frame with variables as columns and rows as samples +# - df: the data fame to check, expected to be a 2 columns data frame in +# standard mode and 3 columns data frame in temporal mode. +# All values in 2 first columns of the data frame are expected to be variables +# names, and in temporal mode, the 3rd column is expected to contain lags. +# An invalid data frame will be ignored, Invalid rows will be discarded +# - state_order: the data frame returned by check_state_order +# - df_name: the data fame name (i.e. :"black box", "true edges") # - mode: the MIIC mode -# return: the dataframe checked +# return: the data frame checked #------------------------------------------------------------------------------- check_other_df <- function (input_data, state_order, df, df_name, mode) { @@ -696,7 +703,7 @@ check_other_df <- function (input_data, state_order, df, df_name, mode) # if (mode %in% MIIC_TEMPORAL_MODES) { - input_data = input_data[,2:ncol(input_data)] + input_data = input_data[, 2:ncol(input_data), drop=F] n_cols <- 3 } else @@ -710,7 +717,7 @@ check_other_df <- function (input_data, state_order, df, df_name, mode) if (nrow(df) == 0) { miic_warning (df_name, "The provided dataframe is empty.") - return (df) + return (NULL) } data_var_names <- colnames (input_data) @@ -739,14 +746,14 @@ check_other_df <- function (input_data, state_order, df, df_name, mode) } } rows_ok <- unlist (lapply (1:nrow(df), FUN=function (x) { ! (x %in% rows_with_warning) } ) ) - df <- df [rows_ok, ] + df <- df [rows_ok, , drop=F] if (nrow(df) == 0) { miic_warning (df_name, "The provided dataframe is empty.") - return (df) + return (NULL) } # - # In temporal mode, check that the 3rd columns is integer >= 0 (lags) + # In temporal mode, check that the 3rd columns is integer >= 0 (lags) # if (mode %in% MIIC_TEMPORAL_MODES) { @@ -773,12 +780,12 @@ check_other_df <- function (input_data, state_order, df, df_name, mode) else miic_warning (df_name, "lag is incorrect for multiple rows (", msg_str, "), these rows will be ignored.") - df <- df [!wrong_lags, ] + df <- df [!wrong_lags, , drop=F] } if (nrow(df) == 0) { miic_warning (df_name, "The provided dataframe is empty.") - return (df) + return (NULL) } # # Check that contextual lag are NA @@ -788,7 +795,7 @@ check_other_df <- function (input_data, state_order, df, df_name, mode) dest_idx = which (state_order$var_names == x[[2]]) return ( (state_order[orig_idx, "is_contextual"] == 1) || (state_order[dest_idx, "is_contextual"] == 1) ) } ) ) - wrongs_ctx= ( contextuals & ( ! is.na (df[,3]) ) ) # NA: OK for now + wrongs_ctx = ( contextuals & ( ! is.na (df[,3]) ) ) if ( any (wrongs_ctx) ) { if (sum (wrongs_ctx) == 1) @@ -828,11 +835,11 @@ check_other_df <- function (input_data, state_order, df, df_name, mode) sum (wrongs_selfs), " wrong lines will be ignored.") } - df <- df [ (!wrongs_ctx) & (!wrongs_lagged) & (!wrongs_selfs), ] + df <- df [ (!wrongs_ctx) & (!wrongs_lagged) & (!wrongs_selfs), , drop=F] if (nrow(df) == 0) { miic_warning (df_name, "The provided dataframe is empty.") - return (df) + return (NULL) } } # @@ -845,10 +852,13 @@ check_other_df <- function (input_data, state_order, df, df_name, mode) df = unique (df) rownames(df) = NULL # - # Equal rows but with variable names swapped + # We remove equal rows but with variable names swapped + # as edges in black box are not oriented and, for true edges, + # the post-processing will not be able to process opposite edges # rows_kept = rep (T, nrow(df)) - for (i in 1:nrow(df)) { + for (i in 1:nrow(df)) + { if ( ! rows_kept[[i]] ) next if (mode %in% MIIC_TEMPORAL_MODES) @@ -867,20 +877,32 @@ check_other_df <- function (input_data, state_order, df, df_name, mode) & (df[,2] == df[i,1]) & (rownames(df) != i) ) rows_kept = rows_kept & (!dup_inverse) - } - df = df[rows_kept,] + } + df = df[rows_kept, , drop=F] if ( n_rows_sav != nrow(df) ) { - if (n_rows_sav - nrow(df) == 1) - miic_warning (df_name, "1 row is duplicated. Only one instance", - " of the row will be used.") + if (df_name == "true edges") + { + miic_warning (df_name, "the implementation of truth edges", + " is not designed to handle opposite edges.", + " Only one direction will be considered for the opposite edge(s).") + } else - miic_warning (df_name, n_rows_sav - nrow(df), " rows are duplicated.", - " Only one instance of these rows will be used.") + { + if (n_rows_sav - nrow(df) == 1) + miic_warning (df_name, "1 row is duplicated. Only one instance", + " of the row will be used.") + else + miic_warning (df_name, n_rows_sav - nrow(df), " rows are duplicated.", + " Only one instance of these rows will be used.") + } } if (nrow(df) == 0) + { miic_warning (df_name, "The provided dataframe is empty.") + return (NULL) + } return (df) } @@ -1025,37 +1047,37 @@ test_param_wrong_float <- function (value, min=NA, max=NA) # Returns: a list with all the parameters, eventually modified or initialized #------------------------------------------------------------------------------- check_parameters <- function (input_data, n_threads, cplx, - orientation, ori_proba_ratio, ori_consensus_ratio, propagation, latent, + orientation, ort_proba_ratio, ort_consensus_ratio, propagation, latent, n_eff, n_shuffles, conf_threshold, sample_weights, test_mar, consistent, max_iteration, consensus_threshold, mode, negative_info, verbose) { list_ret = list ("mode" = mode) list_ret$n_threads = check_param_int (n_threads, "n_threads", 1, min=1, max=NA) - list_ret$cplx = check_param_string (cplx, "complexity", c("nml", "mdl")) + list_ret$cplx = check_param_string (cplx, "complexity", c("nml", "bic")) list_ret$orientation = check_param_logical (orientation, "orientation", TRUE) - if ( test_param_wrong_float (ori_proba_ratio, min=0, max=1) ) + if ( test_param_wrong_float (ort_proba_ratio, min=0, max=1) ) { - miic_warning ("parameters", "supplied value ", ori_proba_ratio, - " for the orientation probabilty ratio parameter is invalid.", + miic_warning ("parameters", "supplied value ", ort_proba_ratio, + " for the orientation probability ratio parameter is invalid.", " It must be a floating point between 0 and 1.", " The default value (1) will be used.") - ori_proba_ratio = 1 + ort_proba_ratio = 1 } - list_ret$ori_proba_ratio = ori_proba_ratio + list_ret$ort_proba_ratio = ort_proba_ratio - if ( is.null (ori_consensus_ratio) ) - ori_consensus_ratio = list_ret$ori_proba_ratio - else if ( test_param_wrong_float (ori_consensus_ratio, min=0, max=1) ) + if ( is.null (ort_consensus_ratio) ) + ort_consensus_ratio = list_ret$ort_proba_ratio + else if ( test_param_wrong_float (ort_consensus_ratio, min=0, max=1) ) { - miic_warning ("parameters", "supplied value ", ori_consensus_ratio, + miic_warning ("parameters", "supplied value ", ort_consensus_ratio, " for the orientation concensus ratio parameter is invalid.", " It must be a floating point between 0 and 1.", " The default value (same as orientation probabilty ratio: ", - ori_proba_ratio, ") will be used.") - ori_consensus_ratio = list_ret$ori_proba_ratio + ort_proba_ratio, ") will be used.") + ort_consensus_ratio = list_ret$ort_proba_ratio } - list_ret$ori_consensus_ratio = ori_consensus_ratio + list_ret$ort_consensus_ratio = ort_consensus_ratio list_ret$propagation = check_param_logical (propagation, "propagation", FALSE) list_ret$latent = check_param_string (latent, "latent", MIIC_VALID_LATENT) diff --git a/R/parseResults.R b/R/parseResults.R index 071236e..164842f 100644 --- a/R/parseResults.R +++ b/R/parseResults.R @@ -1,324 +1,382 @@ -summarizeResults <- function(observations = NULL, results = NULL, +#******************************************************************************* +# Filename : parseResults.R +# +# Description: produce a summary by post-processing the miic C++ output +#******************************************************************************* + +#------------------------------------------------------------------------------- +# summarizeResults +#------------------------------------------------------------------------------- +# Summarize the list of edges, the list will contain: +# - edges that exist in the miic reconstruction (oriented or not) +# - edges that were conditioned away with a non-empty separating set +# - if ground truth is known, edges present in ground truth but not in the +# 2 previous categories, it corresponds to the true edges removed without +# conditioning +# The summary is sorted by info_shifted (log likelihood), decreasing. +# +# Inputs: +# - observations: a data frame, mandatory, the input data +# - results: the miic c++ part output, mandatory. +# - true_edges: a 2 columns data frame, optional, NULL by default. +# The ground truth, if known +# - state_order: the state order data frame used, optional, NULL by default. +# - consensus_threshold: a float, optional, 0.8 by default. Used when +# consistency is activated to construct the consensus graph skeleton. +# - ort_consensus_ratio: a float, optional, 0.1 by default. Used to determine +# if oriented edges are genuine causal and, when consistency is activated, +# to determine the consensus graph orientations. +# - latent: a boolean, optional, TRUE by default. Indicates if latent +# variables discovery was activated during the network reconstruction. +# - propagation: a boolean, optional, FALSE by default. Indicates if orientation +# propagation was activated during the network reconstruction. +# Return: +# - a data frame: the summary data frame. +# * x: 1st node (in alphanumerical order) of the edge +# * y: 2nd node (in alphanumerical order) of the edge +# * type: the miic prediction : "P"(ositive) or "N"(egative) for respectively +# presence or absence of an edge, without considering orientation. +# If ground truth is known, edges are classified as True or False +# Positives/Negatives ("TP", "FP", "TN", "FN"). +# * ai: list containing the conditioning nodes +# * raw_contributions: raw contributions of each ai to the conditional +# independence, measured by I'(x;y;ai|{aj}) / I'(x;y), +# where {aj} is the separating set before adding ai. +# * contributions : contributions of each ai to the reduction of conditional +# mutual information measured by I'(x;y;ai|{aj}) / I'(x;y|{aj}), +# where {aj} is the separating set before adding ai. +# * info: mutual information (corresponds to i_xy in the C++ output) +# * n_xy: the number of samples without missing values for the pair of +# variables +# * info_cond: conditional mutual information (corresponds to i_xy_ai in the +# C++ output) +# * cplx: the NML complexity (used for independence testing) +# * n_xy_ai: the number of samples without missing values for the pair of +# variables and the contributors +# * info_shifted: the difference between conditional MI and cplx +# * ort_inferred: the inferred edge orientation +# * ort_ground_truth: is the true edge orientation if known. +# NA if truth is unknown +# * is_inference_correct: indicate if the inferred edge is correctly oriented. +# NA if truth is unknown, TRUE or FALSE if truth is known +# * is_causal: indicates if edge is genuine causal. +# Note that the genuine causality is deducible only when latent variables +# are allowed and propagation is not allowed +# * ort_consensus: if consistency is activated, indicates the consensus +# orientation of the edge, possible values are 0: not connected, +# 1: not oriented, -2 or 2: oriented or 6: bi-directional (latent variable) +# NA if consistency is not activated. +# * is_causal_consensus: if consistency is activated, indicates if the +# consensus orientation is genuine causal. NA if consistency is not +# activated. +# * edge_stats: if consistency is activated, contains the orientation +# frequencies of each orientation present in the cycle of graphs. +# NA if consistency is not activated. +# * sign: sign of partial correlation between x and y conditioned on "ai"s +# * partial_correlation: coefficient of partial correlation between x and y +# conditioned on "ai"s +# * p_y2x: probability of the arrowhead from y to x, NA for removed edges. +# * p_x2y: probability of the arrowhead from x to y, NA for removed edges. +# * confidence: the ratio of info_shifted between randomized +# and normal samples +#------------------------------------------------------------------------------- +summarizeResults = function (observations, results, true_edges = NULL, state_order = NULL, - consensus_threshold = 0.8, - ori_consensus_ratio = 0.1, latent = TRUE, - propagation = FALSE, verbose = FALSE) { - # Reduced list of edges that will be summarized. There are 3 categories: - # - Edges that exist in the miic reconstruction (oriented or not) - # - Edges that were conditioned away with a non-empty separating set - # - If ground truth is known, any other positive edge - summarized_edges <- matrix(ncol = 2, nrow = 0) - adj_matrix <- results$adj_matrix - var_names <- colnames(adj_matrix) - - # List of edges found by miic - half_adj_matrix = adj_matrix - half_adj_matrix[ lower.tri(adj_matrix, diag = TRUE) ] <- 0 - predicted_edges <- which(half_adj_matrix != 0, arr.ind = TRUE, useNames = FALSE) - predicted_edges <- apply(predicted_edges, 2, function(x) { var_names[x] }) - # Add to summarized edges list - if(length(predicted_edges > 0)) summarized_edges <- predicted_edges - - edges <- results$edges - # List of negative edges with non null conditioning set - conditioned_edges <- as.matrix(edges[(!is.na(edges$ai.vect)) & - edges$category != 1, c("x", "y")]) - # List of negative edges with null conditioning set - indep_null_cond_set_edges <- as.matrix(edges[(is.na(edges$ai.vect)) & - edges$category != 1, c("x", "y")]) - # Add to summarized edges list - summarized_edges <- rbind(summarized_edges, conditioned_edges) - #indep_null_cond_set_edges) - - if (!is.null(true_edges)) { - # List of False Negative edges - false_negative_edges <- data.frame( - x = character(), y = character(), - stringsAsFactors = FALSE - ) - for (i in 1:nrow(true_edges)) { - true_edge <- as.character(unlist(true_edges[i, ])) - if (!any(apply(summarized_edges, 1, function(row, edge) { - all(row %in% edge) - }, true_edge))) { - # If this edge is not already in summarized edges - false_negative_edges[nrow(false_negative_edges) + 1, ] <- true_edge + consensus_threshold = 0.8, ort_consensus_ratio = 0.1, + latent = TRUE, propagation = FALSE) + { + # Keep only edges remaining and edges removed using conditioning + # + summary <- results$edges [ (results$edges$category == 1) + | (! is.na (results$edges$contributions) ), , drop=F] + # + # If true edges is supplied, check, and add if needed, the False Negative + # (the edges removed without conditioning are not present in the summary + # but need to be there to be marked as FN) + # + if ( ! is.null(true_edges) ) + { + # To match easily and quickly edges between data frames, define for each edge + # the nodes names ordered alphanumerically as rowname + # + rownames(summary) <- apply ( summary, 1, FUN=function (x) { + ifelse (x[["x"]] < x[["y"]], + paste0 (x[["x"]], "-", x[["y"]]), + paste0 (x[["y"]], "-", x[["x"]])) } ) + rownames(true_edges) <- apply ( true_edges, 1, FUN=function (x) { + ifelse (x[[1]] < x[[2]], + paste0 (x[[1]], "-", x[[2]]), + paste0 (x[[2]], "-", x[[1]])) } ) + # + # Detect missing edges present in ground truth but not in summary + # + missing_fn <- rownames(true_edges)[ + ! ( rownames(true_edges) %in% rownames(summary) ) ] + if (length (missing_fn) > 0) + { + # Add missing edges coming from the ground truth + # + summary[ ( nrow (summary) + 1 ) : + ( nrow (summary) + length (missing_fn) ), ] = NA + rownames (summary) [ (nrow (summary) - length (missing_fn) + 1) : + nrow(summary) ] = missing_fn + summary[ missing_fn, c("x","y") ] = true_edges [ missing_fn, ] + # + # Pick values from 'edges' data frame for these missing FN edges + # To avoid iterations over a possibly huge 'edges' data frame + # apply a pre-filtering to reduce search + # (normally, there should not be a lot of missing FN) + # + pre_filter = unique (unlist (true_edges [ missing_fn, ]) ) + edges_4_fn = results$edges[ (results$edges$x %in% pre_filter) + | (results$edges$y %in% pre_filter), , drop=F] + cols_2_pick = colnames (results$edges) + cols_2_pick = cols_2_pick[ (cols_2_pick != "x") & (cols_2_pick != "y") ] + for (i in (nrow (summary) - length (missing_fn) + 1) : nrow(summary) ) + { + one_edge = edges_4_fn[ ( (edges_4_fn$x == summary[i, "x"]) + & (edges_4_fn$y == summary[i, "y"]) ) + | ( (edges_4_fn$x == summary[i, "y"]) + & (edges_4_fn$y == summary[i, "x"]) ), , drop=F] + summary[i, cols_2_pick] = one_edge[1, cols_2_pick] + } } } - # Add to summarized edges list - summarized_edges <- rbind(summarized_edges, false_negative_edges, - stringsAsFactors = FALSE - ) - } - - n <- nrow(summarized_edges) - summary <- data.frame( - x = character(n), y = character(n), type = character(n), ai = character(n), - raw_contributions = character(n), contributions = character(n), - info = numeric(n), info_cond = numeric(n), cplx = numeric(n), - Nxy_ai = numeric(n), info_shifted = numeric(n), infOrt = numeric(n), - trueOrt = numeric(n), isOrtOk = character(n), sign = character(n), - partial_correlation = numeric(n), is_causal = character(n), proba = character(n), - confidence = character(n), stringsAsFactors = FALSE - ) - if(n == 0) return(summary) - - #Initialize is_causal column as NA - summary$is_causal = NA - - # Edge ordering (A<-B or B->A) is given by lexicographical sort - summary[,c('x','y')] = t(apply(as.data.frame(summarized_edges)[,c(1,2)], 1, function(row){sort(row)})) - + # + # If no edge in the summary, returns directly an empty data frame + # + if (nrow (summary) == 0) + return (data.frame (x = character(0), y = character(0), + type = character(0), ai = character(0), + raw_contributions = character(0), contributions = character(0), + info = numeric(0), n_xy = numeric(0), info_cond = numeric(0), + cplx = numeric(0), n_xy_ai = numeric(0), info_shifted = numeric(0), + ort_inferred = integer(0), ort_ground_truth = integer(0), + is_inference_correct = logical(0), is_causal = logical(0), + ort_consensus = integer(0), is_causal_consensus = logical(0), + edge_stats = character(0), sign = character(0), + partial_correlation = numeric(0), p_y2x = numeric(0), p_x2y = numeric(0), + confidence = numeric(0), stringsAsFactors = FALSE) ) + # + # Edge ordering (A<-B or B->A) is given by alphanumerical order + # + summary [, c("x","y")] <- t (apply (summary[,c("x","y")], 1, + function (row) { sort(row) } ) ) + # # Edge 'type' corresponds to the miic prediction : P(ositive) or N(egative) # for respectively presence or absence of an edge, without considering # orientation. If ground truth is known, edges are classified as True or # False Positives/Negatives (TP, FP, TN, FN). - if (is.null(true_edges)) { - summary$type <- apply(summary, 1, function(row, adj_matrix) { - ifelse(adj_matrix[row[1], row[2]] == 0, "N", "P") - }, adj_matrix) - } else { - type <- character(n) - for (i in 1:nrow(summary)) { - row <- summary[i, ] - row_is_true_edge <- any(apply( - true_edges, 1, - function(true_edge, edge) { - all(true_edge %in% edge) - }, - c(row$x, row$y) - )) - if (adj_matrix[row$x, row$y] != 0) { - # Inferred positive : either True Positive or False Positive - type[i] <- ifelse(row_is_true_edge, "TP", "FP") - } - else { - # Inferred negative : either True Negative or False Negative - type[i] <- ifelse(row_is_true_edge, "FN", "TN") - } + # + if ( is.null(true_edges) ) + summary$type <- ifelse (summary$category == 1, "P", "N") + else + { + summary$type[ (summary$category == 1) + & (rownames(summary) %in% rownames(true_edges)) ] <- "TP" + summary$type[ (summary$category == 1) + & (! (rownames(summary) %in% rownames(true_edges))) ] <- "FP" + summary$type[ (summary$category == 0) + & (! (rownames(summary) %in% rownames(true_edges))) ] <- "TN" + summary$type[ (summary$category == 0) + & (rownames(summary) %in% rownames(true_edges)) ] <- "FN" } - summary$type <- type - } - - # Ai is a list containing the conditioning nodes - summary$ai <- fill_summary_column(summary, edges, "x", "y", "ai.vect") - - # Raw contributions of each ai to the conditional independence, measured by - # I'(x;y;ai|{aj}) / I'(x;y), where {aj} is the separating set before adding ai. - summary$raw_contributions <- fill_summary_column( - summary, edges, "x", "y", "raw_contributions") - - # Contributions of each ai to the reduction of conditional mutual information - # measured by I'(x;y;ai|{aj}) / I'(x;y|{aj}), where {aj} is the separating set - # before adding ai. - summary$contributions <- fill_summary_column( - summary, edges, "x", "y", "contributions") - + # # info and info_cond contain the (conditional) mutual information values - summary$info <- fill_summary_column(summary, edges, "x", "y", "Ixy") - summary$info_cond <- fill_summary_column(summary, edges, "x", "y", "Ixy_ai") - - # cplx is the NML complexity (used for independence testing) - summary$cplx <- fill_summary_column(summary, edges, "x", "y", "cplx") - - # Nxy_ai is the number of samples without missing values used for this edge - summary$Nxy_ai <- fill_summary_column(summary, edges, "x", "y", "Nxy_ai") - + # + colnames(summary)[ which (colnames(summary) == "i_xy")] <- "info" + colnames(summary)[ which (colnames(summary) == "i_xy_ai")] <- "info_cond" + # # info_shifted is the difference between MI and cplx + # summary$info_shifted <- summary$info_cond - summary$cplx - - # confidence is the ratio of info_shifted between randomized and normal sample - summary$confidence <- fill_summary_column( - summary, edges, "x", "y", "confidence") - summary$confidence[summary$confidence == -1] <- NA - - # infOrt is the inferred edge orientation - summary$infOrt <- apply(summary, 1, function(row, adj_matrix) { - adj_matrix[row[1], row[2]] - }, adj_matrix) - - # trueOrt is the true edge orientation (if known) - if (is.null(true_edges)) { - summary$trueOrt <- rep(NA, n) - } else { - true_adj_matrix <- matrix(0, - ncol = dim(adj_matrix)[1], - nrow = dim(adj_matrix)[1] - ) - colnames(true_adj_matrix) <- var_names - rownames(true_adj_matrix) <- var_names - for (i in 1:nrow(true_edges)) { - true_edge <- as.character(unlist(true_edges[i, ])) + # + # confidence is the ratio of info_shifted between randomized + # and normal samples + # + summary$confidence [summary$confidence == -1] <- NA_real_ + # + # ort_inferred is the inferred edge orientation + # + summary$ort_inferred <- apply (summary, 1, function (row, adj_mat) { + adj_mat[row[1], row[2]] }, results$adj_matrix) + # + # ort_ground_truth is the true edge orientation (if known) + # + var_names <- colnames (results$adj_matrix) + n_vars <- length (var_names) + if ( is.null(true_edges) ) + summary$ort_ground_truth <- NA_integer_ + else + { + true_adj_matrix <- matrix (0, ncol=n_vars, nrow=n_vars, + dimnames=list(var_names,var_names) ) + for ( i in 1:nrow (true_edges) ) + { + true_edge <- as.character (unlist (true_edges[i, ]) ) true_adj_matrix[true_edge[1], true_edge[2]] <- 2 true_adj_matrix[true_edge[2], true_edge[1]] <- -2 + } + summary$ort_ground_truth <- apply (summary, 1, function (row, true_adj_matrix) { + true_adj_matrix[row[1], row[2]] }, true_adj_matrix) } - - summary$trueOrt <- apply(summary, 1, function(row, true_adj_matrix) { - true_adj_matrix[row[1], row[2]] - }, true_adj_matrix) - } - - # isOrtOk - summary$isOrtOk <- ifelse(summary$infOrt == summary$trueOrt, "Y", "N") - + # + # is_inference_correct indicates if the inferred edge is correctly oriented. + # NA if truth is unknown, TRUE or FALSE if truth is known + # + summary$is_inference_correct <- ifelse (summary$ort_inferred == summary$ort_ground_truth, + TRUE, FALSE) + # # Sign and coefficient of partial correlation between x and y conditioned # on "ai"s. - summary[, c("sign", "partial_correlation")] <- compute_partial_correlation( - summary, observations, state_order - ) - summary$partial_correlation <- as.numeric(summary$partial_correlation) - - # proba contains the orientation likelihoods as computed by miic (cf - # Affeldt & Isambert, UAI 2015 proceedings) : the probabilities of - # both orientations separated by a semi-colon. - orientation_probabilities <- results$orientations.prob - summary$proba <- sapply(1:nrow(summary), function(i) { - row <- summary[i, ] - id_x <- match(row$x, var_names) - id_y <- match(row$y, var_names) - proba_adj <- results$proba_adj_matrix - if (!is.null(results$adj_matrices) && ncol(results$adj_matrices) > 1) { - proba_adj <- results$proba_adj_average - } - return(paste(proba_adj[id_y, id_x], proba_adj[id_x, id_y], sep = ";")) - }) - + # + summary [, c("sign", "partial_correlation")] <- compute_partial_correlation ( + summary, observations, state_order) + summary$sign[summary$sign == "NA"] <- NA_character_ + summary$partial_correlation <- as.numeric (summary$partial_correlation) + # + # Probabilities of orientations + # + if (!is.null(results$adj_matrices) && length(results$adj_matrices) > 1) + tmp_proba_adj <- results$proba_adj_average + else + tmp_proba_adj <- results$proba_adj_matrix + summary$p_y2x <- unlist (lapply (1:nrow(summary), function(i) { + proba_of_edge <- tmp_proba_adj[ summary[i, "y"], summary[i, "x"] ] + ifelse (proba_of_edge == -1, NA_real_, proba_of_edge) + } ) ) + summary$p_x2y <- unlist (lapply (1:nrow(summary), function(i) { + proba_of_edge <- tmp_proba_adj[ summary[i, "x"], summary[i, "y"] ] + ifelse (proba_of_edge == -1, NA_real_, proba_of_edge) + } ) ) + # # Genuine causality is deducible only when latent variables are allowed and # propagation is not allowed - causality_deducible <- latent && !propagation - # If consistent parameter is turned on and the result graph is a union of - # more than one inconsistent graphs, get the possible orientations of each - # edge with the correponding frequencies and the consensus status. - if (!is.null(results$adj_matrices) && ncol(results$adj_matrices) > 1) { - # use split to turn summary data frame to list only to be able to use sapply - # with simplify = FALSE, otherwise apply() will force simplification, which - # is annoying and surprising when the return value of the function is the - # same for all rows. Theoretically it can never happen here, but this is R! - # Always be careful. - edge_stats_table <- sapply(split(summary, seq(nrow(summary))), - get_edge_stats_table, - var_names, - results$adj_matrices, - simplify = FALSE - ) - target <- which(names(summary) == "infOrt")[1] - summary <- cbind( - summary[, 1:target, drop = FALSE], - consensus = sapply(edge_stats_table, - get_consensus_status, - consensus_threshold - ), - edge_stats = sapply(edge_stats_table, get_edge_stats_str), - is_causal_consensus = NA, - summary[, (target + 1):length(summary), drop = FALSE] - ) + # + causality_deducible <- latent && (!propagation) + summary$is_causal = as.logical (NA) + summary$ort_consensus = NA_integer_ + summary$is_causal_consensus = as.logical (NA) + summary$edge_stats = NA_character_ + + if ( ( ! is.null (results$adj_matrices) ) + && (length (results$adj_matrices) > 1) ) + { + # If consistent parameter is turned on and the result graph is a union of + # more than one inconsistent graphs, get the possible orientations of each + # edge with the corresponding frequencies and the consensus status. + # + n_cycles = length (results$adj_matrices) + edge_stats_table <- lapply (1:nrow(summary), function(i) { + list_adj <- unlist (lapply (results$adj_matrices, + FUN=function (z) { z[ summary[i, "x"], summary[i, "y"] ] }) ) + t <- table(list_adj) / n_cycles + t <- t[order(t, decreasing = TRUE), drop = FALSE] + return (t) + }) + + summary$ort_consensus <- unlist (lapply (edge_stats_table, + get_consensus_status, + consensus_threshold) ) + summary$edge_stats <- unlist (lapply (edge_stats_table, + get_edge_stats_str) ) + # # Set consensus edge status according to the average probabilities - for (i in 1:nrow(summary)) { + # + for (i in 1:nrow(summary)) + { row <- summary[i, ] - if (causality_deducible) { + if (causality_deducible) + { # Set initial values if deducible - if (row$infOrt != 0) - summary[i, ]$is_causal <- "N" - if (row$consensus != 0) - summary[i, ]$is_causal_consensus <- "N" - } - if (row$consensus == 0) next + if (row$ort_inferred != 0) + summary[i, "is_causal"] <- FALSE + if (row$ort_consensus != 0) + summary[i, "is_causal_consensus"] <- FALSE + } + if (row$ort_consensus == 0) + next - id_x <- match(row$x, var_names) - id_y <- match(row$y, var_names) # probability of an edge tip being a head (<,>), * means head or tail (-) - proba_x2y <- results$proba_adj_average[id_x, id_y] # proba of x *-> y - proba_y2x <- results$proba_adj_average[id_y, id_x] # proba of x <-* y + proba_x2y <- results$proba_adj_average[row$x, row$y] # proba of x *-> y + proba_y2x <- results$proba_adj_average[row$y, row$x] # proba of x <-* y ratio_x2y <- (1 - proba_x2y) / proba_x2y ratio_y2x <- (1 - proba_y2x) / proba_y2x - if (ratio_x2y < ori_consensus_ratio && ratio_y2x < ori_consensus_ratio) { - summary[i, ]$consensus <- 6 - } else if (ratio_x2y < ori_consensus_ratio && - ratio_y2x >= ori_consensus_ratio) { - summary[i, ]$consensus <- 2 - if (1 / ratio_y2x < ori_consensus_ratio && causality_deducible) { - summary[i, ]$is_causal_consensus <- "Y" - if (row$infOrt == 2) { - summary[i, ]$is_causal <- "Y" + if ( (ratio_x2y < ort_consensus_ratio) + && (ratio_y2x < ort_consensus_ratio) ) + summary[i, "ort_consensus"] <- 6 + else if ( (ratio_x2y < ort_consensus_ratio) + && (ratio_y2x >= ort_consensus_ratio) ) + { + summary[i, "ort_consensus"] <- 2 + if (1 / ratio_y2x < ort_consensus_ratio && causality_deducible) + { + summary[i, "is_causal_consensus"] <- TRUE + if (row$ort_inferred == 2) + summary[i, "is_causal"] <- TRUE } } - } else if (ratio_y2x < ori_consensus_ratio && - ratio_x2y >= ori_consensus_ratio) { - summary[i, ]$consensus <- -2 - if (1 / ratio_x2y < ori_consensus_ratio && causality_deducible) { - summary[i, ]$is_causal_consensus <- "Y" - if (row$infOrt == -2) { - summary[i, ]$is_causal <- "Y" + else if ( (ratio_y2x < ort_consensus_ratio) + && (ratio_x2y >= ort_consensus_ratio) ) + { + summary[i, "ort_consensus"] <- -2 + if (1 / ratio_x2y < ort_consensus_ratio && causality_deducible) + { + summary[i, "is_causal_consensus"] <- TRUE + if (row$ort_inferred == -2) + summary[i, "is_causal"] <- TRUE } } - } else { - summary[i, ]$consensus <- 1 + else + summary[i, "ort_consensus"] <- 1 } + summary$ort_consensus = as.integer (summary$ort_consensus) } - } else if (causality_deducible) { - # if (is.null(results$adj_matrices) || ncol(results$adj_matrices) <= 1) - # set is_causal by results$proba_adj_matrix - for (i in 1:nrow(summary)) { + else if (causality_deducible) + { + # Consistency not activated or only one graph in the cycle + # + for (i in 1:nrow(summary)) + { row <- summary[i, ] - if (row$infOrt == 0) { + if (row$ort_inferred == 0) next - } - summary[i, ]$is_causal <- "N" + summary[i, "is_causal"] <- FALSE - id_x <- match(row$x, var_names) - id_y <- match(row$y, var_names) # probability of an edge tip being a head (<,>), * means head or tail (-) - proba_x2y <- results$proba_adj_matrix[id_x, id_y] # proba of x *-> y - proba_y2x <- results$proba_adj_matrix[id_y, id_x] # proba of x <-* y + proba_x2y <- results$proba_adj_matrix[row$x, row$y] # proba of x *-> y + proba_y2x <- results$proba_adj_matrix[row$y, row$x] # proba of x <-* y ratio_x2y <- (1 - proba_x2y) / proba_x2y ratio_y2x <- (1 - proba_y2x) / proba_y2x - if (row$infOrt == 2 && - ratio_x2y < ori_consensus_ratio && - 1 / ratio_y2x < ori_consensus_ratio) { - summary[i, ]$is_causal <- "Y" - } - if (row$infOrt == -2 && - ratio_y2x < ori_consensus_ratio && - 1 / ratio_x2y < ori_consensus_ratio) { - summary[i, ]$is_causal <- "Y" + if ( (row$ort_inferred == 2) + && (ratio_x2y < ort_consensus_ratio) + && (1 / ratio_y2x < ort_consensus_ratio) ) + summary[i, "is_causal"] <- TRUE + if ( (row$ort_inferred == -2) + && (ratio_y2x < ort_consensus_ratio) + && (1 / ratio_x2y < ort_consensus_ratio) ) + summary[i, "is_causal"] <- TRUE } } - } - - # Sort summary by log confidence and return it - summary <- summary[order(summary$info_shifted, decreasing = TRUE), ] + # + # Sort summary by log likelihood, keep only some cols and returns + # + columns_kept = c ("x", "y", "type", "ai", "raw_contributions", "contributions", + "info", "n_xy", "info_cond", "cplx", "n_xy_ai", "info_shifted", + "ort_inferred", "ort_ground_truth", "is_inference_correct", "is_causal", + "ort_consensus", "is_causal_consensus", "edge_stats", + "sign", "partial_correlation", "p_y2x", "p_x2y", "confidence") + columns_kept = columns_kept[columns_kept %in% colnames(summary)] + summary <- summary[order(summary$info_shifted, decreasing = TRUE), + columns_kept, drop=F] rownames(summary) <- c() - return(summary) -} - -matrix_from_3_columns <- function(df, rows, columns, values) { - x = df[[rows]] - y = df[[columns]] - nodes = unique(c(x,y)) - with(df, { - out <- matrix(0, nrow=length(nodes), ncol=length(nodes), - dimnames=list(nodes, nodes)) - out[cbind(x, y)] <- df[[values]] - out[cbind(y, x)] <- df[[values]] - out - }) -} - -fill_summary_column <- function(summary, matrix, rows, columns, values) { - # This function uses the information from matrix to return a vector of - # values (from the column named `values`). - # It creates a 2D matrix indexed with the `rows` and `columns` values for - # faster lookup, and returns the values for all the combinations of rows and - # columns observed in the `summary` matrix. - - wide_matrix <- matrix_from_3_columns(matrix, rows, columns, values) - - apply(summary, 1, function(row, wide_matrix) { - wide_matrix[row[1], row[2]] - }, wide_matrix) -} + return (summary) + } +#------------------------------------------------------------------------------- +# compute_partial_correlation +#------------------------------------------------------------------------------- compute_partial_correlation <- function(summary, observations, state_order) { ppcor_results <- data.frame( sign = character(nrow(summary)), @@ -403,27 +461,20 @@ compute_partial_correlation <- function(summary, observations, state_order) { return(ppcor_results) } -get_edge_stats_table <- function(row, var_names, adj_matrices) { - # row[1]: variable name of x - # row[2]: variable name of y - n_var <- length(var_names) - # adj_matrices is of dimension (n_var * n_var, n_cycle), i.e., each - # column is a 1-d adjacency matrix - index_1d <- n_var * (match(row[1], var_names) - 1) + match(row[2], var_names) - n_cycle <- dim(adj_matrices)[2] - # edge stats table, count replaced by frequency (percentage) - t <- table(adj_matrices[index_1d,]) / n_cycle - t <- t[order(t, decreasing = TRUE), drop = FALSE] - return(t) -} - +#------------------------------------------------------------------------------- +# get_edge_stats_str +#------------------------------------------------------------------------------- get_edge_stats_str <- function(stats_table) { t <- sapply(stats_table, scales::percent_format()) # return a ";" separated string of format "percentage(orientation)" return(paste(t, "(", names(t), ")", sep = "", collapse = ";")) } +#------------------------------------------------------------------------------- +# get_consensus_status +#------------------------------------------------------------------------------- # 0: unconnected, 1: connected +#------------------------------------------------------------------------------- get_consensus_status <- function(stats_table, consensus_threshold) { if (length(stats_table) < 1) return(NA) diff --git a/R/tmiic.plot.R b/R/tmiic.plot.R index acb6654..8c4cd9b 100644 --- a/R/tmiic.plot.R +++ b/R/tmiic.plot.R @@ -6,125 +6,6 @@ # Author : Franck SIMON #******************************************************************************* -#------------------------------------------------------------------------------- -# tmiic.export -#------------------------------------------------------------------------------- -#' Export temporal miic (tmiic) result to different plotting methods -#' -#' @description This function creates an object built from the result returned -#' by \code{\link{miic}} executed in temporal mode that is ready to be fed to -#' different plotting methods. -#' -#' @param tmiic_res [a tmiic object] -#' The object returned by the \code{\link{miic}} execution in temporal mode. -#' -#' @param display [a string]. Optional, default value "compact". -#' Possible values are \emph{"raw"}, \emph{"lagged"}, \emph{"compact"}, -#' \emph{"combine"}, \emph{"unique"}, \emph{"drop"}: -#' \itemize{ -#' \item When \emph{display} = \emph{"raw"}, the export function will -#' use the tmiic graph object as it, leading to the return of a lagged -#' graph. -#' \item When \emph{display} = \emph{"lagged"}, the export function will -#' repeat the edges over history assuming stationarity and return a lagged -#' graph. -#' \item When \emph{display} = \emph{"compact"}, the default, nodes -#' and edges are converted into a flattened version to produce a compact -#' view of the temporal network whilst still presenting all the information -#' in the export.\cr -#' i.e.: X_lag1->Y_lag0, X_lag2<-Y_lag0 become respectively X->Y lag=1, -#' X<-Y lag=2. -#' \item When \emph{display} = \emph{"combine"}, prior to the export, -#' a pre-processing will be applied to kept only one edge -#' per couple of nodes. The info_shifted will be the highest one -#' of the summarized edges whilst the lag and orientation of the -#' summarized edge will be an aggregation.\cr -#' i.e.: X_lag2->Y_lag0, X_lag0<-Y_lag1 will become X<->Y lag=1-2 with -#' the info_shifted of X_lag2->Y_lag0 if info_shifted of -#' X_lag2->Y_lag0 > X_lag0<-Y_lag1. -#' \item When \emph{display} = \emph{"unique"}, prior to the export, -#' a pre-processing will be applied to kept only the edges having the -#' highest info_shifted for a couple of nodes. -#' If several edges between the sames nodes have the same -#' info_shifted, then the edge kept is the one with the minimum lag.\cr -#' i.e.: X_lag1->Y_lag0, X_lag0<-Y_lag2 with info_shifted of -#' X_lag1->Y_lag0 > X_lag0<-Y_lag2 become X->Y lag=1. -#' \item When \emph{display} = \emph{"drop"}, prior to the export, -#' a pre-processing will be applied to kept only the edges having the -#' highest info_shifted for a couple of nodes. -#' If several edges between the sames nodes have the same -#' info_shifted, then the edge kept is the one with the minimum lag.\cr -#' i.e. : X_lag1->Y_lag0, X_lag0<-Y_lag2 with info_shifted of -#' X_lag1->Y_lag0 > X_lag0<-Y_lag2 become X->Y. -#' The lag information is dropped during the preprocessing and -#' will not be exported. -#' } -#' -#' @param show_self_loops [a boolean] Optional, TRUE by default. -#' When TRUE, the edges like X_lag0-X_lag1 are exported. -#' When FALSE, only edges having different nodes are exported. -#' -#' @param method A string representing the plotting method. -#' Currently only "igraph" is supported. -#' -#' @param pcor_palette Optional. The color palette used to represent the partial -#' correlations (the color of the edges). See \code{\link{getIgraph}} for details. -#' -#' @return A graph object adapted to the method. -#' -#' @export -#' -#' @examples -#' \donttest{ -#' library(miic) -#' data(covidCases) -#' # execute MIIC (reconstruct graph in temporal mode) -#' tmiic_res <- miic(input_data = covidCases, mode = "TS", n_layers = 3, delta_t = 1, movavg = 14) -#' -#' # Plot default compact temporal network Using igraph -#' if(require(igraph)) { -#' g = tmiic.export(tmiic_res, method="igraph") -#' plot(g) # Default visualisation, calls igraph::plot.igraph() -#' -#' # Plot raw temporal network Using igraph -#' g = tmiic.export(tmiic_res, display="raw", method="igraph") -#' plot(g) # Default visualisation, calls igraph::plot.igraph() -#' -#' # Plot full temporal network Using igraph -#' g = tmiic.export(tmiic_res, display="lagged", method="igraph") -#' plot(g) # Default visualisation, calls igraph::plot.igraph() -#' -#' # Specifying layout (see ?igraph::layout_) -#' l <- layout_on_grid(g, width = 5, height = 3, dim = 2) -#' plot(g, layout=l) -#' -#' # Override some graphical parameters -#' plot(g, edge.arrow.size = 0.75) -#' plot(g, vertex.shape="none", edge.color="gray85", vertex.label.color="gray10") -#' -#' # For compact graphs, please be aware that the rendering of -#' # igraph::plot.igraph() is not optimal when the graph contains -#' # multiple edges between the same nodes. -#' # So the recommend way to plot a compact graph is to use tmiic plotting: -#' plot(tmiic_res) -#' } -#' -#' } -#------------------------------------------------------------------------------- -tmiic.export <- function (tmiic_res, display="compact", show_self_loops=TRUE, - method="igraph", pcor_palette=NULL) - { - if (is.null(tmiic_res$all.edges.summary)) - stop("Error: The inferred network does not exist") - if (is.null(method)) - stop("Error: Plotting method is required") - if (method != "igraph") - stop("Error: Method not supported") - return(tmiic_getIgraph(tmiic_res, display=display, - show_self_loops=show_self_loops, - pcor_palette=pcor_palette)) - } - #------------------------------------------------------------------------------- # tmiic_getIgraph #------------------------------------------------------------------------------- @@ -141,7 +22,7 @@ tmiic.export <- function (tmiic_res, display="compact", show_self_loops=TRUE, # conditional mutual information minus the complexity cost. # # params: -# - tmiic_res: a tmiic object, returned by the miic execution in temporal mode +# - tmiic_obj: a tmiic object, returned by the miic execution in temporal mode # # - display: string. Optional, default value "compact". # Possible values are "raw", "lagged", "compact", "combine", "unique", "drop": @@ -152,29 +33,23 @@ tmiic.export <- function (tmiic_res, display="compact", show_self_loops=TRUE, # * "compact", the default, nodes and edges are converted into a flattened # version to produce a compact view of the temporal network # whilst still presenting all the information. -# i.e.: X_lag1->Y_lag0, X_lag0<-Y_lag2 become respectively X->Y lag=1, +# e.g. X_lag1->Y_lag0, X_lag0<-Y_lag2 become respectively X->Y lag=1, # X<-Y lag=2. # * "combine", a pre-processing will be applied to kept only one edge # per couple of nodes. The info_shifted will be the highest one # of the summarized edges whilst the lag and orientation of the # summarized edge will be an aggregation. -# i.e.: X_lag2->Y_lag0, X_lag0<-Y_lag1 will become X<->Y lag=1,2 with +# e.g. X_lag2->Y_lag0, X_lag0<-Y_lag1 will become X<->Y lag=1,2 with # the info_shifted of X_lag2->Y_lag0 if info_shifted of # X_lag2->Y_lag0 > X_lag0<-Y_lag1. # * "unique", a pre-processing will be applied to kept only the edges # having the highest info_shifted for a couple of nodes. # If several edges between the sames nodes have the same # info_shifted, then the edge kept is the one with the minimum lag. -# i.e.: X_lag1->Y_lag0, X_lag0<-Y_lag2 with info_shifted of +# e.g. X_lag1->Y_lag0, X_lag0<-Y_lag2 with info_shifted of # X_lag1->Y_lag0 > X_lag0<-Y_lag2 become X->Y lag=1. -# * "drop"}, prior to the plotting, a pre-processing will be applied -# to kept only the edges having the highest info_shifted for a couple -# of nodes. -# If several edges between the sames nodes have the same -# info_shifted, then the edge kept is the one with the minimum lag. -# i.e. : X_lag1->Y_lag0, X_lag0<-Y_lag2 with info_shifted of -# X_lag1->Y_lag0 > X_lag0<-Y_lag2 become X->Y. -# The lag information is dropped during the preprocessing. +# * "drop"}, the same pre-processing as "unique" will be applied. +# In addition, the lag information will be dropped # # - show_self_loops: boolean, optional, TRUE by default. # When TRUE, the edges like X_lag0-X_lag1 are included in the iGraph object. @@ -186,16 +61,15 @@ tmiic.export <- function (tmiic_res, display="compact", show_self_loops=TRUE, # # returns: an igraph graph object. #------------------------------------------------------------------------------- -tmiic_getIgraph <- function (tmiic_res, display="compact", +tmiic_getIgraph <- function (tmiic_obj, display="compact", show_self_loops=TRUE, pcor_palette=NULL) { if (display == "lagged") - tmiic_res$all.edges.summary = tmiic_res$tmiic$all.edges.stationarity + tmiic_obj$summary = tmiic_obj$tmiic$stationarity else if (display != "raw") - tmiic_res <- tmiic_flatten_network (tmiic_res, flatten_mode=display, + tmiic_obj <- tmiic_flatten_network (tmiic_obj, flatten_mode=display, keep_edges_on_same_node=show_self_loops) - - graph <- getIgraph (tmiic_res, pcor_palette=pcor_palette) + graph <- getIgraph (tmiic_obj, pcor_palette=pcor_palette) if (display %in% c("raw", "lagged") ) { @@ -206,8 +80,8 @@ tmiic_getIgraph <- function (tmiic_res, display="compact", else { igraph::E(graph)$curved = FALSE - if ( "lag" %in% colnames(tmiic_res$all.edges.summary) ) - igraph::E(graph)$label <- tmiic_res$all.edges.summary$lag + if ( "lag" %in% colnames(tmiic_obj$summary) ) + igraph::E(graph)$label <- tmiic_obj$summary$lag } return(graph) } @@ -224,11 +98,11 @@ tmiic_getIgraph <- function (tmiic_res, display="compact", # params: the tmiic object returned by the miic execution in temporal mode, # eventually flattened # -# @return tmiic_res [a tmiic object] The modified tmiic object +# @return [a tmiic object] The modified tmiic object #----------------------------------------------------------------------------- -tmiic_prepare_edges_for_plotting <- function (tmiic_res) +tmiic_prepare_edges_for_plotting <- function (tmiic_obj) { - df_edges <- tmiic_res$all.edges.summary[tmiic_res$all.edges.summary$type %in% c('P', 'TP', 'FP'), ] + df_edges <- tmiic_obj$summary[tmiic_obj$summary$type %in% c('P', 'TP', 'FP'), ] if (nrow(df_edges) <= 0) df_edges$xy = character(0) else @@ -249,20 +123,24 @@ tmiic_prepare_edges_for_plotting <- function (tmiic_res) # for(row in 1:nrow(df_edges)) { - if(df_edges[row, "infOrt"] == -2) + if(df_edges[row, "ort_inferred"] == -2) { df_edges[row, c("x","y")] = df_edges[row, c("y","x")] - df_edges[row, "infOrt"] = 2 - if(!is.na(df_edges[row, "proba"])) - df_edges[row, "proba"] = paste0(rev( - strsplit(df_edges[row, "proba"], ";")[[1]]), collapse=";") - if(!is.na(df_edges[row, "trueOrt"])) - df_edges[row, "trueOrt"] = 2 + df_edges[row, "ort_inferred"] = 2 + if ( (!is.na(df_edges[row, "p_y2x"])) + && (!is.na(df_edges[row, "p_x2y"])) ) + { + temp <- df_edges[row, "p_y2x"] + df_edges[row, "p_y2x"] <- df_edges[row, "p_x2y"] + df_edges[row, "p_x2y"] <- temp + } + if(!is.na(df_edges[row, "ort_ground_truth"])) + df_edges[row, "ort_ground_truth"] = 2 } } } - tmiic_res$all.edges.summary <- df_edges - return (tmiic_res) + tmiic_obj$summary <- df_edges + return (tmiic_obj) } #------------------------------------------------------------------------------- @@ -273,15 +151,15 @@ tmiic_prepare_edges_for_plotting <- function (tmiic_res) # @description This function identifies the couple of nodes having mutiples # edges # -# @param [a tmiic graph object] +# @param [a tmiic object] # The graph object returned by the miic execution in temporal mode and # flattened (if the tmiic object is not flattened, the function does nothing) # # @return df_mult [a dataframe] The dataframe containing the multiple edges #------------------------------------------------------------------------------- -tmiic_get_multiple_edges_for_plotting <- function (tmiic_res) +tmiic_get_multiple_edges_for_plotting <- function (tmiic_obj) { - df_mult <- tmiic_res$all.edges.summary + df_mult <- tmiic_obj$summary if (nrow(df_mult) <= 0) df_mult$count <- numeric(0) else @@ -539,18 +417,18 @@ tmiic_compute_row_layout_greedy_recurs <- function (node_left, node_right, # The layout is completed with nodes without edges to produce the final # layer 0 layout. # -# param: tmiic_res, the object returned by the miic execution in temporal mode +# param: tmiic_obj, the object returned by the miic execution in temporal mode # # returns: a list, the position along an axis for each node #------------------------------------------------------------------------------- -tmiic_compute_row_layout_greedy <- function (tmiic_res) +tmiic_compute_row_layout_greedy <- function (tmiic_obj) { - list_nodes_not_lagged <- tmiic_res$state_order$var_names + list_nodes_not_lagged <- tmiic_obj$state_order$var_names # # Filter out self edges, count and summarize edges regardless their lags # - tmiic_flat <- tmiic_flatten_network (tmiic_res) - df_edges <- tmiic_flat$all.edges.summary + tmiic_flat <- tmiic_flatten_network (tmiic_obj) + df_edges <- tmiic_flat$summary df_edges <- df_edges[(df_edges$x != df_edges$y),] if (nrow (df_edges) == 0) df_edges$count <- integer() @@ -606,15 +484,15 @@ tmiic_compute_row_layout_greedy <- function (tmiic_res) # This function computes the layout so that the less layers # has a node, the more to the exteriors it will be placed. # -# param: tmiic_res, a tmiic object returned by the execution of miic +# param: tmiic_obj, a tmiic object returned by the execution of miic # in temporal mode ("raw" graph_type) # # returns: a list, the position along an axis for each node #------------------------------------------------------------------------------- -tmiic_compute_row_layout_layers <- function (tmiic_res) +tmiic_compute_row_layout_layers <- function (tmiic_obj) { - n_nodes_not_lagged <- nrow(tmiic_res$state_order) - list_n_layers_back <- tmiic_res$state_order$n_layers - 1 + n_nodes_not_lagged <- nrow(tmiic_obj$state_order) + list_n_layers_back <- tmiic_obj$state_order$n_layers - 1 n_layers_back_max <- max (list_n_layers_back) # # Precompute the rows on the grid, putting nodes with the less lags @@ -660,23 +538,23 @@ tmiic_compute_row_layout_layers <- function (tmiic_res) #------------------------------------------------------------------------------- # Internal function to precompute a layout suited for the display of raw and # lagged graphs -# This function computes the layout using Sugiyama algorihtm to +# This function computes the layout using Sugiyama algorithm to # minimize crossing edges # -# param: tmiic_res, a tmiic objectreturned by the execution of miic +# param: tmiic_obj, a tmiic object returned by the execution of miic # in temporal mode ("raw" graph_type) # # returns: a list, the position along an axis for each node #------------------------------------------------------------------------------- -tmiic_compute_row_layout_sugiyama <- function (tmiic_res) +tmiic_compute_row_layout_sugiyama <- function (tmiic_obj) { - list_nodes_not_lagged <- tmiic_res$state_order$var_names + list_nodes_not_lagged <- tmiic_obj$state_order$var_names n_nodes_not_lagged <- length(list_nodes_not_lagged) # # Filter out self edges, count and summarize edges regardless their lags # - tmiic_flat <- tmiic_flatten_network(tmiic_res) - df_edges <- tmiic_flat$all.edges.summary + tmiic_flat <- tmiic_flatten_network(tmiic_obj) + df_edges <- tmiic_flat$summary df_edges <- df_edges[(df_edges$x != df_edges$y),] if (nrow(df_edges) == 0) df_edges$count <- integer() @@ -712,7 +590,7 @@ tmiic_compute_row_layout_sugiyama <- function (tmiic_res) # graphs # # params: -# - tmiic_res, a tmiic object returned by the miic's execution in temporal mode. +# - tmiic_obj, a tmiic object returned by the miic's execution in temporal mode. # # - display: string. optional, default value "raw". # Possible values are "raw" and "lagged". @@ -726,7 +604,7 @@ tmiic_compute_row_layout_sugiyama <- function (tmiic_res) # * When positioning = "alphabetical": # The nodes are positioned alphabetically in ascending order # * When positioning = "layers": -# The nodes with the less lags wil be placed on the exteriors +# The nodes with the less lags will be placed on the exteriors # while the nodes having the most lags are in the center # * When positioning = "greedy": # A greedy algorithm will be used to placed the nodes in a way minimizing @@ -741,7 +619,7 @@ tmiic_compute_row_layout_sugiyama <- function (tmiic_res) # # returns: a matrix, the layout to use for drawing #------------------------------------------------------------------------------- -tmiic_compute_grid_layout <- function (tmiic_res, display="raw", +tmiic_compute_grid_layout <- function (tmiic_obj, display="raw", positioning="greedy", orientation="L") { if (! display %in% c("raw", "lagged") ) @@ -751,7 +629,7 @@ tmiic_compute_grid_layout <- function (tmiic_res, display="raw", if (! orientation %in% c("L", "P") ) stop ("Error: Invalid orientation parameter") - nodes_not_lagged <- tmiic_res$state_order$var_names + nodes_not_lagged <- tmiic_obj$state_order$var_names n_nodes_not_lagged <- length (nodes_not_lagged) # # Precompute the layer 0 layout @@ -768,18 +646,18 @@ tmiic_compute_grid_layout <- function (tmiic_res, display="raw", list_pos_of_nodes <- unlist (list_pos_of_nodes) } if (positioning == "layers") - list_pos_of_nodes <- tmiic_compute_row_layout_layers (tmiic_res) + list_pos_of_nodes <- tmiic_compute_row_layout_layers (tmiic_obj) if (positioning == "greedy") - list_pos_of_nodes <- tmiic_compute_row_layout_greedy (tmiic_res) + list_pos_of_nodes <- tmiic_compute_row_layout_greedy (tmiic_obj) if (positioning == "sugiyama") - list_pos_of_nodes <- tmiic_compute_row_layout_sugiyama (tmiic_res) + list_pos_of_nodes <- tmiic_compute_row_layout_sugiyama (tmiic_obj) if ( is.null (list_pos_of_nodes) ) stop ("Error: Layout can not be infered") # # As contextual nodes are placed in an extra column/row when display is "raw", # here we update the nodes positions to maintain a "nice" display # - is_contextual <- tmiic_res$state_order$is_contextual + is_contextual <- tmiic_obj$state_order$is_contextual if ( (display == "raw") & (sum(is_contextual) > 0) ) { list_pos_upd <- list_pos_of_nodes @@ -824,9 +702,9 @@ tmiic_compute_grid_layout <- function (tmiic_res, display="raw", # # Place contextual and lag0 nodes # - list_n_layers_back <- tmiic_res$state_order$n_layers - 1 + list_n_layers_back <- tmiic_obj$state_order$n_layers - 1 n_layers_back_max <- max (list_n_layers_back) - list_delta_t <- tmiic_res$state_order$delta_t + list_delta_t <- tmiic_obj$state_order$delta_t max_lags <- max (list_n_layers_back * list_delta_t) df_layout <- data.frame ( col=integer(), row=integer() ) @@ -875,17 +753,30 @@ tmiic_compute_grid_layout <- function (tmiic_res, display="raw", #------------------------------------------------------------------------------- #' Basic plot function of a temporal miic (tmiic) network inference result #' -#' @description This function calls \code{\link{tmiic.export}} to build a -#' plottable object from the result returned by \code{\link{miic}} in -#' temporal mode and plot it. +#' @description This function calls \code{\link{export}} to build a plottable +#' object from the result returned by \code{\link{miic}} in temporal mode +#' and plot it. #' -#' @details See the documentation of \code{\link{tmiic.export}} for further +#' @details See the documentation of \code{\link{export}} for further #' details. #' -#' @param x [a tmiic graph object] -#' The graph object returned by \code{\link{miic}} in temporal mode +#' @param x [a tmiic object, required] +#' +#' The object returned by \code{\link{miic}} in temporal mode. +#' +#' @param method [a string, optional, default value "igraph"] +#' +#' The plotting method, currently only "igraph" is supported. +#' +#' @param pcor_palette [a color palette, optional, default value +#' grDevices::colorRampPalette(c("blue", "darkgrey", "red")] +#' +#' Used to represent the partial correlations (the color of the edges). +#' The palette must be able to handle 201 shades to cover the correlation range +#' from -100 to +100. +#' +#' @param display [a string, optional, default value "compact"] #' -#' @param display [a string]. Optional, default value "compact". #' Possible values are \emph{"raw"}, \emph{"lagged"}, \emph{"compact"}, #' \emph{"combine"}, \emph{"unique"}, \emph{"drop"}: #' \itemize{ @@ -901,50 +792,47 @@ tmiic_compute_grid_layout <- function (tmiic_res, display="raw", #' and edges are converted into a flattened version to produce a compact #' view of the temporal network whilst still presenting all the information #' in the plotting.\cr -#' i.e.: X_lag1->Y_lag0, X_lag2<-Y_lag0 become respectively X->Y lag=1, +#' e.g. X_lag1->Y_lag0, X_lag2<-Y_lag0 become respectively X->Y lag=1, #' X<-Y lag=2. #' \item When \emph{display} = \emph{"combine"}, prior to the plotting, -#' a preprocessing will be applied to kept only one edge -#' per couple of nodes. The info_shifted will be the highest one +#' a pre-processing will be applied to kept only one edge +#' per pair of nodes. The info_shifted will be the highest one #' of the summarized edges whilst the lag and orientation of the #' summarized edge will be an aggregation.\cr -#' i.e.: X_lag1->Y_lag0, X_lag2<-Y_lag0 will become X<->Y lag=1,2 with +#' e.g. X_lag1->Y_lag0, X_lag2<-Y_lag0 will become X<->Y lag=1,2 with #' the info_shifted of X_lag1->Y_lag0 if info_shifted of #' X_lag1->Y_lag0 > X_lag2<-Y_lag0. #' \item When \emph{display} = \emph{"unique"}, prior to the plotting, -#' a preprocessing will be applied to kept only the edges having the -#' highest info_shifted for a couple of nodes. +#' a pre-processing will be applied to kept only the edges having the +#' highest info_shifted for a pair of nodes. #' If several edges between the sames nodes have the same #' info_shifted, then the edge kept is the one with the minimum lag.\cr -#' i.e.: X_lag1->Y_lag0, X_lag2<-Y_lag0 with info_shifted of +#' e.g. X_lag1->Y_lag0, X_lag2<-Y_lag0 with info_shifted of #' X_lag1->Y_lag0 > X_lag2<-Y_lag0 become X->Y lag=1. -#' \item When \emph{display} = \emph{"drop"}, prior to the plotting, -#' a preprocessing will be applied to kept only the edges having the -#' highest info_shifted for a couple of nodes. -#' If several edges between the sames nodes have the same -#' info_shifted, then the edge kept is the one with the minimum lag.\cr -#' i.e. : X_lag1->Y_lag0, X_lag2<-Y_lag0 with info_shifted of -#' X_lag1->Y_lag0 > X_lag2<-Y_lag0 become X->Y. -#' The lag information is dropped during the preprocessing and -#' will not be displayed on the final plotting. +#' \item When \emph{display} = \emph{"drop"}, the same pre-processing +#' as \emph{"unique"} will be applied, then the lag information will be +#' dropped and will not be displayed on the final plotting. #' } #' -#' @param show_self_loops [a boolean] Optional, TRUE by default. -#' When TRUE, the edges like X_lag0-X_lag1 are included in the iGraph object. -#' When FALSE, only edges having different nodes are present in the iGraph +#' @param show_self_loops [a boolean, optional, TRUE by default] +#' +#' When TRUE, the lagged edges starting and ending on the same node +#' are included in the igraph object. +#' When FALSE, only edges having different nodes are present in the igraph #' object. #' -#' @param positioning_for_grid [a string] Optional, "greedy" by default. -#' Used only when the display is "raw" or "lagged and no layout is supplied. -#' Possible values are \emph{"none"}, \emph{"alphabetical"}, \emph{"layers"} +#' @param positioning_for_grid [a string, optional, "greedy" by default] +#' +#' Used only when the display is "raw" or "lagged" and no layout is supplied. +#' Possible values are \emph{"none"}, \emph{"alphabetical"}, \emph{"layers"}, #' \emph{"greedy"} and \emph{"sugiyama"} #' \itemize{ #' \item When \emph{positioning_for_grid} = \emph{"none"} #' The nodes are positioned as they appear in the miic result #' \item When \emph{positioning_for_grid} = \emph{"alphabetical"} -#' The nodes are positioned alphabeticaly in ascending order +#' The nodes are positioned alphabetically in ascending order #' \item When \emph{positioning_for_grid} = \emph{"layers"} -#' The nodes with the less lags wil be placed on the exteriors +#' The nodes with the less lags will be placed on the exteriors #' while the nodes having the most lags are in the center #' \item When \emph{positioning_for_grid} = \emph{"greedy"} #' A greedy algorithm will be used to placed the nodes in a way minimizing @@ -954,25 +842,20 @@ tmiic_compute_grid_layout <- function (tmiic_res, display="raw", #' minimizing the crossing edges #' } #' -#' @param orientation_for_grid [a string] Optional, "L" by default. +#' @param orientation_for_grid [a string, optional, "L" by default] +#' #' Used only when the display is "raw" or "lagged and no layout is supplied. #' Indicates the orientation of the draw, possible values are landscape: "L" #' or portrait: "P". #' -#' @param method A string representing the plotting method. Default to "igraph". -#' Currently only "igraph" is supported. -#' -#' @param pcor_palette Optional. The color palette used to represent the partial -#' correlations (the color of the edges). See \code{\link{getIgraph}} for details. -#' -#' @param \dots Additional plotting parameters. See the corresponding plot function -#' for the complete list. +#' @param \dots Additional plotting parameters. See the corresponding plot +#' function for the complete list. #' #' For igraph, see \code{\link[igraph]{igraph.plotting}}. #' #' @export #' -#' @seealso \code{\link{tmiic.export}} for generic exports, +#' @seealso \code{\link{export}} for graphical exports, #' \code{\link[igraph]{igraph.plotting}} #' #' @examples @@ -982,31 +865,32 @@ tmiic_compute_grid_layout <- function (tmiic_res, display="raw", #' #' # EXAMPLE COVID CASES (time series demo) #' data(covidCases) #' # execute MIIC (reconstruct graph in temporal mode) -#' tmiic_res <- miic(input_data = covidCases, mode = "TS", n_layers = 3, delta_t = 1, movavg = 14) +#' tmiic_obj <- miic(input_data = covidCases, mode = "TS", n_layers = 3, delta_t = 1, mov_avg = 14) #' #' # to plot the default compact graph #' if(require(igraph)) { -#' plot(tmiic_res) +#' plot(tmiic_obj) #' } #' -#' # to plot the raw temporal network Using igraph +#' # to plot the raw temporal network #' if(require(igraph)) { -#' plot(tmiic_res, display="raw") +#' plot(tmiic_obj, display="raw") #' } #' -#' # to plot the full temporal network Using igraph +#' # to plot the full temporal network #' if(require(igraph)) { -#' plot(tmiic_res, display="lagged") +#' plot(tmiic_obj, display="lagged") #' } #' #' } #------------------------------------------------------------------------------- -plot.tmiic = function(x, display="compact", show_self_loops=TRUE, +plot.tmiic = function(x, method='igraph', pcor_palette=NULL, + display="compact", show_self_loops=TRUE, positioning_for_grid="greedy", orientation_for_grid="L", - method = 'igraph', pcor_palette=NULL, ...) + ...) { if (method != 'igraph') - stop("Error: Method not supported. See ?tmiic.export for supported methods.") + stop("Error: Method not supported. See ?export for supported methods.") if ( !base::requireNamespace("igraph", quietly = TRUE) ) stop("Error: Package 'igraph' is required.") if ( is.null (x$adj_matrix) ) @@ -1028,8 +912,8 @@ plot.tmiic = function(x, display="compact", show_self_loops=TRUE, # # Export the graph to a graphical object # - graph <- tmiic.export (x, display=display, method=method, - pcor_palette=pcor_palette) + graph <- export (x, method=method, pcor_palette=pcor_palette, + display=display, show_self_loops=show_self_loops) # # Look if we have cases with multiple edges between two nodes # or multiple self loops because we need to plot these cases iteratively. @@ -1057,7 +941,7 @@ plot.tmiic = function(x, display="compact", show_self_loops=TRUE, # If we have a least on case with multiple edges between the same nodes, # draw iteratively # - df_edges <- x$all.edges.summary + df_edges <- x$summary edges_colors_iter <- igraph::E(graph)$color edges_labels_iter <- igraph::E(graph)$label # diff --git a/R/tmiic.utils.R b/R/tmiic.utils.R index 1f0492c..77b73ce 100644 --- a/R/tmiic.utils.R +++ b/R/tmiic.utils.R @@ -10,10 +10,10 @@ # tmiic_check_state_order_part1 #------------------------------------------------------------------------------- # This function performs the first part checks of the state order columns -# specific to temporal mode: n_layers, delta_t and movavg. +# specific to temporal mode: n_layers, delta_t and mov_avg. # In most cases, these columns will not be present at this stage, # as these information will be likely provided as parameters -# (cf tmiic_check_parameters to see how the n_layers, delta_t and movavg +# (cf tmiic_check_parameters to see how the n_layers, delta_t and mov_avg # parameters are moved into the state_order). # Checks here are basic and cover NULL, integer type and minimal values only. # NAs are excluded from warnings (NA = row added because var name missing) @@ -86,11 +86,11 @@ tmiic_check_state_order_part1 <- function (state_order) state_order$delta_t = as.integer (state_order$delta_t) } # - # movavg check + # mov_avg check # - if ("movavg" %in% colnames (state_order) ) + if ("mov_avg" %in% colnames (state_order) ) { - wrongs = unlist (lapply (state_order$movavg, FUN=function(x) { + wrongs = unlist (lapply (state_order$mov_avg, FUN=function(x) { if (is.na (x)) # NA: OK (missing row added before) return (FALSE) else if ( is.na ( suppressWarnings (as.numeric(x)) ) ) # Not num: KO @@ -111,9 +111,9 @@ tmiic_check_state_order_part1 <- function (state_order) else miic_warning ("state order", "the moving average are incorrect for", " several variables (", msg_str, "), these values will be ignored.") - state_order$movavg[wrongs] = NA + state_order$mov_avg[wrongs] = NA } - state_order$movavg = as.integer (state_order$movavg) + state_order$mov_avg = as.integer (state_order$mov_avg) } return (state_order) } @@ -123,7 +123,7 @@ tmiic_check_state_order_part1 <- function (state_order) #------------------------------------------------------------------------------- # Checks on parameters for temporal mode # -# As the temporal parameters n_layers, delta_t, movavg need to take different +# As the temporal parameters n_layers, delta_t, mov_avg need to take different # values depending on the type (discrete/continuous) or contextual, # these parameters are moved in the state_order to have a value defined # for each variable (unless these information are already in the state_order, @@ -140,7 +140,7 @@ tmiic_check_state_order_part1 <- function (state_order) # - params: the list of parameters with temporal parameters added #------------------------------------------------------------------------------- tmiic_check_parameters <- function (state_order, params, - n_layers, delta_t, movavg, keep_max_data, max_nodes) + n_layers, delta_t, mov_avg, keep_max_data, max_nodes) { # Check number of layers parameter # @@ -230,46 +230,46 @@ tmiic_check_parameters <- function (state_order, params, } } # - # Check movavg + # Check mov_avg # - if ( ! is.null (movavg) ) + if ( ! is.null (mov_avg) ) { - if ( test_param_wrong_int (movavg, min=0, max=NA) - || (movavg == 1) ) + if ( test_param_wrong_int (mov_avg, min=0, max=NA) + || (mov_avg == 1) ) { - if ( "movavg" %in% colnames(state_order) ) - miic_warning ("parameters", "supplied value ", list_to_str (movavg), + if ( "mov_avg" %in% colnames(state_order) ) + miic_warning ("parameters", "supplied value ", list_to_str (mov_avg), " for the moving average parameter is invalid,", " if not NULL or 0, it must be an integer >= 2.", " This issue has no impact as the moving average is provided", " in the state_order.") else - miic_warning ("parameters", "supplied value ", list_to_str (movavg), + miic_warning ("parameters", "supplied value ", list_to_str (mov_avg), " for the moving average parameter is invalid,", " if not NULL or 0, it must be an integer >= 2.", " The moving average parameter will be ignored.") } - else # valid movavg + else # valid mov_avg { - if ( ! ("movavg" %in% colnames(state_order)) ) + if ( ! ("mov_avg" %in% colnames(state_order)) ) { - state_order$movavg = movavg - # No movavg on discrete or contextual vars - state_order$movavg[state_order$var_type == 0] = 0 - state_order$movavg[state_order$is_contextual == 1] = 0 + state_order$mov_avg = mov_avg + # No mov_avg on discrete or contextual vars + state_order$mov_avg[state_order$var_type == 0] = 0 + state_order$mov_avg[state_order$is_contextual == 1] = 0 } - else # movavg in state_order + else # mov_avg in state_order { - na_in_so = is.na (state_order$movavg) + na_in_so = is.na (state_order$mov_avg) if ( any (na_in_so) ) { miic_warning ("parameters", "the moving average is both supplied", " in the state_order and as a parameter. As some values are missing", " in the state_order, the parameter will be used to fill these", " missing values.") - state_order$movavg[ na_in_so - & (state_order$var_type == 1) - & (state_order$is_contextual == 0)] = movavg + state_order$mov_avg[ na_in_so + & (state_order$var_type == 1) + & (state_order$is_contextual == 0)] = mov_avg } else miic_warning ("parameters", "the moving average is both supplied", @@ -291,7 +291,7 @@ tmiic_check_parameters <- function (state_order, params, #------------------------------------------------------------------------------- # Second part of the check state order for temporal mode. # This function is designed to be called after the check_parameters_temporal -# function has moved (if needed) the n_layers, delta_t and movavg parameters +# function has moved (if needed) the n_layers, delta_t and mov_avg parameters # into the state_order. # This function will try to fill possible missing values and will check/fix # the temporal settings against the var_type and is_contextual information. @@ -498,30 +498,30 @@ tmiic_check_state_order_part2 <- function (state_order) "there must be one variable at least with a delta t > 0.") } # - # Check state order movavg + # Check state order mov_avg # - if ( ! ("movavg" %in% colnames(state_order)) ) + if ( ! ("mov_avg" %in% colnames(state_order)) ) { - # Add movavg column with 0 for all vars + # Add mov_avg column with 0 for all vars # - state_order$movavg = 0 + state_order$mov_avg = 0 } else { # Replace NA vals by 0 # - na_in_so = is.na (state_order$movavg) + na_in_so = is.na (state_order$mov_avg) if ( any (na_in_so) ) { msg_str = list_to_str (state_order$var_names[na_in_so], n_max=10) miic_warning ("state order", "the missing moving average have been", " set to 0 for variables ", msg_str) - state_order$movavg[na_in_so] = 0 + state_order$mov_avg[na_in_so] = 0 } # # Check/fix invalid values: for discrete vars, no moving average # - wrongs = ( (state_order$movavg != 0) & (state_order$var_type == 0) ) + wrongs = ( (state_order$mov_avg != 0) & (state_order$var_type == 0) ) if ( any (wrongs) ) { msg_str = list_to_str (state_order$var_names[wrongs], n_max=10) @@ -531,12 +531,12 @@ tmiic_check_state_order_part2 <- function (state_order) else miic_warning ("temporal checks", "moving average operations cannot", " be applied on discrete variables (", msg_str, ").") - state_order$movavg[wrongs] = 0 + state_order$mov_avg[wrongs] = 0 } # # Check/fix invalid values: for contextual vars, no moving average # - wrongs = ( (state_order$movavg != 0) & (state_order$is_contextual == 1) ) + wrongs = ( (state_order$mov_avg != 0) & (state_order$is_contextual == 1) ) if ( any (wrongs) ) { msg_str = list_to_str (state_order$var_names[wrongs], n_max=10) @@ -546,14 +546,14 @@ tmiic_check_state_order_part2 <- function (state_order) else miic_warning ("temporal checks", "moving average operations can not", " be applied on contextualvariables (", msg_str, ").") - state_order$movavg[wrongs] = 0 + state_order$mov_avg[wrongs] = 0 } # # Warning if multiple values of moving average excluding discrete and contextual # - uniq_vals = unique (state_order$movavg[ (!is.na (state_order$movavg)) - & (state_order$var_type == 1) - & (state_order$is_contextual == 0) ]) + uniq_vals = unique (state_order$mov_avg[ (!is.na (state_order$mov_avg)) + & (state_order$var_type == 1) + & (state_order$is_contextual == 0) ]) if (length (uniq_vals) > 1) { msg_str = list_to_str (uniq_vals) @@ -726,7 +726,7 @@ tmiic_extract_trajectories <- function (input_data, check=T) FUN=function (x) { return (x[1,1] != 1) } ) ) ) if (length (wrong_starts) > 0) miic_warning ("check trajectories", length (wrong_starts), - " trajectorie(s) don't start with 1 as first time step value") + " trajectories don't start with 1 as first time step value") max_nb_ts = max (unlist (lapply (list_ts, FUN=nrow) ) ) if (max_nb_ts == 1) miic_error ("trajectories check", @@ -769,7 +769,7 @@ tmiic_group_trajectories = function (list_ts, drop_timestep=FALSE) } #------------------------------------------------------------------------------- -# tmiic_movavg_onecol +# tmiic_mov_avg_onecol #------------------------------------------------------------------------------- # Utility function to a apply a moving average over a list # params: @@ -778,7 +778,7 @@ tmiic_group_trajectories = function (list_ts, drop_timestep=FALSE) # This moving average is centered, so the first (w-1) %/% 2 and the last # (w-1) - low_shift items will be filled with NA_real_ #------------------------------------------------------------------------------- -tmiic_movavg_onecol = function (x, w) +tmiic_mov_avg_onecol = function (x, w) { low_shift = (w-1) %/% 2 high_shift = (w-1) - low_shift @@ -798,18 +798,18 @@ tmiic_movavg_onecol = function (x, w) } #------------------------------------------------------------------------------- -# tmiic_movavg +# tmiic_mov_avg #------------------------------------------------------------------------------- # Apply moving averages on data # - list_ts: a list of dataframe, each item representing a trajectory. # Each dataframe must contain the time step information in the 1st column # and the variables in the other columns. -# - movavg: the list of moving average to be applied, optional, NULL by defaut. -# The length of the movavg list is the number of columns of the dataframes - 1 +# - mov_avg: the list of moving average to be applied, optional, NULL by defaut. +# The length of the mov_avg list is the number of columns of the dataframes - 1 # (because the 1st column in dataframes is the time step). -# When the movavg item value is >= 2, a moving average using this value as +# When the mov_avg item value is >= 2, a moving average using this value as # window size is applied on the corresponding column: -# movavg item 1 is applied data column 2, moavg item 2 to data column 3, ... +# mov_avg item 1 is applied data column 2, moavg item 2 to data column 3, ... # - keep_max_data: boolean flag, optional, FALSE by default # When FALSE, the rows containing NA introduced by the moving average(s) # are deleted, otherwise when TRUE, the rows are kept @@ -818,34 +818,34 @@ tmiic_movavg_onecol = function (x, w) # Returns: # - list_ts: the list trajectories with moving averages applied #------------------------------------------------------------------------------- -tmiic_movavg = function (list_ts, movavg=NULL, keep_max_data=F, verbose_level=0) +tmiic_mov_avg = function (list_ts, mov_avg=NULL, keep_max_data=F, verbose_level=0) { - if ( is.null (movavg) || all (movavg < 2) ) + if ( is.null (mov_avg) || all (mov_avg < 2) ) return (list_ts) if (verbose_level >= 1) miic_msg ("Applying moving averages...") - # Apply movavg on each trajectory and variable of the dataset + # Apply mov_avg on each trajectory and variable of the dataset # n_vars = ncol(list_ts[[1]])-1 var_names = colnames (list_ts[[1]])[-1] for (i in 1:length(list_ts) ) for (j in 1:n_vars) - if (movavg[[j]] >= 2) + if (mov_avg[[j]] >= 2) { - # print (paste0 (j, " => movavg = ", movavg[[j]])) - list_ts[[i]][,j+1] = tmiic_movavg_onecol (list_ts[[i]][,j+1], movavg[[j]]) + # print (paste0 (j, " => mov_avg = ", mov_avg[[j]])) + list_ts[[i]][,j+1] = tmiic_mov_avg_onecol (list_ts[[i]][,j+1], mov_avg[[j]]) if (verbose_level == 2) miic_msg ("- ", var_names[[j]], ": moving average of window size ", - movavg[[j]], " applied") + mov_avg[[j]], " applied") } # # Remove starting and ending rows where NAs were introduced # if (!keep_max_data) { - movavg_max = max(movavg) - low_shift = (movavg_max-1) %/% 2 - high_shift = (movavg_max-1) - low_shift + mov_avg_max = max(mov_avg) + low_shift = (mov_avg_max-1) %/% 2 + high_shift = (mov_avg_max-1) - low_shift start_idx = 1 if (low_shift > 0) start_idx = start_idx + low_shift @@ -1178,7 +1178,7 @@ tmiic_estimate_dynamic <- function (list_ts, state_order, max_nodes=50, #' time steps between each layer that are needed to cover the dynamic of a #' temporal dataset when reconstructing a temporal causal graph. #' Using autocorrelation decay, the function computes the average relaxation -#' time of the variables and, in regard of a maximum number of nodes, deduces +#' time of the variables and, based on a maximum number of nodes, deduces #' the number of layers and number of time steps between each layer to be used. #' #' @param input_data [a data frame] @@ -1210,20 +1210,20 @@ tmiic_estimate_dynamic <- function (list_ts, state_order, max_nodes=50, #' variable is to be considered as a contextual variable (1) or not (0). #' Contextual variables will be excluded from the temporal dynamic estimation. #' -#' "movavg" (optional) contains an integer value that specifies the size of +#' "mov_avg" (optional) contains an integer value that specifies the size of #' the moving average window to be applied to the variable. -#' Note that if "movavg" column is present in the \emph{state_order}, +#' Note that if "mov_avg" column is present in the \emph{state_order}, #' its values will overwrite the function parameter. #' -#' @param movavg [an integer] Optional, NULL by default.\cr +#' @param mov_avg [an integer] Optional, NULL by default.\cr #' When an integer>= 2 is supplied, a moving average operation is applied #' to all the non discrete and not contextual variables. If no \emph{state_order} #' is provided, the discrete/continuous variables are deduced from the input #' data. If you want to apply a moving average only on specific columns, -#' consider to use a \emph{movavg} column in the \emph{state_order} parameter. +#' consider to use a \emph{mov_avg} column in the \emph{state_order} parameter. #' #' @param max_nodes [a positive integer] The maximum number of nodes in the -#' final temporal causal graph. The more nodes allowed in the temporal +#' final time-unfolded causal graph. The more nodes allowed in the temporal #' causal discovery, the more precise will be the discovery but at the cost #' of longer execution time. The default is set to 50 for fast causal #' discovery. On recent computers, values up to 200 or 300 nodes are usually @@ -1241,7 +1241,7 @@ tmiic_estimate_dynamic <- function (list_ts, state_order, max_nodes=50, #' #' @export #------------------------------------------------------------------------------- -estimateTemporalDynamic <- function (input_data, state_order=NULL, movavg=NULL, +estimateTemporalDynamic <- function (input_data, state_order=NULL, mov_avg=NULL, max_nodes=50, verbose_level=1) { input_data = check_input_data (input_data, "TS") @@ -1253,13 +1253,13 @@ estimateTemporalDynamic <- function (input_data, state_order=NULL, movavg=NULL, params = list(), n_layers = NULL, delta_t = NULL, - movavg = movavg, + mov_avg = mov_avg, keep_max_data = F, max_nodes = max_nodes) state_order = tmiic_check_state_order_part2 (list_ret$state_order) list_ts = tmiic_extract_trajectories (input_data) - list_ts = tmiic_movavg (list_ts, state_order$movavg, verbose_level=verbose_level) + list_ts = tmiic_mov_avg (list_ts, state_order$mov_avg, verbose_level=verbose_level) state_order = tmiic_estimate_dynamic (list_ts, state_order, max_nodes=max_nodes, verbose_level=verbose_level) diff --git a/R/tmiic.wrapper.R b/R/tmiic.wrapper.R index 1ad9026..5ca0b67 100644 --- a/R/tmiic.wrapper.R +++ b/R/tmiic.wrapper.R @@ -213,7 +213,7 @@ tmiic_lag_input_data <- function (list_ts, state_order, keep_max_data=FALSE) # Utility function to precompute lags, layers and shifts of nodes in the # lagged network # -# params: tmiic_res [a tmiic object] The object returned by miic's +# params: tmiic_obj [a tmiic object] The object returned by miic's # execution in temporal mode. # # returns: a dataframe with lagged nodes as row name and 3 columns: @@ -221,13 +221,13 @@ tmiic_lag_input_data <- function (list_ts, state_order, keep_max_data=FALSE) # - corresp_nodes: the corresponding non lagged node # - shifts: the shift to apply to find the next lagged node #----------------------------------------------------------------------------- -tmiic_precompute_lags_layers_and_shifts <- function (tmiic_res) +tmiic_precompute_lags_layers_and_shifts <- function (tmiic_obj) { - list_nodes_not_lagged = tmiic_res$state_order$var_names - is_contextual = tmiic_res$state_order$is_contextual + list_nodes_not_lagged = tmiic_obj$state_order$var_names + is_contextual = tmiic_obj$state_order$is_contextual n_nodes_not_lagged = length (list_nodes_not_lagged) - list_n_layers_back <- tmiic_res$state_order$n_layers - 1 - list_delta_t <- tmiic_res$state_order$delta_t + list_n_layers_back <- tmiic_obj$state_order$n_layers - 1 + list_delta_t <- tmiic_obj$state_order$delta_t # # Identify lag and layer of each node # @@ -313,11 +313,11 @@ tmiic_combine_lag <- function (df) # for (idx in 1:nrow(df) ) { - if (df[idx,"infOrt"] == -2) + if (df[idx,"ort_inferred"] == -2) df[idx, c("x","y","lag")] <- c (df[idx,"y"], df[idx,"x"], -as.integer (df[idx,"lag"]) ) - if ( (df[idx,"infOrt"] == 6) & (as.integer (df[idx,"lag"]) != 0) ) + if ( (df[idx,"ort_inferred"] == 6) & (as.integer (df[idx,"lag"]) != 0) ) df[nrow(df)+1, c("x","y","lag")] <- c (df[idx,"y"], df[idx,"x"], -as.integer (df[idx,"lag"]) ) } @@ -392,36 +392,30 @@ tmiic_combine_orient <- function (df, col_name) #----------------------------------------------------------------------------- tmiic_combine_probas <- function (df, comb_orient) { - valid_probas <- grepl (';', df$proba, fixed=TRUE) - df <- df[valid_probas,] + df <- df[ (!is.na (df[, "p_y2x"]) ) + & (!is.na (df[, "p_x2y"]) ), , drop=F] if (nrow (df) <= 0) - return (NA) + return ( c(NA_real_, NA_real_) ) # # We set probas like if we have node X <= node Y # for ( idx in 1:nrow(df) ) if (df[idx,"x"] > df[idx,"y"]) - { - proba_split <- strsplit (df[idx, "proba"], ';' )[[1]] - df[idx, "proba"] <- paste (proba_split[[2]], proba_split[[1]], sep=";") - } - # - # Split proba column so we can do maths on it - # - probas_split <- strsplit (df$proba, ';' ) - df_probas <- do.call(rbind, probas_split) - df_probas <- data.frame (x=as.numeric ( as.character (df_probas[,1]) ), - y=as.numeric ( as.character (df_probas[,2]) ) ) + { + temp <- df[idx, "p_y2x"] + df[idx, "p_y2x"] <- df[idx, "p_x2y"] + df[idx, "p_x2y"] <- temp + } # # Depending on the pre-computed combined orientation, keep max/min/avg # if (comb_orient == 6) - return (paste (max(df_probas[,1]), max(df_probas[,2]), sep=";") ) + return (c (max(df$p_y2x), max(df$p_x2y) ) ) if (comb_orient == 2) - return (paste (min(df_probas[,1]), max(df_probas[,2]), sep=";") ) + return (c (min(df$p_y2x), max(df$p_x2y) ) ) if (comb_orient == -2) - return (paste (max(df_probas[,1]), min(df_probas[,2]), sep=";") ) - return (paste (mean(df_probas[,1]), mean(df_probas[,2]), sep=";")) + return (c (max(df$p_y2x), min(df$p_x2y) ) ) + return (c (mean(df$p_y2x), mean(df$p_x2y) ) ) } #----------------------------------------------------------------------------- @@ -436,7 +430,7 @@ tmiic_combine_probas <- function (df, comb_orient) # is reduced to non lagged nodes and filled with NA during the process # # params: -# - tmiic_res: a tmiic object, returned by tmiic +# - tmiic_obj: a tmiic object, returned by tmiic # # - flatten_mode: string, optional, default value "compact". # Possible values are "compact", "combine", "unique", "drop": @@ -476,30 +470,30 @@ tmiic_combine_probas <- function (df, comb_orient) # as input where the summary dataframe has been flattened and the adjacency # matrix reduced to the non lagged nodes #----------------------------------------------------------------------------- -tmiic_flatten_network <- function (tmiic_res, flatten_mode="compact", +tmiic_flatten_network <- function (tmiic_obj, flatten_mode="compact", keep_edges_on_same_node=TRUE) { # Reduce size of adj_matrix to non lagged nodes # (we don't care about content as it is not used for plotting) # - list_nodes <- tmiic_res$state_order$var_names - tmiic_res$adj_matrix <- matrix(NA, nrow=0, ncol=length (list_nodes)) - colnames(tmiic_res$adj_matrix) <- list_nodes + list_nodes <- tmiic_obj$state_order$var_names + tmiic_obj$adj_matrix <- matrix(NA, nrow=0, ncol=length (list_nodes)) + colnames(tmiic_obj$adj_matrix) <- list_nodes # # Keep only edges found by miic # - df_edges <- tmiic_res$all.edges.summary[tmiic_res$all.edges.summary$type %in% c('P', 'TP', 'FP'), ] + df_edges <- tmiic_obj$summary[tmiic_obj$summary$type %in% c('P', 'TP', 'FP'), ] if (nrow(df_edges) <= 0) { if (flatten_mode != "drop") df_edges$lag = numeric(0) - tmiic_res$all.edges.summary <- df_edges - return (tmiic_res) + tmiic_obj$summary <- df_edges + return (tmiic_obj) } # # Precompute lag and layer of each node # - df_precomputed <- tmiic_precompute_lags_layers_and_shifts (tmiic_res) + df_precomputed <- tmiic_precompute_lags_layers_and_shifts (tmiic_obj) # # First step, perform flatten_mode="compact": # from summary, remove lag info from nodes names and put it into a lag column @@ -521,15 +515,17 @@ tmiic_flatten_network <- function (tmiic_res, flatten_mode="compact", else { df_edges [edge_idx, c("x","y","lag")] <- c(node_y, node_x, -lag) - if (abs (one_edge$infOrt) == 2) - df_edges [edge_idx,"infOrt"] <- -one_edge$infOrt - if ( !is.na (one_edge$trueOrt ) ) - if (abs (one_edge$trueOrt ) == 2) - df_edges [edge_idx,"trueOrt"] <- -one_edge$trueOrt - if ( !is.na (one_edge$proba ) ) + if (abs (one_edge$ort_inferred) == 2) + df_edges [edge_idx,"ort_inferred"] <- -one_edge$ort_inferred + if ( !is.na (one_edge$ort_ground_truth ) ) + if (abs (one_edge$ort_ground_truth ) == 2) + df_edges [edge_idx,"ort_ground_truth"] <- -one_edge$ort_ground_truth + if ( (!is.na(one_edge$p_y2x)) + && (!is.na(one_edge$p_x2y)) ) { - df_edges [edge_idx, "proba"] = paste0 (rev ( - strsplit (df_edges [edge_idx, "proba"], ";")[[1]]), collapse=";") + temp <- one_edge$p_y2x + df_edges[edge_idx, "p_y2x"] <- one_edge$p_x2y + df_edges[edge_idx, "p_x2y"] <- temp } } } @@ -538,13 +534,13 @@ tmiic_flatten_network <- function (tmiic_res, flatten_mode="compact", # Exclude self loops if requested # if (!keep_edges_on_same_node) - df_edges <- df_edges[df_edges$x != df_edges$y, ] + df_edges <- df_edges[df_edges$x != df_edges$y, , drop=F] if (nrow(df_edges) <= 0) { if (flatten_mode == "drop") df_edges$lag <- NULL - tmiic_res$all.edges.summary <- df_edges - return (tmiic_res) + tmiic_obj$summary <- df_edges + return (tmiic_obj) } # # "compact" mode is done @@ -565,14 +561,14 @@ tmiic_flatten_network <- function (tmiic_res, flatten_mode="compact", # # Keep one edge per couple of nodes # - df_group <- df_edges[FALSE,] + df_group <- df_edges[FALSE, , drop=F] for ( xy_idx in 1:nrow(df_xy) ) { ref_x <- df_xy[xy_idx,"x"] ref_y <- df_xy[xy_idx,"y"] cond_same_edges = ( ( (df_edges[["x"]] == ref_x) & (df_edges[["y"]] == ref_y) ) | ( (df_edges[["x"]] == ref_y) & (df_edges[["y"]] == ref_x) ) ) - df_same <- df_edges[cond_same_edges,] + df_same <- df_edges[cond_same_edges, , drop=F] if (nrow (df_same) > 1) { @@ -581,10 +577,12 @@ tmiic_flatten_network <- function (tmiic_res, flatten_mode="compact", # Combine lag, orient and proba # df_same$new_lag <- tmiic_combine_lag (df_same) - comb_infOrt <- tmiic_combine_orient (df_same, "infOrt") - df_same$proba <- tmiic_combine_probas (df_same, comb_infOrt) - df_same$trueOrt <- tmiic_combine_orient (df_same, "trueOrt") - df_same$infOrt <- comb_infOrt + comb_ort_inferred <- tmiic_combine_orient (df_same, "ort_inferred") + tmp_ret <- tmiic_combine_probas (df_same, comb_ort_inferred) + df_same$p_y2x <- tmp_ret[[1]] + df_same$p_x2y <- tmp_ret[[2]] + df_same$ort_ground_truth <- tmiic_combine_orient (df_same, "ort_ground_truth") + df_same$ort_inferred <- comb_ort_inferred # # Orientations and probas have been computed for x <= y, # so force x <= y on all rows @@ -602,7 +600,7 @@ tmiic_flatten_network <- function (tmiic_res, flatten_mode="compact", } max_info <- max (df_same[["info_shifted"]]) - df_same <- df_same[ (df_same[["info_shifted"]] == max_info),] + df_same <- df_same[ (df_same[["info_shifted"]] == max_info), , drop=F] } if (nrow(df_same) > 1) { @@ -630,10 +628,10 @@ tmiic_flatten_network <- function (tmiic_res, flatten_mode="compact", { # For contextual variable, we clean the lag info # - is_contextual <- tmiic_res$state_order$is_contextual + is_contextual <- tmiic_obj$state_order$is_contextual if (!is.null(is_contextual)) { - list_nodes_not_lagged = tmiic_res$state_order$var_names + list_nodes_not_lagged = tmiic_obj$state_order$var_names for ( edge_idx in 1:nrow(df_edges) ) { one_edge <- df_edges[edge_idx,] @@ -647,8 +645,8 @@ tmiic_flatten_network <- function (tmiic_res, flatten_mode="compact", # # returns the tmiic structure where network summary has been flattened # - tmiic_res$all.edges.summary <- df_edges - return (tmiic_res) + tmiic_obj$summary <- df_edges + return (tmiic_obj) } #----------------------------------------------------------------------------- @@ -662,23 +660,23 @@ tmiic_flatten_network <- function (tmiic_res, flatten_mode="compact", # i.e: assuming that we used nlayers=4 and delta_t=1, the edge X_lag0-X_lag1 # will be copied as X_lag1-X_lag2 and X_lag2-X_lag3. # -# param: tmiic_res, the object returned by tmiic +# param: tmiic_obj, the object returned by tmiic # # returns: a dataframe with edges completed by stationarity #----------------------------------------------------------------------------- -tmiic_repeat_edges_over_history <- function (tmiic_res) +tmiic_repeat_edges_over_history <- function (tmiic_obj) { # Consider only edges found by miic type = "P", "TP", "FP" # - df_edges <- tmiic_res$all.edges.summary[tmiic_res$all.edges.summary$type %in% c('P', 'TP', 'FP'), ] + df_edges <- tmiic_obj$summary[tmiic_obj$summary$type %in% c('P', 'TP', 'FP'), ] if (nrow(df_edges) <= 0) return (df_edges) # # Precompute lag, layer and shift of each node # - df_precomp <- tmiic_precompute_lags_layers_and_shifts (tmiic_res) - list_n_layers_back <- tmiic_res$state_order$n_layers - 1 - list_nodes_not_lagged <- tmiic_res$state_order$var_names + df_precomp <- tmiic_precompute_lags_layers_and_shifts (tmiic_obj) + list_n_layers_back <- tmiic_obj$state_order$n_layers - 1 + list_nodes_not_lagged <- tmiic_obj$state_order$var_names # # Duplicate the edges over all layers of history # diff --git a/R/write.cytoscape.R b/R/write.cytoscape.R index 100e69f..876424a 100755 --- a/R/write.cytoscape.R +++ b/R/write.cytoscape.R @@ -11,29 +11,30 @@ fromStringToNumberArrowType <- function(val) { #' GraphML converting function for miic graph #' #' @description Convert miic graph to [GraphML format](http://graphml.graphdrawing.org/). -#' @param g The graph object returned by [miic][miic()]. +#' @param miic_obj A miic object. The object returned by the \code{\link{miic}} execution. #' @param file A string. Path to the output file containing file name without #' extension (.graphml will be appended). #' @param layout An optional data frame of 2 (or 3) columns containing the #' coordinate `x` and `y` for each node. The optional first column can contain #' node names. If node names is not given, the order of the input file will be #' assigned to the list of positions. +#' @return None #' @export #' @useDynLib miic #' @md -miic.write.network.cytoscape <- function(g, file, layout = NULL) { +writeCytoscapeNetwork <- function(miic_obj, file, layout = NULL) { ##################################### NETWORK IN GRAPHML if (missing(file)) { stop("The file path is necessary") } - if (is.null(g$all.edges.summary)) { + if (is.null(miic_obj$summary)) { stop("The result of the miic execution is required") } - summary <- g$all.edges.summary - adj_matrix <- g$adj_matrix + summary <- miic_obj$summary + adj_matrix <- miic_obj$adj_matrix if (is.null(layout)) { line <- "\n" @@ -143,7 +144,7 @@ miic.write.network.cytoscape <- function(g, file, layout = NULL) { } else { weigth <- (summary[index, "partial_correlation"]) } - if (summary[index, "infOrt"] == 1) { + if (summary[index, "ort_inferred"] == 1) { line <- paste( line, "\t\t", - summary[index, "Nxy_ui"], + summary[index, "n_xy_ai"], "\n", sep = "" ) @@ -495,7 +496,7 @@ miic.write.network.cytoscape <- function(g, file, layout = NULL) { for (index in indexes) { sourceArrowNum <- 0 targetArrowNum <- 0 - if (summary[index, "infOrt"] == 1) { + if (summary[index, "ort_inferred"] == 1) { line <- paste( line, "\t\t\n", sep = "" ) - } else if (summary[index, "infOrt"] == 2) { + } else if (summary[index, "ort_inferred"] == 2) { if (is.na(summary[index, "partial_correlation"])) { value <- "arrow" varchar <- intToUtf8(187) @@ -567,7 +568,7 @@ miic.write.network.cytoscape <- function(g, file, layout = NULL) { ) sourceArrowNum <- 0 targetArrowNum <- fromStringToNumberArrowType(value) - } else if (summary[index, "infOrt"] == -2) { + } else if (summary[index, "ort_inferred"] == -2) { if (is.na(summary[index, "partial_correlation"])) { value <- "arrow" varchar <- intToUtf8(187) @@ -621,7 +622,7 @@ miic.write.network.cytoscape <- function(g, file, layout = NULL) { ) sourceArrowNum <- 0 targetArrowNum <- fromStringToNumberArrowType(value) - } else if (summary[index, "infOrt"] == 6) { + } else if (summary[index, "ort_inferred"] == 6) { if (is.na(summary[index, "partial_correlation"])) { value <- "arrow" varchar <- intToUtf8(187) @@ -712,7 +713,7 @@ miic.write.network.cytoscape <- function(g, file, layout = NULL) { line <- paste( line, "\t\t\t\n", sep = "" ) diff --git a/R/write.style.R b/R/write.style.R index f826e8e..1a00890 100755 --- a/R/write.style.R +++ b/R/write.style.R @@ -1,11 +1,14 @@ #' Style writing function for the miic network -#' @description This function writes the miic style for a correct visualization using the cytoscape tool (http://www.cytoscape.org/). +#' @description This function writes the miic style for a correct +#' visualization using the cytoscape tool (http://www.cytoscape.org/). #' @details The style is written in the xml file format. -#' @param file [a string] The file path of the output file (containing the file name without extension). +#' @param file [a string] The file path of the output file (containing the +#' file name without extension). +#' @return None #' @export #' @useDynLib miic -miic.write.style.cytoscape <- function(file) { +writeCytoscapeStyle <- function(file) { if (missing(file)) { cat("The file path is necessary") } else { diff --git a/README.md b/README.md index 8f64461..5405d53 100644 --- a/README.md +++ b/README.md @@ -6,25 +6,67 @@ status](https://github.com/miicTeam/miic_R_package/workflows/R-CMD-check/badge.svg)](https://github.com/miicTeam/miic_R_package/actions) -This repository contains the source code for MIIC (**M**ultivariate **I**nformation based **I**nductive **C**ausation), a method based on constraint-based approaches that learns a large class of causal or non-causal graphical models from purely observational data while including the effects of unobserved latent variables. Starting from a complete graph, the method iteratively removes dispensable edges, by uncovering significant information contributions from indirect paths, and assesses edge-specific confidences from randomization of available data. The remaining edges are then oriented based on the signature of causality in observational data. This approach can be applied on a wide range of datasets and provide new biological insights on regulatory networks from single cell expression data, genomic alterations during tumor development and co-evolving residues in protein structures. Since the version 2.0, MIIC can in addition process -stationary time series to unveil temporal causal graphs. +This repository contains the source code for MIIC (**M**ultivariate +**I**nformation-based **I**nductive **C**ausation), a causal discovery method, +based on information theory principles, which learns a large class of causal +or non-causal graphical models from purely observational data, +while including the effects of unobserved latent variables. +Starting from a complete graph, the method iteratively removes dispensable +edges, by uncovering significant information contributions from indirect paths, +and assesses edge-specific confidences from randomization of available data. +The remaining edges are then oriented based on the signature of causality +in observational data. The recent more interpretable MIIC extension (iMIIC) +further distinguishes genuine causes from putative and latent causal effects, +while scaling to very large datasets (hundreds of thousands of samples). +Since the version 2.0, MIIC also includes a temporal mode (tMIIC) +to learn temporal causal graphs from stationary time series data. +MIIC has been applied to a wide range of biological and biomedical data, +such as single cell gene expression data, genomic alterations in tumors, +live-cell time-lapse imaging data (CausalXtract), +as well as medical records of patients. +MIIC brings unique insights based on causal interpretation and could be used +in a broad range of other data science domains (technology, climatology, +economy, ...). ## References -Simon F., Comes M. C., Tocci T., Dupuis L., Cabeli V., Lagrange N., Mencattini A., Parrini M. C., Martinelli E., Isambert H.; [CausalXtract: a flexible pipeline to extract causal effects from live-cell time-lapse imaging data; eLife, reviewed preprint](https://www.biorxiv.org/content/10.1101/2024.02.06.579177v1.abstract). - -Ribeiro-Dantas M. D. C., Li H., Cabeli V., Dupuis L., Simon F., Hettal L., Hamy A. S., Isambert H.; [Learning interpretable causal networks from very large datasets, application to 400,000 medical records of breast cancer patients; iScience, 2024](https://arxiv.org/abs/2303.06423). - -Cabeli V., Li H., Ribeiro-Dantas M., Simon F., Isambert H.; [Reliable causal discovery based on mutual information supremum principle for finite dataset; Why21 at NeurIPS, 2021](https://why21.causalai.net/papers/WHY21_24.pdf). - -Cabeli V., Verny L., Sella N., Uguzzoni G., Verny M., Isambert H.; Learning clinical networks from medical records based on information estimates in mixed-type data; PLoS computational biology., 2020. [doi:10.1371/journal.pcbi.1007866](https://doi.org/10.1371/journal.pcbi.1007866) | [code](https://github.com/vcabeli/miic_PLoS) - -Li H., Cabeli V., Sella N., Isambert H.; [Constraint-based causal structure learning with consistent separating sets; In Advances in Neural Information Processing Systems 2019.](https://papers.nips.cc/paper/9573-constraint-based-causal-structure-learning-with-consistent-separating-sets) | [code](https://github.com/honghaoli42/consistent_pcalg) - -Verny L., Sella N., Affeldt S., Singh PP., Isambert H.; Learning causal networks with latent variables from multivariate information in genomic data; PLoS Comput. Biol., 2017. [doi:10.1371/journal.pcbi.1005662](https://doi.org/10.1371/journal.pcbi.1005662) +Simon F., Comes M. C., Tocci T., Dupuis L., Cabeli V., Lagrange N., +Mencattini A., Parrini M. C., Martinelli E., Isambert H., +[CausalXtract: a flexible pipeline to extract causal effects from live-cell +time-lapse imaging data, eLife 2024](https://www.biorxiv.org/content/10.1101/2024.02.06.579177v1.abstract). + +Ribeiro-Dantas M. D. C., Li H., Cabeli V., Dupuis L., Simon F., Hettal L., +Hamy A. S., Isambert H., +[Learning interpretable causal networks from very large datasets, application +to 400,000 medical records of breast cancer patients, iScience, 2024](https://doi.org/10.1016/j.isci.2024.109736). + +Cabeli V., Li H., Ribeiro-Dantas M., Simon F., Isambert H., +[Reliable causal discovery based on mutual information supremum principle for +finite dataset, Why21 at NeurIPS 2021](https://why21.causalai.net/papers/WHY21_24.pdf). + +Cabeli V., Verny L., Sella N., Uguzzoni G., Verny M., Isambert H., +[Learning clinical networks from medical records based on information estimates +in mixed-type data, PLoS Comput. Biol. 2020](https://doi.org/10.1371/journal.pcbi.1007866) +| [code](https://github.com/vcabeli/miic_PLoS) + +Li H., Cabeli V., Sella N., Isambert H., +[Constraint-based causal structure learning with consistent separating sets, +In Advances in Neural Information Processing Systems 2019.](https://papers.nips.cc/paper/9573-constraint-based-causal-structure-learning-with-consistent-separating-sets) +| [code](https://github.com/honghaoli42/consistent_pcalg). + +Verny L., Sella N., Affeldt S., Singh PP., Isambert H., +[Learning causal networks with latent variables from multivariate information +in genomic data, PLoS Comput. Biol. 2017](https://doi.org/10.1371/journal.pcbi.1005662). + +Affeldt S., Isambert H., +[Robust Reconstruction of Causal Graphical Models based on Conditional 2-point +and 3-point Information, UAI 2015](https://auai.org/uai2015/proceedings/papers/293.pdf) +| [supp](https://auai.org/uai2015/proceedings/supp/293_supp.pdf). ## Prerequisites + MIIC contains R and C++ sources. -- To compile from source, a compiler with support for c++14 language features is required. + +- To compile from source, a compiler with support for c++14 language features is required. - MIIC imports the following R packages: ppcor, scales, stats, Rcpp ## Installation @@ -41,29 +83,37 @@ remotes::install_github("miicTeam/miic_R_package") ## Quick start -MIIC allows you to create a graph object from a dataset of observations of both discrete and continuous variables, potentially with missing values and taking into account unobserved latent variables. -You can find this example along others by calling the documentation of the main function `?miic` from R. +MIIC allows you to create a graph object from a dataset of observations +of both discrete and continuous variables, potentially with missing values +and taking into account unobserved latent variables. +You can find this example along others by calling the documentation +of the main function `?miic` from R. + ```R library(miic) # EXAMPLE HEMATOPOIESIS data(hematoData) # execute MIIC (reconstruct graph) -miic.res <- miic( +miic_obj <- miic( input_data = hematoData, latent = "yes", n_shuffles = 10, conf_threshold = 0.001 ) # plot graph with igraph if(require(igraph)) { - plot(miic.res, method="igraph") + plot(miic_obj, method="igraph") } ``` ## Documentation -You can find the documentation pages in the "man" folder, in the auto generated [PDF](https://cran.r-project.org/web/packages/miic/miic.pdf), or use R functions `help()` and `?`. +You can find the documentation pages in the "man" folder, in the auto generated +[PDF](https://CRAN.R-project.org/package=miic/miic.pdf), +or use R functions `help()` and `?`. + ## Authors + - Tiziana Tocci - Nikita Lagrange - Orianne Debeaupuis @@ -78,4 +128,5 @@ You can find the documentation pages in the "man" folder, in the auto generated - Hervé Isambert ## License + GPL-2 | GPL-3 diff --git a/data/datalist b/data/datalist index 8a94a1e..7ae2dac 100755 --- a/data/datalist +++ b/data/datalist @@ -1,4 +1,3 @@ hematoData cosmicCancer -ohno covidCases diff --git a/data/ohno.rda b/data/ohno.rda deleted file mode 100755 index 02e71cd..0000000 Binary files a/data/ohno.rda and /dev/null differ diff --git a/data/ohno_stateOrder.rda b/data/ohno_stateOrder.rda deleted file mode 100755 index 26b683c..0000000 Binary files a/data/ohno_stateOrder.rda and /dev/null differ diff --git a/man/computeMutualInfo.Rd b/man/computeMutualInfo.Rd index 0a81298..5e01e8c 100644 --- a/man/computeMutualInfo.Rd +++ b/man/computeMutualInfo.Rd @@ -5,11 +5,11 @@ \title{Compute (conditional) mutual information} \usage{ computeMutualInfo( - X, - Y, + x, + y, df_conditioning = NULL, maxbins = NULL, - cplx = c("nml", "mdl"), + cplx = c("nml", "bic"), n_eff = -1, sample_weights = NULL, is_continuous = NULL, @@ -17,11 +17,11 @@ computeMutualInfo( ) } \arguments{ -\item{X}{[a vector] -A vector that contains the observational data of the first variable.} +\item{x}{[a vector] +The \eqn{X} vector that contains the observational data of the first variable.} -\item{Y}{[a vector] -A vector that contains the observational data of the second variable.} +\item{y}{[a vector] +The \eqn{Y} vector that contains the observational data of the second variable.} \item{df_conditioning}{[a data frame] The data frame of the observations of the conditioning variables.} @@ -34,19 +34,19 @@ faster, a larger number allows finer discretization.} \item{cplx}{[a string] The complexity model: \itemize{ -\item["mdl"] Minimum description Length -\item["nml"] Normalized Maximum Likelihood, less costly compared to "mdl" in -the finite sample case and will allow for more bins. +\item["bic"] Bayesian Information Criterion +\item["nml"] Normalized Maximum Likelihood, more accurate complexity cost +compared to BIC, especially on small sample size. }} \item{n_eff}{[an integer] -The number of effective samples. When there is significant autocorrelation in -the samples you may want to specify a number of effective samples that is -lower than the number of points in the distribution.} +The effective number of samples. When there is significant autocorrelation +between successive samples, you may want to specify an effective number of +samples that is lower than the total number of samples.} \item{sample_weights}{[a vector of floats] Individual weights for each sample, used for the same reason as the effective -sample number but with individual precision.} +number of samples but with individual weights.} \item{is_continuous}{[a vector of booleans] Specify if each variable is to be treated as continuous (TRUE) or discrete @@ -55,8 +55,8 @@ Specify if each variable is to be treated as continuous (TRUE) or discrete considered as discrete, and numerical vectors as continuous.} \item{plot}{[a boolean] -Specify whether the XY joint space with discretization scheme is to be -plotted (requires `ggplot2` and `gridExtra`).} +Specify whether the resulting XY optimum discretization is to be plotted +(requires `ggplot2` and `gridExtra`).} } \value{ A list that contains : @@ -65,11 +65,12 @@ A list that contains : the cutpoints for the partitioning of \eqn{X}. \item cutpoints2: Only when \eqn{Y} is continuous, a vector containing the cutpoints for the partitioning of \eqn{Y}. -\item niterations: Only when at least one of the input variables is +\item n_iterations: Only when at least one of the input variables is continuous, the number of iterations it takes to reach the convergence of the estimated information. -\item iterationN: Only when at least one of the input variables is - continuous, the list of vectors of cutpoints of each iteration. +\item iteration1, iteration2, ... Only when at least one of the input + variables is continuous, the list of vectors of cutpoints of each + iteration. \item info: The estimation of (conditional) mutual information without the complexity cost. \item infok: The estimation of (conditional) mutual information with the @@ -78,33 +79,32 @@ complexity cost (\eqn{Ik = I - cplx}). } } \description{ -For discrete variables, the computation is based on the -empirical frequency minus a complexity cost (computed as BIC or with the -Normalized Maximum Likelihood). When continuous variables are present, each -continuous variable is discretized where the partitioning is chosen by -maximizing the mutual information minus the complexity cost. The estimation -based on the optimally discretized distributions effectively approaches the -mutual information computed on the original continuous variables. +For discrete or categorical variables, the (conditional) +mutual information is computed using the empirical frequencies minus a +complexity cost (computed as BIC or with the Normalized Maximum Likelihood). +When continuous variables are present, each continuous variable is +discretized for each mutual information estimate so as to maximize the +mutual information minus the complexity cost (see Cabeli 2020). } \details{ For a pair of continuous variables \eqn{X} and \eqn{Y}, the mutual information \eqn{I(X;Y)} will be computed iteratively. In each iteration, the -algorithm optimizes first the partitioning of \eqn{X} and then that of -\eqn{Y}, while maximizing +algorithm optimizes the partitioning of \eqn{X} and then of \eqn{Y}, +in order to maximize \deqn{Ik(X_{d};Y_{d}) = I(X_{d};Y_{d}) - cplx(X_{d};Y_{d})} -where \eqn{cplx(X_{d}; Y_{d})} is the complexity cost of the current -partitioning (see Affeldt 2016 and Cabeli 2020). Upon convergence, the -information terms \eqn{I(X_{d};Y_{d})} and \eqn{Ik(X_{d};Y_{d})}, as well as -the partitioning of \eqn{X_{d}} and \eqn{Y_{d}} in terms of cutpoints, are -returned. +where \eqn{cplx(X_{d}; Y_{d})} is the complexity cost of the corresponding +partitioning (see Cabeli 2020). +Upon convergence, the information terms \eqn{I(X_{d};Y_{d})} +and \eqn{Ik(X_{d};Y_{d})}, as well as the partitioning of \eqn{X_{d}} +and \eqn{Y_{d}} in terms of cutpoints, are returned. -For conditional mutual information with conditioning set \eqn{U}, the +For conditional mutual information with a conditioning set \eqn{U}, the computation is done based on \deqn{ Ik(X;Y|U) = 0.5*(Ik(X_{d};Y_{d},U_{d}) - Ik(X_{d};U_{d}) + Ik(Y_{d};X_{d},U_{d}) - Ik(Y_{d};U_{d})), } -where each of the four summands is estimated independently. +where each of the four summands is estimated separately. } \examples{ library(miic) @@ -138,14 +138,10 @@ message("I(X;Y) = ", res$info) res <- computeMutualInfo(X, Y, df_conditioning = matrix(Z, ncol = 1), plot = TRUE) message("I(X;Y|Z) = ", res$info) } - } \references{ \itemize{ -\item Verny et al., \emph{PLoS Comp. Bio. 2017.} - https://doi.org/10.1371/journal.pcbi.1005662 -\item Cabeli et al., \emph{PLoS Comp. Bio. 2020.} - https://doi.org/10.1371/journal.pcbi.1007866 -\item Affeldt et al., \emph{Bioinformatics 2016} +\item Cabeli \emph{et al.}, PLoS Comput. Biol. 2020, \href{https://doi.org/10.1371/journal.pcbi.1007866}{Learning clinical networks from medical records based on information estimates in mixed-type data} +\item Affeldt \emph{et al.}, UAI 2015, \href{https://auai.org/uai2015/proceedings/papers/293.pdf}{Robust Reconstruction of Causal Graphical Models based on Conditional 2-point and 3-point Information} } } diff --git a/man/computeThreePointInfo.Rd b/man/computeThreePointInfo.Rd index 62fd0b6..fb67836 100644 --- a/man/computeThreePointInfo.Rd +++ b/man/computeThreePointInfo.Rd @@ -5,26 +5,26 @@ \title{Compute (conditional) three-point information} \usage{ computeThreePointInfo( - X, - Y, - Z, + x, + y, + z, df_conditioning = NULL, maxbins = NULL, - cplx = c("nml", "mdl"), + cplx = c("nml", "bic"), n_eff = -1, sample_weights = NULL, is_continuous = NULL ) } \arguments{ -\item{X}{[a vector] -A vector that contains the observational data of the first variable.} +\item{x}{[a vector] +The \eqn{X} vector that contains the observational data of the first variable.} -\item{Y}{[a vector] -A vector that contains the observational data of the second variable.} +\item{y}{[a vector] +The \eqn{Y} vector that contains the observational data of the second variable.} -\item{Z}{[a vector] -A vector that contains the observational data of the third variable.} +\item{z}{[a vector] +The \eqn{Z} vector that contains the observational data of the third variable.} \item{df_conditioning}{[a data frame] The data frame of the observations of the set of conditioning variables @@ -38,52 +38,52 @@ faster, a larger number allows finer discretization.} \item{cplx}{[a string] The complexity model: \itemize{ -\item["mdl"] Minimum description Length -\item["nml"] Normalized Maximum Likelihood, less costly compared to "mdl" in -the finite sample case and will allow for more bins. +\item["bic"] Bayesian Information Criterion +\item["nml"] Normalized Maximum Likelihood, more accurate complexity cost +compared to BIC, especially on small sample size. }} \item{n_eff}{[an integer] -The number of effective samples. When there is significant autocorrelation in -the samples you may want to specify a number of effective samples that is -lower than the number of points in the distribution.} +The effective number of samples. When there is significant autocorrelation +between successive samples, you may want to specify an effective number of +samples that is lower than the total number of samples.} \item{sample_weights}{[a vector of floats] Individual weights for each sample, used for the same reason as the effective -sample number but with individual precision.} +number of samples but with individual weights.} \item{is_continuous}{[a vector of booleans] Specify if each variable is to be treated as continuous (TRUE) or discrete -(FALSE), must be of length `ncol(df_conditioning) + 2`, in the order -\eqn{X, Y, U1, U2, ...}. If not specified, factors and character vectors are -considered as discrete, and numerical vectors as continuous.} +(FALSE), must be of length `ncol(df_conditioning) + 3`, in the order +\eqn{X, Y, Z, U1, U2, ...}. If not specified, factors and character vectors +are considered as discrete, and numerical vectors as continuous.} } \value{ A list that contains : \itemize{ -\item I3: The estimation of (conditional) three-point information without the +\item i3: The estimation of (conditional) three-point information without the complexity cost. -\item I3k: The estimation of (conditional) three-point information with the -complexity cost (\eqn{I3k = I3 - cplx}). -\item I2: For reference, the estimation of (conditional) mutual information -\eqn{I(X;Y|U)} used in the estimation of \eqn{I3}. -\item I2k: For reference, the estimation of regularized (conditional) mutual -information \eqn{Ik(X;Y|U)} used in the estimation of \eqn{I3k}. +\item i3k: The estimation of (conditional) three-point information with the +complexity cost (\emph{i3k = i3 - cplx}). +\item i2: For reference, the estimation of (conditional) mutual information +\eqn{I(X;Y|U)} used in the estimation of \emph{i3}. +\item i2k: For reference, the estimation of regularized (conditional) mutual +information \eqn{Ik(X;Y|U)} used in the estimation of \emph{i3k}. } } \description{ -Three point information is defined based on mutual information. -For discrete variables, the computation is based on the -empirical frequency minus a complexity cost (computed as BIC or with the -Normalized Maximum Likelihood). When continuous variables are present, each -continuous variable is discretized where the partitioning is chosen by -maximizing the mutual information minus the complexity cost. +Three point information is defined and computed as the +difference of mutual information and conditional mutual information, e.g. +\deqn{I(X;Y;Z|U) = I(X;Y|U) - Ik(X;Y|U,Z)} +For discrete or categorical variables, the three-point information is +computed with the empirical frequencies minus a complexity cost +(computed as BIC or with the Normalized Maximum Likelihood). } \details{ For variables \eqn{X}, \eqn{Y}, \eqn{Z} and a set of conditioning variables \eqn{U}, the conditional three point information is defined as -\deqn{Ik(X;Y;Z|U) = Ik(X;Y|U) - Ik(X;Y|U,Z)}, where \eqn{Ik} is the -regularized conditional mutual information. +\deqn{Ik(X;Y;Z|U) = Ik(X;Y|U) - Ik(X;Y|U,Z)} +where \eqn{Ik} is the shifted or regularized conditional mutual information. See \code{\link{computeMutualInfo}} for the definition of \eqn{Ik}. } \examples{ @@ -94,8 +94,8 @@ Z <- runif(N) X <- Z * 2 + rnorm(N, sd = 0.2) Y <- Z * 2 + rnorm(N, sd = 0.2) res <- computeThreePointInfo(X, Y, Z) -message("I(X;Y;Z) = ", res$I3) -message("Ik(X;Y;Z) = ", res$I3k) +message("I(X;Y;Z) = ", res$i3) +message("Ik(X;Y;Z) = ", res$i3k) \donttest{ # Independence, conditional dependence : X -> Z <- Y @@ -103,17 +103,13 @@ X <- runif(N) Y <- runif(N) Z <- X + Y + rnorm(N, sd = 0.1) res <- computeThreePointInfo(X, Y, Z) -message("I(X;Y;Z) = ", res$I3) -message("Ik(X;Y;Z) = ", res$I3k) +message("I(X;Y;Z) = ", res$i3) +message("Ik(X;Y;Z) = ", res$i3k) } - } \references{ \itemize{ -\item Verny et al., \emph{PLoS Comp. Bio. 2017.} - https://doi.org/10.1371/journal.pcbi.1005662 -\item Cabeli et al., \emph{PLoS Comp. Bio. 2020.} - https://doi.org/10.1371/journal.pcbi.1007866 -\item Affeldt et al., \emph{Bioinformatics 2016} +\item Cabeli \emph{et al.}, PLoS Comput. Biol. 2020, \href{https://doi.org/10.1371/journal.pcbi.1007866}{Learning clinical networks from medical records based on information estimates in mixed-type data} +\item Affeldt \emph{et al.}, UAI 2015, \href{https://auai.org/uai2015/proceedings/papers/293.pdf}{Robust Reconstruction of Causal Graphical Models based on Conditional 2-point and 3-point Information} } } diff --git a/man/discretizeMutual.Rd b/man/discretizeMutual.Rd index 871f5e9..f13d247 100644 --- a/man/discretizeMutual.Rd +++ b/man/discretizeMutual.Rd @@ -2,26 +2,27 @@ % Please edit documentation in R/discretizeMutual.R \name{discretizeMutual} \alias{discretizeMutual} -\title{Iterative dynamic programming for (conditional) mutual information through optimized discretization.} +\title{Iterative dynamic programming for (conditional) mutual information through +optimized discretization.} \usage{ discretizeMutual( - X, - Y, + x, + y, matrix_u = NULL, maxbins = NULL, cplx = "nml", n_eff = NULL, sample_weights = NULL, - is_discrete = NULL, + is_continuous = NULL, plot = TRUE ) } \arguments{ -\item{X}{[a vector] -A vector that contains the observational data of the first variable.} +\item{x}{[a vector] +The \eqn{X} vector that contains the observational data of the first variable.} -\item{Y}{[a vector] -A vector that contains the observational data of the second variable.} +\item{y}{[a vector] +The \eqn{Y} vector that contains the observational data of the second variable.} \item{matrix_u}{[a numeric matrix] The matrix with the observations of as many columns as conditioning variables.} @@ -31,53 +32,67 @@ The maximum number of bins desired in the discretization. A lower number makes t number allows finer discretization (by default : 5 * cubic root of N).} \item{cplx}{[a string] -The complexity used in the dynamic programming. Either "mdl" for Minimum description Length or -"nml" for Normalized Maximum Likelihood, which is less costly in the finite sample case and -will allow more bins than mdl.} +The complexity used in the dynamic programming: +\itemize{ +\item["bic"] Bayesian Information Criterion +\item["nml"] Normalized Maximum Likelihood, more accurate complexity cost +compared to BIC, especially on small sample size. +}} -\item{n_eff}{[an int] -The number of effective samples. When there is significant autocorrelation in the samples you may -want to specify a number of effective samples that is lower than the number of points in the distribution.} +\item{n_eff}{[an integer] +The effective number of samples. When there is significant autocorrelation +between successive samples, you may want to specify an effective number of +samples that is lower than the total number of samples.} \item{sample_weights}{[a vector of floats] -Individual weights for each sample, used for the same reason as the effective sample number but with individual -precision.} +Individual weights for each sample, used for the same reason as the effective +number of samples but with individual weights.} -\item{is_discrete}{[a vector of booleans] -Specify if each variable is to be treated as discrete (TRUE) or continuous (FALSE) in a -logical vector of length ncol(matrix_u) + 2, in the order [X, Y, U1, U2...]. By default, -factors and character vectors are treated as discrete, and numerical vectors as continuous.} +\item{is_continuous}{[a vector of booleans] +Specify if each variable is to be treated as continuous (TRUE) +or discrete (FALSE) in a logical vector of length ncol(matrix_u) + 2, +in the order [X, Y, U1, U2...]. By default, factors and character vectors +are treated as discrete, and numerical vectors as continuous.} \item{plot}{[a boolean] -Specify if the XY joint space with discretization scheme is to be plotted or not (requires -ggplot2 and gridExtra).} +Specify whether the resulting XY optimum discretization is to be plotted +(requires `ggplot2` and `gridExtra`).} } \value{ A list that contains : \itemize{ \item{two vectors containing the cutpoints for each variable : -\emph{cutpoints1} corresponds to \emph{X}, -\emph{cutpoints2} corresponds to \emph{Y}.} -\item{\emph{niterations} is the number of iterations performed before convergence of the (C)MI estimation.} -\item{\emph{iterationN}, lists contatining the cutpoint vectors for each iteration.} -\item{\emph{info} and \emph{infok}, the estimated (C)MI value and (C)MI minus the complexity cost.} +\emph{cutpoints1} corresponds to \emph{x}, +\emph{cutpoints2} corresponds to \emph{y}.} +\item{\emph{n_iterations} is the number of iterations performed before +convergence of the (C)MI estimation.} +\item{\emph{iteration1, iteration2, ...}, lists containing +the cutpoint vectors for each iteration.} +\item{\emph{info} and \emph{infok}, the estimated (C)MI value +and (C)MI minus the complexity cost.} \item{if \emph{plot} == TRUE, a plot object (requires ggplot2 and gridExtra).} } } \description{ -This function chooses cutpoints in the input distributions by maximizing the mutual -information minus a complexity cost (computed as BIC or with the Normalized Maximum Likelihood). The -(conditional) mutual information computed on the optimized discretized distributions effectively approaches -the mutual information computed on the original continuous variables. +This function chooses cutpoints in the input distributions by +maximizing the mutual information minus a complexity cost +(computed as BIC or with the Normalized Maximum Likelihood). +The (conditional) mutual information computed on the optimized discretized +distributions effectively estimates the mutual information of the original +continuous variables. } \details{ -For a pair of variables \eqn{X} and \eqn{Y}, the algorithm will in turn choose cutpoints on \eqn{X} -then on \eqn{Y}, maximizing \eqn{I(X_{d};Y_{d}) - cplx(X_{d};Y_{d})} where \eqn{cplx(X_{d};Y_{d})} is the -complexity cost of the considered discretizations of \eqn{X} and \eqn{Y} (see Affeldt 2016 and Cabeli 2020). -When the value \eqn{I(X_{d};Y_{d})} is stable between two iterations the discretization scheme of -\eqn{X_{d}} and \eqn{Y_{d}} is returned as well as \eqn{I(X_{d};Y_{d})} and \eqn{I(X_{d};Y_{d})-cplx(X_{d};Y_{d})}. +For a pair of continuous variables \eqn{X} and \eqn{Y}, +the algorithm will iteratively choose cutpoints on \eqn{X} then on \eqn{Y}, +maximizing \eqn{I(X_{d};Y_{d}) - cplx(X_{d};Y_{d})} where +\eqn{cplx(X_{d};Y_{d})} is the complexity cost of the considered +discretizations of \eqn{X} and \eqn{Y} (see Cabeli 2020). +Upon convergence, the discretization scheme of \eqn{X_{d}} and \eqn{Y_{d}} +is returned as well as \eqn{I(X_{d};Y_{d})} +and \eqn{I(X_{d};Y_{d})-cplx(X_{d};Y_{d})}. -With a set of conditioning variables \eqn{U}, the discretization scheme maximizes each term of the sum +With a set of conditioning variables \eqn{U}, the discretization scheme +maximizes each term of the sum \eqn{I(X;Y|U) \sim 0.5*(I(X_{d};Y_{d}, U_{d}) - I(X_{d};U_{d}) + I(Y_{d};X_{d}, U_{d}) - I(Y_{d};U_{d}))}. Discrete variables can be passed as factors and will be used "as is" to maximize each term. @@ -101,7 +116,7 @@ X <- -as.numeric(Z == 1) + as.numeric(Z == 2) + 0.2 * rnorm(N) Y <- as.numeric(Z == 1) + as.numeric(Z == 2) + 0.2 * rnorm(N) res <- miic::discretizeMutual(X, Y, cplx = "nml") message("I(X;Y) = ", res$info) -res <- miic::discretizeMutual(X, Y, matrix(Z, ncol = 1), is_discrete = c(FALSE, FALSE, TRUE)) +res <- miic::discretizeMutual(X, Y, matrix(Z, ncol = 1), is_continuous = c(TRUE, TRUE, FALSE)) message("I(X;Y|Z) = ", res$info) @@ -114,12 +129,9 @@ message("I(X;Y) = ", res$info) res <- discretizeMutual(X, Y, matrix_u = matrix(Z, ncol = 1), plot = TRUE) message("I(X;Y|Z) = ", res$info) } - } \references{ \itemize{ -\item Verny et al., \emph{PLoS Comp. Bio. 2017.} https://doi.org/10.1371/journal.pcbi.1005662 -\item Cabeli et al., \emph{PLoS Comp. Bio. 2020.} https://doi.org/10.1371/journal.pcbi.1007866 -\item Affeldt et al., \emph{Bioinformatics 2016} +\item Cabeli \emph{et al.}, PLoS Comput. Biol. 2020, \href{https://doi.org/10.1371/journal.pcbi.1007866}{Learning clinical networks from medical records based on information estimates in mixed-type data} } } diff --git a/man/estimateTemporalDynamic.Rd b/man/estimateTemporalDynamic.Rd index e6d6b52..7719590 100644 --- a/man/estimateTemporalDynamic.Rd +++ b/man/estimateTemporalDynamic.Rd @@ -7,7 +7,7 @@ estimateTemporalDynamic( input_data, state_order = NULL, - movavg = NULL, + mov_avg = NULL, max_nodes = 50, verbose_level = 1 ) @@ -42,20 +42,20 @@ Discrete variables will be excluded from the temporal dynamic estimation. variable is to be considered as a contextual variable (1) or not (0). Contextual variables will be excluded from the temporal dynamic estimation. -"movavg" (optional) contains an integer value that specifies the size of +"mov_avg" (optional) contains an integer value that specifies the size of the moving average window to be applied to the variable. -Note that if "movavg" column is present in the \emph{state_order}, +Note that if "mov_avg" column is present in the \emph{state_order}, its values will overwrite the function parameter.} -\item{movavg}{[an integer] Optional, NULL by default.\cr +\item{mov_avg}{[an integer] Optional, NULL by default.\cr When an integer>= 2 is supplied, a moving average operation is applied to all the non discrete and not contextual variables. If no \emph{state_order} is provided, the discrete/continuous variables are deduced from the input data. If you want to apply a moving average only on specific columns, -consider to use a \emph{movavg} column in the \emph{state_order} parameter.} +consider to use a \emph{mov_avg} column in the \emph{state_order} parameter.} \item{max_nodes}{[a positive integer] The maximum number of nodes in the -final temporal causal graph. The more nodes allowed in the temporal +final time-unfolded causal graph. The more nodes allowed in the temporal causal discovery, the more precise will be the discovery but at the cost of longer execution time. The default is set to 50 for fast causal discovery. On recent computers, values up to 200 or 300 nodes are usually @@ -77,6 +77,6 @@ This function estimates the number of layers and number of time steps between each layer that are needed to cover the dynamic of a temporal dataset when reconstructing a temporal causal graph. Using autocorrelation decay, the function computes the average relaxation -time of the variables and, in regard of a maximum number of nodes, deduces +time of the variables and, based on a maximum number of nodes, deduces the number of layers and number of time steps between each layer to be used. } diff --git a/man/export.Rd b/man/export.Rd new file mode 100644 index 0000000..58ea75e --- /dev/null +++ b/man/export.Rd @@ -0,0 +1,151 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/miic.plot.R +\name{export} +\alias{export} +\title{Export miic result for plotting (with igraph)} +\usage{ +export( + miic_obj, + method = "igraph", + pcor_palette = NULL, + display = "compact", + show_self_loops = TRUE +) +} +\arguments{ +\item{miic_obj}{[a miic object, required] + +The object returned by the \code{\link{miic}} execution.} + +\item{method}{[a string, optional, default value "igraph"] + +The plotting method, currently only "igraph" is supported.} + +\item{pcor_palette}{[a color palette, optional, default value +grDevices::colorRampPalette(c("blue", "darkgrey", "red")] + +Used to represent the partial correlations (the color of the edges). +The palette must be able to handle 201 shades to cover the correlation range +from -100 to +100.} + +\item{display}{[a string, optional, default value "compact"] + +Used only when exporting object returned by miic in temporal mode. +It allows different representations of the temporal graph. +Possible values are \emph{"raw"}, \emph{"lagged"}, \emph{"compact"}, +\emph{"combine"}, \emph{"unique"}, \emph{"drop"}: +\itemize{ +\item When \emph{display} = \emph{"raw"}, the export function will + use the tmiic graph object as it, leading to the return of a lagged + graph. +\item When \emph{display} = \emph{"lagged"}, the export function will + repeat the edges over history assuming stationarity and return a lagged + graph. +\item When \emph{display} = \emph{"compact"}, the default, nodes + and edges are converted into a flattened version to produce a compact + view of the temporal network whilst still presenting all the information + in the export.\cr + e.g. X_lag1->Y_lag0, X_lag2<-Y_lag0 become respectively X->Y lag=1, + X<-Y lag=2. +\item When \emph{display} = \emph{"combine"}, prior to the export, + a pre-processing will be applied to kept only one edge + per pair of nodes. The info_shifted will be the highest one + of the summarized edges whilst the lag and orientation of the + summarized edge will be an aggregation.\cr + e.g. X_lag2->Y_lag0, X_lag0<-Y_lag1 will become X<->Y lag=1-2 with + the info_shifted of X_lag2->Y_lag0 if info_shifted of + X_lag2->Y_lag0 > X_lag0<-Y_lag1. +\item When \emph{display} = \emph{"unique"}, prior to the export, + a pre-processing will be applied to kept only the edges having the + highest info_shifted for a pair of nodes. + If several edges between the sames nodes have the same + info_shifted, then the edge kept is the one with the minimum lag.\cr + e.g. X_lag1->Y_lag0, X_lag0<-Y_lag2 with info_shifted of + X_lag1->Y_lag0 > X_lag0<-Y_lag2 become X->Y lag=1. +\item When \emph{display} = \emph{"drop"}, the same pre-processing + as \emph{"unique"} will be applied, then the lag information will be + dropped before the export. +}} + +\item{show_self_loops}{[a boolean, optional, TRUE by default] + +Used only when exporting object returned by miic in temporal mode. +When TRUE, the lagged edges starting and ending on the same node +are included in the igraph object. +When FALSE, only edges having different nodes are present in the igraph +object.} +} +\value{ +A graph object adapted to the method. +} +\description{ +This function creates an object built from the result returned +by \code{\link{miic}} that is ready to be fed to the plotting method. +} +\details{ +The behavior depends on the method used for the export. + +For igraph, edge attributes are passed to the igraph graph +and can be accessed with e.g. \code{E(g)$partial_correlation}. +See \code{\link{miic}} for more details on edge parameters. +By default, edges are colored according to the partial correlation +between two nodes conditioned on the conditioning set +(negative is blue, null is gray and positive is red) +and their width is based on the conditional mutual information +minus the complexity cost. +} +\examples{ +\donttest{ +library(miic) +data(hematoData) + +# execute MIIC (reconstruct graph) +miic_obj <- miic( + input_data = hematoData, latent = "yes", + n_shuffles = 10, conf_threshold = 0.001 +) + +# Using igraph +if(require(igraph)) { +g = export(miic_obj, "igraph") +plot(g) # Default visualisation, calls igraph::plot.igraph() + +# Specifying layout (see ?igraph::layout_) +l <-layout_with_kk(g) +plot(g, layout=l) + +# Override some graphical parameters +plot(g, edge.curved = .2) +plot(g, vertex.shape="none", edge.color="gray85", vertex.label.color="gray10") +} + +# In temporal mode, execute MIIC +data(covidCases) +tmiic_obj <- miic(input_data = covidCases, mode = "TS", n_layers = 3, delta_t = 1, mov_avg = 14) + +# Plot by default the compact display of the temporal network using igraph +if(require(igraph)) { +g = export (tmiic_obj) +plot(g) + +# Plot the raw temporal network using igraph +g = export(tmiic_obj, display="raw") +plot(g) + +# Plot the complete temporal network using igraph (completed by stationarity) +g = export(tmiic_obj, display="lagged") +plot(g) + +# Specifying layout (see ?igraph::layout_) +l <- layout_on_grid(g, width = 5, height = 3, dim = 2) +plot(g, layout=l) + +# For compact temporal display, please be aware that the rendering of +# igraph::plot.igraph() is not optimal when the graph contains +# multiple edges between the same nodes. +# So, the recommend way to plot a compact graph is to use tmiic plotting: +plot(tmiic_obj) +} + +} +} diff --git a/man/getIgraph.Rd b/man/getIgraph.Rd deleted file mode 100644 index 01460ca..0000000 --- a/man/getIgraph.Rd +++ /dev/null @@ -1,37 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/miic.plot.R -\name{getIgraph} -\alias{getIgraph} -\title{Igraph plotting function for miic} -\usage{ -getIgraph(miic.res, pcor_palette = NULL) -} -\arguments{ -\item{miic.res}{[a miic graph object] -The graph object returned by the miic execution.} - -\item{pcor_palette}{The color palette used to represent the partial correlations -(the color of the edges). The palette must be able to handle 201 shades -to cover the correlation range from -100 to +100. The default palette is -grDevices::colorRampPalette(c("blue", "darkgrey", "red").} -} -\value{ -An igraph graph object. -} -\description{ -This functions returns an igraph object built from the result -returned by \code{\link{miic}}. -} -\details{ -Edges attributes are passed to the igraph graph and can be accessed with -e.g. \code{E(g)$partial_correlation}. See \code{\link{miic}} for more -details on edge parameters. By default, edges are colored according to the -partial correlation between two nodes conditioned on the conditioning set -(negative is blue, null is gray and positive is red) and their width is -based on the conditional mutual information minus the complexity cost. -} -\seealso{ -\code{\link{miic}} for details on edge parameters in the returned object, -\code{\link[igraph]{igraph.plotting}} for the detailed description of the -plotting parameters and \code{\link[igraph]{layout}} for different layouts. -} diff --git a/man/miic.Rd b/man/miic.Rd index 633bff3..0c97053 100644 --- a/man/miic.Rd +++ b/man/miic.Rd @@ -12,8 +12,8 @@ miic( n_threads = 1, cplx = "nml", orientation = TRUE, - ori_proba_ratio = 1, - ori_consensus_ratio = NULL, + ort_proba_ratio = 1, + ort_consensus_ratio = NULL, propagation = FALSE, latent = "orientation", n_eff = -1, @@ -28,7 +28,7 @@ miic( mode = "S", n_layers = NULL, delta_t = NULL, - movavg = NULL, + mov_avg = NULL, keep_max_data = FALSE, max_nodes = 50, verbose = FALSE @@ -97,9 +97,9 @@ of time steps between each layer for the variable. Note that if a \emph{"delta_t"} column is present in the \emph{state_order}, its values will overwrite the function parameter. -\emph{"movavg"} (optional) contains an integer value that specifies the size of -the moving average window to be applied to the variable. -Note that if \emph{"movavg"} column is present in the \emph{state_order}, +\emph{"mov_avg"} (optional) contains an integer value that specifies the size +of the moving average window to be applied to the variable. +Note that if \emph{"mov_avg"} column is present in the \emph{state_order}, its values will overwrite the function parameter.} \item{true_edges}{[a data frame, optional, NULL by default] @@ -116,9 +116,11 @@ must be valid in the time unfolded graph. e.g. a row var1, var2, 3 is valid with \emph{n_layers} = 4 + \emph{delta_t} = 1 or \emph{n_layers} = 2 + \emph{delta_t} = 3 but not for \emph{n_layers} = 2 + \emph{delta_t} = 2 as there is no matching -edge in the time unfolded graph. Please note that the order is important: -var1, var2, 3 is interpreted as var1_lag3 - var2_lag0. Please note also that, -for contextual variables that are not lagged, the expected value in the +edge in the time unfolded graph.\cr +Please note that the order is important: in standard mode, "var1 var2" will +be interpreted as var1 -> var2 and in temporal mode, "var1 var2 3" is +interpreted as var1_lag3 -> var2_lag0. Please note also that, in temporal +mode, for contextual variables that are not lagged, the expected value in the third column for the time lag is NA.} \item{black_box}{[a data frame, optional, NULL by default] @@ -148,17 +150,17 @@ When set greater than 1, n_threads parallel threads will be used for computation your compiler is compatible with openmp if you wish to use multithreading.} \item{cplx}{[a string, optional, "nml" by default, possible values: -"nml", "mdl"] +"nml", "bic"] In practice, the finite size of the input dataset requires that the 2-point and 3-point information measures should be \emph{shifted} by a \emph{complexity} term. The finite size corrections can be based on -the Minimal Description Length (MDL) criterion. -However, the MDL complexity criterion tends to underestimate the +the Bayesian Information Criterion (BIC). +However, the BIC complexity term tends to underestimate the relevance of edges connecting variables with many different categories, leading to the removal of false negative edges. To avoid such biases with finite datasets, the (universal) Normalized Maximum Likelihood (NML) -criterion can be used (see Affeldt \emph{et al.}, UAI 2015).} +criterion can be used (see Affeldt 2015).} \item{orientation}{[a boolean value, optional, TRUE by default] @@ -167,33 +169,38 @@ edge directions, based on the sign and magnitude of the conditional 3-point information of unshielded triples and, in temporal mode, using time. If set to FALSE, the orientation step is not performed.} -\item{ori_proba_ratio}{[a floating point between 0 and 1, optional, +\item{ort_proba_ratio}{[a floating point between 0 and 1, optional, 1 by default] The threshold when deducing the type of an edge tip (head/tail) from the probability of orientation. For a given edge tip, denote by p the probability of it being a head, -the orientation is accepted if (1 - p) / p < \emph{ori_proba_ratio}. +the orientation is accepted if (1 - p) / p < \emph{ort_proba_ratio}. 0 means reject all orientations, 1 means accept all orientations.} -\item{ori_consensus_ratio}{[a floating point between 0 and 1, optional, +\item{ort_consensus_ratio}{[a floating point between 0 and 1, optional, NULL by default] - -The threshold when deducing the type of an consensus edge tip (head/tail) -from the average probability of orientation. -For a given edge tip, denote by p the probability of it being a head, -the orientation is accepted if (1 - p) / p < \emph{ori_consensus_ratio}. -0 means reject all orientations, 1 means accept all orientations. -If not supplied, the \emph{ori_consensus_ratio} will be initialized with -the \emph{ori_proba_ratio} value.} +Used to determine if orientations correspond to genuine causal edges +and, when consistency is activated, to deduce the orientations in +the consensus graph.\cr +Oriented edges will be marked as genuine causal when: + \eqn{ (1 - p_{head}) / p_{head} < } \emph{ort_consensus_ratio} +and \eqn{ p_{tail} / (1 - p_{tail}) < } \emph{ort_consensus_ratio}.\cr +When consistency is activated, \emph{ort_consensus_ratio} is used as +threshold when deducing the type of an consensus edge tip (head/tail) +from the average probability of orientations over the cycle of graphs. +For a given edge tip, denote by p the average probability of it being a head, +the orientation is accepted if (1 - p) / p < \emph{ort_consensus_ratio}.\cr +If not supplied, the \emph{ort_consensus_ratio} will be initialized with +the \emph{ort_proba_ratio} value.} \item{propagation}{[a boolean value, optional, FALSE by default] If set to FALSE, the skeleton is partially oriented with only the v-structure orientations. Otherwise, the v-structure orientations are -propagated to downstream undirected edges in unshielded triples following +propagated to downstream un-directed edges in unshielded triples following the propagation procedure, relying on probabilities (for more details, -see Verny \emph{et al.}, PLoS Comp. Bio. 2017).} +see Verny 2017).} \item{latent}{[a string, optional, "orientation" by default, possible values: "orientation", "no", "yes"] @@ -212,13 +219,13 @@ In standard mode, the n samples given in the \emph{input_data} data frame are expected to be independent. In case of correlated samples such as in Monte Carlo sampling approaches, the effective number of independent samples \emph{n_eff} can be estimated using the decay of the autocorrelation function -(Verny et al., PLoS Comp. Bio. 2017). This effective number \emph{n_eff} -of independent samples can be provided using this parameter.} +(see Verny 2017). This effective number \emph{n_eff} of independent samples +can be provided using this parameter.} \item{n_shuffles}{[a positive integer, optional, 0 by default] The number of shufflings of the original dataset in order to evaluate -the edge specific confidence ratio of all inferred edges. +the edge specific confidence ratio of all retained edges. Default is 0: no confidence cut is applied. If the number of shufflings is set to an integer > 0, the confidence threshold must also be > 0 (e.g. \emph{n_shuffles} = 100 and \emph{conf_threshold} = 0.01).} @@ -226,9 +233,8 @@ is set to an integer > 0, the confidence threshold must also be > 0 \item{conf_threshold}{[a positive floating point, optional, 0 by default] The threshold used to filter the less probable edges following the skeleton -step. See Verny \emph{et al.}, PLoS Comp. Bio. 2017. -Default is 0: no confidence cut is applied. If the the confidence threshold -is set > 0, the number of shufflings must also be > 0 +step (see Verny 2017). Default is 0: no confidence cut is applied. If the +confidence threshold is set > 0, the number of shufflings must also be > 0 (e.g. \emph{n_shuffles} = 100 and \emph{conf_threshold} = 0.01).} \item{sample_weights}{[a numeric vector, optional, NULL by default] @@ -246,16 +252,17 @@ between the full distribution and the non-missing distribution the joint distribution of \eqn{X} and \eqn{Y} on samples which are not missing on Z. This is a way to ensure that data are missing at random for the considered -interaction and to avoid selection bias.} +interaction and detect bias due to values not missing at random.} \item{consistent}{[a string, optional, "no" by default, possible values: "no", "orientation", "skeleton"] If set to "orientation": iterate over skeleton and orientation steps to -ensure consistency of the network. -If set to "skeleton": iterate over skeleton step to get a consistent skeleton, -then orient edges and discard inconsistent orientations to ensure consistency -of the network (see Li \emph{et al.}, NeurIPS 2019 for details).} +ensure consistency of the separating sets and all disconnected pairs in the +final network. +If set to "skeleton": iterate over skeleton step to get a consistent +skeleton, then orient edges including inconsistent orientations +(see Li 2019 for details).} \item{max_iteration}{[a positive integer, optional, 100 by default] @@ -284,9 +291,8 @@ probability of orientation.)} \item{negative_info}{[a boolean value, optional, FALSE by default] -For test purpose only. If TRUE, negative shifted mutual information is -allowed during the computation when mutual information is inferior to the -complexity term. +If TRUE, negative shifted mutual information is allowed during the +computation when mutual information is inferior to the complexity term. For small dataset with complicated structures, e.g. discrete variables with many levels, allowing for negative shifted mutual information may help identifying weak v-structures related to those discrete variables, @@ -295,15 +301,16 @@ the difference between two negative shifted mutual information terms (expected to be negative due to the small sample size). However, under this setting, a v-structure (X -> Z <- Y) in the final graph does not necessarily imply that X is dependent on Y conditioning on Z, -As a consequence, the interpretability of the final graph -is hindered. In practice, it's advised to keep this parameter as FALSE.} +As a consequence, the reliability of certain orientations is not guaranteed. +By contrast, keeping this parameter as FALSE is more conservative and leads +to more reliable orientations (see Cabeli 2021 and Ribeiro-Dantas 2024).} \item{mode}{[a string, optional, "S" by default, possible values are -"S": Standard (IID samples) or "TS": Temporal Stationary"] +"S": Standard (non temporal data) or "TS": Temporal Stationary data] When temporal mode is activated, the time information must be provided in the first column of \emph{input_data}. For more details about temporal -stationary mode, see Simon \emph{et al.}, eLife reviewed preprint 2024.} +stationary mode (see Simon 2024).} \item{n_layers}{[an integer, optional, NULL by default, must be >= 2 if supplied] @@ -326,7 +333,7 @@ for the first sample, the next sample will use 2, 9, 16 and so on. If not supplied, the number of time steps between layers is estimated from the dynamic of the dataset and the number of layers.} -\item{movavg}{[an integer, optional, NULL by default, must be >= 2 +\item{mov_avg}{[an integer, optional, NULL by default, must be >= 2 if supplied] Used only in temporal mode. When supplied, a moving average operation is @@ -343,7 +350,7 @@ whilst they will be dropped if FALSE.} Used only in temporal mode and if the \emph{n_layers} or \emph{delta_t} parameters are not supplied. \emph{max_nodes} is used as the maximum number -of nodes in the final graph to compute \emph{n_layers} and/or \emph{delta_t}. +of nodes in the final time-unfolded graph to compute \emph{n_layers} and/or \emph{delta_t}. The default is 50 to produce quick runs and can be increased up to 200 or 300 on recent computers to produce more precise results.} @@ -354,73 +361,149 @@ If TRUE, debugging output is printed.} \value{ A \emph{miic-like} object that contains: \itemize{ - \item{\emph{all.edges.summary:} a data frame with information about the relationship between - each pair of variables + \item{\emph{summary:} a data frame with information about the + relationship between relevant pair of variables. + + As returning the information on all possible pairs of variables could lead + to an huge data frame, by convention, the summary does not include pair of + variables not sharing information at all (\emph{I'(x,y) <= 0}). + However, as exception to this convention, when a ground truth is supplied + (using the \emph{true_edges} parameter), the edges that are not retained + by MIIC because the variables does not share information at all + but are present in the true edges will be included in the summary + to report correctly all the false negative edges. + + So, the summary contains these categories of edges: + \itemize{ + \item{ edges retained} + \item{ edges not retained after conditioning on some contributor(s)} + \item{ edges not retained without conditioning but present in true edges} + } + while these edges are not considered as relevant and are not included: + \itemize{ + \item{ edges not retained without conditioning and not in true edges} + } + + Information available in the summary are: \itemize{ - \item{ \emph{x:} X node} - \item{ \emph{y:} Y node} + \item{ \emph{x:} X node name} + \item{ \emph{y:} Y node name} \item{ \emph{type:} contains 'N' if the edge has been removed or 'P' for retained edges. If the true graph is supplied in the \emph{true_edges} parameter, 'P' becomes 'TP' (True Positive) or 'FP' (False Positive), while 'N' becomes 'TN' (True Negative) or 'FN' (False Negative). - Note that, as the \emph{all.edges.summary} does not contain all the - negative edges, edges not present are 'TN'.} + Note that, as the \emph{summary} does not contain all the + removed edges, edges not present have to be considered as 'N' + and, if the true graph is supplied, as 'TN'.} \item{ \emph{ai:} the contributing nodes found by the method which - participate in the mutual information between \emph{x} and \emph{y}, + contribute to the mutual information between \emph{x} and \emph{y}, and possibly separate them.} \item{ \emph{raw_contributions:} describes the share of total mutual - information between \emph{x} and \emph{y} explained by each contributor.} + information between \emph{x} and \emph{y} explained by each contributor, + measured by I'(x;y;ai|\{aj\}) / I'(x;y), + where \{aj\} is the separating set before adding ai.} \item{ \emph{contributions:} describes the share of remaining mutual information between \emph{x} and \emph{y} explained by each successive - contributors.} - \item{ \emph{info:} provides the pairwise mutual information times - \emph{Nxyi} for the pair (\emph{x}, \emph{y}).} - \item{ \emph{info_cond:} provides the conditional mutual information - times \emph{Nxy_ai} for the pair (\emph{x}, \emph{y}) when conditioned - on the collected nodes \emph{ai}. It is - equal to the \emph{info} column when \emph{ai} is an empty set.} - \item{ \emph{cplx:} gives the computed complexity between the (\emph{x}, - \emph{y}) variables taking into account the contributing nodes \emph{ai}. - Edges that have have more conditional information \emph{info_cond} - than \emph{cplx} are retained in the final graph.} - \item{ \emph{Nxy_ai:} gives the number of complete samples on which the - information and the complexity have been computed. If the input dataset + contributors, measured by I'(x;y;ai|\{aj\}) / I'(x;y|\{aj\}), + where \{aj\} is the separating set before adding ai. } + \item{ \emph{info:} the mutual information \emph{I(x;y)} times \emph{n_xy}, + the number of samples without missing or NA values for both \emph{x} + and \emph{y}. } + \item{ \emph{n_xy:} gives the number of samples on which the information + without conditioning has been computed. If the input dataset has no missing value, the number of samples is the same for all pairs and corresponds to the total number of samples.} - \item{ \emph{info_shifted:} represents the \emph{info} - \emph{cplx} value. - It is a way to quantify the strength of the edge (\emph{x}, \emph{y}).} - \item{ \emph{infOrt:} the orientation of the edge (\emph{x}, \emph{y}). - It is the same value as in the adjacency matrix at row \emph{x} and - column \emph{y} : 1 for unoriented, 2 for an edge from X to Y, - -2 from Y to X and 6 for bidirectional.} - \item{ \emph{trueOrt:} the orientation of the edge (\emph{x}, \emph{y}) - present in the true edges are provided.} - \item{ \emph{isOrtOk:} information about the consistency of the inferred - graph’s orientations with a reference graph is given (if true edges - are provided). - 'Y': the orientation is consistent; 'N': the orientation is not consistent - with the PAG (Partial Ancestor Graph) derived from the given true graph.} + \item{ \emph{info_cond:} the conditional mutual information \emph{I(x;y|ai)} + times the number of samples without NA \emph{n_xy_ai} used in the + computation. + \emph{info_cond} is equal to \emph{info} when \emph{ai} is an empty set.} + \item{ \emph{cplx:} the complexity term for the pair (\emph{x}, \emph{y}) + taking into account the contributing nodes \emph{ai}.} + \item{ \emph{n_xy_ai:} the number of samples without NA in + \emph{x}, \emph{y} and all nodes in \emph{ai} on which the + information and the complexity terms are computed. + If the input dataset has no missing value, the number of samples is the + same for all pairs and corresponds to the total number of samples.} + \item{ \emph{info_shifted:} value equal to \emph{info_cond} - \emph{cplx}. + Used to decide whether the edge is retained (when positive), + or removed (when zero or possibly negative when the parameter + \emph{negative_info} is set to TRUE).} + \item{ \emph{ort_inferred:} the orientation of the edge (\emph{x}, \emph{y}). + 0: edge removed, 1: un-directed, 2: directed from X to Y, -2: directed + from Y to X, 6: bi-directed.\cr + When the \emph{consistent} option is turned on and there is more than + one graph in the consistent cycle, this is the inferred orientation + of the edge in the last graph in the cycle. } + \item{ \emph{ort_ground_truth:} the orientation of the edge (\emph{x}, + \emph{y}) in the ground truth graph when true edges are provided.} + \item{ \emph{is_inference_correct:} indicates if the inferred orientation + agrees with the provided ground truth. TRUE: agrees, FALSE: disagrees and + set to NA when no ground truth is supplied.} + \item{ \emph{is_causal:} boolean value indicating the causal nature of the + arrow tips of an edge, based on the probabilities given in the columns + \emph{p_y2x} and \emph{p_x2y}. TRUE: when the edges is directed + and both the head and the tail are set with high confidence + (adjustable with the \emph{ort_consensus_ratio} parameter), + FALSE otherwise or NA if the edge is not retained. + More formally, an oriented edge is marked as genuine causal when + \eqn{ (1 - p_{head}) / p_{head} < } \emph{ort_consensus_ratio} + and \eqn{ p_{tail} / (1 - p_{tail}) < } \emph{ort_consensus_ratio}.\cr + A directed edge not marked as genuine causal indicates that only + the head is set with high confidence, while the tail remains uncertain. + This corresponds to a putative causal edge, which could either be + a genuine causal edge or a bi-directed edge from a latent confounder.\cr + Note that the genuine causality is deducible only when latent variables + are allowed and propagation is not allowed.} + \item{ \emph{ort_consensus:} Not computed (NAs) when + consistency is not activated or, when consistency is on, + if there is only one graph returned (no cycle). + When computed, indicates the consensus orientation of the edge + determined from the consensus skeleton and the \emph{ort_consensus_ratio} + threshold on averaged orientation probabilities over the cycle of graphs. + Possible values are 0: not connected, 1: un-directed, -2 or 2: directed + and 6: bi-directed (latent variable).} + \item{ \emph{is_causal_consensus:} Not computed (NAs) when + consistency is not activated or, when consistency is on, + if there is only one graph returned (no cycle). + When computed, work in the same way as \emph{is_causal} + but on the consensus graph.} + \item{ \emph{edge_stats:} Not computed (NAs) when + consistency is not activated or, when consistency is on, + if there is only one graph returned (no cycle). + When computed, contains the frequencies of all \emph{ort_inferred} + values present in the cycle of graphs for the edge (\emph{x, y}), + in the format [percentage(orientation)], separated by ";". + e.g. In a cycle of 4 graphs, if an edge is three times marked as 2 + (directed) and one time marked as 1 (un-directed), edge_stats will + contain "75\%(2);25\%(1)".} \item{ \emph{sign:} the sign of the partial correlation between variables \emph{x} and \emph{y}, conditioned on the contributing nodes \emph{ai}.} \item{ \emph{partial_correlation:} value of the partial correlation for the - edge (\emph{x}, \emph{y}) conditioned on the contributing nodes \emph{ai}.} - \item{ \emph{is_causal:} details about the nature of the arrow tip for a - directed edge. A directed edge in a causal graph does not necessarily imply - causation but it does imply that the cause-effect relationship is not the - other way around. An arrow-tip which is itself downstream of another - directed edge suggests stronger causal sense and is marked by a 'Y', - or 'N' otherwise.} - \item{ \emph{proba:} probabilities for the inferred orientation, derived - from the three-point mutual information (cf Affeldt & Isambert, UAI 2015 - proceedings) and noted as p(x->y);p(x<-y).} - \item{ \emph{confidence:} this column is computed when the confidence cut - is activated. It represents the ratio between the probability to reject - the edge (\emph{x}, \emph{y}) in the dataset versus the mean probability - to do the same in multiple (user defined) number of randomized datasets.} + edge (\emph{x, y}) conditioned on the contributing nodes \emph{ai}.} + \item{ \emph{p_y2x:} probability of the arrowhead from \emph{y} to \emph{x}, + of the inferred orientation, derived from the three-point mutual information + (see Verny 2017 and Ribeiro-Dantas 2024). NA if the edge is removed.} + \item{ \emph{p_x2y:} probability of the arrowhead from \emph{x} to \emph{y}, + of the inferred orientation, derived from the three-point mutual information + (see Verny 2017 and Ribeiro-Dantas 2024). NA if the edge is removed.} + \item{ \emph{confidence:} computed only when the confidence cut is + activated, NA otherwise. + When computed, it corresponds to a measure of the strength of the retained + edges: it is the ratio between the probability to reject the edge + \emph{exp(-info_shifted(x;y|ai))} in the original dataset and + the mean probability to do the same in \emph{n_shuffles} number + of randomized datasets. Edges with \emph{confidence} > \emph{conf_threshold} + will be filtered out from the graph. + (see parameters \emph{n_shuffles} and \emph{conf_threshold}) } } + } + + \item{\emph{edges:} a data frame with the raw edges output coming from + the C++ core function. This data frame is used internally by MIIC to + produce the summary and contains all pairs of variables (\emph{x, y}). } - \item{\emph{orientations.prob:} this data frame lists the orientation + \item{\emph{triples:} this data frame lists the orientation probabilities of the two edges of all unshielded triples of the reconstructed network with the structure: node1 -- mid-node -- node2: \itemize{ @@ -431,9 +514,9 @@ A \emph{miic-like} object that contains: \item{ \emph{p3:} probability of the arrowhead mid-node <- node2} \item{ \emph{p4:} probability of the arrowhead mid-node -> node2} \item{ \emph{node2:} node at the end of the unshielded triplet} - \item{ \emph{NI3:} 3 point (conditional) mutual information * N} - \item{ \emph{Conflict:} indicates if there a conflict between the - computed probabilities and the NI3 value} + \item{ \emph{ni3:} 3 point (conditional) mutual information * N} + \item{ \emph{conflict:} indicates if there is a conflict between the + computed probabilities and the \emph{ni3} value} } } @@ -442,27 +525,64 @@ A \emph{miic-like} object that contains: pairs of vertices are adjacent or not in the graph. The matrix can be read as a (row, column) set of couples where the row represents the source node and the column the target node. Since miic can reconstruct mixed networks - (including directed, undirected and bidirected edges), we will have a + (including directed, un-directed and bi-directed edges), we will have a different digit for each case: \itemize{ - \item{ 1: (\emph{x}, \emph{y}) edge is undirected} + \item{ 1: (\emph{x}, \emph{y}) edge is un-directed} \item{ 2: (\emph{x}, \emph{y}) edge is directed as \emph{x} -> \emph{y} } \item{ -2: (\emph{x}, \emph{y}) edge is directed as \emph{x} <- \emph{y} } - \item{ 6: (\emph{x}, \emph{y}) edge is bidirected} + \item{ 6: (\emph{x}, \emph{y}) edge is bi-directed} } } \item {\emph{proba_adj_matrix:} the probability adjacency matrix is - a square matrix used to represent the orientation probabilities associated + a square matrix used to represent the orientation probabilities associated to the edges tips. The value at ("row", "column") is the probability, for the edge between "row" and "column" nodes, of the edge tip on the "row" - side. A probability less than 0.5 is an indication of a possible tail - (cause) and a probability greater than 0.5 a possible head (effect). + side. A probability less than 0.5 is an indication of a possible tail + (cause) and a probability greater than 0.5 a possible head (effect). } + + \item {\emph{adj_matrices:} present only when consistency is activated. + The list of the adjacency matrices, one for each graph + which is part of the resulting cycle of graphs. + Each item is a square matrix with the same layout as \emph{adj_matrix}. } + + \item {\emph{proba_adj_matrices:} present only when consistency is activated. + The list of the probability adjacency matrices, one for each graph + which is part of the resulting cycle of graphs. Each item is a + square matrix with the same layout as \emph{proba_adj_matrix}. } + + \item {\emph{proba_adj_average:} present only when consistency is activated. + The average probability adjacency matrix is a square matrix used to + represent the orientation probabilities associated to the edges tips + of the consensus graph. Its layout is the same as \emph{proba_adj_matrix} + and it contains the averaged probability of edges tips over the resulting + cycle of graphs. } + + \item {\emph{is_consistent:} present only when consistency is activated. + TRUE if the returned graph is consistent, FALSE otherwise. } + + \item {\emph{time:} execution time of the different steps and total run-time + of the causal graph reconstruction by MIIC. } + + \item {\emph{interrupted:} TRUE if causal graph reconstruction has been + interrupted, FALSE otherwise. } + + \item {\emph{scores:} present only when true edges have been supplied. + Contains the scores of the returned graph in regard of the ground truth: + \itemize{ + \item{ \emph{tp}: number of edges marked as True Positive } + \item{ \emph{fp}: number of edges marked as False Positive } + \item{ \emph{fn}: number of edges marked as False Negative } + \item{ \emph{precision}: Precision } + \item{ \emph{recall}: Recall } + \item{ \emph{fscore}: F1-Score } + } } \item {\emph{params:} the list of parameters used for the network reconstruction. The parameters not supplied are initialized to their default - values. Otherwise, the parameters are checked and corrected if necessary.} + values. Otherwise, the parameters are checked and corrected if necessary. } \item {\emph{state_order:} the state order used for the network reconstruction. If no state order is supplied, it is generated by using @@ -484,38 +604,42 @@ evaluation.} } } \description{ -MIIC (Multivariate Information based Inductive Causation) combines +MIIC (Multivariate Information-based Inductive Causation) combines constraint-based and information-theoretic approaches to disentangle direct from indirect effects amongst correlated variables, including cause-effect relationships and the effect of unobserved latent causes. } \details{ -In standard mode, starting from a complete graph, the method iteratively removes +Starting from a complete graph, the method iteratively removes dispensable edges, by uncovering significant information contributions from indirect paths, and assesses edge-specific confidences from randomization of available data. The remaining edges are then oriented based on the signature -of causality in observational data. +of causality in observational data. Miic distinguishes genuine causal edges +(with both reliable arrow heads and tails) from putative causal edges (with +one reliable arrow head only) and latent causal edges (with both reliable +arrow heads). (see Ribeiro-Dantas 2024) In temporal mode, miic reorganizes the dataset using the \emph{n_layers} and \emph{delta_t} parameters to transform the time steps into lagged samples. As starting point, a lagged graph is created with only edges having at least one node laying on the last time step. Then, miic standard algorithm is applied to remove dispensable edges. -The remaining edges are then oriented by using the temporality and the +The remaining edges are then duplicated to ensure time invariance +(stationary dynamic) and oriented using the temporality and the signature of causality in observational data. The use of temporal mode -is exposed in Simon \emph{et al.}, eLife reviewed preprint 2024. - -The method relies on an information theoretic based (conditional) independence -test which is described in (Verny \emph{et al.}, PLoS Comp. Bio. 2017), -(Cabeli \emph{et al.}, PLoS Comp. Bio. 2020). It deals with both categorical -and continuous variables by performing optimal context-dependent discretization. -As such, the input data frame may contain both numerical columns which will be -treated as continuous, or character / factor columns which will be treated +is presented in Simon 2024. + +The method relies on information theoretic principles which replace +(conditional) independence tests as described in Affeldt 2015, Cabeli 2020, +Cabeli 2021 and Ribeiro-Dantas 2024. It deals with both categorical and +continuous variables by performing optimal context-dependent discretization. +As such, the input data frame may contain both numerical columns which will +be treated as continuous, or character / factor columns which will be treated as categorical. For further details on the optimal discretization method and the conditional independence test, see the function discretizeMutual. -The user may also choose to run miic with scheme presented in -(Li \emph{et al.}, NeurIPS 2019) to improve the end result's interpretability -by ensuring consistent separating set during the skeleton iterations. +The user may also choose to run miic with scheme presented in Li 2019 +and Ribeiro-Dantas 2024 to improve the end result's interpretability +by ensuring consistent separating sets. } \examples{ library(miic) @@ -524,76 +648,58 @@ library(miic) data(hematoData) # execute MIIC (reconstruct graph) -miic.res <- miic( +miic_obj <- miic( input_data = hematoData[1:1000,], latent = "yes", n_shuffles = 10, conf_threshold = 0.001 ) # plot graph if(require(igraph)) { - plot(miic.res, method="igraph") + plot(miic_obj, method="igraph") } \donttest{ # write graph to graphml format. Note that to correctly visualize # the network we created the miic style for Cytoscape (http://www.cytoscape.org/). -miic.write.network.cytoscape(g = miic.res, file = file.path(tempdir(), "temp")) +writeCytoscapeNetwork(miic_obj, file = file.path(tempdir(), "temp")) # EXAMPLE CANCER data(cosmicCancer) data(cosmicCancer_stateOrder) # execute MIIC (reconstruct graph) -miic.res <- miic( +miic_obj <- miic( input_data = cosmicCancer, state_order = cosmicCancer_stateOrder, latent = "yes", n_shuffles = 100, conf_threshold = 0.001 ) # plot graph if(require(igraph)) { - plot(miic.res) -} - -# write graph to graphml format. Note that to correctly visualize -# the network we created the miic style for Cytoscape (http://www.cytoscape.org/). -miic.write.network.cytoscape(g = miic.res, file = file.path(tempdir(), "temp")) - -# EXAMPLE OHNOLOGS -data(ohno) -data(ohno_stateOrder) -# execute MIIC (reconstruct graph) -miic.res <- miic( - input_data = ohno, latent = "yes", state_order = ohno_stateOrder, - n_shuffles = 100, conf_threshold = 0.001 -) - -# plot graph -if(require(igraph)) { - plot(miic.res) + plot(miic_obj) } # write graph to graphml format. Note that to correctly visualize # the network we created the miic style for Cytoscape (http://www.cytoscape.org/). -miic.write.network.cytoscape(g = miic.res, file = file.path(tempdir(), "temp")) +writeCytoscapeNetwork(miic_obj, file = file.path(tempdir(), "temp")) # EXAMPLE COVID CASES (time series demo) data(covidCases) # execute MIIC (reconstruct graph in temporal mode) -tmiic.res <- miic(input_data = covidCases, mode = "TS", n_layers = 3, delta_t = 1, movavg = 14) +tmiic_obj <- miic(input_data = covidCases, mode = "TS", n_layers = 3, delta_t = 1, mov_avg = 14) # to plot the default graph (compact) if(require(igraph)) { - plot(tmiic.res) + plot(tmiic_obj) } -# to plot the raw temporal network Using igraph +# to plot the raw temporal network if(require(igraph)) { - plot(tmiic.res, display="raw") + plot(tmiic_obj, display="raw") } -# to plot the full temporal network Using igraph +# to plot the full temporal network if(require(igraph)) { - plot(tmiic.res, display="lagged") + plot(tmiic_obj, display="lagged") } } @@ -601,11 +707,13 @@ if(require(igraph)) { } \references{ \itemize{ -\item{Simon et al., \emph{eLife reviewed preprint} https://www.biorxiv.org/content/10.1101/2024.02.06.579177v1.abstract } -\item{Ribeiro-Dantas et al., \emph{iScience 2024} https://arxiv.org/abs/2303.06423 } -\item{Cabeli et al., \emph{PLoS Comp. Bio. 2020.} https://doi.org/10.1371/journal.pcbi.1007866 } -\item{Li et al., \emph{NeurIPS 2019} http://papers.nips.cc/paper/9573-constraint-based-causal-structure-learning-with-consistent-separating-sets.pdf } -\item{Verny et al., \emph{PLoS Comp. Bio. 2017.} https://doi.org/10.1371/journal.pcbi.1005662 } +\item Simon \emph{et al.}, eLife 2024, \href{https://www.biorxiv.org/content/10.1101/2024.02.06.579177v1.abstract}{CausalXtract: a flexible pipeline to extract causal effects from live-cell time-lapse imaging data} +\item Ribeiro-Dantas \emph{et al.}, iScience 2024, \href{https://doi.org/10.1016/j.isci.2024.109736}{Learning interpretable causal networks from very large datasets, application to 400,000 medical records of breast cancer patients} +\item Cabeli \emph{et al.}, NeurIPS 2021, \href{https://why21.causalai.net/papers/WHY21_24.pdf}{Reliable causal discovery based on mutual information supremum principle for finite dataset} +\item Cabeli \emph{et al.}, PLoS Comput. Biol. 2020, \href{https://doi.org/10.1371/journal.pcbi.1007866}{Learning clinical networks from medical records based on information estimates in mixed-type data} +\item Li \emph{et al.}, NeurIPS 2019, \href{http://papers.nips.cc/paper/9573-constraint-based-causal-structure-learning-with-consistent-separating-sets.pdf}{Constraint-based causal structure learning with consistent separating sets} +\item Verny \emph{et al.}, PLoS Comput. Biol. 2017, \href{https://doi.org/10.1371/journal.pcbi.1005662}{Learning causal networks with latent variables from multivariate information in genomic data} +\item Affeldt \emph{et al.}, UAI 2015, \href{https://auai.org/uai2015/proceedings/papers/293.pdf}{Robust Reconstruction of Causal Graphical Models based on Conditional 2-point and 3-point Information} } } \seealso{ diff --git a/man/miic.export.Rd b/man/miic.export.Rd deleted file mode 100644 index 8f1fc0a..0000000 --- a/man/miic.export.Rd +++ /dev/null @@ -1,62 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/miic.plot.R -\name{miic.export} -\alias{miic.export} -\title{Export miic result to different plotting methods} -\usage{ -miic.export(miic.res, method = NULL, pcor_palette = NULL) -} -\arguments{ -\item{miic.res}{[a miic graph object] -The graph object returned by the miic execution.} - -\item{method}{A string representing the plotting method. -Currently only "igraph" is supported.} - -\item{pcor_palette}{The color palette used to represent the partial correlations -(the color of the edges). The palette must be able to handle 201 shades -to cover the correlation range from -100 to +100. The default palette is -grDevices::colorRampPalette(c("blue", "darkgrey", "red").} -} -\value{ -A graph object adapted to the method. -} -\description{ -This function creates an object built from the result returned -by \code{\link{miic}} that is ready to be fed to different plotting methods. -} -\details{ -See the details of specific function for each method. -For igraph, see \code{\link{getIgraph}}. -} -\examples{ -\donttest{ -library(miic) -data(hematoData) - -# execute MIIC (reconstruct graph) -miic.res <- miic( - input_data = hematoData, latent = "yes", - n_shuffles = 10, conf_threshold = 0.001 -) - -# Using igraph -if(require(igraph)) { -g = miic.export(miic.res, "igraph") -plot(g) # Default visualisation, calls igraph::plot.igraph() - -# Specifying layout (see ?igraph::layout_) -l <-layout_with_kk(g) -plot(g, layout=l) - -# Override some graphical parameters -plot(g, edge.curved = .2) -plot(g, vertex.shape="none", edge.color="gray85", vertex.label.color="gray10") -} - -} - -} -\seealso{ -\code{\link{getIgraph}} for details on the igraph exported object. -} diff --git a/man/ohno.Rd b/man/ohno.Rd deleted file mode 100644 index c49e8d3..0000000 --- a/man/ohno.Rd +++ /dev/null @@ -1,23 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/data.R -\docType{data} -\name{ohno} -\alias{ohno} -\title{Tetraploidization in vertebrate evolution} -\format{ -A data.frame object. -} -\usage{ -data(ohno) -} -\description{ -20,415 protein-coding genes in the human genome from Ensembl (v70) and information on the -retention of duplicates originating either from the two whole genome duplications at -the onset of vertebrates (‘ohnolog’) or from subsequent small scale duplications (‘SSD’) -as well as copy number variants (‘CNV’). -} -\references{ -Verny et al., PLoS Comp. Bio. 2017. -} -\keyword{data} -\keyword{datasets} diff --git a/man/ohno_stateOrder.Rd b/man/ohno_stateOrder.Rd deleted file mode 100644 index 5cbdebb..0000000 --- a/man/ohno_stateOrder.Rd +++ /dev/null @@ -1,23 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/data.R -\docType{data} -\name{ohno_stateOrder} -\alias{ohno_stateOrder} -\title{Tetraploidization in vertebrate evolution} -\format{ -A data.frame object. -} -\usage{ -data(ohno_stateOrder) -} -\description{ -20,415 protein-coding genes in the human genome from Ensembl (v70) and information on the -retention of duplicates originating either from the two whole genome duplications at -the onset of vertebrates (‘ohnolog’) or from subsequent small scale duplications (‘SSD’) -as well as copy number variants (‘CNV’), category order. -} -\references{ -Verny et al., PLoS Comp. Bio. 2017. -} -\keyword{data} -\keyword{datasets} diff --git a/man/plot.miic.Rd b/man/plot.miic.Rd index 80db754..4367c0f 100644 --- a/man/plot.miic.Rd +++ b/man/plot.miic.Rd @@ -7,29 +7,35 @@ \method{plot}{miic}(x, method = "igraph", pcor_palette = NULL, ...) } \arguments{ -\item{x}{[a miic graph object] -The graph object returned by \code{\link{miic}}.} +\item{x}{[a miic object, required] -\item{method}{A string representing the plotting method. Default to "igraph". -Currently only "igraph" is supported.} +The object returned by \code{\link{miic}} execution.} -\item{pcor_palette}{Optional. The color palette used to represent the partial -correlations (the color of the edges). See \code{\link{miic.export}} for details.} +\item{method}{[a string, optional, default value "igraph"] + +The plotting method, currently only "igraph" is supported.} + +\item{pcor_palette}{[a color palette, optional, default value +grDevices::colorRampPalette(c("blue", "darkgrey", "red")] + +Used to represent the partial correlations (the color of the edges). +The palette must be able to handle 201 shades to cover the correlation range +from -100 to +100.} + +\item{\dots}{Additional plotting parameters. See the corresponding plot +function for the complete list. -\item{\dots}{Additional plotting parameters. See the corresponding plot function -for the complete list. For igraph, see \code{\link[igraph]{igraph.plotting}}.} } \description{ -This function calls \code{\link{miic.export}} to build a +This function calls \code{\link{export}} to build a plottable object from the result returned by \code{\link{miic}} and plot it. } \details{ -See the documentation of \code{\link{miic.export}} for further +See the documentation of \code{\link{export}} for further details. } \seealso{ -\code{\link{miic.export}} for generic exports, -\code{\link{getIgraph}} for igraph export, +\code{\link{export}} for graphical exports, \code{\link[igraph]{igraph.plotting}} } diff --git a/man/plot.tmiic.Rd b/man/plot.tmiic.Rd index 3943a87..9442caa 100644 --- a/man/plot.tmiic.Rd +++ b/man/plot.tmiic.Rd @@ -6,20 +6,33 @@ \usage{ \method{plot}{tmiic}( x, + method = "igraph", + pcor_palette = NULL, display = "compact", show_self_loops = TRUE, positioning_for_grid = "greedy", orientation_for_grid = "L", - method = "igraph", - pcor_palette = NULL, ... ) } \arguments{ -\item{x}{[a tmiic graph object] -The graph object returned by \code{\link{miic}} in temporal mode} +\item{x}{[a tmiic object, required] + +The object returned by \code{\link{miic}} in temporal mode.} + +\item{method}{[a string, optional, default value "igraph"] + +The plotting method, currently only "igraph" is supported.} + +\item{pcor_palette}{[a color palette, optional, default value +grDevices::colorRampPalette(c("blue", "darkgrey", "red")] + +Used to represent the partial correlations (the color of the edges). +The palette must be able to handle 201 shades to cover the correlation range +from -100 to +100.} + +\item{display}{[a string, optional, default value "compact"] -\item{display}{[a string]. Optional, default value "compact". Possible values are \emph{"raw"}, \emph{"lagged"}, \emph{"compact"}, \emph{"combine"}, \emph{"unique"}, \emph{"drop"}: \itemize{ @@ -35,50 +48,47 @@ Possible values are \emph{"raw"}, \emph{"lagged"}, \emph{"compact"}, and edges are converted into a flattened version to produce a compact view of the temporal network whilst still presenting all the information in the plotting.\cr - i.e.: X_lag1->Y_lag0, X_lag2<-Y_lag0 become respectively X->Y lag=1, + e.g. X_lag1->Y_lag0, X_lag2<-Y_lag0 become respectively X->Y lag=1, X<-Y lag=2. \item When \emph{display} = \emph{"combine"}, prior to the plotting, - a preprocessing will be applied to kept only one edge - per couple of nodes. The info_shifted will be the highest one + a pre-processing will be applied to kept only one edge + per pair of nodes. The info_shifted will be the highest one of the summarized edges whilst the lag and orientation of the summarized edge will be an aggregation.\cr - i.e.: X_lag1->Y_lag0, X_lag2<-Y_lag0 will become X<->Y lag=1,2 with + e.g. X_lag1->Y_lag0, X_lag2<-Y_lag0 will become X<->Y lag=1,2 with the info_shifted of X_lag1->Y_lag0 if info_shifted of X_lag1->Y_lag0 > X_lag2<-Y_lag0. \item When \emph{display} = \emph{"unique"}, prior to the plotting, - a preprocessing will be applied to kept only the edges having the - highest info_shifted for a couple of nodes. + a pre-processing will be applied to kept only the edges having the + highest info_shifted for a pair of nodes. If several edges between the sames nodes have the same info_shifted, then the edge kept is the one with the minimum lag.\cr - i.e.: X_lag1->Y_lag0, X_lag2<-Y_lag0 with info_shifted of + e.g. X_lag1->Y_lag0, X_lag2<-Y_lag0 with info_shifted of X_lag1->Y_lag0 > X_lag2<-Y_lag0 become X->Y lag=1. -\item When \emph{display} = \emph{"drop"}, prior to the plotting, - a preprocessing will be applied to kept only the edges having the - highest info_shifted for a couple of nodes. - If several edges between the sames nodes have the same - info_shifted, then the edge kept is the one with the minimum lag.\cr - i.e. : X_lag1->Y_lag0, X_lag2<-Y_lag0 with info_shifted of - X_lag1->Y_lag0 > X_lag2<-Y_lag0 become X->Y. - The lag information is dropped during the preprocessing and - will not be displayed on the final plotting. +\item When \emph{display} = \emph{"drop"}, the same pre-processing + as \emph{"unique"} will be applied, then the lag information will be + dropped and will not be displayed on the final plotting. }} -\item{show_self_loops}{[a boolean] Optional, TRUE by default. -When TRUE, the edges like X_lag0-X_lag1 are included in the iGraph object. -When FALSE, only edges having different nodes are present in the iGraph +\item{show_self_loops}{[a boolean, optional, TRUE by default] + +When TRUE, the lagged edges starting and ending on the same node +are included in the igraph object. +When FALSE, only edges having different nodes are present in the igraph object.} -\item{positioning_for_grid}{[a string] Optional, "greedy" by default. -Used only when the display is "raw" or "lagged and no layout is supplied. -Possible values are \emph{"none"}, \emph{"alphabetical"}, \emph{"layers"} +\item{positioning_for_grid}{[a string, optional, "greedy" by default] + +Used only when the display is "raw" or "lagged" and no layout is supplied. +Possible values are \emph{"none"}, \emph{"alphabetical"}, \emph{"layers"}, \emph{"greedy"} and \emph{"sugiyama"} \itemize{ \item When \emph{positioning_for_grid} = \emph{"none"} The nodes are positioned as they appear in the miic result \item When \emph{positioning_for_grid} = \emph{"alphabetical"} - The nodes are positioned alphabeticaly in ascending order + The nodes are positioned alphabetically in ascending order \item When \emph{positioning_for_grid} = \emph{"layers"} - The nodes with the less lags wil be placed on the exteriors + The nodes with the less lags will be placed on the exteriors while the nodes having the most lags are in the center \item When \emph{positioning_for_grid} = \emph{"greedy"} A greedy algorithm will be used to placed the nodes in a way minimizing @@ -88,29 +98,24 @@ Possible values are \emph{"none"}, \emph{"alphabetical"}, \emph{"layers"} minimizing the crossing edges }} -\item{orientation_for_grid}{[a string] Optional, "L" by default. +\item{orientation_for_grid}{[a string, optional, "L" by default] + Used only when the display is "raw" or "lagged and no layout is supplied. Indicates the orientation of the draw, possible values are landscape: "L" or portrait: "P".} -\item{method}{A string representing the plotting method. Default to "igraph". -Currently only "igraph" is supported.} - -\item{pcor_palette}{Optional. The color palette used to represent the partial -correlations (the color of the edges). See \code{\link{getIgraph}} for details.} - -\item{\dots}{Additional plotting parameters. See the corresponding plot function -for the complete list. +\item{\dots}{Additional plotting parameters. See the corresponding plot +function for the complete list. For igraph, see \code{\link[igraph]{igraph.plotting}}.} } \description{ -This function calls \code{\link{tmiic.export}} to build a -plottable object from the result returned by \code{\link{miic}} in -temporal mode and plot it. +This function calls \code{\link{export}} to build a plottable +object from the result returned by \code{\link{miic}} in temporal mode +and plot it. } \details{ -See the documentation of \code{\link{tmiic.export}} for further +See the documentation of \code{\link{export}} for further details. } \examples{ @@ -120,26 +125,26 @@ library(miic) #' # EXAMPLE COVID CASES (time series demo) data(covidCases) # execute MIIC (reconstruct graph in temporal mode) -tmiic_res <- miic(input_data = covidCases, mode = "TS", n_layers = 3, delta_t = 1, movavg = 14) +tmiic_obj <- miic(input_data = covidCases, mode = "TS", n_layers = 3, delta_t = 1, mov_avg = 14) # to plot the default compact graph if(require(igraph)) { - plot(tmiic_res) + plot(tmiic_obj) } -# to plot the raw temporal network Using igraph +# to plot the raw temporal network if(require(igraph)) { - plot(tmiic_res, display="raw") + plot(tmiic_obj, display="raw") } -# to plot the full temporal network Using igraph +# to plot the full temporal network if(require(igraph)) { - plot(tmiic_res, display="lagged") + plot(tmiic_obj, display="lagged") } } } \seealso{ -\code{\link{tmiic.export}} for generic exports, +\code{\link{export}} for graphical exports, \code{\link[igraph]{igraph.plotting}} } diff --git a/man/tmiic.export.Rd b/man/tmiic.export.Rd deleted file mode 100644 index 5872415..0000000 --- a/man/tmiic.export.Rd +++ /dev/null @@ -1,115 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/tmiic.plot.R -\name{tmiic.export} -\alias{tmiic.export} -\title{Export temporal miic (tmiic) result to different plotting methods} -\usage{ -tmiic.export( - tmiic_res, - display = "compact", - show_self_loops = TRUE, - method = "igraph", - pcor_palette = NULL -) -} -\arguments{ -\item{tmiic_res}{[a tmiic object] -The object returned by the \code{\link{miic}} execution in temporal mode.} - -\item{display}{[a string]. Optional, default value "compact". -Possible values are \emph{"raw"}, \emph{"lagged"}, \emph{"compact"}, -\emph{"combine"}, \emph{"unique"}, \emph{"drop"}: -\itemize{ -\item When \emph{display} = \emph{"raw"}, the export function will - use the tmiic graph object as it, leading to the return of a lagged - graph. -\item When \emph{display} = \emph{"lagged"}, the export function will - repeat the edges over history assuming stationarity and return a lagged - graph. -\item When \emph{display} = \emph{"compact"}, the default, nodes - and edges are converted into a flattened version to produce a compact - view of the temporal network whilst still presenting all the information - in the export.\cr - i.e.: X_lag1->Y_lag0, X_lag2<-Y_lag0 become respectively X->Y lag=1, - X<-Y lag=2. -\item When \emph{display} = \emph{"combine"}, prior to the export, - a pre-processing will be applied to kept only one edge - per couple of nodes. The info_shifted will be the highest one - of the summarized edges whilst the lag and orientation of the - summarized edge will be an aggregation.\cr - i.e.: X_lag2->Y_lag0, X_lag0<-Y_lag1 will become X<->Y lag=1-2 with - the info_shifted of X_lag2->Y_lag0 if info_shifted of - X_lag2->Y_lag0 > X_lag0<-Y_lag1. -\item When \emph{display} = \emph{"unique"}, prior to the export, - a pre-processing will be applied to kept only the edges having the - highest info_shifted for a couple of nodes. - If several edges between the sames nodes have the same - info_shifted, then the edge kept is the one with the minimum lag.\cr - i.e.: X_lag1->Y_lag0, X_lag0<-Y_lag2 with info_shifted of - X_lag1->Y_lag0 > X_lag0<-Y_lag2 become X->Y lag=1. -\item When \emph{display} = \emph{"drop"}, prior to the export, - a pre-processing will be applied to kept only the edges having the - highest info_shifted for a couple of nodes. - If several edges between the sames nodes have the same - info_shifted, then the edge kept is the one with the minimum lag.\cr - i.e. : X_lag1->Y_lag0, X_lag0<-Y_lag2 with info_shifted of - X_lag1->Y_lag0 > X_lag0<-Y_lag2 become X->Y. - The lag information is dropped during the preprocessing and - will not be exported. -}} - -\item{show_self_loops}{[a boolean] Optional, TRUE by default. -When TRUE, the edges like X_lag0-X_lag1 are exported. -When FALSE, only edges having different nodes are exported.} - -\item{method}{A string representing the plotting method. -Currently only "igraph" is supported.} - -\item{pcor_palette}{Optional. The color palette used to represent the partial -correlations (the color of the edges). See \code{\link{getIgraph}} for details.} -} -\value{ -A graph object adapted to the method. -} -\description{ -This function creates an object built from the result returned -by \code{\link{miic}} executed in temporal mode that is ready to be fed to -different plotting methods. -} -\examples{ -\donttest{ -library(miic) -data(covidCases) -# execute MIIC (reconstruct graph in temporal mode) -tmiic_res <- miic(input_data = covidCases, mode = "TS", n_layers = 3, delta_t = 1, movavg = 14) - -# Plot default compact temporal network Using igraph -if(require(igraph)) { -g = tmiic.export(tmiic_res, method="igraph") -plot(g) # Default visualisation, calls igraph::plot.igraph() - -# Plot raw temporal network Using igraph -g = tmiic.export(tmiic_res, display="raw", method="igraph") -plot(g) # Default visualisation, calls igraph::plot.igraph() - -# Plot full temporal network Using igraph -g = tmiic.export(tmiic_res, display="lagged", method="igraph") -plot(g) # Default visualisation, calls igraph::plot.igraph() - -# Specifying layout (see ?igraph::layout_) -l <- layout_on_grid(g, width = 5, height = 3, dim = 2) -plot(g, layout=l) - -# Override some graphical parameters -plot(g, edge.arrow.size = 0.75) -plot(g, vertex.shape="none", edge.color="gray85", vertex.label.color="gray10") - -# For compact graphs, please be aware that the rendering of -# igraph::plot.igraph() is not optimal when the graph contains -# multiple edges between the same nodes. -# So the recommend way to plot a compact graph is to use tmiic plotting: -plot(tmiic_res) -} - -} -} diff --git a/man/miic.write.network.cytoscape.Rd b/man/writeCytoscapeNetwork.Rd similarity index 75% rename from man/miic.write.network.cytoscape.Rd rename to man/writeCytoscapeNetwork.Rd index c001fc2..fbcc519 100644 --- a/man/miic.write.network.cytoscape.Rd +++ b/man/writeCytoscapeNetwork.Rd @@ -1,13 +1,13 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/write.cytoscape.R -\name{miic.write.network.cytoscape} -\alias{miic.write.network.cytoscape} +\name{writeCytoscapeNetwork} +\alias{writeCytoscapeNetwork} \title{GraphML converting function for miic graph} \usage{ -miic.write.network.cytoscape(g, file, layout = NULL) +writeCytoscapeNetwork(miic_obj, file, layout = NULL) } \arguments{ -\item{g}{The graph object returned by \link[=miic]{miic}.} +\item{miic_obj}{A miic object. The object returned by the \code{\link{miic}} execution.} \item{file}{A string. Path to the output file containing file name without extension (.graphml will be appended).} @@ -17,6 +17,9 @@ coordinate \code{x} and \code{y} for each node. The optional first column can co node names. If node names is not given, the order of the input file will be assigned to the list of positions.} } +\value{ +None +} \description{ Convert miic graph to \href{http://graphml.graphdrawing.org/}{GraphML format}. } diff --git a/man/miic.write.style.cytoscape.Rd b/man/writeCytoscapeStyle.Rd similarity index 53% rename from man/miic.write.style.cytoscape.Rd rename to man/writeCytoscapeStyle.Rd index 8e74867..5927284 100644 --- a/man/miic.write.style.cytoscape.Rd +++ b/man/writeCytoscapeStyle.Rd @@ -1,16 +1,21 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/write.style.R -\name{miic.write.style.cytoscape} -\alias{miic.write.style.cytoscape} +\name{writeCytoscapeStyle} +\alias{writeCytoscapeStyle} \title{Style writing function for the miic network} \usage{ -miic.write.style.cytoscape(file) +writeCytoscapeStyle(file) } \arguments{ -\item{file}{[a string] The file path of the output file (containing the file name without extension).} +\item{file}{[a string] The file path of the output file (containing the +file name without extension).} +} +\value{ +None } \description{ -This function writes the miic style for a correct visualization using the cytoscape tool (http://www.cytoscape.org/). +This function writes the miic style for a correct +visualization using the cytoscape tool (http://www.cytoscape.org/). } \details{ The style is written in the xml file format. diff --git a/src/computation_continuous.cpp b/src/computation_continuous.cpp index 010fc9a..47f2f40 100755 --- a/src/computation_continuous.cpp +++ b/src/computation_continuous.cpp @@ -267,7 +267,7 @@ void reconstructCutCoarse (const TempVector& cuts_idx, // This speeds up significantly the dynamic programming at the cost of finding // approximated solutions. // - cache: shared pointer, cache area storing previous calculations -// - cplx: int, is the choice of complexity : 0 for simple MDL (product of +// - cplx: int, is the choice of complexity : 0 for simple BIC (product of // all observed values) and 1 for NML with stochastic complexity and a // combinatorial term with the previous number of levels. // @@ -374,12 +374,12 @@ void optimizeCutPoints (const TempGrid2d::ConstRow& data_ranked_target, TempVector cum_counts_target (nb_factors_target); TempVector cum_counts_joint (nb_factors_joint); // - // Stochastic complexity used for the simple MDL cost. + // Stochastic complexity used for the simple BIC cost. // Its value is 0.5 * number_of_other_levels where number of other levels is // the product of number of unique observed levels of all variables except // the variable being optimized. // - double k_mdl = 0.5 * (nb_factors_joint_for_cplx - 1) * nb_factors_target; + double k_bic = 0.5 * (nb_factors_joint_for_cplx - 1) * nb_factors_target; for (int cum_bin_idx = 0; cum_bin_idx < n_cuts_max; ++cum_bin_idx) { // Compute info from the first bin to the "cum_bin_idx" bin @@ -421,7 +421,7 @@ void optimizeCutPoints (const TempGrid2d::ConstRow& data_ranked_target, H_cum_bin_target -= cache->getH(weighted_count); if (cplx == 0 && cum_counts_target[level] > 0) - Hk_cum_bin_target -= k_mdl * cache->getLog(n_samples); + Hk_cum_bin_target -= k_bic * cache->getLog(n_samples); else if (cplx == 1) Hk_cum_bin_target -= cache->getLogC(weighted_count, nb_factors_joint_for_cplx); } @@ -498,7 +498,7 @@ void optimizeCutPoints (const TempGrid2d::ConstRow& data_ranked_target, H_cut_to_cum_target -= cache->getH(weighted_count); if (cplx == 0 && cut_counts_target[level] > 0) - Hk_cut_to_cum_target -= k_mdl * cache->getLog(n_samples); + Hk_cut_to_cum_target -= k_bic * cache->getLog(n_samples); else if (cplx == 1) Hk_cut_to_cum_target -= cache->getLogC(weighted_count, nb_factors_joint_for_cplx); } @@ -545,8 +545,8 @@ void optimizeCutPoints (const TempGrid2d::ConstRow& data_ranked_target, //------------------------------------------------------------------------------ // Initialize Ik(x,y) with equal bin discretization // Repeat -// optimize on x Ik(x,y): Hx - Hxy - kmdl -// optimize on y Ik(x,y): Hy - Hxy - kmdl +// optimize on x Ik(x,y): Hx - Hxy - k_bic +// optimize on y Ik(x,y): Hy - Hxy - k_bic // Until convergence //------------------------------------------------------------------------------ // Inputs: diff --git a/src/computation_discrete.cpp b/src/computation_discrete.cpp index 614364d..c5ec297 100644 --- a/src/computation_discrete.cpp +++ b/src/computation_discrete.cpp @@ -9,7 +9,7 @@ #include "mutual_information.h" #include "structure.h" -constexpr int MDL = 0; +constexpr int BIC = 0; namespace miic { namespace computation { @@ -85,7 +85,7 @@ InfoBlock computeCondMutualInfoDiscrete(const TempGrid2d& data, // Conclude on current count if (Nuy > 0) { Huy -= Nuy * log(Nuy); - if (cplx != MDL) { + if (cplx != BIC) { logC_uy_x += cache->getLogC(lround(Nuy), rx); } Nuy = 0; @@ -99,7 +99,7 @@ InfoBlock computeCondMutualInfoDiscrete(const TempGrid2d& data, for (auto& Nxu : Nux_list) { if (Nxu > 0) { Hux -= Nxu * log(Nxu); - if (cplx != MDL) { + if (cplx != BIC) { logC_ux_y += cache->getLogC(lround(Nxu), ry); } Nxu = 0; // reset counter @@ -107,7 +107,7 @@ InfoBlock computeCondMutualInfoDiscrete(const TempGrid2d& data, } if (Nu > 0) { Hu -= Nu * log(Nu); - if (cplx != MDL) { + if (cplx != BIC) { auto Nu_long = lround(Nu); logC_u_x += cache->getLogC(Nu_long, rx); logC_u_y += cache->getLogC(Nu_long, ry); @@ -116,7 +116,7 @@ InfoBlock computeCondMutualInfoDiscrete(const TempGrid2d& data, } } - if (cplx == MDL) { + if (cplx == BIC) { double logN = log(N_total); logC_ux_y = 0.5 * (ry - 1) * (rx * ru - 1) * logN; logC_uy_x = 0.5 * (rx - 1) * (ry * ru - 1) * logN; @@ -214,7 +214,7 @@ Info3PointBlock computeInfo3PointAndScoreDiscrete(const TempGrid2d& data, // Conclude on current count if (Nuy > 0) { Huy -= Nuy * log(Nuy); - if (cplx != MDL) { + if (cplx != BIC) { auto Nuy_long = lround(Nuy); logC_uy_x += cache->getLogC(Nuy_long, rx); logC_uy_z += cache->getLogC(Nuy_long, rz); @@ -222,7 +222,7 @@ Info3PointBlock computeInfo3PointAndScoreDiscrete(const TempGrid2d& data, for (auto& Nzuy : Nzuy_list) { if (Nzuy > 0) { Hzuy -= Nzuy * log(Nzuy); - if (cplx != MDL) { + if (cplx != BIC) { logC_zuy_x += cache->getLogC(lround(Nzuy), rx); } Nzuy = 0; @@ -238,7 +238,7 @@ Info3PointBlock computeInfo3PointAndScoreDiscrete(const TempGrid2d& data, if (Nu == 0) continue; Hu -= Nu * log(Nu); - if (cplx != MDL) { + if (cplx != BIC) { auto Nu_long = lround(Nu); logC_u_x += cache->getLogC(Nu_long, rx); logC_u_y += cache->getLogC(Nu_long, ry); @@ -249,7 +249,7 @@ Info3PointBlock computeInfo3PointAndScoreDiscrete(const TempGrid2d& data, for (auto& Nzu : Nzu_list) { if (Nzu > 0) { Hzu -= Nzu * log(Nzu); - if (cplx != MDL) { + if (cplx != BIC) { auto Nzu_long = lround(Nzu); logC_zu_x += cache->getLogC(Nzu_long, rx); logC_zu_y += cache->getLogC(Nzu_long, ry); @@ -263,7 +263,7 @@ Info3PointBlock computeInfo3PointAndScoreDiscrete(const TempGrid2d& data, if (Nux == 0) continue; Hux -= Nux * log(Nux); - if (cplx != MDL) { + if (cplx != BIC) { auto Nux_long = lround(Nux); logC_ux_y += cache->getLogC(Nux_long, ry); logC_ux_z += cache->getLogC(Nux_long, rz); @@ -275,7 +275,7 @@ Info3PointBlock computeInfo3PointAndScoreDiscrete(const TempGrid2d& data, if (Nzux == 0) continue; Hzux -= Nzux * log(Nzux); - if (cplx != MDL) { + if (cplx != BIC) { logC_zux_y += cache->getLogC(lround(Nzux), ry); } Nzux_list(j, l) = 0; @@ -284,7 +284,7 @@ Info3PointBlock computeInfo3PointAndScoreDiscrete(const TempGrid2d& data, } // check maximum mutual infos - cplx terms - if (cplx == MDL) { + if (cplx == BIC) { double logN = log(N_total); logC_ux_y = 0.5 * (ry - 1) * (rx * ru - 1) * logN; logC_uy_x = 0.5 * (rx - 1) * (ry * ru - 1) * logN; diff --git a/src/environment.h b/src/environment.h index f969631..2fd6eee 100644 --- a/src/environment.h +++ b/src/environment.h @@ -46,7 +46,7 @@ struct Environment { vector nodes; Grid2d edges; bool orientation = false; - double ori_proba_ratio = 1; + double ort_proba_ratio = 1; bool propagation = false; // Level of consistency required for the graph // 0: no consistency requirement @@ -65,7 +65,7 @@ struct Environment { bool latent_orientation = false; // Whether or not do MAR (Missing at random) test using KL-divergence bool test_mar = false; - // Complexity mode. 0: mdl 1: nml + // Complexity mode. 0: bic (formerly mdl) 1: nml int cplx = 1; // List of ids of edge whose status is not yet determined vector unsettled_list; diff --git a/src/get_information.cpp b/src/get_information.cpp index d151487..d36a484 100644 --- a/src/get_information.cpp +++ b/src/get_information.cpp @@ -82,9 +82,9 @@ Info3PointBlock getInfo3Point( environment.data_double, X, Y, ui_list, environment.levels, environment.is_continuous, n_samples_non_na_z, levels_red, sample_is_not_NA, environment.noise_vec); - double cplxMdl = environment.cache.cterm->getLog(n_samples_non_na_z); + double cplxBic = environment.cache.cterm->getLog(n_samples_non_na_z); - if ((kldiv - cplxMdl) > 0) { + if ((kldiv - cplxBic) > 0) { // The sample is not representative of the population, hence for 3-point // information, we cannot draw conclusion the unshielded triple (X, Z, // Y), return 0; For contributing score, Z is not a good candidate. diff --git a/src/mdl_pair_discretize.cpp b/src/mdl_pair_discretize.cpp index 3cfd36a..4cbf854 100755 --- a/src/mdl_pair_discretize.cpp +++ b/src/mdl_pair_discretize.cpp @@ -110,10 +110,10 @@ List miicRGetInfo3Point(List input_data, List arg_list) { auto res = getInfo3Point(environment, 0, 1, 2, ui_list); List result = List::create( - _["I3"] = res.Ixyz_ui, - _["I3k"] = res.Ixyz_ui - res.kxyz_ui, - _["I2"] = res.Ixy_ui, - _["I2k"] = res.Ixy_ui - res.kxy_ui); + _["i3"] = res.Ixyz_ui, + _["i3k"] = res.Ixyz_ui - res.kxyz_ui, + _["i2"] = res.Ixy_ui, + _["i2k"] = res.Ixy_ui - res.kxy_ui); return result; } diff --git a/src/mutual_information.h b/src/mutual_information.h index ed1eebd..aa33159 100644 --- a/src/mutual_information.h +++ b/src/mutual_information.h @@ -33,7 +33,7 @@ using std::lround; constexpr double kPrecision = 1.e-10; // rux: number of levels of each (joint) variable [x, u, ux] -// cplx 0: MDL, 1: NML +// cplx 0: BIC, 1: NML // flag (for cplx == 1 only) 0: mutual info, 1: conditional mutual info // When flag == 1 && cplx == 1, x and u are not symmetrical, x represents single // variable, whereas u represents joint variable (see def of cond mutual info) diff --git a/src/orientation.cpp b/src/orientation.cpp index 44475c7..07927cd 100644 --- a/src/orientation.cpp +++ b/src/orientation.cpp @@ -29,8 +29,8 @@ namespace { constexpr double kEpsI3 = 1.0e-10; -bool acceptProba(double proba, double ori_proba_ratio) { - return (1 - proba) / proba < ori_proba_ratio; +bool acceptProba(double proba, double ort_proba_ratio) { + return (1 - proba) / proba < ort_proba_ratio; } } // anonymous namespace @@ -39,10 +39,10 @@ bool acceptProba(double proba, double ori_proba_ratio) { // x2y: probability that there is an arrow from node x to y void updateAdj(Environment& env, int x, int y, double y2x, double x2y) { env.edges(x, y).proba_head = x2y; - if (acceptProba(x2y, env.ori_proba_ratio)) + if (acceptProba(x2y, env.ort_proba_ratio)) env.edges(x, y).status = 2; env.edges(y, x).proba_head = y2x; - if (acceptProba(y2x, env.ori_proba_ratio)) + if (acceptProba(y2x, env.ort_proba_ratio)) env.edges(y, x).status = 2; } @@ -314,7 +314,7 @@ vector> orientationProbability(Environment& environment) { // Write output vector> orientations{{"source1", "p1", "p2", "target", "p3", - "p4", "source2", "NI3", "Conflict"}}; + "p4", "source2", "ni3", "conflict"}}; for (size_t i = 0; i < triples.size(); i++) { const auto& triple = triples[i]; const auto& probas = probas_list[i]; diff --git a/src/r_cpp_interface.cpp b/src/r_cpp_interface.cpp index 59d3e87..304aef5 100644 --- a/src/r_cpp_interface.cpp +++ b/src/r_cpp_interface.cpp @@ -49,8 +49,8 @@ void setEnvironmentFromR(const Rcpp::List& input_data, if (arg_list.containsElementNamed("orientation")) environment.orientation = as(arg_list["orientation"]); - if (arg_list.containsElementNamed("ori_proba_ratio")) - environment.ori_proba_ratio = as(arg_list["ori_proba_ratio"]); + if (arg_list.containsElementNamed("ort_proba_ratio")) + environment.ort_proba_ratio = as(arg_list["ort_proba_ratio"]); if (arg_list.containsElementNamed("propagation")) environment.propagation = as(arg_list["propagation"]); @@ -79,7 +79,7 @@ void setEnvironmentFromR(const Rcpp::List& input_data, environment.test_mar = as(arg_list["test_mar"]); if (arg_list.containsElementNamed("cplx")) { - if (as(arg_list["cplx"]).compare("mdl") == 0) + if (as(arg_list["cplx"]).compare("bic") == 0) environment.cplx = 0; } @@ -116,6 +116,7 @@ void setEnvironmentFromR(const Rcpp::List& input_data, environment.edges(i, j).status = 0; environment.edges(i, j).status_init = 0; environment.edges(i, j).status_prev = 0; + environment.edges(i, j).proba_head = -1; } } diff --git a/src/reconstruct.cpp b/src/reconstruct.cpp index fd1f9a7..b013b04 100644 --- a/src/reconstruct.cpp +++ b/src/reconstruct.cpp @@ -139,7 +139,7 @@ List reconstruct(List input_data, List arg_list) { _["proba_adj_matrix"] = getProbaAdjMatrix(environment.edges), _["edges"] = getEdgesInfoTable(environment.edges, environment.nodes), - _["orientations.prob"] = orientations, + _["triples"] = orientations, _["time"] = vector{time.init, time.iter, time.cut, time.ori, time.getTotal()}, _["interrupted"] = false); diff --git a/src/skeleton.cpp b/src/skeleton.cpp index 0968b6b..b274a6f 100644 --- a/src/skeleton.cpp +++ b/src/skeleton.cpp @@ -88,7 +88,10 @@ bool initializeSkeleton(Environment& environment) { edges(i, j).shared_info = std::make_shared(); edges(j, i).shared_info = edges(i, j).shared_info; - if (edges(i, j).status) initializeEdge(environment, i, j); + if (edges(i, j).status) + initializeEdge(environment, i, j); + else + edges(i, j).shared_info->connected = 0; } } } // omp parallel diff --git a/src/tmiic.cpp b/src/tmiic.cpp index 538ad2b..adfff10 100644 --- a/src/tmiic.cpp +++ b/src/tmiic.cpp @@ -1,9 +1,9 @@ //***************************************************************************** // Filename : tmiic.cpp Creation date: 07 may 2020 // -// Author : Franck SIMON -// // Description: Store functions for temporal mode of miic (tmiic) +// +// Author : Franck SIMON //***************************************************************************** //============================================================================= diff --git a/src/utilities.cpp b/src/utilities.cpp index decd607..66014de 100755 --- a/src/utilities.cpp +++ b/src/utilities.cpp @@ -266,9 +266,9 @@ vector> getEdgesInfoTable( std::sort(edge_list.begin(), edge_list.end()); vector> table; - table.emplace_back(std::initializer_list{"x", "y", "z.name", - "ai.vect", "raw_contributions", "contributions", "zi.vect", "Ixy", - "Ixy_ai", "cplx", "Rxyz_ai", "category", "Nxy_ai", "confidence"}); + table.emplace_back(std::initializer_list{"x", "y", "z_name", + "ai", "raw_contributions", "contributions", "zi", "i_xy", "i_xy_ai", + "cplx", "r_xyz_ai", "category", "n_xy", "n_xy_ai", "confidence"}); for (const auto& edge : edge_list) { auto info = edge.getEdge().shared_info; double confidence = -1; @@ -289,6 +289,7 @@ vector> getEdgesInfoTable( to_string(info->kxy_ui), to_string(info->Rxyz_ui), to_string(info->connected), + to_string(info->Nxy), to_string(info->Nxy_ui), to_string(confidence) });