Software | Diogo Ferrari

Python packages

causalinf: Causal Inference Collection in Python

UNDER DEVELOPMENT


causalinf is a Python module to facilitate causal inference by offering various submodules focused on different methodologies and identification strategies, including Difference-in-Differences (DiD), Regression Discontinuity Design (RDD), Instrumental Variables (IV), Mediation Analysis (MA), Matching Methods (MM), Selection on Observables (SoO), and Structural Causal Models (SCM). Each method encompasses essential functionalities such as (1) evaluating causal identification assumptions, (2) estimating causal effects, (3) generating summary result tables and plots, (4) performing sensitivity analyses, and (5) producing model diagnostics.

causalinf is a Python module to facilitate causal inference by offering various submodules focused on different methodologies and identification strategies, including Difference-in-Differences (DiD), Regression Discontinuity Design (RDD), Instrumental Variables (IV), Mediation Analysis (MA), Matching Methods (MM), Selection on Observables (SoO), and Structural Causal Models (SCM). Each method encompasses essential functionalities such as (1) evaluating causal identification assumptions, (2) estimating causal effects, (3) generating summary result tables and plots, (4) performing sensitivity analyses, and (5) producing model diagnostics.

TidyPolars4sci: Combining Polars and Tidyverse for Python


TidyPolars4sci provides functions that match as closely as possible to R’s Tidyverse functions for manipulating data frames and conducting data analysis in Python using the blazingly fast Polars as backend. Fast: Uses Polars as backend for data manipulation. So it inherits many advantages of Polars: fast, parallel, GPU support, etc.; Tidy: Keeps the data in tidy (rectangular table) format (no multi-indexes) Sintax: While Polars is fast, the sintax is not the most intuitive. The package provides frontend methods that matches R’s Tidyverse functions, making it easier for users familiar with that ecosystem to transition to this library. Extended functinalities: Polars is extended to provide many functionalities to facilitate data manipulation and and analysis. Research: The package is design to facilitate academic research data analysis and reporting, making it easy to produce tables whose format are common in academic research. Output formats include LaTex, excel, and text-processing formats.

TidyPolars4sci provides functions that match as closely as possible to R’s Tidyverse functions for manipulating data frames and conducting data analysis in Python using the blazingly fast Polars as backend. Fast: Uses Polars as backend for data manipulation. So it inherits many advantages of Polars: fast, parallel, GPU support, etc.; Tidy: Keeps the data in tidy (rectangular table) format (no multi-indexes) Sintax: While Polars is fast, the sintax is not the most intuitive. The package provides frontend methods that matches R’s Tidyverse functions, making it easier for users familiar with that ecosystem to transition to this library. Extended functinalities: Polars is extended to provide many functionalities to facilitate data manipulation and and analysis. Research: The package is design to facilitate academic research data analysis and reporting, making it easy to produce tables whose format are common in academic research. Output formats include LaTex, excel, and text-processing formats.

R packages

Hierarchical Dirichlet Process Generalized Linear Models (hdpGLM)


	The package implements the hierarchical Dirichlet process Generalized Linear Models proposed in Ferrari (2020) Modeling Context-Dependent Latent Effect Heterogeneity, which expands the non-parametric Bayesian models proposed in Mukhopadhyay and Gelfand (1997), Hannah (2011), and Heckman and Vytlacil (2007) to deal with context-dependent cases. The package can be used to estimate latent heterogeneity in the marginal effect of GLM linear coefficients, to cluster data points based on that latent heterogeneity, and to investigate the occurrence of Simpson’s Paradox due to latent or omitted features. It can also be used with hierarchical data to estimate the effect of upper-level features (e.g., levels of inequality, regional economic decline, institutional features) on the latent heterogeneity in the effect of lower-level covariates (e.g., income, gender, party identification) on outcome variables (e.g., policy preferences, support for populism, vote intention), which can be caused by omitted interactions in the model specification.

The package implements the hierarchical Dirichlet process Generalized Linear Models proposed in Ferrari (2020) Modeling Context-Dependent Latent Effect Heterogeneity, which expands the non-parametric Bayesian models proposed in Mukhopadhyay and Gelfand (1997), Hannah (2011), and Heckman and Vytlacil (2007) to deal with context-dependent cases. The package can be used to estimate latent heterogeneity in the marginal effect of GLM linear coefficients, to cluster data points based on that latent heterogeneity, and to investigate the occurrence of Simpson’s Paradox due to latent or omitted features. It can also be used with hierarchical data to estimate the effect of upper-level features (e.g., levels of inequality, regional economic decline, institutional features) on the latent heterogeneity in the effect of lower-level covariates (e.g., income, gender, party identification) on outcome variables (e.g., policy preferences, support for populism, vote intention), which can be caused by omitted interactions in the model specification.

Identification analysis and structural causal model estimation in R (idar)


	The package implements identification analysis and structural causal model estimation in R. The software is particularly useful when the analysis relies on selection on observables for causal inference, making it easy to check if a causal effect is identifiable for any given assumption about the DGP encoded in a DAG. It provides an easy-to-use parametric estimation procedure using a linear structural equations model if a causal effect is identifiable. More specifically, it provides an end-to-end estimation of structural causal models (SCM), which includes specification of the data generating process (DGP) using directed acyclic graphs (DAGs), identification analysis, selection of adjustment variables (selection on observables), estimation of causal effects, and computation of numeric and graphical summaries of the estimation results.

The package implements identification analysis and structural causal model estimation in R. The software is particularly useful when the analysis relies on selection on observables for causal inference, making it easy to check if a causal effect is identifiable for any given assumption about the DGP encoded in a DAG. It provides an easy-to-use parametric estimation procedure using a linear structural equations model if a causal effect is identifiable. More specifically, it provides an end-to-end estimation of structural causal models (SCM), which includes specification of the data generating process (DGP) using directed acyclic graphs (DAGs), identification analysis, selection of adjustment variables (selection on observables), estimation of causal effects, and computation of numeric and graphical summaries of the estimation results.

Cluster Estimated Standard Errors in R (ceser)


	(with John Jackson) The package implements the Cluster Estimated Standard Errors (CESE) method proposed by Jackson (2019) to compute clustered standard errors of linear coefficients in regression models with grouped data. CESE produces more conservative confidence intervals, outperform the classical clustered robust standard error (CRSE) method in various ways, and avoid CRSE downward bias and underestimation of the clustered standard errors. (see Jackson, J., (2019) Corrected standard errors with clustered data, Political Analysis)

(with John Jackson) The package implements the Cluster Estimated Standard Errors (CESE) method proposed by Jackson (2019) to compute clustered standard errors of linear coefficients in regression models with grouped data. CESE produces more conservative confidence intervals, outperform the classical clustered robust standard error (CRSE) method in various ways, and avoid CRSE downward bias and underestimation of the clustered standard errors. (see Jackson, J., (2019) Corrected standard errors with clustered data, Political Analysis)

Occupation and Class Scheme Classification in R (occupar)


	The package occupar (Occupation Classification in R) provides: (1) a handful of functions to convert between different versions of the International Standard Classification of Occupations (ISCO): ISCO-68, ISCO-88, ISCO-08; (2) a set of functions to compute class schemes (EGP, ISEI, ESeC, etc.) based on ISCO. The current package benefited from Harry Ganzeboom’s work of ISCO and class schemes.

Election Forensics Package (eforensics)


	(with Walter Mebane) The package can be used to estimate probability of fraud in elections using Finite Mixture Models (Supported by NSF grant SES 1523355).

Exploratory Data Analysis in R (edar)


	A package that facilitates exploratory data analysis and visualization of model results, aligned with tidyverse and pipe coding philosophy.