Data Science

Data Science & Bioinformatics Workflow

This repository provides a 3-tiered architecture for Data Science and Bioinformatics, balancing system stability with the rapid agility needed for research.

1. Architecture Overview

Research workflows are managed through three layers of abstraction:

LayerToolPurposePersistence
Global BaselineNix ModulesYour everyday “Gold Standard” tools (Tidyverse, PyTorch).Permanent
Project SandboxFlake TemplatesReproducible environments for specific papers/experiments.Project-local
Rapid ExplorerMamba / PixiQuick, ad-hoc installation of “edge” academic tools.Persistent (~/.local)

2. The Global Environment

To enable the data science stack, set data-science.enable = true within each language module in your host configuration:

modules.dev.python = {
  enable = true;
  data-science.enable = true;
};
modules.dev.r = {
  enable = true;
  data-science.enable = true;
};

Python & R Suites

Defined within modules/dev/python.nix and modules/dev/r.nix respectively, these suites provide:

Base (always included when the language is enabled):

  • Python: ipython, ipykernel, pip (+ black, pylint, poetry, isort as user packages).
  • R: tidyverse, tidymodels, devtools, shiny, knitr, languageserver, IRkernel, radian.

Data Science extras (data-science.enable = true):

  • Python: pandas, polars, numpy, scipy, scikit-learn, pytorch, transformers, langchain, opencv4, matplotlib, seaborn, plotly, jupyterlab, biopython, scikit-bio, and more.
  • R: rmarkdown, readxl, arrow, caret, randomForest, xgboost, glmnet, plotly, leaflet, phyloseq, openxlsx, and more.

IDE & Server Integration

  • Positron: The IDE is automatically wrapped to find R and Python environments. When either language module is enabled, Positron’s wrapper sets R_HOME/R_LIBS_SITE/R_ENVIRON_USER (for R) and PYTHONHOME/PATH (for Python). You do not need to manually configure interpreter paths.
  • JupyterHub: Kernels for both Python 3 and R are pre-registered. Each kernel uses the language module’s package environment when available, or a minimal fallback (ipykernel / IRkernel) when the language module is disabled.
  • Shell: Tools are available globally in your terminal via python or radian.

3. Project Workflow (Reproducible Research)

For specific research projects, use the Data Science Template. This allows you to lock exact versions of libraries for a specific paper or analysis.

Quick Start

  1. Initialize Project:

    mkdir my-research && cd my-new-research
    nix flake init -t .#data-science
  2. Activate Environment:

    direnv allow

    The shell will display a welcome banner with software versions.

  3. Analysis:

    • Launch JupyterLab: jupyter lab
    • Launch R Console: radian

Customizing Packages

Edit the flake.nix in your project folder. Add packages to the python-env or r-env blocks. direnv will automatically reload the environment when you save the file.


4. Manual Workflow (Agility)

When you need a tool that is not in Nixpkgs or changes too frequently for the Nix module system, use the provided FHS-wrapped package managers.

Mamba & Pixi

Both are configured with a robust FHS (Filesystem Hierarchy Standard) environment that includes:

  • Compression: zlib, bzip2, xz (essential for FASTQ/BAM files).
  • Plotting: libGL, fontconfig, freetype.
  • Runtimes: glibc, libuuid, expat.

Usage:

mamba install scanpy  # Instant installation
mamba-shell           # Enter the FHS shell where binaries can find libraries

5. Bioinformatics Configuration

Channel Management

A global .condarc is managed via modules/dev/conda.nix. It enforces:

  • Channels: conda-forge and bioconda.
  • Priority: strict (prevents conflicts between standard and bio tools).
  • nodefaults: Ensures environments are clean and reproducible.

Persistence

All environment data (~/.local/share/envs) is automatically persisted via the modules/persist.nix module, ensuring your environments survive system updates.


6. Tips & Best Practices

  • Stability: Use the Global Suite for 90% of your work.
  • Reproducibility: Use the Project Template for any code that will be published.
  • Graphics: If a plotting library fails to find a library, run it inside mamba-shell or pixi-shell.
  • Update: Run hey pull to update the global suites, or nix flake update inside your project folder to update local project dependencies.