Data Science
Data Science & Bioinformatics Workflow
This repository provides a 3-tiered architecture for Data Science and Bioinformatics, balancing system stability with the rapid agility needed for research.
1. Architecture Overview
Research workflows are managed through three layers of abstraction:
| Layer | Tool | Purpose | Persistence |
|---|---|---|---|
| Global Baseline | Nix Modules | Your everyday “Gold Standard” tools (Tidyverse, PyTorch). | Permanent |
| Project Sandbox | Flake Templates | Reproducible environments for specific papers/experiments. | Project-local |
| Rapid Explorer | Mamba / Pixi | Quick, ad-hoc installation of “edge” academic tools. | Persistent (~/.local) |
2. The Global Environment
To enable the data science stack, set data-science.enable = true within each language module in your host configuration:
modules.dev.python = {
enable = true;
data-science.enable = true;
};
modules.dev.r = {
enable = true;
data-science.enable = true;
};
Python & R Suites
Defined within modules/dev/python.nix and modules/dev/r.nix respectively, these suites provide:
Base (always included when the language is enabled):
- Python: ipython, ipykernel, pip (+ black, pylint, poetry, isort as user packages).
- R: tidyverse, tidymodels, devtools, shiny, knitr, languageserver, IRkernel, radian.
Data Science extras (data-science.enable = true):
- Python: pandas, polars, numpy, scipy, scikit-learn, pytorch, transformers, langchain, opencv4, matplotlib, seaborn, plotly, jupyterlab, biopython, scikit-bio, and more.
- R: rmarkdown, readxl, arrow, caret, randomForest, xgboost, glmnet, plotly, leaflet, phyloseq, openxlsx, and more.
IDE & Server Integration
- Positron: The IDE is automatically wrapped to find R and Python environments. When either language module is enabled, Positron’s wrapper sets
R_HOME/R_LIBS_SITE/R_ENVIRON_USER(for R) andPYTHONHOME/PATH(for Python). You do not need to manually configure interpreter paths. - JupyterHub: Kernels for both Python 3 and R are pre-registered. Each kernel uses the language module’s package environment when available, or a minimal fallback (ipykernel / IRkernel) when the language module is disabled.
- Shell: Tools are available globally in your terminal via
pythonorradian.
3. Project Workflow (Reproducible Research)
For specific research projects, use the Data Science Template. This allows you to lock exact versions of libraries for a specific paper or analysis.
Quick Start
-
Initialize Project:
mkdir my-research && cd my-new-research nix flake init -t .#data-science -
Activate Environment:
direnv allowThe shell will display a welcome banner with software versions.
-
Analysis:
- Launch JupyterLab:
jupyter lab - Launch R Console:
radian
- Launch JupyterLab:
Customizing Packages
Edit the flake.nix in your project folder. Add packages to the python-env or r-env blocks. direnv will automatically reload the environment when you save the file.
4. Manual Workflow (Agility)
When you need a tool that is not in Nixpkgs or changes too frequently for the Nix module system, use the provided FHS-wrapped package managers.
Mamba & Pixi
Both are configured with a robust FHS (Filesystem Hierarchy Standard) environment that includes:
- Compression: zlib, bzip2, xz (essential for FASTQ/BAM files).
- Plotting: libGL, fontconfig, freetype.
- Runtimes: glibc, libuuid, expat.
Usage:
mamba install scanpy # Instant installation
mamba-shell # Enter the FHS shell where binaries can find libraries
5. Bioinformatics Configuration
Channel Management
A global .condarc is managed via modules/dev/conda.nix. It enforces:
- Channels:
conda-forgeandbioconda. - Priority:
strict(prevents conflicts between standard and bio tools). - nodefaults: Ensures environments are clean and reproducible.
Persistence
All environment data (~/.local/share/envs) is automatically persisted via the modules/persist.nix module, ensuring your environments survive system updates.
6. Tips & Best Practices
- Stability: Use the Global Suite for 90% of your work.
- Reproducibility: Use the Project Template for any code that will be published.
- Graphics: If a plotting library fails to find a library, run it inside
mamba-shellorpixi-shell. - Update: Run
hey pullto update the global suites, ornix flake updateinside your project folder to update local project dependencies.