Software

The Computational Biosciences Group develops innovative software solutions and methods for FAIR Data Management and Data Analysis and Machine Learning for experimental data and mechanistic models toward addressing our nation’s energy, environment, and health needs.

FAIR Data Management

Data Modeling

Hierarchical Data Modeling Framework (HDMF) is a python package for working with hierarchical data and creating extensible data standards [Source] [Contact] [Cite]. Additional tools in the HDMF ecosystem include the following:

HDMF Common Schema is a collection of reusable data schema for creating scientific data standards. [Source] [Contact] [Cite]
HDMF ML Schema is a format schema for common machine learning workflows and outputs. [Source] [Contact]
HDMF-Zarr implements a Zarr backend for HDMF. [Source] [Contact] [Cite]
HDMF DocUtils is a library for generating documentation from HDMF data schema. [Source] [Contact]

Linked Open Data Modeling Language (LinkML) is a flexible data modeling language and software framework for working with and validating data in a variety of formats (JSON, RDF, TSV) and for compiling LinkML schemas to other frameworks [Source] [Contact]. Additional tools in the LinkML ecosystem include the following:

linkml-model defines the metamodel schema and specification for the LinkML modeling language. [Source]
linkml-runtime is a python library providing runtime support for LinkML data models. [Source]
schema-automator is a toolkit that assists with the generation and enhancement of LinkML schemas. [Source]
schemasheets is a framework for managing schema using spreadsheets and compiling them to LinkML. [Source]
linkml-project-cookiecutter is a Cookiecutter template for projects using Linkml. [Source]

Ontology Tools

Ontology Development Kit (ODK) is a Dockerized suite of tools for setting up an ontology development workflow with GitHub. [Source] [Contact] [Cite]

Bayesian OWL Ontology Merging (Boomer) uses a combined logical and probabilistic approach to translate mappings into logical axioms that can be used to merge ontologies. [Source] [Contact] [Cite]

BioMake is a GNU-Make-like utility for managing builds and complex workflows using declarative specifications [Source] [Contact] [Cite]

Ontobio is a python API for working with ontologies and associations. [Source] [Contact]

ontoRunNER is a toolkit for named entity recognition using ontologies. [Source] [Contact]

obographviz is a package for translating OBO ontology graphs into Dot/Graphviz visualizations. [Source] [Contact]

Sample Annotator is a Python and flask API for inferring missing or incorrect sample metadata and performing annotation of samples from semi-structured or untidy data. [Source] [Contact]

Domain Data Standards and Portals

Biolink Model provides a high-level data model of biological entities (genes, diseases, phenotypes, pathways, individuals, substances, etc.), their properties and relationships, and enumerates ways in which they can be associated. [Source] [Contact]

KGX is a toolkit and file format for working with and for exchanging data in Knowledge Graphs (KGs) that conform to or are aligned to the Biolink Model. [Source] [Contact]

Neurodata Without Borders (NWB) is a R&D100 award-winning, leading data standard for neurophysiology supported by the NIH BRAIN Initiative [Source] [Contact] [Cite]. NWB provides neuroscientists with a common standard to share, archive, use, and build common analysis tools for neurophysiology data. NWB is supported by many neurophysiology tools and a growing number of neurophysiology data are available in NWB via the DANDI data archive. NWB also includes a broad range of core software for using NWB data, among others:

PyNWB is the reference Python API for working with NWB files. [Source]
MatNWB is the reference Matlab API for working with NWB files. [Source]
NWBInspector is a tool for inspecting NWB files for compliance with best practices. [Source]
NWBWidgets a library of widgets for visualization NWB data in a Jupyter notebooks. [Source]

Neurodata Extension Catalog is a searchable online catalog for managing extensions to the NWB data standard [Source][Contact]. The NDX Catalog provides the following additional tools:

NDX Template is a Cookiecutter template for creating Neurodata Extensions (NDX) for NWB. [Source]
Staged Extensions is a repository for submitting Neurodata Extensions (NDX) to the NDX Catalog. [Source]

OpenMSI is an R&D100 award-winning, advanced application for web-based visualization, analysis, and management of mass spectrometry imaging data. [Cite]

Data Analysis and Machine Learning

BASTet is a novel framework for shareable and reproducible data analysis that supports standardized data and analysis interfaces, integrated data storage, data provenance, workflow management, and a broad set of integrated tools. [Source][Contact][Cite]

ClearMap is a toolbox for the analysis and registration of volumetric images of organs and organisms obtained via tissue clearing, immunolabeling and light sheet microscopy (iDISCO) [Source] [Contact] [Cite]. ClearMap's toolbox includes the following components:

Wobbly-Stitcher for stitching TB data sets non-rigidly.
TubeMap for extracting vasculature and other tubular networks from TB data.
CellMap for extracting neuronal activity markers and cell shapes.

Dynamic Components Analysis (DCA) is an unsupervised dimensionality reduction algorithm that finds low-dimensional subspaces with high dynamical complexity(Predictive Information). [Source] [Contact] [Cite]

Supervised Dynamic Components Analysis (sDCA) is an extension of DCA for unsupervised dimensionality reduction algorithm that finds low-dimensional subspaces between a source and a target time-series (e.g., interacting brain regions, brain-to-behavior) with highest Predictive Information.[Source] [Contact]

Compressed Predictive Information Coding (CPIC) is a generalization of DCA to non-Gaussian, non-linear mappings. It compresses both the past and the future of the time series based on a Predictive Information bottleneck. It uses Bayesian variational inference and deep learning. [Source] [Contact]

Feedback Controllability Components Analysis (FCCA) is an unsupervised dimensional algorithm that finds low-dimensional subspaces that are most feedback controllable. [Source] [Contact]

Orthogonal stochastic linear mixing model (OSLMM) is an unsupervised learning algorithm for time series data that imposes an orthogonality constraint on the latent mixing terms. In practice, this results in more interpretable latent spaces. [Source] [Contact]

pyUoI is a Python package implementing several statistical-machine learning algorithms in the Union of Intersections framework, which infers models with accurate feature selection (low false positives and low false negatives) and estimation (low bias and low variance). [Source] [Contact] [Cite]

Embiggen is a performant graph ML package, with support for various algorithms including node2vec, GloVE, TransE, SimplE, and others. [Source] [Contact] [Cite]

EnsmallenGraph is a performant graph library written in Rust with Python bindings. Essentially a much faster version of NetworkX. [Source] [Contact] [Cite]

SpikeDecoder uses kernel density estimates on hippocampal spike marks and LFP oscillatory features to decode spatial location of rats. [Source] [Contact]

MotifDetector targets understanding motifs and their role in communication in interacting sub-processes (e.g. two interacting animals, two interacting brain regions), e.g., via MCMC inference of infinite Hidden Markov models. [Source] [Contact]

WTFgenes ("What's The Function of these genes?") is a Bayesian Term Enrichment Analysis (TEA) based gene set enrichment analysis. [Source] [Contact] [Cite]