Data Science & Technology Software
Collaborative Web Applications
- ALSHub - ALSHub is a website used for managing users of the ALS facility and their proposals, experiment safety details, and beamtime
- AmeriFlux web application ‐ Allows users to upload data, and download data. Future features include personalized sets of tower sites, personalized dashboard-style reporting, and data visualization.
- CLEER Model - The Cloud Energy and Emissions Research (CLEER) Model is a comprehensive user-friendly open-access model for assessing the net energy and emissions implications of cloud services in different regions and at different levels of market adoption.
- eProject Builder - A secure web-based data entry and tracking system for energy savings performance contract (ESPC) projects
- Materials Project - Web application for materials design; by computing properties of all known materials, the Materials Project aims to remove the guesswork from materials design in a variety of applications.
- OpenMSI - Web application for management, storage, visualization, and statistical analysis of Mass Spectrometry Imaging (MSI) data.
- TECA - A high-performance package for extreme event detection and climate data analytics.
- SENSEI - A framework enabling write once run anywhere in situ data analytics.
- Berkeley Storage Manager (BeStMan) ‐ LBNL implementation of Storage Resource Manager (SRM) based on standard interface.
- FastBit - Implementation of FastBit indexing/searching algorithm.
- FastQuery (Restricted Access) - A parallel indexing system for scientific data based on FastBit
- H5hut, H5Part - H5hut (formerly H5Part) is a very simple data storage schema and provides an API that simplifies the reading/writing of the data to the HDF5 file format. H5Part is built on top of the HDF5 (Hierarchical Data Format).
- PyNWB - Neurodata Without Borders: Neurophysiology(NWB:N) is more than just a file format but it defines an ecosystem of tools, methods, and standards for storing, sharing, and analyzing complex neurophysiology data. PyNWB is a Python package for working with NWB:N files. It provides a high-level API for efficiently working with Neurodata stored in the NWB:N format. Beyond neurophysiology, PyNWB provides a general set of tools for hierarchical organization of data for the creation of complex data standards.
- SPADE - A JEE application that takes files from wherever they are produced, i.e. at an experiment, and delivers them into a data warehouse from which they can be retrieved for analysis and also archived
- SRM-Lite - A simple command-line based tool with pluggable file transfer protocol supports including scp, hpn-scp, and sftp
- SDS framework (Restricted Access) - An automatic data management system for exascale computing.
- ArrayUDF - A MapReduce type system for scientific data (as tensor) analysis.
- DataElevator - A software to move in a hierarchy storage system, e.g. burst buffer.
- Henson - A software using cooperative multitasking to enable in situ data analysis
- FRIEDA - A data management framework to understand various trade-offs of scientific data management in elastic transient environments.
- BDC - Berkeley Data Cloud comprises a web-based front end and a backend that serves data using a combination of SQL and FastBit indexing. Data can also be retrieved, and analysis results published using an API. The Multi-Informatics for Nuclear Operations Scenarios (MINOS), a collaboration with the Applied Physics Lab, is using BDC. Learn more about currently supported projects.
- Dac-Man - Dac-Man is a Deduce tool designed to efficiently track, compare and manage changes and associated provenance in large scientific datasets.
- MaDaTS - Managing Data on Tiered Storage for Scientific Workflows provides an API and a command-line tool that allows users to manage their workflows and data on tiered storage.
Networking, Monitoring, and Security
- B.I.4NERSC - Analytical methods for unveiling information buried in data files from monitoring software at NERSC
- BulkDataMover - A scalable data transfer management tool for GridFTP transfer protocol
- DataMover-Lite - End-user data downloading tool for ESGF climate data
- ESG2Net100 - Library enabling minimal memory copy from disk I/O to network I/O
- Hive Mind - Lightweight, decentralized, intrusion detection based on mobile agents and swarm intelligence
- LBNL Physics-Based Intrusion Detection Bro Modules - A set of signatures for use with the Zeek (née Bro) Network Security Monitor that analyze communication with a physical system and compare the effects of that communication with a physical simulation of the device.
- LBNL DDoS Detection on Science Networks - Monitors network logs in order to detect denial of service attacks on "research and education" networks that disambiguates such attacks from sustained, high-volume network flows characteristic of large science projects, and referred to as "elephant flows."
- LBNL Stream-Processing Architecture for Real-time Cyber-physical Security (SPARCS) - Extracts data from distribution-level phasor measurement units (PMUs) and power quality meters, and stores SCADA captured over the network, enabling a physically distributed, hierarchical processing of that data, stores the data in one or more databases, and provides both software APIs and a graphical, web-based, front-end for inspection of data.
- Analytics for Stream-Processing Architecture for Real-time Cyber-physical Security (Analytic-SPARCS) - A set of analytics that monitors both power measurements collected by distribution grid phasor measurement units (µPMUs) and SCADA communication in order to detect cyber attacks against equipment located in distribution grid substations.
- Identifying Computational Operations Based on Power Measurements - This software is an approach for leveraging sensitive power measurements to "fingerprint" or infer computation taking place on computing systems, including high-performance computing systems, by examining patterns in power use.
- LBNL Disruption Tolerant Key Management Monitoring for Stream-Processing Architecture for Real-time Cyber-physical Security (DTKM-SPARCS) - A set of signatures that monitor the Disruption-Tolerant Key Management protocol developed by PNNL as part of the DOE CEDS program.
- Research Network Transfer Performance Predictor (netperf-predict) - This software contains two sets of analysis routines for predicting the percentage of retransmitted packets on network flows. One directory contains code that applies random forest regression in order to predict the number of retransmitted packets on each flow, operating on timeseries data from the tstat tool, which outputs flow-like data. The second directory also applies a random forest regression and also incorporates a “smoothing” routine that increases accuracy in some situations.
- StorNet - Storage and network bandwidth coordination system.
Visualization, Data Analysis Algorithms, and Applications
- BASTet - BASTet is a novel framework for shareable and reproducible data analysis that supports standardized data and analysis interfaces, integrated data storage, data provenance, workflow management, and a broad set of integrated tools. BASTet has been motivated by the critical need to enable MSI researchers to share, reuse, reproduce, validate, interpret, and apply common and new analysis methods.
- BrainFormat - The LBNL BrainFormat library specifies a general data format standardization framework and implements a novel file format for management and storage of neuroscience data
- Brain Modulyzer - Brain Modulyzer is an interactive visual exploration tool for functional magnetic resonance imaging (fMRI) brain scans, aimed at analyzing the correlation between different brain regions when resting or when performing mental tasks. Integrated methods from graph theory and analysis, such as community detection and derived graph measures, make it possible to explore the modular and hierarchical organization of functional brain networks.
- DAGR - DAGR is a scalable framework for implementing analysis pipelines using parallel design patterns. DAGR abstracts the pipeline concept into a state machine composed of connected algorithmic units. Each algorithmic unit is written to do a single task resulting in highly modularized, reusable code. DAGR provides infrastructure for control, communication, and parallelism, you provide the kernels to implement your analyses. Written in modern C++ and designed to leverage MPI+threading for parallelism, DAGR can leverage the latest HPC hardware including many-core architectures and GPUs. The framework supports a number of parallel design patterns including distributed data, map-reduce, and task-based parallelism.
- Dionysus - Library for computation of persistent homology
- DIY - DIY is a block-parallel library for implementing scalable algorithms that can execute both in-core and out-of-core. The same program can be executed with one or more threads per MPI process, seamlessly combining distributed-memory message passing with shared-memory thread parallelism. The abstraction enabling these capabilities is block parallelism; blocks and their message queues are mapped onto processing elements (MPI processes or threads) and are migrated between memory and storage by the DIY runtime. Complex communication patterns, including neighbor exchange, merge reduction, swap reduction, and all-to-all exchange, are possible in- and out-of-core in DIY.
- ECoG ClusterFlow - ECoG ClusterFlow is an interactive visual analysis tool for the exploration of high-resolution Electrocorticography (ECoG) data. ECoG Clusterflow detects and visualizes dynamic high-level structures, such as communities, using the time-varying spatial connectivity network derived from high-resolution ECoG data. ECoG ClusterFlow makes it possible 1) to compare the spatiotemporal evolution patterns for continuous and discontinuous time-frames, 2) to aggregate data to compare and contrast temporal information at varying levels of granularity, 3) to investigate the evolution of spatial patterns without occluding the spatial context information.
- Exa.TrkX - Exa.TrkX is a graph-based pattern recognition pipeline for noisy experimental data. It aims to measure millions of particle trajectories per second from Petabytes of raw data produced by the next generation of High Energy Physics experiments.
- F3D - F3D is a Fiji plugin, designed for high-resolution 3D image, and written in OpenCL.
- FibriPy - A Python Software Environment for Fibrilar Structures Analysis from 3D Images (FibriPy).
- MSM-CAM - CAMERA Materials Segmentation and Metrics
- QuantCT - Quantitative analysis of micro-tomography images
- PMRF-IS - Parallel Markov Random Fields for Image Segmentation
- PointCloudXplore - PointCloudXplore is the first visualization system specifically developed for the analysis of 3D gene expression data. PointCloudXplore is available for Linux, Mac, and Windows. <!--404 For more information about 3D gene expression data see also the webpage of the Berkeley Drosophila transcription Network Project. -->
- semViewer - (Restricted Access) The semViewer software was developed as part of an LBNL LDRD project during the period 1999-2001. It is used to perform distance and angular measurements of perceived 3D objects present in pairs of images obtained from scanning electron microscopy.
- SENSEI Generic Data Interface - The SENSEI generic data interface provides a framework for science code teams and analysis algorithm developers to write code once and use it anywhere within the four major in situ analysis frameworks (ADIOS, GLEAN, ParaView/Catalyst, and VisIt/libsim). Furthermore, since ParaView/Catalyst and VisIt/Libsim both are treated as analysis routines under SENSEI, these visualizations can be run in situ, or in transit using ADIOS or GLEAN transparently.
- tess2 - Library to compute Delaunay and Voronoi tesselations in HPC environments
- Toolkit for Extreme Climate Analysis (TECA) - TECA is a collection of climate analysis algorithms geared toward extreme event detection and tracking implemented in a scalable parallel framework. The core is written in modern C++ and uses MPI+thread for parallelism. The framework supports a number of parallel design patterns including distributed data parallelism and map-reduce. Python bindings make the high performance c++ code easy to use. TECA has been used up to 750k cores.
- Visapult - Visapult is a pipelined-parallel volume rendering application capable of rendering extremely large volume data on a wide range of common platforms. It was featured in a paper in the SC 2000 Technical Program.
- VisIt - A distributed, parallel visualization and graphical analysis tool for data defined on two- and three-dimensional (2D and 3D) meshes.
- Xi-CAM - A versatile interface for visualization and data analysis providing workflows for local and remote computing, data management, and seamless integration of plugins. Xi-cam is a continuing development project in an early beta stage. If interested in collaborative development or to receive development beta releases, please contact Ron Pandolfi (firstname.lastname@example.org) and Alex Hexemer (email@example.com).
- WarpIV - WarpIV is a python application that enables efficient, parallel visualization and analysis of simulation data while it is being generated by the Warp simulation framework. WarpIV integrates state-of-the-art in situ visualization and analysis using VisIt with Warp, supports management and control of complex in situ visualization and analysis workflows, and implements integrated analytics to facilitate query and feature-based data analytics and efficient large-scale data analysis.
- See also OpenMSI under Web Applications, above.
- Tigres - Tigres provides a programming library to compose and execute large-scale data-intensive scientific workflows from desktops to supercomputers. Tigres addresses the challenge of enabling a collaborative analysis of DOE Science data through a new concept of reusable “templates” that enable scientists to easily compose, run and manage collaborative computational tasks. These templates define common computation patterns used in analyzing a data set.
- WoAS - Workflow-Aware Scheduling adds workflow aware scheduling capabilities to Slurm.
- ScSF - Scheduling Simulation Framework manages worker instances, deploy experiment setups, running simulations, and harvest results. Data analysis and plotting functions are also present in the controller.
- Jupyter-Kale - a tool that enables Jupyter Notebooks to seamlessly interface with HPC workflows, leveraging distributed computational resources for iterative human-in-the-loop scientific exploration
- IDAES - The Institute for the Design of Advanced Energy Systems (IDAES) project is developing next-generation computational tools for Process Systems Engineering (PSE) of advanced energy systems to enable their rapid design and optimization. These tools are designed to work together in the modular framework depicted below.
- CCSI2 Computational Toolset - The CCSI2 Computational Toolset is a comprehensive, integrated suite of validated science-based computational models. This toolset from the Carbon Capture Simulation for Industry Impact (CCSI2) project aims to increase confidence in equipment and process designs, thereby reducing the risk associated with incorporating multiple innovative technologies into new carbon capture solutions.