DOLMA

DOLMA (Data Optimization and Learning for Model Alignment) is a framework designed to manage large-scale datasets for training and fine-tuning language models efficiently.

Features

Supports dataset cleaning and filtering for better model training
Implements deduplication and compression techniques
Optimized for large-scale NLP dataset processing
Provides tools for ethical and responsible dataset curation
Works with popular transformer-based LLM architectures
Open-source and adaptable for different AI research needs

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow DOLMA

DOLMA Web Site

Other Useful Business Software

Our Free Plans just got better! | Auth0

With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now

Rate This Project

User Reviews

Be the first to post a review of DOLMA!

Additional Project Details

Operating Systems

Linux, Mac, Windows

Programming Language

Python

Related Categories

Python Natural Language Processing (NLP) Tool

Registered

2025-01-24

Similar Business Software

Azure OpenAI Service

Apply advanced coding and language models to a variety of use cases. Leverage large-scale, generative AI models with deep understandings of language and code to enable new reasoning and comprehension capabilities for building cutting-edge applications. Apply these coding and language models...

See Software
Gensim

Gensim is a free, open source Python library designed for unsupervised topic modeling and natural language processing, focusing on large-scale semantic modeling. It enables the training of models like Word2Vec, FastText, Latent Semantic Analysis (LSA), and Latent Dirichlet Allocation (LDA),...

See Software
GPT-4

GPT-4 (Generative Pre-trained Transformer 4) is a large-scale unsupervised language model, yet to be released by OpenAI. GPT-4 is the successor to GPT-3 and part of the GPT-n series of natural language processing models, and was trained on a dataset of 45TB of text to produce human-like text...

See Software
Cogito

Cogito Tech is a leading AI data solutions provider specializing in data labeling and annotation services. We deliver high-quality data for applications across computer vision, natural language processing (NLP), and content services. Our expertise extends to fine-tuning large language models...

See Software
NLP Cloud

Fast and accurate AI models suited for production. Highly-available inference API leveraging the most advanced NVIDIA GPUs. We selected the best open-source natural language processing (NLP) models from the community and deployed them for you. Fine-tune your own models - including GPT-J - or...

See Software
AI21 Studio

AI21 Studio provides API access to Jurassic-1 large-language-models. Our models power text generation and comprehension features in thousands of live applications. Take on any language task. Our Jurassic-1 models are trained to follow natural language instructions and require just a few examples...

See Software

Report inappropriate content

DOLMA

Data and tools for generating and inspecting OLMo pre-training data

Get an email when there's a new version of DOLMA

Features

Project Samples

Project Activity

Categories

License

Follow DOLMA

User Reviews

Additional Project Details

Operating Systems

Programming Language

Related Categories

Registered