Thousands of Machine Learning Datasets
#

Introduction:
#

Without Data there is no Machine Learning, no AI, no Deep Learning. Because of heavy automation, IOT devices all around, there is no dirth of data. The first issue is, due to privacy and security related issues, data is not available for everyone. The second issue is cleaning this data. He third issues is getting complete data which can solve a given business problem. To get the complete data you need to get the data from multiple sources, identify the key to connect different records/sample of different sources. It is an expensive and time-consuming step of data science project. If you want to learn data science or want to solve any existing problem using new methods. Then you need some benchmarking framework in place which can display the model metrics (recall/precision/accuracy etc.) of each new approach (algorithm) against a given dataset. So datasets play a critical role in benchmarking algorithm performance.

Thus, machine learning rely heavily on high-quality datasets for training and evaluation. These datasets serve as the foundation for developing robust and accurate models across various AI domains, including classical machine learning, computer vision, natural language processing (NLP), audio processing, and time series analysis. Access to diverse and comprehensive datasets is crucial for researchers and practitioners to tackle real-world problems and advance the field of machine learning.

In this article, I am publishing a curated collection of datasets sourced from over 150 data sources. These datasets and data sources have been carefully selected to cover a wide range of domains, ensuring their relevance to different machine learning applications. Whether you are working on a NLP/text project, CV/image project, audio project, or time series forecasting, or classical machine learning, you’ll find valuable datasets to support your research and development efforts. Let’s dive into the world of machine learning datasets and discover the wealth of resources available to fuel your projects. If you dig each link, you will find hundreds, if not thousands of datasets under many of the links shared. I hope you will get benefitted from this work.

Note: These links I got from chrome bookmarks. At the time of writing this article, I have validated the link. If you find any link is not work / wrongly pointing / wrongly describing then please help me in improving this article. You can write to me at hari.prasad @ vedavit-ps .com.

Note: If you want to search image dataset on this page search “image”, for speech search “speech”

List of Datasets and Data Sources
#

Approx 100 Datasets by DasarpAI on github
100+ Interesting Data Sets for Statistics : 100+ Interesting Data Sets for Statistics
100+ Mammography Image Databases : Mammography Image Databases – 100 or more images of mammograms with ground truth. Additional images available by request, and links to several other mammography databases are provided. (Formats: homebrew)
15 amazon datasets on data.world : amazon data on data.world* - 8 datasets available
20 Free Big Data Sources : 20 Free Big Data Sources
332 Sport Datasets on data.world : sports data on data.world** : 338 datasets available
40 Open Source Audio Datasets
4000+ Groningen Natural Image Database : Groningen Natural Image Database – 4000+ 1536×1024 (16 bit) calibrated outdoor images (Formats: homebrew)
450+ UCI datasets
538 Datasets
57 products datasets on data.world
622 UCI Archive Dataset : UCI Archive-Machine Learning Repository: Data Sets
9 Voice Datasets from cmwire
A Collective list of Free API for Datasets : A collective list of free APIs for use in software and web development.
A list of useful sources A blog post includes many data set databases
Academic Torrents- Large Research dataset : Academic Torrents: distributed network for sharing large research datasets
Air Freight Dataset - Computer Vision : Air Freight – The Air Freight data set is a ray-traced image sequence along with ground truth segmentation based on textural characteristics. (455 images + GT, each 160×120 pixels). (Formats: PNG)
Airline Safety : contains information on accidents from each airline.
Allen Institutes Dataset : Datasets – Allen Institute for AI
Amazon Datasets : Amazon Web Services Public Data Sets
Amsterdam Library of Object Images - ALOI : Amsterdam Library of Object Images – ALOI is a color image collection of one-thousand small objects, recorded for scientific purposes. In order to capture the sensory variation in object recordings, we systematically varied viewing angle, illumination angle, and illumination color for each object, and additionally captured wide-baseline stereo images. We recorded over a hundred images of each object, yielding a total of 110,250 images for the collection. (Formats: png)
Annotated face, hand, cardiac & meat images : Annotated face, hand, cardiac & meat images – Most images & annotations are supplemented by various ASM/AAM analyses using the AAM-API. (Formats: bmp,asf)
Apigee : Apigee: explore dozens of popular APIs
apilist.fun : API List: A public list of free APIs for programmers
AT&T Laboratories Cambridge face database - Images : AT&T Laboratories Cambridge face database
AVHRR Pathfinder : National Centre for Environment Information
Awesome Deep Learning Database : Densely Sampled View Spheres – Densely sampled view spheres – upper half of the view sphere of two toy objects with 2500 images each. (Formats: tiff)
Awesome Public Datasets : Awesome Public Datasets: Well-organized and frequently updated
Aylien Datasets
Aylien News Data API
B2SHARE
BBC Datasets : Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. Class Labels: 5 (business, entertainment, politics, sport, tech)
Berkeley Segmentation Dataset 500 : Berkeley Segmentation Dataset 500
Biometric Systems Lab : Biometric Systems Lab – University of Bologna
Bloomberg + Reuter Finance News Dataset :
Breast Histopathology Images Dataset : This dataset contains 2,77,524 images of size 50×50 extracted from 162 mount slide images of breast cancer specimens scanned at 40x. There are 1,98,738 negative tests and 78,786 positive tests with IDC.
California Water Resources : California’s water resource data.
Caltech Image Database : Caltech Image Database – about 20 images – mostly top-down views of small objects and toys. (Formats: GIF)
CAVIAR video sequences of mall and public space behavior : CAVIAR video sequences of mall and public space behavior - 90K video frames in 90 sequences of various human activities, with XML ground truth of detection and behavior classification (Formats: MPEG2 & JPEG)
CCITT Fax standard images : CCITT Fax standard images – 8 images (Formats: gif)
Census of India :
Chatbot Intents Dataset : The dataset for a chatbot is a JSON file that has disparate tags like goodbye, greetings, pharmacy_search, hospital_search, etc. Every tag has a list of patterns that a user can ask, and the chatbot will respond according to that pattern. The dataset is perfect for understanding how chatbot data works.
CIFAR-10 and CIFAR-100 : CIFAR-10 and CIFAR-100
Cityscapes Dataset : It contains high-quality pixel-level annotations of video sequences taken in 50 different city streets. The dataset is useful in semantic segmentation and training deep neural networks to understand the urban scene.
CMU CIL’s Stereo Data (Image) : CMU CIL’s Stereo Data with Ground Truth – 3 sets of 11 images, including color tiff images with spectroradiometry (Formats: gif, tiff)
CMU PIE Database : CMU PIE Database - A database of 41,368 face images of 68 people captured under 13 poses, 43 illuminations conditions, and with 4 different expressions.
CMU VASC Image Database : CMU VASC Image Database – Images, sequences, stereo pairs (thousands of images) (Formats: Sun Rasterimage)
CodaLab : Hundreds of interesting datasets.
College Scorecard Data :
Color Detection Dataset : The dataset contains a CSV file that has 865 color names with their corresponding RGB (red, green, and blue) values of the color.
Columbia-Utrecht Reflectance and Texture Database : Columbia-Utrecht Reflectance and Texture Database – Texture and reflectance measurements for over 60 samples of 3D texture, observed with over 200 different combinations of viewing and illumination directions. (Formats: bmp)
Computational Colour Constancy Data : Computational Colour Constancy Data - A dataset oriented towards computational color constancy, but useful for computer vision in general. It includes synthetic data, camera sensor data, and over 700 images. (Formats: tiff)
Computational Vision Lab : Computational Vision Lab
Content-based image retrieval database : Content-based image retrieval database - 11 sets of color images for testing algorithms for content-based retrieval. Most sets have a description file with names of objects in each image. (Formats: jpg)
**Covid-19 Google
COVID-19 Open Research Dataset Challenge (CORD-19) : The CORD-19 dataset represents the most extensive machine-readable coronavirus literature collection available for data mining to date.
**Credit Card Fraud Detection: Identify fraudulent credit card transactions.
Cricket Data
Crowdanalytics
Crowdflower Dataset : CrowdFlower: interesting datasets created or enhanced by their contributors
Crunchbase : Crunchbase: Discover innovative companies and the people behind them
CVD Foundation Open Images : Open Images dataset – Open Images is a dataset of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories.
Data Basin : Science-based mapping and analytics platform.
**Data for Cool DS projects
Data world
Data.Gov : The US government portal to open data.
Data.lacity.org
DataCamp
DataInnovation Dataset Blog : Center for Data Innovation: blog posts about interesting, recently-released data sets.
Dataverse.org : Dataverse Project: searchable archive of research data
DC Open Data Catalog : DC Open Data Catalog / OpenDataDC
Deep Fashion - Images : Large-scale Fashion (DeepFashion) Database – Contains over 800,000 diverse fashion images. Each image in this dataset is labeled with 50 categories, 1,000 descriptive attributes, bounding box and clothing landmarks
Devanagari Handwritten Character Dataset - Images
Donor Choose : Donors Choose: data related to their projects
Enron Email Dataset : It has more than 500K emails of over 150 users. The size of the data is around 432Mb. Out of 150 users, most of the users are the senior management of Enron.
Face and Gesture images and image sequences : Face and Gesture images and image sequences – Several image datasets of faces and gestures that are ground truth annotated for benchmarking http://www.fg-net.org/
FG-NET Facial Aging Database : FG-NET Facial Aging Database – Database contains 1002 face images showing subjects at different ages. (Formats: jpg)
Finance Datasets on Kaggle
Find Datasets: CMU Libraries : Discover high-quality datasets thanks to the collection of Huajin Wang, CMU.
Finding Datasets from inside-r.org
Flickr 30k, Images with Caption
Flickr 8k, Images with Caption
Flickr Data 100 Million Yahoo dataset, Images : Flickr Data 100 Million Yahoo dataset
FT Markets Data
FVC2000 Fingerprint Databases, Images : FVC2000 Fingerprint Databases - FVC2000 is the First International Competition for Fingerprint Verification Algorithms. Four fingerprint databases constitute the FVC2000 benchmark (3520 fingerprints in all).
Gapminder Data
German Fingerspelling Database : German Fingerspelling Database – The database contains 35 gestures and consists of 1400 image sequences that contain gestures of 20 different persons recorded under non-uniform daylight lighting conditions. http://www-i6.informatik.rwth-aachen.de/~dreuw/database.html
Getting Stock Data
GHTorrent
Github Activity : contains all public activity on over 2.8 million public Github repositories.
Github philipperemy/financial-news-dataset
Github-DataMeet : Datameet is a community of Data Science enthusiasts.
github.com/TheUpShot : The Upshot: data related to their articles
Global Terrorism Database (GTD)
**Google Dataset Search (beta)
Google House Numbers from street view
Google Scholar
Google Trends Data Portal : Google trends data can be used to examine and analyze the data visually. We can find out what’s trending and what people are searching for.
**grouplens.org Sample movie (with ratings), book and wiki datasets
GTSRB (German traffic sign recognition benchmark) Dataset : Build a model using a deep learning framework that classifies traffic signs and also recognizes the bounding box of signs. The traffic sign classification is also useful in autonomous vehicles for identifying signs and then taking appropriate actions.
Hate crime news : regularly-updated data about hate crimes reported in Google News.
Hate Speech Dataset in Devnagari from Kaggle :
Historical Weather : data from 9000 NOAA weather stations from 1929 to 2016.
HowStat : HowSTAT! The Cricket Statisticians – Home Page
Huggingface datasets
Humanitarian Data Exchange : Humanitarian Data Exchange
IEEE DataPort : Data Competitions** : IEEE DataPort
IEEN Image Library : IEN Image Library – 1000+ images, mostly outdoor sequences (Formats: raw, ppm)
Image Analysis Laboratory : Image Analysis Laboratory – Images obtained from a variety of imaging modalities — raw CFA images, range images and a host of “medical images”. (Formats: homebrew)
Image QA : Image QA
ImageNet : ImageNet
IMDb Top 250 Movies : Ratings and Reviews for New Movies and TV Shows – IMDb
IMDB-Wiki dataset : The IMDB-Wiki dataset is one of the largest open-source datasets for face images with labeled gender and age. The images are collected from IMDB and Wikipedia. It has 5 million-plus labeled images.
IMF Data: The International Monetary Fund publishes data on international finances, foreign exchange reserves, commodity prices, and investments.
IMF-Exchange Rate : IMF-Exchange Rate Archives by Month
India, Surat City
Indian Govt
Indian Liver Patient Dataset
INRIA
Institute of Computer Graphics and Vision : Institute of Computer Graphics and Vision
Inter University Consortium for Politics & Social : Inter-university Consortium for Political and Social Research
Kaggle Datasets : Kaggle provides datasets with their challenges, but each competition has its own rules as to whether the data can be used outside of the scope of the competition.
kdnuggets : kdnuggets- Datasets for Data Mining and Data Science
Kinetics Dataset : There are three different datasets for Kinetics: Kinetics 400, Kinetics 600, and Kinetics 700 dataset. This is a large scale dataset that contains a URL link to around 6.5 million high-quality videos. Build a human action recognition model and detect the action of a human.
Libri Speech Dataset : This dataset contains a large number of English speeches that are derived from the LibriVox project. It has 1000 hours of English-read speech in various accents. The objective of speech recognition is to automatically identify what is being said in the audio.
Liver Tumor Segmentation Challenge Dataset
London Data Store : Lots of datasets on London, UK.
Mall Customers Dataset : The Mall customers dataset holds the details about people visiting the mall. The dataset has an age, customer id, gender, annual income, and spending score. It gains insights from the data and divides the customers into different groups based on their behaviors.
Mammography Image Databases : Mammography Image Databases - 100 or more images of mammograms with ground truth. Additional images available by request, and links to several other mammography databases are provided. (Formats: homebrew)
Manufacturing Process Failures : a collection of variables that were measured during the manufacturing process. The goal is to predict faults with manufacturing.
Mashape - Explore APIs : Mashape: explore hundreds of APIs
Microsoft COCO : Microsoft COCO
Microsoft Datasets :
Microsoft Research Open Data
Million Song Dataset : Million Song Dataset
MIT Vision Texure : MIT Vision Texture – Image archive (100+ images) (Formats: ppm)
MNIST Handwritten digits : MNIST Handwritten digits
Multiple Choice Questions : a data set of multiple-choice questions and the corresponding correct answers. The goal is to predict the answer to any given question.
National Climatic Data Center — NOAA
NAYN.CO Turkish News with categories
NLM HyperDoc Visible Human Project : NLM HyperDoc Visible Human Project - Color, CAT and MRI image samples - over 30 images (Formats: jpeg)
NYC Open Data socrata : NYC Open Data
OASIS 1 : OASIS-1 (Open Access Series of Imaging Studies)
OASIS Brain - Imaging Studies : Cross-Sectional MRI Data in Young, Middle Aged, Nondemented, and Demented Older Adults
Open Data Philly : Connecting people with data for Philadelphia
Open Energy Data Initiative : Over 800 data sets covering energy issues.
Open Government Data Platform India
Open Images is a dataset of ~9 million URLs : Open Images dataset - Open Images is a dataset of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories.
Parkinson Dataset : Parkinson dataset contains biomedical measurements, 195 records of people with 23 different attributes. This data is used to differentiate healthy people and people with Parkinson’s disease.
Photometric 3D Surface Texture Database : Photometric 3D Surface Texture Database - This is the first 3D texture database which provides both full real surface rotations and registered photometric stereo data (30 textures, 1680 images). (Formats: TIFF)
Pittsburgh Science of Learning : Pittsburgh Science of Learning Center’s DataShop
Political advertisements on Facebook : a free collection of data about Facebook ads that is updated daily.
ProPublica Data Store : ProPublica Data Store
Public Git Archive
Python API for Datasets : Python APIs: Python wrappers for many APIs
Quanddl : Quandl: over 10 million financial, economic, and social datasets
R Datasets : Rdatasets: collection of 700+ datasets originally distributed with R packages
RapidAPI.com : 25 Free Public APIs for Developers & Free Alternatives List
rdatamining.com : RDataMining.com
Recommender Systems and Personalization Datasets : This is a portal to a collection of rich datasets that were used in lab research projects at UCSD. It contains various datasets from popular websites like Goodreads book reviews, Amazon product reviews, bartending data, data from social media, etc that are used in building a recommender system.
Reddit Dataset from 2500 subreddits : Reddit Top 2.5 Million: all-time top 1,000 posts from each of the top 2,500 subreddits
Reddit Dataset Jeopardy Question : 200,000+ Jeopardy questions
Reddit Dataset : Datasets subreddit: ask for help finding a specific data set, or post your own
Research.yahoo.com
Reuter Finance News Dataset Title Only :
Satellite Photograph Order : a set of satellite photos of Earth — the goal is to predict which photos were taken earlier than others.
Sebastian Raschka : Sebastian Raschka: datasets categorized by format and topic
Smartcities Data Govt of India :
Stanford Edu Dataset : Stanford Large Network Dataset Collection: graph data
Stanford Speech Dataset
Suicide Rates 1985-2013 : Suicide Rates Overview 1985 to 2016** : Kaggle
Sunlight Foundation Govt Data : Sunlight Foundation: government-focused data
Tamilnadu : 37K Resources, 4,134 Catalog, 101 Departments
TED-LIUM corpus release 3
Temporal concept localization within video - YouTube-8M, Link2 : The YouTube-8M Segments dataset is an extension of the YouTube-8M dataset with human-verified segment annotations. In addition to annotating videos, we would like to temporally localize the entities in the videos, i.e., find out when the entities occur.
The Air Freight data set is a ray-traced image sequence : Air Freight - The Air Freight data set is a ray-traced image sequence along with ground truth segmentation based on textural characteristics. (455 images + GT, each 160x120 pixels). (Formats: PNG)
The MIT-CSAIL Database of Objects and Scenes : The MIT-CSAIL Database of Objects and Scenes - Database for testing multiclass object detection and scene recognition algorithms. Over 72,000 images with 2873 annotated frames. More than 50 annotated object classes. (Formats: jpg)
Tiny Images 80 Million tiny images : Tiny Images 80 Million tiny images6.
Traffic Image Sequences and ‘Marbled Block’ Sequence : Traffic Image Sequences and ‘Marbled Block’ Sequence - thousands of frames of digitized traffic image sequences as well as the ‘Marbled Block’ sequence (grayscale images) (Formats: GIF)
Trending YouTube Video Statistics : Sentiment analysis in a variety of forms, Categorising YouTube videos based on their comments and statistics, Training ML algorithms like RNNs to generate their own YouTube comments, Analyzing what factors affect how popular a YouTube video will be, Statistical analysis over time.
U Oulu wood and knots database : U Oulu wood and knots database - Includes classifications - 1000+ color images (Formats: ppm)
UC Irvine Machine Learning Repository : UC Irvine Machine Learning Repository
UCI Machine Learning Datasets : Data for machine learning — lots of labeled data and description of the problem types.
UCI-Liver Disorder Datasets : UCI Machine Learning Repository: Liver Disorders Data Set
UCI : UC Irvine Machine Learning Repository: datasets specifically designed for machine learning
UFO - Geolocation and Time Dataset : UFO reports: geolocated and time-standardized UFO reports for close to a century
UK Govt
University of Oulu Physics-based Face Database : University of Oulu Physics-based Face Database - contains color images of faces under different illuminants and camera calibration conditions as well as skin spectral reflectance measurements of each person.
University of Oulu Texture Database : University of Oulu Texture Database - Database of 320 surface textures, each captured under three illuminants, six spatial resolutions and nine rotation angles. A set of test suites is also provided so that texture segmentation, classification, and retrieval algorithms can be tested in a standard manner. (Formats: bmp, ras, xv)
UP Govt Economics : Directorate of Economics and Statistics UP Govt.
UP Smart Cities
US Census Bureau : US Census Bureau
US Gov 256K datasets : The Home of the U.S. Government’s Open Data
US Govt : data.gov (see also: Project Open Data Dashboard)
US Students Univerties
US Weather History : historical weather data for the US.
USA Names : contains all Social Security name applications in the US, from 1879 to 2015.
USF Range Image Data with Segmentation : USF Range Image Data with Segmentation Ground Truth - 80 image sets (Formats: Sun rasterimage)
Vanderbilt edu dataset websites :
Vanderbilt edu datasets :
Voting machine age : data on the age of voting machines that were used in the 2016 election.
VQA : Visual Question Answering
Wikipedia Dataset : Wikipedia:Database download – Wikipedia
Wiry Object Recognition Database : Wiry Object Recognition Database - Thousands of images of a cart, ladder, stool, bicycle, chairs, and cluttered scenes with ground truth labelings of edges and regions.
World Bank Open Data : World Bank Open Data
World Bank Open Data: Datasets covering population demographics, a vast number of economic, and development indicators.
Worldbank Datasets
Yale Face Database - 165 images : Yale Face Database - 165 images (15 individuals) with different lighting, expression, and occlusion configurations.
Yale Face Database B - 5760 : Yale Face Database B - 5760 single light source images of 10 subjects each seen under 576 viewing conditions (9 poses x 64 illumination conditions). (Formats: PGM)
Yelp.com Datasets Challenge : Yelp Dataset Challenge: Yelp reviews, business attributes, users, and more from 10 cities
YouTube-8M Dataset : YouTube-8M Dataset - YouTube-8M is a large-scale labeled video dataset that consists of 8 million YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities.

Conclusion:
#

Machine learning datasets play a pivotal role in the development and advancement of various machine learning applications. In this article, we have explored an extensive collection of datasets obtained from more than 150 data sources, encompassing classical machine learning, computer vision, NLP/NLU, audio processing, and time series analysis.

By leveraging these diverse datasets, researchers and practitioners can build more robust and accurate machine learning models. These datasets provide the necessary ingredients for training, testing, and validating models across different domains, enabling the development of intelligent systems that can understand, interpret, and make predictions from complex data.

As the field of machine learning continues to evolve, the availability of high-quality datasets remains crucial. Whether you are embarking on a new project or seeking to enhance your existing models, exploring and utilizing these curated datasets will empower you to push the boundaries of what is possible in machine learning.

Remember, the power of machine learning lies not only in the algorithms and techniques but also in the data that fuels them. Embrace the vast array of datasets at your disposal and embark on exciting journeys of discovery and innovation in the world of machine learning.

Follow Me

Dr. Hari Thapliyaal

Writes on data science & AI, project management, and Advaita Vedanta—and builds training and consulting work around those threads.

Education: Doctorate in AI/NLP (SSBM, Geneva); masters study across computer science, business, data science, and economics.
Career: 30+ years in management and technology leadership; 16+ years across the software product lifecycle; a decade in PM training, coaching, and consulting; hands-on Data Science/AI product solution delivery, course design, and mentoring in GenAI, ML, Deep Learning, NLP and Analytics.
Verticals: Solutions and delivery across logistics, BFSI, investment banking, NGOs, staffing, and industrial engineering.
Strengths: Clarifying messy stakeholder problems and turning them into practical outcomes.

Away from work: long meditation and quiet time in nature.

Thousands of Machine Learning Datasets#

Introduction:#

List of Datasets and Data Sources#

Conclusion:#

Dr. Hari Thapliyaal

Comments:

Related

Thousands of Machine Learning Datasets
#

Introduction:
#

List of Datasets and Data Sources
#

Conclusion:
#