Thousands of Machine Learning Datasets
Thousands of Machine Learning Datasets
Introduction:
Without Data there is no Machine Learning, no AI, no Deep Learning. Because of heavy automation, IOT devices all around, there is no dirth of data. The first issue is, due to privacy and security related issues, data is not available for everyone. The second issue is cleaning this data. He third issues is getting complete data which can solve a given business problem. To get the complete data you need to get the data from multiple sources, identify the key to connect different records/sample of different sources. It is an expensive and time-consuming step of data science project. If you want to learn data science or want to solve any existing problem using new methods. Then you need some benchmarking framework in place which can display the model metrics (recall/precision/accuracy etc.) of each new approach (algorithm) against a given dataset. So datasets play a critical role in benchmarking algorithm performance.
Thus, machine learning rely heavily on high-quality datasets for training and evaluation. These datasets serve as the foundation for developing robust and accurate models across various AI domains, including classical machine learning, computer vision, natural language processing (NLP), audio processing, and time series analysis. Access to diverse and comprehensive datasets is crucial for researchers and practitioners to tackle real-world problems and advance the field of machine learning.
In this article, I am publishing a curated collection of datasets sourced from over 150 data sources. These datasets and data sources have been carefully selected to cover a wide range of domains, ensuring their relevance to different machine learning applications. Whether you are working on a NLP/text project, CV/image project, audio project, or time series forecasting, or classical machine learning, you’ll find valuable datasets to support your research and development efforts. Let’s dive into the world of machine learning datasets and discover the wealth of resources available to fuel your projects. If you dig each link, you will find hundreds, if not thousands of datasets under many of the links shared. I hope you will get benefitted from this work.
Note: These links I got from chrome bookmarks. At the time of writing this article, I have validated the link. If you find any link is not work / wrongly pointing / wrongly describing then please help me in improving this article. You can write to me at hari.prasad @ vedavit-ps .com.
Note: If you want to search image dataset on this page search “image”, for speech search “speech”
List of Datasets and Data Sources
- Approx 100 Datasets by DasarpAI on github
- 100+ Interesting Data Sets for Statistics : 100+ Interesting Data Sets for Statistics
- 100+ Mammography Image Databases : Mammography Image Databases – 100 or more images of mammograms with ground truth. Additional images available by request, and links to several other mammography databases are provided. (Formats: homebrew)
- 15 amazon datasets on data.world : amazon data on data.world* - 8 datasets available
- 20 Free Big Data Sources : 20 Free Big Data Sources
- 332 Sport Datasets on data.world : sports data on data.world** : 338 datasets available
- 40 Open Source Audio Datasets
- 4000+ Groningen Natural Image Database : Groningen Natural Image Database – 4000+ 1536×1024 (16 bit) calibrated outdoor images (Formats: homebrew)
- 450+ UCI datasets
- 538 Datasets
- 57 products datasets on data.world
- 622 UCI Archive Dataset : UCI Archive-Machine Learning Repository: Data Sets
- 9 Voice Datasets from cmwire
- A Collective list of Free API for Datasets : A collective list of free APIs for use in software and web development.
- A list of useful sources A blog post includes many data set databases
- Academic Torrents- Large Research dataset : Academic Torrents: distributed network for sharing large research datasets
- Air Freight Dataset - Computer Vision : Air Freight – The Air Freight data set is a ray-traced image sequence along with ground truth segmentation based on textural characteristics. (455 images + GT, each 160×120 pixels). (Formats: PNG)
- Airline Safety : contains information on accidents from each airline.
- Allen Institutes Dataset : Datasets – Allen Institute for AI
- Amazon Datasets : Amazon Web Services Public Data Sets
- Amsterdam Library of Object Images - ALOI : Amsterdam Library of Object Images – ALOI is a color image collection of one-thousand small objects, recorded for scientific purposes. In order to capture the sensory variation in object recordings, we systematically varied viewing angle, illumination angle, and illumination color for each object, and additionally captured wide-baseline stereo images. We recorded over a hundred images of each object, yielding a total of 110,250 images for the collection. (Formats: png)
- Annotated face, hand, cardiac & meat images : Annotated face, hand, cardiac & meat images – Most images & annotations are supplemented by various ASM/AAM analyses using the AAM-API. (Formats: bmp,asf)
- Apigee : Apigee: explore dozens of popular APIs
- apilist.fun : API List: A public list of free APIs for programmers
- AT&T Laboratories Cambridge face database - Images : AT&T Laboratories Cambridge face database
- AVHRR Pathfinder : National Centre for Environment Information
- Awesome Deep Learning Database : Densely Sampled View Spheres – Densely sampled view spheres – upper half of the view sphere of two toy objects with 2500 images each. (Formats: tiff)
- Awesome Public Datasets : Awesome Public Datasets: Well-organized and frequently updated
- Aylien Datasets
- Aylien News Data API
- B2SHARE
- BBC Datasets : Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. Class Labels: 5 (business, entertainment, politics, sport, tech)
- Berkeley Segmentation Dataset 500 : Berkeley Segmentation Dataset 500
- Biometric Systems Lab : Biometric Systems Lab – University of Bologna
- Bloomberg + Reuter Finance News Dataset :
- Breast Histopathology Images Dataset : This dataset contains 2,77,524 images of size 50×50 extracted from 162 mount slide images of breast cancer specimens scanned at 40x. There are 1,98,738 negative tests and 78,786 positive tests with IDC.
- California Water Resources : California’s water resource data.
- Caltech Image Database : Caltech Image Database – about 20 images – mostly top-down views of small objects and toys. (Formats: GIF)
- CAVIAR video sequences of mall and public space behavior : CAVIAR video sequences of mall and public space behavior - 90K video frames in 90 sequences of various human activities, with XML ground truth of detection and behavior classification (Formats: MPEG2 & JPEG)
- CCITT Fax standard images : CCITT Fax standard images – 8 images (Formats: gif)
- Census of India :
- Chatbot Intents Dataset : The dataset for a chatbot is a JSON file that has disparate tags like goodbye, greetings, pharmacy_search, hospital_search, etc. Every tag has a list of patterns that a user can ask, and the chatbot will respond according to that pattern. The dataset is perfect for understanding how chatbot data works.
- CIFAR-10 and CIFAR-100 : CIFAR-10 and CIFAR-100
- Cityscapes Dataset : It contains high-quality pixel-level annotations of video sequences taken in 50 different city streets. The dataset is useful in semantic segmentation and training deep neural networks to understand the urban scene.
- CMU CIL’s Stereo Data (Image) : CMU CIL’s Stereo Data with Ground Truth – 3 sets of 11 images, including color tiff images with spectroradiometry (Formats: gif, tiff)
- CMU PIE Database : CMU PIE Database - A database of 41,368 face images of 68 people captured under 13 poses, 43 illuminations conditions, and with 4 different expressions.
- CMU VASC Image Database : CMU VASC Image Database – Images, sequences, stereo pairs (thousands of images) (Formats: Sun Rasterimage)
- CodaLab : Hundreds of interesting datasets.
- College Scorecard Data :
- Color Detection Dataset : The dataset contains a CSV file that has 865 color names with their corresponding RGB (red, green, and blue) values of the color.
- Columbia-Utrecht Reflectance and Texture Database : Columbia-Utrecht Reflectance and Texture Database – Texture and reflectance measurements for over 60 samples of 3D texture, observed with over 200 different combinations of viewing and illumination directions. (Formats: bmp)
- Computational Colour Constancy Data : Computational Colour Constancy Data - A dataset oriented towards computational color constancy, but useful for computer vision in general. It includes synthetic data, camera sensor data, and over 700 images. (Formats: tiff)
- Computational Vision Lab : Computational Vision Lab
- Content-based image retrieval database : Content-based image retrieval database - 11 sets of color images for testing algorithms for content-based retrieval. Most sets have a description file with names of objects in each image. (Formats: jpg)
- **Covid-19 Google
- COVID-19 Open Research Dataset Challenge (CORD-19) : The CORD-19 dataset represents the most extensive machine-readable coronavirus literature collection available for data mining to date.
- **Credit Card Fraud Detection: Identify fraudulent credit card transactions.
- Cricket Data
- Crowdanalytics
- Crowdflower Dataset : CrowdFlower: interesting datasets created or enhanced by their contributors
- Crunchbase : Crunchbase: Discover innovative companies and the people behind them
- CVD Foundation Open Images : Open Images dataset – Open Images is a dataset of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories.
- Data Basin : Science-based mapping and analytics platform.
- **Data for Cool DS projects
- Data world
- Data.Gov : The US government portal to open data.
- Data.lacity.org
- DataCamp
- DataInnovation Dataset Blog : Center for Data Innovation: blog posts about interesting, recently-released data sets.
- Dataverse.org : Dataverse Project: searchable archive of research data
- DC Open Data Catalog : DC Open Data Catalog / OpenDataDC
- Deep Fashion - Images : Large-scale Fashion (DeepFashion) Database – Contains over 800,000 diverse fashion images. Each image in this dataset is labeled with 50 categories, 1,000 descriptive attributes, bounding box and clothing landmarks
- Devanagari Handwritten Character Dataset - Images
- Donor Choose : Donors Choose: data related to their projects
- Enron Email Dataset : It has more than 500K emails of over 150 users. The size of the data is around 432Mb. Out of 150 users, most of the users are the senior management of Enron.
- Face and Gesture images and image sequences : Face and Gesture images and image sequences – Several image datasets of faces and gestures that are ground truth annotated for benchmarking http://www.fg-net.org/
- FG-NET Facial Aging Database : FG-NET Facial Aging Database – Database contains 1002 face images showing subjects at different ages. (Formats: jpg)
- Finance Datasets on Kaggle
- Find Datasets: CMU Libraries : Discover high-quality datasets thanks to the collection of Huajin Wang, CMU.
- Finding Datasets from inside-r.org
- Flickr 30k, Images with Caption
- Flickr 8k, Images with Caption
- Flickr Data 100 Million Yahoo dataset, Images : Flickr Data 100 Million Yahoo dataset
- FT Markets Data
- FVC2000 Fingerprint Databases, Images : FVC2000 Fingerprint Databases - FVC2000 is the First International Competition for Fingerprint Verification Algorithms. Four fingerprint databases constitute the FVC2000 benchmark (3520 fingerprints in all).
- Gapminder Data
- German Fingerspelling Database : German Fingerspelling Database – The database contains 35 gestures and consists of 1400 image sequences that contain gestures of 20 different persons recorded under non-uniform daylight lighting conditions. http://www-i6.informatik.rwth-aachen.de/~dreuw/database.html
- Getting Stock Data
- GHTorrent
- Github Activity : contains all public activity on over 2.8 million public Github repositories.
- Github philipperemy/financial-news-dataset
- Github-DataMeet : Datameet is a community of Data Science enthusiasts.
- github.com/TheUpShot : The Upshot: data related to their articles
- Global Terrorism Database (GTD)
- **Google Dataset Search (beta)
- Google House Numbers from street view
- Google Scholar
- Google Trends Data Portal : Google trends data can be used to examine and analyze the data visually. We can find out what’s trending and what people are searching for.
- **grouplens.org Sample movie (with ratings), book and wiki datasets
- GTSRB (German traffic sign recognition benchmark) Dataset : Build a model using a deep learning framework that classifies traffic signs and also recognizes the bounding box of signs. The traffic sign classification is also useful in autonomous vehicles for identifying signs and then taking appropriate actions.
- Hate crime news : regularly-updated data about hate crimes reported in Google News.
- Hate Speech Dataset in Devnagari from Kaggle :
- Historical Weather : data from 9000 NOAA weather stations from 1929 to 2016.
- HowStat : HowSTAT! The Cricket Statisticians – Home Page
- Huggingface datasets
- Humanitarian Data Exchange : Humanitarian Data Exchange
- IEEE DataPort : Data Competitions** : IEEE DataPort
- IEEN Image Library : IEN Image Library – 1000+ images, mostly outdoor sequences (Formats: raw, ppm)
- Image Analysis Laboratory : Image Analysis Laboratory – Images obtained from a variety of imaging modalities — raw CFA images, range images and a host of “medical images”. (Formats: homebrew)
- Image QA : Image QA
- ImageNet : ImageNet
- IMDb Top 250 Movies : Ratings and Reviews for New Movies and TV Shows – IMDb
- IMDB-Wiki dataset : The IMDB-Wiki dataset is one of the largest open-source datasets for face images with labeled gender and age. The images are collected from IMDB and Wikipedia. It has 5 million-plus labeled images.
- IMF Data: The International Monetary Fund publishes data on international finances, foreign exchange reserves, commodity prices, and investments.
- IMF-Exchange Rate : IMF-Exchange Rate Archives by Month
- India, Surat City
- Indian Govt
- Indian Liver Patient Dataset
- INRIA
- Institute of Computer Graphics and Vision : Institute of Computer Graphics and Vision
- Inter University Consortium for Politics & Social : Inter-university Consortium for Political and Social Research
- Kaggle Datasets : Kaggle provides datasets with their challenges, but each competition has its own rules as to whether the data can be used outside of the scope of the competition.
- kdnuggets : kdnuggets- Datasets for Data Mining and Data Science
- Kinetics Dataset : There are three different datasets for Kinetics: Kinetics 400, Kinetics 600, and Kinetics 700 dataset. This is a large scale dataset that contains a URL link to around 6.5 million high-quality videos. Build a human action recognition model and detect the action of a human.
- Libri Speech Dataset : This dataset contains a large number of English speeches that are derived from the LibriVox project. It has 1000 hours of English-read speech in various accents. The objective of speech recognition is to automatically identify what is being said in the audio.
- Liver Tumor Segmentation Challenge Dataset
- London Data Store : Lots of datasets on London, UK.
- Mall Customers Dataset : The Mall customers dataset holds the details about people visiting the mall. The dataset has an age, customer id, gender, annual income, and spending score. It gains insights from the data and divides the customers into different groups based on their behaviors.
- Mammography Image Databases : Mammography Image Databases - 100 or more images of mammograms with ground truth. Additional images available by request, and links to several other mammography databases are provided. (Formats: homebrew)
- Manufacturing Process Failures : a collection of variables that were measured during the manufacturing process. The goal is to predict faults with manufacturing.
- Mashape - Explore APIs : Mashape: explore hundreds of APIs
- Microsoft COCO : Microsoft COCO
- Microsoft Datasets :
- Microsoft Research Open Data
- Million Song Dataset : Million Song Dataset
- MIT Vision Texure : MIT Vision Texture – Image archive (100+ images) (Formats: ppm)
- MNIST Handwritten digits : MNIST Handwritten digits
- Multiple Choice Questions : a data set of multiple-choice questions and the corresponding correct answers. The goal is to predict the answer to any given question.
- National Climatic Data Center — NOAA
- NAYN.CO Turkish News with categories
- NLM HyperDoc Visible Human Project : NLM HyperDoc Visible Human Project - Color, CAT and MRI image samples - over 30 images (Formats: jpeg)
- NYC Open Data socrata : NYC Open Data
- OASIS 1 : OASIS-1 (Open Access Series of Imaging Studies)
- OASIS Brain - Imaging Studies : Cross-Sectional MRI Data in Young, Middle Aged, Nondemented, and Demented Older Adults
- Open Data Philly : Connecting people with data for Philadelphia
- Open Energy Data Initiative : Over 800 data sets covering energy issues.
- Open Government Data Platform India
- Open Images is a dataset of ~9 million URLs : Open Images dataset - Open Images is a dataset of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories.
- Parkinson Dataset : Parkinson dataset contains biomedical measurements, 195 records of people with 23 different attributes. This data is used to differentiate healthy people and people with Parkinson’s disease.
- Photometric 3D Surface Texture Database : Photometric 3D Surface Texture Database - This is the first 3D texture database which provides both full real surface rotations and registered photometric stereo data (30 textures, 1680 images). (Formats: TIFF)
- Pittsburgh Science of Learning : Pittsburgh Science of Learning Center’s DataShop
- Political advertisements on Facebook : a free collection of data about Facebook ads that is updated daily.
- ProPublica Data Store : ProPublica Data Store
- Public Git Archive
- Python API for Datasets : Python APIs: Python wrappers for many APIs
- Quanddl : Quandl: over 10 million financial, economic, and social datasets
- R Datasets : Rdatasets: collection of 700+ datasets originally distributed with R packages
- RapidAPI.com : 25 Free Public APIs for Developers & Free Alternatives List
- rdatamining.com : RDataMining.com
- Recommender Systems and Personalization Datasets : This is a portal to a collection of rich datasets that were used in lab research projects at UCSD. It contains various datasets from popular websites like Goodreads book reviews, Amazon product reviews, bartending data, data from social media, etc that are used in building a recommender system.
- Reddit Dataset from 2500 subreddits : Reddit Top 2.5 Million: all-time top 1,000 posts from each of the top 2,500 subreddits
- Reddit Dataset Jeopardy Question : 200,000+ Jeopardy questions
- Reddit Dataset : Datasets subreddit: ask for help finding a specific data set, or post your own
- Research.yahoo.com
- Reuter Finance News Dataset Title Only :
- Satellite Photograph Order : a set of satellite photos of Earth — the goal is to predict which photos were taken earlier than others.
- Sebastian Raschka : Sebastian Raschka: datasets categorized by format and topic
- Smartcities Data Govt of India :
- Stanford Edu Dataset : Stanford Large Network Dataset Collection: graph data
- Stanford Speech Dataset
- Suicide Rates 1985-2013 : Suicide Rates Overview 1985 to 2016** : Kaggle
- Sunlight Foundation Govt Data : Sunlight Foundation: government-focused data
- Tamilnadu : 37K Resources, 4,134 Catalog, 101 Departments
- TED-LIUM corpus release 3
- Temporal concept localization within video - YouTube-8M, Link2 : The YouTube-8M Segments dataset is an extension of the YouTube-8M dataset with human-verified segment annotations. In addition to annotating videos, we would like to temporally localize the entities in the videos, i.e., find out when the entities occur.
- The Air Freight data set is a ray-traced image sequence : Air Freight - The Air Freight data set is a ray-traced image sequence along with ground truth segmentation based on textural characteristics. (455 images + GT, each 160x120 pixels). (Formats: PNG)
- The MIT-CSAIL Database of Objects and Scenes : The MIT-CSAIL Database of Objects and Scenes - Database for testing multiclass object detection and scene recognition algorithms. Over 72,000 images with 2873 annotated frames. More than 50 annotated object classes. (Formats: jpg)
- Tiny Images 80 Million tiny images : Tiny Images 80 Million tiny images6.
- Traffic Image Sequences and ‘Marbled Block’ Sequence : Traffic Image Sequences and ‘Marbled Block’ Sequence - thousands of frames of digitized traffic image sequences as well as the ‘Marbled Block’ sequence (grayscale images) (Formats: GIF)
- Trending YouTube Video Statistics : Sentiment analysis in a variety of forms, Categorising YouTube videos based on their comments and statistics, Training ML algorithms like RNNs to generate their own YouTube comments, Analyzing what factors affect how popular a YouTube video will be, Statistical analysis over time.
- U Oulu wood and knots database : U Oulu wood and knots database - Includes classifications - 1000+ color images (Formats: ppm)
- UC Irvine Machine Learning Repository : UC Irvine Machine Learning Repository
- UCI Machine Learning Datasets : Data for machine learning — lots of labeled data and description of the problem types.
- UCI-Liver Disorder Datasets : UCI Machine Learning Repository: Liver Disorders Data Set
- UCI : UC Irvine Machine Learning Repository: datasets specifically designed for machine learning
- UFO - Geolocation and Time Dataset : UFO reports: geolocated and time-standardized UFO reports for close to a century
- UK Govt
- University of Oulu Physics-based Face Database : University of Oulu Physics-based Face Database - contains color images of faces under different illuminants and camera calibration conditions as well as skin spectral reflectance measurements of each person.
- University of Oulu Texture Database : University of Oulu Texture Database - Database of 320 surface textures, each captured under three illuminants, six spatial resolutions and nine rotation angles. A set of test suites is also provided so that texture segmentation, classification, and retrieval algorithms can be tested in a standard manner. (Formats: bmp, ras, xv)
- UP Govt Economics : Directorate of Economics and Statistics UP Govt.
- UP Smart Cities
- US Census Bureau : US Census Bureau
- US Gov 256K datasets : The Home of the U.S. Government’s Open Data
- US Govt : data.gov (see also: Project Open Data Dashboard)
- US Students Univerties
- US Weather History : historical weather data for the US.
- USA Names : contains all Social Security name applications in the US, from 1879 to 2015.
- USF Range Image Data with Segmentation : USF Range Image Data with Segmentation Ground Truth - 80 image sets (Formats: Sun rasterimage)
- Vanderbilt edu dataset websites :
- Vanderbilt edu datasets :
- Voting machine age : data on the age of voting machines that were used in the 2016 election.
- VQA : Visual Question Answering
- Wikipedia Dataset : Wikipedia:Database download – Wikipedia
- Wiry Object Recognition Database : Wiry Object Recognition Database - Thousands of images of a cart, ladder, stool, bicycle, chairs, and cluttered scenes with ground truth labelings of edges and regions.
- World Bank Open Data : World Bank Open Data
- World Bank Open Data: Datasets covering population demographics, a vast number of economic, and development indicators.
- Worldbank Datasets
- Yale Face Database - 165 images : Yale Face Database - 165 images (15 individuals) with different lighting, expression, and occlusion configurations.
- Yale Face Database B - 5760 : Yale Face Database B - 5760 single light source images of 10 subjects each seen under 576 viewing conditions (9 poses x 64 illumination conditions). (Formats: PGM)
- Yelp.com Datasets Challenge : Yelp Dataset Challenge: Yelp reviews, business attributes, users, and more from 10 cities
- YouTube-8M Dataset : YouTube-8M Dataset - YouTube-8M is a large-scale labeled video dataset that consists of 8 million YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities.
Conclusion:
Machine learning datasets play a pivotal role in the development and advancement of various machine learning applications. In this article, we have explored an extensive collection of datasets obtained from more than 150 data sources, encompassing classical machine learning, computer vision, NLP/NLU, audio processing, and time series analysis.
By leveraging these diverse datasets, researchers and practitioners can build more robust and accurate machine learning models. These datasets provide the necessary ingredients for training, testing, and validating models across different domains, enabling the development of intelligent systems that can understand, interpret, and make predictions from complex data.
As the field of machine learning continues to evolve, the availability of high-quality datasets remains crucial. Whether you are embarking on a new project or seeking to enhance your existing models, exploring and utilizing these curated datasets will empower you to push the boundaries of what is possible in machine learning.
Remember, the power of machine learning lies not only in the algorithms and techniques but also in the data that fuels them. Embrace the vast array of datasets at your disposal and embark on exciting journeys of discovery and innovation in the world of machine learning.