18 minute read

Datasets

150+ Machine Learning Datasets

Introduction:

Without Data there is no Machine Learning, no AI, no Deep Learning. Because of heavy automation, IOT devices all around, there is no dirth of data. The first issue is, due to privacy and security related issues, data is not available for everyone. The second issue is cleaning this data. He third issues is getting complete data which can solve a given business problem. To get the complete data you need to get the data from multiple sources, identify the key to connect different records/sample of different sources. It is an expensive and time-consuming step of data science project. If you want to learn data science or want to solve any existing problem using new methods. Then you need some benchmarking framework in place which can display the model metrics (recall/precision/accuracy etc.) of each new approach (algorithm) against a given dataset. So datasets play a critical role in benchmarking algorithm performance.

Thus, machine learning rely heavily on high-quality datasets for training and evaluation. These datasets serve as the foundation for developing robust and accurate models across various AI domains, including classical machine learning, computer vision, natural language processing (NLP), audio processing, and time series analysis. Access to diverse and comprehensive datasets is crucial for researchers and practitioners to tackle real-world problems and advance the field of machine learning.

In this article, I am publishing a curated collection of datasets sourced from over 150 data sources. These datasets and data sources have been carefully selected to cover a wide range of domains, ensuring their relevance to different machine learning applications. Whether you are working on a NLP/text project, CV/image project, audio project, or time series forecasting, or classical machine learning, you’ll find valuable datasets to support your research and development efforts. Let’s dive into the world of machine learning datasets and discover the wealth of resources available to fuel your projects. If you dig each link, you will find hundreds, if not thousands of datasets under many of the links shared. I hope you will get benefitted from this work.

Note: These links I got from chrome bookmarks. At the time of writing this article, I have validated the link. If you find any link is not work / wrongly pointing / wrongly describing then please help me in improving this article. You can write to me at hari.prasad @ vedavit-ps .com.

Note: If you want to search image dataset on this page search “image”, for speech search “speech”

List of Datasets and Data Sources

Sno. URL Description  
1. 100+ Interesting Data Sets for Statistics 100+ Interesting Data Sets for Statistics  
2. 100+ Mammography Image Databases Mammography Image Databases – 100 or more images of mammograms with ground truth. Additional images available by request, and links to several other mammography databases are provided. (Formats: homebrew)  
3. 15 amazon datasets on data.world amazon data on data.world 8 datasets available
4. 20 Free Big Data Sources 20 Free Big Data Sources  
5. 332 Sport Datasets on data.world sports data on data.world 338 datasets available
6. 4000+ Groningen Natural Image Database Groningen Natural Image Database – 4000+ 1536×1024 (16 bit) calibrated outdoor images (Formats: homebrew)  
7. 450+ UCI datasets    
8. 538 Datasets FiveThirtyEight: data and code related to their articles  
9. 538 Datasets Summary Githu FiveThirtyEight: data and code related to their articles  
10. 57 products datasets on data.world    
11. 622 UCI Archive Dataset UCI Archive-Machine Learning Repository: Data Sets  
12. A Collective list of Free API for Datasets A collective list of free APIs for use in software and web development.  
13. Academic Torrents- Large Research dataset Academic Torrents: distributed network for sharing large research datasets  
14. The Air Freight data set is a ray-traced image sequence Air Freight - The Air Freight data set is a ray-traced image sequence along with ground truth segmentation based on textural characteristics. (455 images + GT, each 160x120 pixels). (Formats: PNG)  
15. Air Freight Dataset - Computer Vision Air Freight – The Air Freight data set is a ray-traced image sequence along with ground truth segmentation based on textural characteristics. (455 images + GT, each 160×120 pixels). (Formats: PNG)  
16. Allen Institutes Dataset Datasets – Allen Institute for AI  
17. Amazon Datasets Amazon Web Services Public Data Sets  
18. Amsterdam Library of Object Images - ALOI Amsterdam Library of Object Images – ALOI is a color image collection of one-thousand small objects, recorded for scientific purposes. In order to capture the sensory variation in object recordings, we systematically varied viewing angle, illumination angle, and illumination color for each object, and additionally captured wide-baseline stereo images. We recorded over a hundred images of each object, yielding a total of 110,250 images for the collection. (Formats: png)  
19. Annotated face, hand, cardiac & meat images Annotated face, hand, cardiac & meat images – Most images & annotations are supplemented by various ASM/AAM analyses using the AAM-API. (Formats: bmp,asf)  
20. Apigee Apigee: explore dozens of popular APIs  
21. apilist.fun API List: A public list of free APIs for programmers  
22. AT&T Laboratories Cambridge face database - Images AT&T Laboratories Cambridge face database  
23. AVHRR Pathfinder National Centre for Environment Information  
24. Awesome Deep Learning Database Densely Sampled View Spheres – Densely sampled view spheres – upper half of the view sphere of two toy objects with 2500 images each. (Formats: tiff)  
25. Awesome Public Dataset    
26. Awesome Public Datasets Awesome Public Datasets: Well-organized and frequently updated  
27. B2SHARE    
28. Berkeley Segmentation Dataset 500 Berkeley Segmentation Dataset 500  
29. Biometric Systems Lab Biometric Systems Lab – University of Bologna  
30. Caltech Image Database Caltech Image Database – about 20 images – mostly top-down views of small objects and toys. (Formats: GIF)  
31. CAVIAR video sequences of mall and public space behavior CAVIAR video sequences of mall and public space behavior - 90K video frames in 90 sequences of various human activities, with XML ground truth of detection and behavior classification (Formats: MPEG2 & JPEG)  
32. CCITT Fax standard images CCITT Fax standard images – 8 images (Formats: gif)  
33. Census of India    
34. Global Terrorism Database (GTD)    
35. CIFAR-10 and CIFAR-100 CIFAR-10 and CIFAR-100  
36. CMU CIL’s Stereo Data (Image) CMU CIL’s Stereo Data with Ground Truth – 3 sets of 11 images, including color tiff images with spectroradiometry (Formats: gif, tiff)  
37. CMU PIE Database CMU PIE Database - A database of 41,368 face images of 68 people captured under 13 poses, 43 illuminations conditions, and with 4 different expressions.  
38. CMU VASC Image Database CMU VASC Image Database – Images, sequences, stereo pairs (thousands of images) (Formats: Sun Rasterimage)  
39. College Scorecard Data    
40. Columbia-Utrecht Reflectance and Texture Database Columbia-Utrecht Reflectance and Texture Database – Texture and reflectance measurements for over 60 samples of 3D texture, observed with over 200 different combinations of viewing and illumination directions. (Formats: bmp)  
41. Computational Colour Constancy Data Computational Colour Constancy Data - A dataset oriented towards computational color constancy, but useful for computer vision in general. It includes synthetic data, camera sensor data, and over 700 images. (Formats: tiff)  
42. Computational Vision Lab Computational Vision Lab  
43. Content-based image retrieval database Content-based image retrieval database - 11 sets of color images for testing algorithms for content-based retrieval. Most sets have a description file with names of objects in each image. (Formats: jpg)  
44. Cricket Data    
45. Crowdanalytics    
46. Crowdflower Dataset CrowdFlower: interesting datasets created or enhanced by their contributors  
47. Crunchbase Crunchbase: Discover innovative companies and the people behind them  
48. CVD Foundation Open Images Open Images dataset – Open Images is a dataset of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories.  
49. Data world    
50. Data.lacity.org DataLA  
51. DataCamp    
52. DataInnovation Dataset Blog Center for Data Innovation: blog posts about interesting, recently-released data sets.  
53. Dataverse.org Dataverse Project: searchable archive of research data  
54. DC Open Data Catalog DC Open Data Catalog / OpenDataDC  
55. Deep Fashion - Images Large-scale Fashion (DeepFashion) Database – Contains over 800,000 diverse fashion images. Each image in this dataset is labeled with 50 categories, 1,000 descriptive attributes, bounding box and clothing landmarks  
56. Devanagari Handwritten Character Dataset - Images    
57. Donor Choose Donors Choose: data related to their projects  
58. Face and Gesture images and image sequences Face and Gesture images and image sequences – Several image datasets of faces and gestures that are ground truth annotated for benchmarking http://www.fg-net.org/  
59. FG-NET Facial Aging Database FG-NET Facial Aging Database – Database contains 1002 face images showing subjects at different ages. (Formats: jpg)  
60. Finding Datasets from inside-r.org inside-R  
61. Flickr 30k, Images with Caption Flickr 30k  
62. Flickr 8k, Images with Caption Flickr 8k  
63. Flickr Data 100 Million Yahoo dataset, Images Flickr Data 100 Million Yahoo dataset  
64. FVC2000 Fingerprint Databases, Images FVC2000 Fingerprint Databases - FVC2000 is the First International Competition for Fingerprint Verification Algorithms. Four fingerprint databases constitute the FVC2000 benchmark (3520 fingerprints in all).  
65. Gapminder Data    
66. German Fingerspelling Database German Fingerspelling Database – The database contains 35 gestures and consists of 1400 image sequences that contain gestures of 20 different persons recorded under non-uniform daylight lighting conditions. http://www-i6.informatik.rwth-aachen.de/~dreuw/database.html  
67. Getting Stock Data    
68. Github-DataMeet Datameet is a community of Data Science enthusiasts.  
69. Google House Numbers from street view    
70. Google Scholar    
71. HowStat HowSTAT! The Cricket Statisticians – Home Page  
72. github.com/TheUpShot The Upshot: data related to their articles  
73. Huggingface datasets    
74. Humanitarian Data Exchange Humanitarian Data Exchange  
75. IEEE DataPort Data Competitions IEEE DataPort
76. IEEN Image Library IEN Image Library – 1000+ images, mostly outdoor sequences (Formats: raw, ppm)  
77. Image Analysis Laboratory Image Analysis Laboratory – Images obtained from a variety of imaging modalities — raw CFA images, range images and a host of “medical images”. (Formats: homebrew)  
78. Image QA Image QA  
79. ImageNet ImageNet  
80. IMDb Top 250 Movies Ratings and Reviews for New Movies and TV Shows – IMDb  
81. IMF-Exchange Rate IMF-Exchange Rate Archives by Month  
82. Indian Govt    
83. Indian Liver Patient Dataset    
84. INRIA INRIA  
85. Institute of Computer Graphics and Vision Institute of Computer Graphics and Vision  
86. Inter University Consortium for Politics & Social Inter-university Consortium for Political and Social Research  
87. Kaggle Datasets Kaggle provides datasets with their challenges, but each competition has its own rules as to whether the data can be used outside of the scope of the competition.  
88. kdnuggets kdnuggets- Datasets for Data Mining and Data Science  
89. Mammography Image Databases Mammography Image Databases - 100 or more images of mammograms with ground truth. Additional images available by request, and links to several other mammography databases are provided. (Formats: homebrew)  
90. Mashape - Explore APIs Mashape: explore hundreds of APIs  
91. Microsoft COCO Microsoft COCO  
92. Million Song Dataset Million Song Dataset  
93. MIT Vision Texure MIT Vision Texture – Image archive (100+ images) (Formats: ppm)  
94. MNIST Handwritten digits MNIST Handwritten digits  
95. NLM HyperDoc Visible Human Project NLM HyperDoc Visible Human Project - Color, CAT and MRI image samples - over 30 images (Formats: jpeg)  
96. NYC Open Data socrata NYC Open Data  
97. OASIS 1 OASIS-1 (Open Access Series of Imaging Studies)  
98. OASIS Brain - Imaging Studies Cross-Sectional MRI Data in Young, Middle Aged, Nondemented, and Demented Older Adults  
99. Open Images is a dataset of ~9 million URLs Open Images dataset - Open Images is a dataset of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories.  
100. Photometric 3D Surface Texture Database Photometric 3D Surface Texture Database - This is the first 3D texture database which provides both full real surface rotations and registered photometric stereo data (30 textures, 1680 images). (Formats: TIFF)  
101. Pittsburgh Science of Learning Pittsburgh Science of Learning Center’s DataShop  
102. ProPublica Data Store ProPublica Data Store  
103. Python API for Datasets Python APIs: Python wrappers for many APIs  
104. Quanddl Quandl: over 10 million financial, economic, and social datasets  
105. R Datasets Rdatasets: collection of 700+ datasets originally distributed with R packages  
106. RapidAPI.com 25 Free Public APIs for Developers & Free Alternatives List  
107. rdatamining.com RDataMining.com  
108. Reddit Dataset Datasets subreddit: ask for help finding a specific data set, or post your own  
109. Reddit Dataset from 2500 subreddits Reddit Top 2.5 Million: all-time top 1,000 posts from each of the top 2,500 subreddits  
110. Reddit Dataset Jeopardy Question 200,000+ Jeopardy questions  
111. research.yahoo.com    
112. Sebastian Raschka Sebastian Raschka: datasets categorized by format and topic  
113. Smartcities Data Govt of India    
114. Stanford Edu Dataset Stanford Large Network Dataset Collection: graph data  
115. Suicide Rates 1985-2013 Suicide Rates Overview 1985 to 2016 Kaggle
116. Sunlight Foundation Govt Data Sunlight Foundation: government-focused data  
117. India, Surat City    
118. UP Govt Economics Directorate of Economics and Statistics UP Govt.  
119. UP Smart Cities    
120. Tamilnadu 37K Resources, 4,134 Catalog, 101 Departments  
121. The MIT-CSAIL Database of Objects and Scenes The MIT-CSAIL Database of Objects and Scenes - Database for testing multiclass object detection and scene recognition algorithms. Over 72,000 images with 2873 annotated frames. More than 50 annotated object classes. (Formats: jpg)  
122. Tiny Images 80 Million tiny images Tiny Images 80 Million tiny images6.  
123. Traffic Image Sequences and ‘Marbled Block’ Sequence Traffic Image Sequences and ‘Marbled Block’ Sequence - thousands of frames of digitized traffic image sequences as well as the ‘Marbled Block’ sequence (grayscale images) (Formats: GIF)  
124. U Oulu wood and knots database U Oulu wood and knots database - Includes classifications - 1000+ color images (Formats: ppm)  
125. UC Irvine Machine Learning Repository UC Irvine Machine Learning Repository  
126. UCI UC Irvine Machine Learning Repository: datasets specifically designed for machine learning  
127. UCI Archive 620+ datasets    
128. UCI-Liver Disorder Datasets UCI Machine Learning Repository: Liver Disorders Data Set  
129. UFO - Geolocation and Time Dataset UFO reports: geolocated and time-standardized UFO reports for close to a century  
130. UK Govt data.gov.uk  
131. University of Oulu Physics-based Face Database University of Oulu Physics-based Face Database - contains color images of faces under different illuminants and camera calibration conditions as well as skin spectral reflectance measurements of each person.  
132. University of Oulu Texture Database University of Oulu Texture Database - Database of 320 surface textures, each captured under three illuminants, six spatial resolutions and nine rotation angles. A set of test suites is also provided so that texture segmentation, classification, and retrieval algorithms can be tested in a standard manner. (Formats: bmp, ras, xv)  
133. US Census Bureau US Census Bureau  
134. US Gov 256K datasets The Home of the U.S. Government’s Open Data  
135. US Govt data.gov (see also: Project Open Data Dashboard)  
136. US Students Univerties    
137. USF Range Image Data with Segmentation USF Range Image Data with Segmentation Ground Truth - 80 image sets (Formats: Sun rasterimage)  
138. Vanderbilt edu dataset websites    
139. Vanderbilt edu datasets    
140. VQA Visual Question Answering  
141. Wikipedia Dataset Wikipedia:Database download – Wikipedia  
142. Wiry Object Recognition Database Wiry Object Recognition Database - Thousands of images of a cart, ladder, stool, bicycle, chairs, and cluttered scenes with ground truth labelings of edges and regions.  
143. World Bank Open Data World Bank Open Data  
144. Yale Face Database - 165 images Yale Face Database - 165 images (15 individuals) with different lighting, expression, and occlusion configurations.  
145. Yale Face Database B - 5760 Yale Face Database B - 5760 single light source images of 10 subjects each seen under 576 viewing conditions (9 poses x 64 illumination conditions). (Formats: PGM)  
146. Yelp.com Datasets Challenge Yelp Dataset Challenge: Yelp reviews, business attributes, users, and more from 10 cities  
147. YouTube-8M Dataset YouTube-8M Dataset - YouTube-8M is a large-scale labeled video dataset that consists of 8 million YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities.  
148. Aylien News Data API    
149. Aylien Datasets    
150. Finance Datasets on Kaggle    
151. Github philipperemy/financial-news-dataset    
152. FT Markets Data    
153. IMF Datasets    
154. Worldbank Datasets    
155. Reuter Finance News Dataset Title Only    
156. Bloomberg + Reuter Finance News Dataset    
157. Enron Email Dataset It has more than 500K emails of over 150 users. The size of the data is around 432Mb. Out of 150 users, most of the users are the senior management of Enron.  
158. Chatbot Intents Dataset The dataset for a chatbot is a JSON file that has disparate tags like goodbye, greetings, pharmacy_search, hospital_search, etc. Every tag has a list of patterns that a user can ask, and the chatbot will respond according to that pattern. The dataset is perfect for understanding how chatbot data works.  
159. Parkinson Dataset Parkinson dataset contains biomedical measurements, 195 records of people with 23 different attributes. This data is used to differentiate healthy people and people with Parkinson’s disease.  
160. Mall Customers Dataset The Mall customers dataset holds the details about people visiting the mall. The dataset has an age, customer id, gender, annual income, and spending score. It gains insights from the data and divides the customers into different groups based on their behaviors.  
161. Google Trends Data Portal Google trends data can be used to examine and analyze the data visually. We can find out what’s trending and what people are searching for.  
162. Recommender Systems and Personalization Datasets This is a portal to a collection of rich datasets that were used in lab research projects at UCSD. It contains various datasets from popular websites like Goodreads book reviews, Amazon product reviews, bartending data, data from social media, etc that are used in building a recommender system.  
163. GTSRB (German traffic sign recognition benchmark) Dataset Build a model using a deep learning framework that classifies traffic signs and also recognizes the bounding box of signs. The traffic sign classification is also useful in autonomous vehicles for identifying signs and then taking appropriate actions.  
164. Cityscapes Dataset It contains high-quality pixel-level annotations of video sequences taken in 50 different city streets. The dataset is useful in semantic segmentation and training deep neural networks to understand the urban scene.  
165. Kinetics Dataset There are three different datasets for Kinetics: Kinetics 400, Kinetics 600, and Kinetics 700 dataset. This is a large scale dataset that contains a URL link to around 6.5 million high-quality videos. Build a human action recognition model and detect the action of a human.  
166. IMDB-Wiki dataset The IMDB-Wiki dataset is one of the largest open-source datasets for face images with labeled gender and age. The images are collected from IMDB and Wikipedia. It has 5 million-plus labeled images.  
167. Color Detection Dataset The dataset contains a CSV file that has 865 color names with their corresponding RGB (red, green, and blue) values of the color.  
168. Libri Speech Dataset This dataset contains a large number of English speeches that are derived from the LibriVox project. It has 1000 hours of English-read speech in various accents. The objective of speech recognition is to automatically identify what is being said in the audio.  
169. Breast Histopathology Images Dataset This dataset contains 2,77,524 images of size 50×50 extracted from 162 mount slide images of breast cancer specimens scanned at 40x. There are 1,98,738 negative tests and 78,786 positive tests with IDC.  
170. youtube-8M analytics    
171. Temporal concept localization within video - YouTube-8M, Link2 The YouTube-8M Segments dataset is an extension of the YouTube-8M dataset with human-verified segment annotations. In addition to annotating videos, we would like to temporally localize the entities in the videos, i.e., find out when the entities occur.  
172. CodaLab Hundreds of interesting datasets.  
173. Hate Speech Dataset in Devnagari from Kaggle    
174. Stanford Speech Dataset    
175. TED-LIUM corpus release 3    
176. 40 Open Source Audio Datasets    
177. Microsoft Datasets    
178. 9 Voice Datasets from cmwire    
179. BBC Datasets Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. Class Labels: 5 (business, entertainment, politics, sport, tech)  

Conclusion:

Machine learning datasets play a pivotal role in the development and advancement of various machine learning applications. In this article, we have explored an extensive collection of datasets obtained from more than 150 data sources, encompassing classical machine learning, computer vision, NLP/NLU, audio processing, and time series analysis.

By leveraging these diverse datasets, researchers and practitioners can build more robust and accurate machine learning models. These datasets provide the necessary ingredients for training, testing, and validating models across different domains, enabling the development of intelligent systems that can understand, interpret, and make predictions from complex data.

As the field of machine learning continues to evolve, the availability of high-quality datasets remains crucial. Whether you are embarking on a new project or seeking to enhance your existing models, exploring and utilizing these curated datasets will empower you to push the boundaries of what is possible in machine learning.

Remember, the power of machine learning lies not only in the algorithms and techniques but also in the data that fuels them. Embrace the vast array of datasets at your disposal and embark on exciting journeys of discovery and innovation in the world of machine learning.