19 minute read

Datasets

Thousands of Machine Learning Datasets

Introduction:

Without Data there is no Machine Learning, no AI, no Deep Learning. Because of heavy automation, IOT devices all around, there is no dirth of data. The first issue is, due to privacy and security related issues, data is not available for everyone. The second issue is cleaning this data. He third issues is getting complete data which can solve a given business problem. To get the complete data you need to get the data from multiple sources, identify the key to connect different records/sample of different sources. It is an expensive and time-consuming step of data science project. If you want to learn data science or want to solve any existing problem using new methods. Then you need some benchmarking framework in place which can display the model metrics (recall/precision/accuracy etc.) of each new approach (algorithm) against a given dataset. So datasets play a critical role in benchmarking algorithm performance.

Thus, machine learning rely heavily on high-quality datasets for training and evaluation. These datasets serve as the foundation for developing robust and accurate models across various AI domains, including classical machine learning, computer vision, natural language processing (NLP), audio processing, and time series analysis. Access to diverse and comprehensive datasets is crucial for researchers and practitioners to tackle real-world problems and advance the field of machine learning.

In this article, I am publishing a curated collection of datasets sourced from over 150 data sources. These datasets and data sources have been carefully selected to cover a wide range of domains, ensuring their relevance to different machine learning applications. Whether you are working on a NLP/text project, CV/image project, audio project, or time series forecasting, or classical machine learning, you’ll find valuable datasets to support your research and development efforts. Let’s dive into the world of machine learning datasets and discover the wealth of resources available to fuel your projects. If you dig each link, you will find hundreds, if not thousands of datasets under many of the links shared. I hope you will get benefitted from this work.

Note: These links I got from chrome bookmarks. At the time of writing this article, I have validated the link. If you find any link is not work / wrongly pointing / wrongly describing then please help me in improving this article. You can write to me at hari.prasad @ vedavit-ps .com.

Note: If you want to search image dataset on this page search “image”, for speech search “speech”

List of Datasets and Data Sources

  1. Approx 100 Datasets by DasarpAI on github
  2. 100+ Interesting Data Sets for Statistics : 100+ Interesting Data Sets for Statistics
  3. 100+ Mammography Image Databases : Mammography Image Databases – 100 or more images of mammograms with ground truth. Additional images available by request, and links to several other mammography databases are provided. (Formats: homebrew)
  4. 15 amazon datasets on data.world : amazon data on data.world* - 8 datasets available
  5. 20 Free Big Data Sources : 20 Free Big Data Sources
  6. 332 Sport Datasets on data.world : sports data on data.world** : 338 datasets available
  7. 40 Open Source Audio Datasets
  8. 4000+ Groningen Natural Image Database : Groningen Natural Image Database – 4000+ 1536×1024 (16 bit) calibrated outdoor images (Formats: homebrew)
  9. 450+ UCI datasets
  10. 538 Datasets
  11. 57 products datasets on data.world
  12. 622 UCI Archive Dataset : UCI Archive-Machine Learning Repository: Data Sets
  13. 9 Voice Datasets from cmwire
  14. A Collective list of Free API for Datasets : A collective list of free APIs for use in software and web development.
  15. A list of useful sources A blog post includes many data set databases
  16. Academic Torrents- Large Research dataset : Academic Torrents: distributed network for sharing large research datasets
  17. Air Freight Dataset - Computer Vision : Air Freight – The Air Freight data set is a ray-traced image sequence along with ground truth segmentation based on textural characteristics. (455 images + GT, each 160×120 pixels). (Formats: PNG)
  18. Airline Safety : contains information on accidents from each airline.
  19. Allen Institutes Dataset : Datasets – Allen Institute for AI
  20. Amazon Datasets : Amazon Web Services Public Data Sets
  21. Amsterdam Library of Object Images - ALOI : Amsterdam Library of Object Images – ALOI is a color image collection of one-thousand small objects, recorded for scientific purposes. In order to capture the sensory variation in object recordings, we systematically varied viewing angle, illumination angle, and illumination color for each object, and additionally captured wide-baseline stereo images. We recorded over a hundred images of each object, yielding a total of 110,250 images for the collection. (Formats: png)
  22. Annotated face, hand, cardiac & meat images : Annotated face, hand, cardiac & meat images – Most images & annotations are supplemented by various ASM/AAM analyses using the AAM-API. (Formats: bmp,asf)
  23. Apigee : Apigee: explore dozens of popular APIs
  24. apilist.fun : API List: A public list of free APIs for programmers
  25. AT&T Laboratories Cambridge face database - Images : AT&T Laboratories Cambridge face database
  26. AVHRR Pathfinder : National Centre for Environment Information
  27. Awesome Deep Learning Database : Densely Sampled View Spheres – Densely sampled view spheres – upper half of the view sphere of two toy objects with 2500 images each. (Formats: tiff)
  28. Awesome Public Datasets : Awesome Public Datasets: Well-organized and frequently updated
  29. Aylien Datasets
  30. Aylien News Data API
  31. B2SHARE
  32. BBC Datasets : Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. Class Labels: 5 (business, entertainment, politics, sport, tech)
  33. Berkeley Segmentation Dataset 500 : Berkeley Segmentation Dataset 500
  34. Biometric Systems Lab : Biometric Systems Lab – University of Bologna
  35. Bloomberg + Reuter Finance News Dataset :
  36. Breast Histopathology Images Dataset : This dataset contains 2,77,524 images of size 50×50 extracted from 162 mount slide images of breast cancer specimens scanned at 40x. There are 1,98,738 negative tests and 78,786 positive tests with IDC.
  37. California Water Resources : California’s water resource data.
  38. Caltech Image Database : Caltech Image Database – about 20 images – mostly top-down views of small objects and toys. (Formats: GIF)
  39. CAVIAR video sequences of mall and public space behavior : CAVIAR video sequences of mall and public space behavior - 90K video frames in 90 sequences of various human activities, with XML ground truth of detection and behavior classification (Formats: MPEG2 & JPEG)
  40. CCITT Fax standard images : CCITT Fax standard images – 8 images (Formats: gif)
  41. Census of India :
  42. Chatbot Intents Dataset : The dataset for a chatbot is a JSON file that has disparate tags like goodbye, greetings, pharmacy_search, hospital_search, etc. Every tag has a list of patterns that a user can ask, and the chatbot will respond according to that pattern. The dataset is perfect for understanding how chatbot data works.
  43. CIFAR-10 and CIFAR-100 : CIFAR-10 and CIFAR-100
  44. Cityscapes Dataset : It contains high-quality pixel-level annotations of video sequences taken in 50 different city streets. The dataset is useful in semantic segmentation and training deep neural networks to understand the urban scene.
  45. CMU CIL’s Stereo Data (Image) : CMU CIL’s Stereo Data with Ground Truth – 3 sets of 11 images, including color tiff images with spectroradiometry (Formats: gif, tiff)
  46. CMU PIE Database : CMU PIE Database - A database of 41,368 face images of 68 people captured under 13 poses, 43 illuminations conditions, and with 4 different expressions.
  47. CMU VASC Image Database : CMU VASC Image Database – Images, sequences, stereo pairs (thousands of images) (Formats: Sun Rasterimage)
  48. CodaLab : Hundreds of interesting datasets.
  49. College Scorecard Data :
  50. Color Detection Dataset : The dataset contains a CSV file that has 865 color names with their corresponding RGB (red, green, and blue) values of the color.
  51. Columbia-Utrecht Reflectance and Texture Database : Columbia-Utrecht Reflectance and Texture Database – Texture and reflectance measurements for over 60 samples of 3D texture, observed with over 200 different combinations of viewing and illumination directions. (Formats: bmp)
  52. Computational Colour Constancy Data : Computational Colour Constancy Data - A dataset oriented towards computational color constancy, but useful for computer vision in general. It includes synthetic data, camera sensor data, and over 700 images. (Formats: tiff)
  53. Computational Vision Lab : Computational Vision Lab
  54. Content-based image retrieval database : Content-based image retrieval database - 11 sets of color images for testing algorithms for content-based retrieval. Most sets have a description file with names of objects in each image. (Formats: jpg)
  55. **Covid-19 Google
  56. COVID-19 Open Research Dataset Challenge (CORD-19) : The CORD-19 dataset represents the most extensive machine-readable coronavirus literature collection available for data mining to date.
  57. **Credit Card Fraud Detection: Identify fraudulent credit card transactions.
  58. Cricket Data
  59. Crowdanalytics
  60. Crowdflower Dataset : CrowdFlower: interesting datasets created or enhanced by their contributors
  61. Crunchbase : Crunchbase: Discover innovative companies and the people behind them
  62. CVD Foundation Open Images : Open Images dataset – Open Images is a dataset of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories.
  63. Data Basin : Science-based mapping and analytics platform.
  64. **Data for Cool DS projects
  65. Data world
  66. Data.Gov : The US government portal to open data.
  67. Data.lacity.org
  68. DataCamp
  69. DataInnovation Dataset Blog : Center for Data Innovation: blog posts about interesting, recently-released data sets.
  70. Dataverse.org : Dataverse Project: searchable archive of research data
  71. DC Open Data Catalog : DC Open Data Catalog / OpenDataDC
  72. Deep Fashion - Images : Large-scale Fashion (DeepFashion) Database – Contains over 800,000 diverse fashion images. Each image in this dataset is labeled with 50 categories, 1,000 descriptive attributes, bounding box and clothing landmarks
  73. Devanagari Handwritten Character Dataset - Images
  74. Donor Choose : Donors Choose: data related to their projects
  75. Enron Email Dataset : It has more than 500K emails of over 150 users. The size of the data is around 432Mb. Out of 150 users, most of the users are the senior management of Enron.
  76. Face and Gesture images and image sequences : Face and Gesture images and image sequences – Several image datasets of faces and gestures that are ground truth annotated for benchmarking http://www.fg-net.org/
  77. FG-NET Facial Aging Database : FG-NET Facial Aging Database – Database contains 1002 face images showing subjects at different ages. (Formats: jpg)
  78. Finance Datasets on Kaggle
  79. Find Datasets: CMU Libraries : Discover high-quality datasets thanks to the collection of Huajin Wang, CMU.
  80. Finding Datasets from inside-r.org
  81. Flickr 30k, Images with Caption
  82. Flickr 8k, Images with Caption
  83. Flickr Data 100 Million Yahoo dataset, Images : Flickr Data 100 Million Yahoo dataset
  84. FT Markets Data
  85. FVC2000 Fingerprint Databases, Images : FVC2000 Fingerprint Databases - FVC2000 is the First International Competition for Fingerprint Verification Algorithms. Four fingerprint databases constitute the FVC2000 benchmark (3520 fingerprints in all).
  86. Gapminder Data
  87. German Fingerspelling Database : German Fingerspelling Database – The database contains 35 gestures and consists of 1400 image sequences that contain gestures of 20 different persons recorded under non-uniform daylight lighting conditions. http://www-i6.informatik.rwth-aachen.de/~dreuw/database.html
  88. Getting Stock Data
  89. GHTorrent
  90. Github Activity : contains all public activity on over 2.8 million public Github repositories.
  91. Github philipperemy/financial-news-dataset
  92. Github-DataMeet : Datameet is a community of Data Science enthusiasts.
  93. github.com/TheUpShot : The Upshot: data related to their articles
  94. Global Terrorism Database (GTD)
  95. **Google Dataset Search (beta)
  96. Google House Numbers from street view
  97. Google Scholar
  98. Google Trends Data Portal : Google trends data can be used to examine and analyze the data visually. We can find out what’s trending and what people are searching for.
  99. **grouplens.org Sample movie (with ratings), book and wiki datasets
  100. GTSRB (German traffic sign recognition benchmark) Dataset : Build a model using a deep learning framework that classifies traffic signs and also recognizes the bounding box of signs. The traffic sign classification is also useful in autonomous vehicles for identifying signs and then taking appropriate actions.
  101. Hate crime news : regularly-updated data about hate crimes reported in Google News.
  102. Hate Speech Dataset in Devnagari from Kaggle :
  103. Historical Weather : data from 9000 NOAA weather stations from 1929 to 2016.
  104. HowStat : HowSTAT! The Cricket Statisticians – Home Page
  105. Huggingface datasets
  106. Humanitarian Data Exchange : Humanitarian Data Exchange
  107. IEEE DataPort : Data Competitions** : IEEE DataPort
  108. IEEN Image Library : IEN Image Library – 1000+ images, mostly outdoor sequences (Formats: raw, ppm)
  109. Image Analysis Laboratory : Image Analysis Laboratory – Images obtained from a variety of imaging modalities — raw CFA images, range images and a host of “medical images”. (Formats: homebrew)
  110. Image QA : Image QA
  111. ImageNet : ImageNet
  112. IMDb Top 250 Movies : Ratings and Reviews for New Movies and TV Shows – IMDb
  113. IMDB-Wiki dataset : The IMDB-Wiki dataset is one of the largest open-source datasets for face images with labeled gender and age. The images are collected from IMDB and Wikipedia. It has 5 million-plus labeled images.
  114. IMF Data: The International Monetary Fund publishes data on international finances, foreign exchange reserves, commodity prices, and investments.
  115. IMF-Exchange Rate : IMF-Exchange Rate Archives by Month
  116. India, Surat City
  117. Indian Govt
  118. Indian Liver Patient Dataset
  119. INRIA
  120. Institute of Computer Graphics and Vision : Institute of Computer Graphics and Vision
  121. Inter University Consortium for Politics & Social : Inter-university Consortium for Political and Social Research
  122. Kaggle Datasets : Kaggle provides datasets with their challenges, but each competition has its own rules as to whether the data can be used outside of the scope of the competition.
  123. kdnuggets : kdnuggets- Datasets for Data Mining and Data Science
  124. Kinetics Dataset : There are three different datasets for Kinetics: Kinetics 400, Kinetics 600, and Kinetics 700 dataset. This is a large scale dataset that contains a URL link to around 6.5 million high-quality videos. Build a human action recognition model and detect the action of a human.
  125. Libri Speech Dataset : This dataset contains a large number of English speeches that are derived from the LibriVox project. It has 1000 hours of English-read speech in various accents. The objective of speech recognition is to automatically identify what is being said in the audio.
  126. Liver Tumor Segmentation Challenge Dataset
  127. London Data Store : Lots of datasets on London, UK.
  128. Mall Customers Dataset : The Mall customers dataset holds the details about people visiting the mall. The dataset has an age, customer id, gender, annual income, and spending score. It gains insights from the data and divides the customers into different groups based on their behaviors.
  129. Mammography Image Databases : Mammography Image Databases - 100 or more images of mammograms with ground truth. Additional images available by request, and links to several other mammography databases are provided. (Formats: homebrew)
  130. Manufacturing Process Failures : a collection of variables that were measured during the manufacturing process. The goal is to predict faults with manufacturing.
  131. Mashape - Explore APIs : Mashape: explore hundreds of APIs
  132. Microsoft COCO : Microsoft COCO
  133. Microsoft Datasets :
  134. Microsoft Research Open Data
  135. Million Song Dataset : Million Song Dataset
  136. MIT Vision Texure : MIT Vision Texture – Image archive (100+ images) (Formats: ppm)
  137. MNIST Handwritten digits : MNIST Handwritten digits
  138. Multiple Choice Questions : a data set of multiple-choice questions and the corresponding correct answers. The goal is to predict the answer to any given question.
  139. National Climatic Data Center — NOAA
  140. NAYN.CO Turkish News with categories
  141. NLM HyperDoc Visible Human Project : NLM HyperDoc Visible Human Project - Color, CAT and MRI image samples - over 30 images (Formats: jpeg)
  142. NYC Open Data socrata : NYC Open Data
  143. OASIS 1 : OASIS-1 (Open Access Series of Imaging Studies)
  144. OASIS Brain - Imaging Studies : Cross-Sectional MRI Data in Young, Middle Aged, Nondemented, and Demented Older Adults
  145. Open Data Philly : Connecting people with data for Philadelphia
  146. Open Energy Data Initiative : Over 800 data sets covering energy issues.
  147. Open Government Data Platform India
  148. Open Images is a dataset of ~9 million URLs : Open Images dataset - Open Images is a dataset of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories.
  149. Parkinson Dataset : Parkinson dataset contains biomedical measurements, 195 records of people with 23 different attributes. This data is used to differentiate healthy people and people with Parkinson’s disease.
  150. Photometric 3D Surface Texture Database : Photometric 3D Surface Texture Database - This is the first 3D texture database which provides both full real surface rotations and registered photometric stereo data (30 textures, 1680 images). (Formats: TIFF)
  151. Pittsburgh Science of Learning : Pittsburgh Science of Learning Center’s DataShop
  152. Political advertisements on Facebook : a free collection of data about Facebook ads that is updated daily.
  153. ProPublica Data Store : ProPublica Data Store
  154. Public Git Archive
  155. Python API for Datasets : Python APIs: Python wrappers for many APIs
  156. Quanddl : Quandl: over 10 million financial, economic, and social datasets
  157. R Datasets : Rdatasets: collection of 700+ datasets originally distributed with R packages
  158. RapidAPI.com : 25 Free Public APIs for Developers & Free Alternatives List
  159. rdatamining.com : RDataMining.com
  160. Recommender Systems and Personalization Datasets : This is a portal to a collection of rich datasets that were used in lab research projects at UCSD. It contains various datasets from popular websites like Goodreads book reviews, Amazon product reviews, bartending data, data from social media, etc that are used in building a recommender system.
  161. Reddit Dataset from 2500 subreddits : Reddit Top 2.5 Million: all-time top 1,000 posts from each of the top 2,500 subreddits
  162. Reddit Dataset Jeopardy Question : 200,000+ Jeopardy questions
  163. Reddit Dataset : Datasets subreddit: ask for help finding a specific data set, or post your own
  164. Research.yahoo.com
  165. Reuter Finance News Dataset Title Only :
  166. Satellite Photograph Order : a set of satellite photos of Earth — the goal is to predict which photos were taken earlier than others.
  167. Sebastian Raschka : Sebastian Raschka: datasets categorized by format and topic
  168. Smartcities Data Govt of India :
  169. Stanford Edu Dataset : Stanford Large Network Dataset Collection: graph data
  170. Stanford Speech Dataset
  171. Suicide Rates 1985-2013 : Suicide Rates Overview 1985 to 2016** : Kaggle
  172. Sunlight Foundation Govt Data : Sunlight Foundation: government-focused data
  173. Tamilnadu : 37K Resources, 4,134 Catalog, 101 Departments
  174. TED-LIUM corpus release 3
  175. Temporal concept localization within video - YouTube-8M, Link2 : The YouTube-8M Segments dataset is an extension of the YouTube-8M dataset with human-verified segment annotations. In addition to annotating videos, we would like to temporally localize the entities in the videos, i.e., find out when the entities occur.
  176. The Air Freight data set is a ray-traced image sequence : Air Freight - The Air Freight data set is a ray-traced image sequence along with ground truth segmentation based on textural characteristics. (455 images + GT, each 160x120 pixels). (Formats: PNG)
  177. The MIT-CSAIL Database of Objects and Scenes : The MIT-CSAIL Database of Objects and Scenes - Database for testing multiclass object detection and scene recognition algorithms. Over 72,000 images with 2873 annotated frames. More than 50 annotated object classes. (Formats: jpg)
  178. Tiny Images 80 Million tiny images : Tiny Images 80 Million tiny images6.
  179. Traffic Image Sequences and ‘Marbled Block’ Sequence : Traffic Image Sequences and ‘Marbled Block’ Sequence - thousands of frames of digitized traffic image sequences as well as the ‘Marbled Block’ sequence (grayscale images) (Formats: GIF)
  180. Trending YouTube Video Statistics : Sentiment analysis in a variety of forms, Categorising YouTube videos based on their comments and statistics, Training ML algorithms like RNNs to generate their own YouTube comments, Analyzing what factors affect how popular a YouTube video will be, Statistical analysis over time.
  181. U Oulu wood and knots database : U Oulu wood and knots database - Includes classifications - 1000+ color images (Formats: ppm)
  182. UC Irvine Machine Learning Repository : UC Irvine Machine Learning Repository
  183. UCI Machine Learning Datasets : Data for machine learning — lots of labeled data and description of the problem types.
  184. UCI-Liver Disorder Datasets : UCI Machine Learning Repository: Liver Disorders Data Set
  185. UCI : UC Irvine Machine Learning Repository: datasets specifically designed for machine learning
  186. UFO - Geolocation and Time Dataset : UFO reports: geolocated and time-standardized UFO reports for close to a century
  187. UK Govt
  188. University of Oulu Physics-based Face Database : University of Oulu Physics-based Face Database - contains color images of faces under different illuminants and camera calibration conditions as well as skin spectral reflectance measurements of each person.
  189. University of Oulu Texture Database : University of Oulu Texture Database - Database of 320 surface textures, each captured under three illuminants, six spatial resolutions and nine rotation angles. A set of test suites is also provided so that texture segmentation, classification, and retrieval algorithms can be tested in a standard manner. (Formats: bmp, ras, xv)
  190. UP Govt Economics : Directorate of Economics and Statistics UP Govt.
  191. UP Smart Cities
  192. US Census Bureau : US Census Bureau
  193. US Gov 256K datasets : The Home of the U.S. Government’s Open Data
  194. US Govt : data.gov (see also: Project Open Data Dashboard)
  195. US Students Univerties
  196. US Weather History : historical weather data for the US.
  197. USA Names : contains all Social Security name applications in the US, from 1879 to 2015.
  198. USF Range Image Data with Segmentation : USF Range Image Data with Segmentation Ground Truth - 80 image sets (Formats: Sun rasterimage)
  199. Vanderbilt edu dataset websites :
  200. Vanderbilt edu datasets :
  201. Voting machine age : data on the age of voting machines that were used in the 2016 election.
  202. VQA : Visual Question Answering
  203. Wikipedia Dataset : Wikipedia:Database download – Wikipedia
  204. Wiry Object Recognition Database : Wiry Object Recognition Database - Thousands of images of a cart, ladder, stool, bicycle, chairs, and cluttered scenes with ground truth labelings of edges and regions.
  205. World Bank Open Data : World Bank Open Data
  206. World Bank Open Data: Datasets covering population demographics, a vast number of economic, and development indicators.
  207. Worldbank Datasets
  208. Yale Face Database - 165 images : Yale Face Database - 165 images (15 individuals) with different lighting, expression, and occlusion configurations.
  209. Yale Face Database B - 5760 : Yale Face Database B - 5760 single light source images of 10 subjects each seen under 576 viewing conditions (9 poses x 64 illumination conditions). (Formats: PGM)
  210. Yelp.com Datasets Challenge : Yelp Dataset Challenge: Yelp reviews, business attributes, users, and more from 10 cities
  211. YouTube-8M Dataset : YouTube-8M Dataset - YouTube-8M is a large-scale labeled video dataset that consists of 8 million YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities.

Conclusion:

Machine learning datasets play a pivotal role in the development and advancement of various machine learning applications. In this article, we have explored an extensive collection of datasets obtained from more than 150 data sources, encompassing classical machine learning, computer vision, NLP/NLU, audio processing, and time series analysis.

By leveraging these diverse datasets, researchers and practitioners can build more robust and accurate machine learning models. These datasets provide the necessary ingredients for training, testing, and validating models across different domains, enabling the development of intelligent systems that can understand, interpret, and make predictions from complex data.

As the field of machine learning continues to evolve, the availability of high-quality datasets remains crucial. Whether you are embarking on a new project or seeking to enhance your existing models, exploring and utilizing these curated datasets will empower you to push the boundaries of what is possible in machine learning.

Remember, the power of machine learning lies not only in the algorithms and techniques but also in the data that fuels them. Embrace the vast array of datasets at your disposal and embark on exciting journeys of discovery and innovation in the world of machine learning.

Updated: