27 minute read

My Journey from Master to PhD in Data Science and AI

My Journey from Master to PhD in Data Science and AI

I have been in software development between 1993 to 2009. Some of these years were in senior leadership roles in delivery management, project management, CMMI, ISO, ISMS, PMO, etc. In 2010 I moved into project management training and consulting. In 2018, I was considering going back to technology but this time I wanted to pick up a completely new stack of technology. I decided I would move into Data Science and AI. I knew AI as much as any typical software development person in senior management knew about this. It means I was highly confident that I could pick this up quickly. On top of that lots of content is available on the internet, many YouTube channels, and many courses are available. I thought it should be a cakewalk for me. I started learning this technology in my own way in my committed free time. Within 5-6 months after lots of study I started realizing this was the way to get the knowledge but it could not help me get confident to solve the problem. The more I learned, the more I felt that I did not know how much I needed to learn or how far I needed to go.

That time after a lot of contemplation I decided I must make long-term, time and money commitments for learning new technology. Otherwise, I will know many jargon but never will get the confidence to solve the problem using this knowledge, and I will not be able to appreciate the future scope of AI technology and the avenues available. Hence, I decided to take admission into the M.S. in Data Science program, which was a joint program offered by IIIT Bangalore and LJMU (Liverpool John Moore University). Including our pre-course preparation + course + post-course evaluation, it took around 2 years. I learned so many things that it changed my perspective about AI, Machine Learning, Deep Learning, Robotics, IoT, BigData, and their impact on the business.

I am extremely enthusiastic about utilizing AI technology to address issues related to Indian languages. This is precisely why I chose to base my MS thesis on the topic of “Sarcasm Detection in Hinglish Language (SDHL).” However, during this period, the NLP landscape experienced rapid changes due to the emergence of Transformer technologies and Large Language Models (LLM). Despite having numerous intriguing projects in mind that revolved around Indian culture and languages, I faced challenges such as time constraints, funding constraints, and the maturity of the technology. Consequently, I became fascinated by the Question-Answer task in NLP and decided to focus on applying it to Indian historical books. Thus, I embarked on my journey by selecting “The Mahabharat” as the book of interest. My Ph.D. thesis was “AI-Powered Historical Book Question Answering”.

I confess following on this journey, and suggest anyone who wants to embark on this journey be ready for all this at least if not more. If you feel better than what I am writing below then you can consider yourself lucky enough.

  • It was a completely lonely journey, nobody will be with you on this journey.
  • Need to learn and master a programming language to code for data scrapping, data cleaning, model training, model evaluation, and data analysis. I remain focused on Python.
  • Learn NLP technology from scratch. From basic nuance of any language like part of speech, lemmatization, and grammar to text embedding, neural networks, and mathematics behind that.
  • Transformer technologies
  • Large Language Models
  • Usage of different databases SQL Database, Non-SQL Database, Vector Database, and Graph Database.
  • How to select relevant research papers, read technical papers, make notes during filtering, use work, and cite them properly
  • Finding gaps in papers, finding improvement opportunities in the paper.
  • Finding what trick, or technique from a paper can be used or refined for my work.
  • Learning LaTex
  • Learning Research Management
  • Learning NLP Model building and finetuning.
  • Learning NLP Model Evaluation, especially for NLP and Question Answering Work.
  • Cloud Platforms available from Google (VertexAI), Amazon (AWS Sagemaker), and Microsoft (Azure AI Machine Learning)

In this journey I learned so many things that it is not possible to sum up or enumerate in one blog. During this journey, I wrote dozens of articles and, in the interest of brevity, in this article linking to those different articles. Whether you are a researcher, an established professional in AI, or a project manager you will find something of value for yourself in this article. My research topic was “AI-Powered Historical Books Question Answering”. I am not discussing anything about this topic in this article. It deserves multiple articles in the future.

DBA vs PhD Difference and Similarity

Learning what can be called research and how to conduct serious research is important for any career. During the initial period of our career, we do whatever we learn from college, institutions, colleagues, and organization training. If you learn seriously then a time comes when want to solve the problems with new methods. Those new methods may be efficient, produce better quality, more economical, more safe and secure, environment friendly, etc. But, when we are allowed to lead those initiatives, many times we feel stuck in the well. It is easy to start some idea, but, to take it to conclusion demands a different rigor.

This rigor comes to you from academic rigor. With my experience in software development, process improvement, project management training, and consulting I am saying seriously that this rigor doesn’t come without going through some serious research project. Doing some research projects takes a different kind of mindset. In a research project, the sponsor spends money to learn the ways how problem can be solved, and at the same time, it is equally important how the problem cannot be solved. We need to demonstrate this with empirical data.

If you are serious about learning about mindset then Ph.D. is the way, it gives you that rigor. Ph.D. is wonderful and rigorous but the issue is after one or two decades in the industry, when you get maturity and you can solve the problem then you don’t want to work with a university professor to learn this methodology that too at the cost of loss of salary and rejections from university professors. Most of the time it becomes an ego fight between established university professor and professional. Ph.D. makes more sense when you are continuing your master’s or within the first 5 years of your career. After that, there are lessor gains and more losses and that is the reason an extremely low number of people with solid corporate backgrounds choose this path.

There is an alternative path for the experienced working professionals. That path is DBA, Doctorate in Business Administration. Do not confuse this with a typical administration degree. In business, either you have domain expertise or technology expertise. If you want to take those to the next level for innovation and research then DBA makes complete sense. The good thing is, here your guides are from the industry and not from university. So, they understand you better than academic university professors. On top of that, if you are lucky enough you may find a guide with more experience in the domain, in which you want to pursue your DBA. If you are more lucky then you get a research partner who is working on the same topic and this person may be your research guide or other fellow student.

My Learnings

I would like to summarise my learning in this journey in two parts

  1. Learnings related to conducting research in AI/NLP Domain
  2. Learning related to AI/NLP Technologies

How to Conduct AI Research?

Depending upon your domain, and the technology type in which you are doing research, there can be different ways of conducting research. Some popular research types are

  • Basic Research (Pure Research)
  • Applied Research (Most of the NLP-related researchers fall under this group). It provides solutions and is often conducted with a clear end goal in mind, such as developing new technologies or improving existing processes.
  • Quantitative Research
  • Qualitative Research
  • Descriptive Research: Without giving any conclusion, just document the observations of different experiments.
  • Causal Research
  • Longitudinal Research: Longitudinal research involves collecting data from the same subjects over an extended period.
  • Action Research
  • Case Study Research

The main flow of conducting research is as follows:

  • Define problems clearly: Don’t pick things just for a degree. Pick up some topic that you care about, for which data is available or you can get, for which you can put in extra effort, and go the extra mile. Otherwise, in this long journey, your interest will come down and like many other students, you will drop the idea.
  • Ask yourself, what do you know about the existing state of problem and existing solutions?
  • Define, what contribution your work will make. If you don’t solve that problem then what will happen? If answers are not giving a kick to you then don’t go ahead, simply drop the idea.
  • Articulate your research questions, which you must answer by the end of the research.
  • Articulate your hypothesis, assumptions, and limitations clearly.
  • Learn how to Conduct literature survey. I have written a separate article on this.
  • In self-funded data science research, avoid the temptations of using huge data or using LLM. You may not get the resources or you may end up paying a lot of money from your pocket in this journey.
  • Visualise the raw data, clean the data, and visualize. It will help you understand whether data is good enough for training purposes.
  • Master a programming language, if you do not know then you have to learn, otherwise it will cause you a lot of pain during experimentation. No matter how many coding tools, coding assistance, great IDE, or libraries are available there is no replacement of your own ability to write and debug.
  • Document every setup and outcome of every setup. If you don’t like some of the results then don’t throw them away. Document everything thing. Remember, research is not only about what works, it is also about what doesn’t work.
  • Analyze your research data and ask yourself whether your research questions can be answered. What happened to your null hypothesis?
  • Write your conclusions, contributions made, possible future improvement, something which you couldn’t do, and why?

Research Project Management

I come from a long project management consulting background so I understand that that any project is about change management, a project is a unique endeavor or produces unique output. These outputs can be products, services, or results. The success of any project depends upon understanding enterprise environmental factors, organization process assets, project lifecycle, project resources, and ten aspects of any project namely: scope, schedule, cost, quality, project resources, communication, risk, procurement, and integration. So, a research project is not any different from any other normal project in this sense. Because they also have resources, a Gantt chart, quality, scope, requirements, risk, and other important things to take care of.

But we need to understand research projects are unique in the sense of trying to understand business problems in detail or trying to understand possible solutions around that. During the research project, we do not solve the problem fully, but we lay out a clearly defined path following which the problem can be solved. So the output of the research project is a prototype, model, high-level or detailed approach, methods to solve the problem, a reports that can explain the problem in a better way. In a research project, you are focused on creating a solution but more than that you are interested in knowing all the possible ways to solve the problem, and the pros and cons of each method. These pros and cons can come in the form of certain metrics that influence, cost, time, quality, efficiency, etc.

It is possible that in the research project, you need to establish new metrics to evaluate the performance of the final research outcome. In my case, I was working on descriptive question-answering work. How to evaluate whether questions and answers generated by the system are correct? what is the meaning of correct? How much correct they are? If your problem is unique you may deviate from standard metrics.

Research is creative work, tracking the research work and knowing how much work has been done is not as straightforward as in typical construction projects, maintenance projects, or software development. Every project type has a certain degree of creativity involved in, more the creativity more uncertainty around marking the task, and whether it will be completed or not. One day you can mark a task as complete and after one week you get some new idea to solve that problem and you start working on the same task, which was marked complete. Sometimes you realize the dataset was incorrect or we have a different or better dataset and you want to use this new dataset for solutioning. And now it is failing, you enter into another loop.

Sometimes months will go by in reading, and thinking, and no breakthrough, and one day all of sudden you realize the solution is before your eyes. Due to these reasons, you must document your project scope, research questions, assumptions, limitations, risks, and related work after thorough thinking.

Finally, depending on organization size, priorities, strategies, market position, and industry research budget may vary. Either individuals or organizations, generally they don’t have enough budget for research. Due to this reason in research projects, we have a scarcity of resources, you may be alone or hardly one or two people with you. On top of this, cycle time difference between research and the productization of research may be very small. This can lead to lots of pressure on the research team to improvise the product. Every organization has different ways of handling this challenge.

Literature Survey

  • My experience in Literature Survey and review. How to conduct Literature Survey
  • Window shopping of the work done by other researchers in your area of interest.
  • How to identify important work?
  • What to read in the identified work?
  • Selecting work for the second reading. What to read in the second reading?
  • Why to use other’s work?
  • How to use other’s work? Different styles of quoting the work.
  • How to avoid citation errors?
  • Different kinds of papers.
    • Conference Paper
    • ArXiv Paper
    • Book
    • Thesis
    • Journal Article
    • Magazine Article
    • Report
    • Bill/Act
    • Patent
    • Working Paper
    • Encyclopedia Article
    • Paper Published in Journals
  • Understood the relationship between Publishers, Journals, Domain, Volumes, Articles
  • List of popular AI/NLP publishers and journals

Publishing (Paper/Book/Article)

  • Publication via: When you want to publish some research/creative work you can any of these tools.
    • Thesis (PhD/Longer) and Dissertation (MS/Shorter)
    • Magazine Article
    • Journal Article
    • Conference journal
    • Newsletter
    • Online Journals
    • Book
  • Publishers: For AI/NLP-related work there are many publishers, and some of the eminent publishers are following.
  • Journals: Each publisher can be a publisher of many journals, books, and magazines. When we want to publish some work we need to choose a specific journal. Every journal has a defined audience, hence they have predefined content types. You cannot publish your work in any journal of your choice. The work which you want to publish should be aligned with the journal’s purpose and audience. Popular journals to publish AI/NLP work are as follows.
    • Journal of Artificial Intelligence Research (JAIR)
    • Artificial Intelligence (AI) Journal
    • Computational Linguistics (CL) Journal
    • Natural Language Engineering Journal
    • ACM Transactions on Speech and Language Processing (TSLP)
    • IEEE Transactions on Neural Networks and Learning Systems:
    • Journal of Machine Learning Research (JMLR)
    • Journal of Artificial Intelligence and Research in NLP (AIRNLP)
    • NeurIPS Proceedings (Conference on Neural Information Processing Systems)
    • EMNLP (Empirical Methods in Natural Language Processing)
  • The popularity and impact factor (IF) of journals are important parameters to consider before publishing. But they keep changing over time.
  • CRediT (Credit Texonomy) is created to define the different roles of the authors. It helps in funding compliance. The main roles are - Conceptualization, Data Curation, Formal Analysis, Funding Acquisition, Investigation, Methodology, and Project Admin.
  • Scopus is an abstract and citation database that provides access to a vast collection of academic and scientific research literature. It is one of the most comprehensive and widely used bibliographic databases in the world. Scopus is particularly popular in the academic and research communities and is frequently used for literature review, citation analysis, and research evaluation. It helps publishers in selecting and assigning reviewers for any work.

Thesis Documentation

  • Different components of the thesis and their order.
    • Title
    • Dedication: List of the people who inspired you to do this work or who will benefit from this.
    • Acknowledgement: List the names of people and their contributions in your research journey.
    • Abstract: Summarise your work in terms of the key problem, main solution, and results.
    • Abbreviations. You can create a Glossary manually. If you want to assign page numbers where these terms are used then you need to modify your content accordingly and then page numbers on which a term is used get automatically generated.
    • Table of Contents. This is generated automatically from the main work.
    • List of Tables. This is generated automatically from the main work.
    • List of Figures. This is generated automatically from the main work.
    • Thesis ChaptersThis is generated automatically from the main work.
    • Appendix: Anything that is disturbing the flow of your chapter, because some content is very large, or too much data, or detail is not relevant you can move that work into the Appendix. But, keep in mind, this is your own work and not reference material copied from other’s work.
    • Bibliography: Use appropriate code and this is generated automatically from the main work.
    • Index: Use appropriate code and this is generated automatically from the main work.
  • Page numbering format for main content and non-main content (preamble)
  • There are many formats to publish your thesis. I chose report formats
  • Bibliography: There are many bibliography formats like Harvard, APA, IEEE AMA, APSA, ACS, etc. Which format you shall use is decided by the university (where you are submitting your work), conference (where you are presenting your work), and publisher (where you want to publish your work). The bibliography format decides the following.
    • How the items will be listed in the bibliography (sorting order, author first name, middle name, last name, year of publication, article name, order of in the text string)
    • How the items will appear within the text (subscript, number, first author last name)
    • How the items in the bibliography appear (in the order they appear in the text or first author’s last name or something else)
    • Use of [] or () around the name, and the number of the corresponding bibliography information, when it appears in the main text.

Configuration Management & Tools in Data Science/NLP Research Projects

  • Report Writing: LuaLaTeX, Overleaf. There are dozens of images, many appendices, and many chapters therefore you need to organize these in an appropriate folder structure. Versioning is taken care by Overleaf but, based on the work completed, you can label the versions.
  • Coding: Visual Code/ Python/ Jupyter Notebook
  • Code Version: Github
  • Workflow Design/ Presentation: Google Docs
  • Notes during Exploration: Notepad++
  • Citation Management: Mendeley
  • Document Storage: Google Drive
  • Research Exploration: ChatGPT
  • Exploring research work: Mendeley, Google Scholar, ResearchGate

LaTeX Learnings

LaTeX is a tool for thesis writing. Initially, it was painful but very soon, I mastered this tool. It is not possible to appreciate this tool and write all the features I learned and explore here. I wrote a separate article on this. LaTeX Capabilities

NLP Terminologies

I created this resource and keep updating it. Comprehensive Glossary of LLM, Deep Learning, NLP, and CV Terminology is my understanding around some important terms.

NLP Tasks

There are hundreds of tasks in NLP that can be done using various NLP technologies. I wrote an article, grouping and listing these tasks. NLP Tasks

NLP Evaluation

I was struggling to evaluate the performance of different sub-systems of my models. My work was purely on NLP but I came across many metrics that were worth documenting in one place. I wrote this article on ML learning model evaluation. Many of these metrics which are mentioned I used in my research project. I used BLEU Score, ROUGE Score, Cosine, NLP Precision, NLP Recall, NLP F1_Score, R@n, P@n, MRR, and MAP metrics for evaluating different subsystems in my project.

Another related article is Distances in Machine Learning

Transformers Models for Question-Answering

  • Models for Extractive Question Answering Task: BERT, DistilBERT, ALBERT, RoBERTa by Meta, XLNET, GPT3,
  • Models for Generative Question Answer Task: T5 by Google, GPT3, DialoGPT, CTRL, Encoder-Decoder Models.
  • LLM for Question Answering Task: GPT3, GPT-Neo, ChatGPT and GPT4 by OpenAI, BLOOM by Huggingface, LLaMa and Jurassic-1 by Meta, Chinchilla, LaMDA & PaLM by Google, XLNet by CMU and Google, Megatron-Turing NLG

Large Language Models

A large language model (LLM) is a type of artificial intelligence (AI) model. These are designed to understand and generate human language, they also have other capabilities like image, video, sound, voice understanding, and generation. They are trained on vast amounts of text data to learn the structure and comprehend different parts of the input, grammar, and semantics of language. Large language models are typically based on deep learning architectures, such as recurrent neural networks (RNNs), or transformer architectures like encoder, decoder, and encoder-decoder. They are called large because they are trained on massive datasets containing text from the internet. This extensive training data helps the model capture a wide range of linguistic patterns and knowledge. In this training process, the model learns the coefficients of billions of parameters.

Large language models are typically pre-trained on a diverse corpus of text. During pre-training, the model learns to predict the next word in a sentence and it can also predict a missing/masked word in between, which helps it develop a strong understanding of language. After pre-training they are finetuned for many other specific tasks and in this process, we create many experts who can handle Text generation, Question-answering, Language translation, Sentiment analysis, Text summarization, Chatbot applications, Code generation, Text completion, Knowledge base querying, etc. LLM models have a capability to understand the semantic meaning of a word, sentence, and paragraph. It means if two words have completely different spelling or two sentences has completely different grammar and words, LLM can tell how close or unrelated those sentences are.

There is no specific established definition of LLM but any model which is having 1+ billion parameters is put under LLM, although it is not strict criteria. Sometimes when a model is trained on trillons of token but compressed to 500 million parameters is also called LLM.

Text Embedding Technologies

  • Word Embedding: GloVe, Word2Vec, TFIDF
  • Sentence Embedding: Doc2Vec, SentenceBERT, InferSent, USE (Universal Sentence Encoder), SentenceTransformer
  • Embedding with FastText

Vector and Graph Databases

Vector databases are specialized storage systems developed for the efficient management of dense vectors. They differ from standard relational databases, such as PostgreSQL, which were built to store tabular data in rows and columns. They’re also distinct from newer NoSQL databases like MongoDB that store data as JSON. Vector databases are designed to store and retrieve just one type of data: vector embeddings. Vector embeddings are the distilled representations of the training data produced as an output from the training stage of the machine learning process. They serve as the filter through which fresh data is processed during inference.

  • Pinecone: A managed, cloud-native vector database with a straightforward API and no infrastructure requirements. I wrote this article on Pinecone
  • Milvus: An open-source vector database designed to facilitate embedding similarity search and AI applications.
  • Chroma: An open-source embedding database that excels at building large language model applications and audio-based use cases
  • Weaviate: An open-source knowledge graph that allows users to store and search for data objects based on their semantic meaning.
  • FAISS: A high-performance vector database capable of supporting real-time applications, session management, and high-traffic websites

Complexity Around Question-Answering Generation Task

  • I wrote a separate article on Types of Questions.. It discusses different types of questions, why do People ask questions?
  • Challenging is solving question-answering tasks using AI technology.
  • Do we want to generate questions only or do we want to generate questions and corresponding answers or do we want to generate questions along with corresponding answers and reference text?
  • What kind of question do we want to create, descriptive, boolean, multi-choice (MCQ)?

Complexity Around Domain Like History

  • Grammer of the age, for example, written English of the 18th century followed different grammar than today’s English.
  • Spelling of words (name of people, name of places, name of festivals, name of plants) from the original old text. For example, Mahabharat translation from Sanskrit to Mahabharat will have inconsistent spelling for words. This is a transcription issue.
  • Context of the era is important, for example, if you mix Ramcharit Manas text with pre-independence history text then many things look out of context, and combining them can create a problem
  • The old text may be biased towards some gender, caste, religion, geography, profession, etc.
  • Format of the old book, especially old Sanskrit work is in the form of Sutra. You need to be careful whether you are using a translation of the sutra or a translated commentary on the sutra.
  • If you are using translated history work then you need to be careful about the acceptability of the translation by a wider readership
  • If you are using translated history work then you also need to be careful the translation is created from original text or from the translated work. For example, we pick up English translated work of Mahabharat, now if we want to pick a Hindi translation then this was created from an English work or the original Sanskrit work.
  • Do we have word embedding and sentence embedding around the text in hand?

Working with Limited Resources

Generally, in research projects, we are constrained by budget and it is more true in academic research like DBA and Ph.D. I was working on an AI-NLP project and wanted to explore the capabilities of transformers and LLM. Therefore there are two kinds of resources needed. One is API services from ChatGPT, ChatPDF, and other companies. Second, the GPU machine can be used for model training and inferencing purposes. Both of these are expensive resources and if you are not careful in resource planning, exploring free resources, or running experiments with leisurely you may burn up thousands of dollars. What specific API, with how much data, what is the output, how much GPU do we need, what are options available in the market, and at what price? We need to explore all these options carefully so that you can complete the work without spending any money (with free resources) or the least money. Unfortunately getting the answer to these questions is not that easy. The main reason for that is, initially, questions are not clearly framed and secondly internet is full of different kinds of options you need to pay attention to choose those carefully.

I wrote this article in my journey. Compressing Large Language Model

Questions Answering System for Big System

When ChatGPT is available then why need to spend time and money in creating question answering system? ChatGPT can create questions from the text which available in the public domain. Similarly, ChatGPT can answer those questions for which text is available in the public domain. But, even in the public domain, can it answer questions from old Sanskrit or Tamil texts? For that, we need to understand what corpus is used to create ChatGPT. ChatGPT, being a propriety commercial tool they have not published anything valuable about this model. What text they have used? How many parameters model? How much hardware of what kind has been used for training etc. this information is not available. So we don’t know what answer can be given by ChatGPT, or what question can be generated by it. Secondly, if some data is your private data, like health, income, email, WhatsApp msg, corporate data, government data, etc., this data is not available in the public domain. If we want to create a QA system around this then we need to go away from ChatGPT and we need our own systems.

What technology can be used, in terms of hardware, software, network, and security for model training and inferencing? This depends upon the volume of data and the kind of data. Data may be any kind like tabular data like CSV files or excel-sheet, RDBMS data, image data, health image (x-ray, MRI, etc), text data (book, article, news, email, etc), video (gif, short clip, live stream, movies, surveillance recording), audio (conversation, music, singing, speech, etc). In my case, I was dealing with text data. English translation of Mahabharat book.

We need to understand how to perform question-answering work when you have thousands of books, emails, corporate circulars, and procedure documents, and these documents have information in different formats.

If we explore LLM like LLaMa, BLOOM, PaLM, GPT3, etc. then how to finetune these models with limited resources? I explored PEFT and LoRA with LLaMa and BLOOM. We also need to keep in mind model training became easy it does not mean inferencing can be done on low-grade hardware.

Application and Scope of Question Answering System (QAS)

Wisdom comes when we contemplate, process data, and ask questions. The deeper the quality of the question, the more challenging and unique those questions are, more unshakeable wisdom you will have. We need to understand after reading a book asking questions is one thing, without reading the question what questions we can ask? After the question is answered how do you know the question is answered correctly or not? This is a completely different challenge and it is more challenging for computer and much more challenges for computer when the answer is generated from a different part of the text and extracted from one or two parts of the text. Whatever the business domain of humans, QA is a ubiquitous domain-agnostic language task. QAS can take many formats, for example, chatbot, FAQ, interview, searching, exam evaluation, interrogation, exploration, etc. With a few example, we will try to understand applications of QAS.

  • When, at the end of the university or school want to evaluate the learner.
  • When, a doctor what to examine a patient.
  • When, an auditor wants to know the answer to a question.
  • When, a journalist wants to ask some question from a politician.
  • When, a sales salesperson wants to know why the sales were less in the last quarter.
  • When, HR wants to know whether a policy is violated.
  • when, the CEO, wants to understand, what factor is contributing how much in the cost of a product.
  • when, the listener wants to validate his/her understanding about a text.

Learning Python

  • Decorator function
  • Python Naming Convention
  • Building Utility Libraries
  • Keeping application configuration variables and constants away from the main work.
  • List comprehension with Lambda function.
  • Enabling and using GPU/TPU in kaggle and colab with pytorch and tensorflow.
  • Google colab forms
  • Using streamlit library for frontend
  • NLP Library: SpaCy, NLTK, Gensim

Web Scrapping

  • While exploring different web scrapping techniques and libraries I wrote this article Python API for Data Collection
  • Apart from this I web scrapped 2100 Sections of Mahabharat Books from 2100 urls for my DBA.
  • During my MS program I used several other web scrapping tools to scrap data from twitter and other social media accounts.

Prompt Engineering

I wrote a detailed article on Introduction to Prompt Engineering (PE). What is PE and what are the possibilities? I used ChatGPT for prompt engineering whenever I needed a partner and guide to ask questions/doubts/clarification about the following.

  • Asking Python programming questions
  • Debugging Python code
  • Asking for research guidance
  • Asking for options and evaluation of options
  • Asking for summarization

Model Finetuning & Training

  • Model Tuning with VertexAI
  • Using huggingface, tensorflow, pytroch models and initial code for inference and fintuning.
  • Using huggingface zeroshot learning models for question-answering work.

Model Deployment

Model Repositories

Other Learnings

  • Langchain, Chain of Thought, Tree of Thought
  • Ethics Related Issues in AI Products

Some useful resources