Model Garden of VertexAI
Model Garden of VertexAI:
Unlocking the Power of Google’s VertexAI: Exploring the World of Pre-Built Models for AI Tasks
Introduction:
Artificial Intelligence (AI) has transformed numerous industries, from healthcare and finance to e-commerce, logistic, eduction and entertainment. But the complexity of developing machine learning models often poses a challenge. As the demand for AI-powered solutions continues to rise, data scientists seek efficient ways to leverage pre-trained models or build custom models to address specific tasks. In this regard, Google’s VertexAI emerges as a robust platform that offers an extensive selection of pre-built models for a wide range of AI tasks. VertexAI platform has revolutionized the landscape by seamlessly leveraging LLM (Large Language Models) and Prompt Engineering techniques to perform complex machine learning tasks effortlessly. With VertexAI, data scientists can harness the power of state-of-the-art language models, such as LLM, to accelerate their ML development process. Additionally, the innovative concept of Prompt Engineering enables users to effectively communicate with the models, guiding them to deliver precise and accurate results. From computer vision and natural language processing to speech processing and structured tabular data analysis, Vertex AI’s repertoire includes over 100 models catering to diverse application domains. This article explores how Vertex AI, through its integration of LLM and Prompt Engineering, empowers users to effortlessly tackle intricate machine learning tasks across diverse domains, revolutionizing the AI development experience.
Foundation models:
Pre-trained multi-task models that can be further tuned or customized for specific tasks.
Sno. | Name | Details | Task Name | Vision/ Language | Input DataType | Model Name |
---|---|---|---|---|---|---|
1 | PaLM 2 for Text | Fine-tuned to follow natural language instructions and is suitable for a variety of language tasks, such as: classification, extraction, summarization and content generation. | Text Gen. | Language | Text | text-bison@001 |
2 | PaLM 2 for Chat | Fine-tuned to conduct natural conversation. Use this model to build and customize your own chatbot application. | Text Gen. | Language | Text | chat-bison@001 |
3 | Embeddings for text | Text embedding is an important NLP technique that converts textual data into numerical vectors that can be processed by machine learning algorithms, especially large models. These vector representations are designed to capture the semantic meaning and context of the words they represent. | Embedding | Language | Text | textembedding-gecko@001 |
4 | Codey for Code Completion | Generates code based on code prompts. Good for code suggestions and minimizing bugs in code. | Code Gen. | Language | Text | code-gecko@001 |
5 | Codey for Code Generation | Generates code based on natural language input. Good for writing functions, classes, unit tests, and more. | Code Gen. | Language | Text | code-bison@001 |
6 | Codey for Code Chat | Get code-related assistance through natural conversation. Good for questions about an API, syntax in a supported language, and more. | Code Gen. | Language | Text | codechat-bison@001 |
7 | BERT | Neural network-based technique for natural language processing. Use it to train your own question answering system and more. | Text Gen. | Language | Text | google/bert-base-001 |
8 | InstructPix2Pix | Given an input image and a text prompt that tells the model what to do, the instruct-pix2pix model follows the prompt to edit the image by generating a new one. | Image Gen. | Vision, Language | Text+Image | timbrooks/instruct-pix2pix |
9 | ControlNet | Control image generation with text prompt and control image. | Image Gen. | Vision, Language | Text | lllyasviel/ControlNet |
10 | BLIP2 | BLIP2 is for the image captioning and visual-question-answering tasks. | Text Gen. | Vision, Language | Image | Salesforce/blip2-opt-2.7b |
11 | Stable Diffusion 1.4 (Keras) | KerasCV implementation of stability.ai’s text-to-image model, Stable Diffusion 1.4. | Image Gen. | Vision, Language | Text | keras/stable-diffusion-v1-4 |
12 | Embeddings for Image | Generates vectors based on images, which can be used for downstream tasks like image classification, image search, and so on. | Embedding | Vision, | Image | imageembedding-001 |
13 | Label detector (PaLI zero-shot) | Label Detector Zero-shot classifies images based on labels, represented as a list of text prompt strings, which are provided by the user, and calculates the confidence score of each labelâs presence in the image. | Classification | Vision, | Image | imagezeroshot-001 |
14 | Stable Diffusion v1-5 | Latent text-to-image diffusion model capable of generating photo-realistic images given a text input. | Image Gen. | Vision, | Text | runwayml/stable-diffusion-v1-5 |
15 | Stable Diffusion Inpainting | Stable Diffusion Inpainting is a latent diffusion model capable of inpainting images given any text input and a mask image. | Image Gen. | Vision, | Text | runwayml/stable-diffusion-inpainting |
16 | BLIP image captioning | A Vision-Language Pre-training (VLP) framework for image captioning. | Text Gen. | Vision, | Image | Salesforce/blip-image-captioning-base |
17 | BLIP VQA | A Vision-Language Pre-training (VLP) framework for visual question answering (VQA). | Text Gen. | Vision, | Image | Salesforce/blip-vqa-base |
18 | CLIP | Neural network capable of classifying images without prior training on the classes. | Classification | Vision, | Image | openai/clip-vit-base-patch32 |
19 | OWL-ViT | Zero-shot, text-conditioned object detection model that can query an image with one or multiple text queries. | Text Gen. | Vision, | Text+Image | google/owlvit-base-patch32 |
20 | ViT GPT2 | Image captioning model | Text Gen. | Vision, | Image | nlpconnect/vit-gpt2-image-captioning |
21 | ViLT VQA | Vision-and-Language Transformer (ViLT) model fine-tuned on VQAv2. | Text Gen. | Vision, | Image | dandelin/vilt-b32-finetuned-vqa |
22 | LayoutLM for VQA | Fine-tuned for document understanding and information extraction tasks like form and receipt understanding. | Info. Extraction | Vision, | Scan Doc | impira/layoutlm-document-qa |
23 | T5-FLAN | T5 (Text-To-Text Transfer Transformer) model with the T5-FLAN checkpoint. | Text Gen. | Language | Text | google/t5-flan-001 |
24 | Sec-PaLM2 | The sec-palm model is a foundational model that has been pretrained on a variety of security-specific tasks. The model has broad security understanding across a number topics, such as threat intelligence, security operations, and malware analysis. It is ideal for analyzing, summarizing, and aggregating information across multiple security data sources, as well as generating rules and search queries from natural language input. | Info. Extraction | Language | Text | google/sec-palm-000 |
25 | Chirp | Chirp is a version of a Universal Speech Model that has over 2B parameters and can transcribe in over 100 languages in a single model. | Speech Gen. | Speech | chirp-rnnt1 |
Fine-tunable models :
Models that data scientists can further fine-tune through a custom notebook or pipeline.
Sno. | Name | Details | Task Name | Vision/ Language | Input DataType | Model Name |
---|---|---|---|---|---|---|
1 | Stable Diffusion Inpainting | Stable Diffusion Inpainting is a latent diffusion model capable of inpainting images given any text input and a mask image. | Image Gen. | Vision, Language | Text | runwayml/stable-diffusion-inpainting |
2 | ControlNet | Control image generation with text prompt and control image. | Image Gen. | Vision, | Text+Image | lllyasviel/ControlNet |
3 | tfhub/EfficientNetV2 | EfficientNet V2 are a family of image classification models, which achieve better parameter efficiency and faster training speed than prior arts. | Classification | Vision, | Image | tensorflow-hub/efficientnetv2 |
4 | tfvision/vit | The Vision Transformer (ViT) is a transformer-based architecture for image classification. | Classification | Vision, | Image | tfvision/vit-s16 |
5 | tfvision/SpineNet | SpineNet is an image object detection model generated using Neural Architecture Search. | Detection | Vision, | Image | tfvision/spinenet49 |
6 | tfvision/YOLO | YOLO algorithm is a one-stage object detection algorithm that can achieve real-time performance on a single GPU. | Detection | Vision, | Image | tfvision/scaled-yolo |
7 | DeepLabv3+ (with checkpoint) | Semantic segmentation is the task of assigning a label to each pixel in an image, where each label corresponds to a specific class of object or scene element. | Segmentation | Vision, | Image | deeplabv3plus-cityscapes-20230315 |
8 | ResNet (with checkpoint) | Image classification model as described in the paper “Deep Residual Learning for Image Recognition”. | Classification | Vision, | Image | resnet50 |
9 | ResNet-RS (with checkpoint) | Image classification model as described in the paper “Revisiting ResNets: Improved Training and Scaling Strategies”. | Classification | Vision, | Image | ResNet-RS-50 |
10 | Faster R-CNN (Detectron2) | Faster R-CNN is a deep convolutional network used for image object detection. | Detection | Vision, | Image | detectron2/faster-r-cnn |
11 | MobileNet (TIMM) | Small but powerful models optimized for mobile and embedded vision applications. | Classification | Vision, | Image | timm/mobilenetv2_100 |
12 | EfficientNet (TIMM) | A family of convolutional neural networks (CNNs) designed to be both accurate and efficient. | Classification | Vision, | Image | timm/efficientnetv2_rw_s |
13 | DeiT | A convolution-free transformer for image classification. | Classification | Vision, | Image | timm/deit_base_patch16_224 |
14 | BEiT | A self-supervised learning framework for image representation learning inspired by BERT. | Classification | Vision, | Image | timm/beit_base_patch16_224 |
15 | ViT (TIMM) | Transformer-like architecture for image classification. | Classification | Vision, | Image | timm/vit_base_patch16_224 |
16 | RetinaNet (Detectron2) | RetinaNet is a one-stage object detection model that utilizes a feature pyramid network (FPN) on top of a ResNet and adds a focal loss function to address class imbalance during training. | Detection | Vision, | Image | detectron2/retinanet |
17 | Mask R-CNN (Detectron2) | Mask R-CNN is an instance segmentation model which extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. | Detection | Vision, | Image | detectron2/mask-r-cnn |
18 | ResNet (TIMM) | A type of artificial neural network that is made up of residual blocks with skip connections. | Classification | Vision, | Image | timm/resnet50 |
19 | ResNeSt (TIMM) | An extension of the ResNet architecture that uses a new attention mechanism called split-attention. | Classification | Vision, | Image | timm/resnest50d |
20 | ConvNeXt (TIMM) | A pure convolutional model that is an extension of the ResNet architecture that uses a new attention mechanism called Swin Transformer. | Classification | Vision, | Image | timm/convnext_base |
21 | CspNet (TIMM) | A type of deep neural network that is an extension of the ResNet architecture that uses a new cross stage partial connection to reduce the number of parameters and computation cost without sacrificing accuracy. | Classification | Vision, | Image | timm/cspdarknet53 |
22 | Inception (TIMM) | Inception network is a deep neural network with an architectural design that consists of repeating components referred to as Inception modules. | Classification | Vision, | Image | timm/inception_v4 |
Task-specific solutions:
Most of these pre-built models are ready to use off the shelf, and many can be customized using your own data.
Sno. | Name | Details | Task Name | Vision/ Language | Input DataType | Model Name |
---|---|---|---|---|---|---|
1 | Entity analysis | Inspect text to identify and label persons, organizations, locations, events, products and more. | Classification | Language | Text | google/language_v1-analyze_entities |
2 | Content classification | Use Google’s state-of-the-art language technology to analyzes text content and returns content categories for the content. The latest version of Content Classification supports over 1,000 categories. | Classification | Language | Text | google/language_v1-classify_text_v1 |
3 | Sentiment analysis | Sentiment analysis attempts to determine the overall attitude (positive or negative) expressed within the text. Sentiment is represented by numerical score and magnitude values. | Classification | Language | Text | google/language_v1-analyze_sentiment |
4 | Entity sentiment analysis | Entity Sentiment Analysis inspects the given text for known entities (proper nouns and common nouns), returns information about those entities, and identifies the prevailing emotional opinion of the entity within the text, especially to determine a writer’s attitude toward the entity as positive, negative, or neutral. | Classification | Language | Text | google/language_v1-analyze_entity_sentiment |
5 | Syntax analysis | Syntactic analysis extracts linguistic information, breaking up the given text into a series of sentences and tokens (generally, word boundaries), providing further analysis on those tokens. | Extraction | Language | Text | google/language_v1-analyze_syntax |
6 | Text Moderation | Text moderation analyzes a document and returns a list of harmful and sensitive categories that apply to the text found in the document. | Classification | Language | Text | google/language_v1-moderate_text |
7 | Text Translation | Use Google’s proven pre-trained text model to get text translations for 100+ languages. | Translation | Language | Text | Text Translation |
8 | Occupancy analytics | Detect people and vehicles in a video or image, plus zone detection, dwell time, and more. | Detection | Vision, | Image, Video | google/occupancy-analytics-001 |
9 | Person/vehicle detector | Detects and counts people and vehicles in video. | Detection | Vision, | Video | People/vehicle detector |
10 | Object detector | Identify and locate objects in video | Detection | Vision, | Video | Object detector |
11 | PPE detector | Identify people and personal protective equipment (PPE). | Detection | Vision, | Image | PPE detector |
12 | Person blur | Mask or blur a person’s appearance in video | Detection | Vision, | Video | People blur |
13 | Product recognizer | Identify products at the GTIN or UPC level | Recognition | Vision, | Image | Product recognizer |
14 | Tag recognizer | Extract text in product and price tags | Recognition | Vision, | Scan Doc | Tag recognizer |
15 | Content moderation (Vision) | Content Moderator (Vision) detects objectionable or unwanted content across predefined content labels (e.g., adult, violence, spoof) or custom labels provided by the user. | Classification | Vision, | Scan Doc | Content Moderation |
16 | Face detector (Vision API) | Face detector is a prebuilt Vision API model that detects multiple faces in media (images, video) and provides bounding polygons for the face and other facial “landmarks” along with their corresponding confidence values. | Detection | Vision, | Image, Video | Face Detector |
17 | Watermark detector | Watermark detector is a prebuilt model that detects watermarks in the input image. | Detection | Vision, | Scan Doc | imagewatermarkdetector-001 |
18 | Text detector (Vision API) | Text detector detects and extracts text from images. It uses optical character recognition (OCR) for an image to recognize text and convert it to machine coded text. | Detection | Vision, Language | Scan Doc | Text Detector |
19 | AutoML E2E | Tabular Workflow for End-to-End AutoML is the complete AutoML pipeline for classification and regression tasks. | Classification | Tabular | AutoML E2E | |
20 | Document AI OCR processor | Document OCR can identify and extract text from documents in over 200 printed languages and 50 handwritten languages. | Extraction | Document | pretrained-ocr-v1.2-2022-11-10 | |
21 | Form Parser | Document AI Form Parser applies advanced machine learning technologies to extract key-value pairs, checkboxes, tables from documents in over 200+ languages. | Extraction | Document | pretrained-form-parser-v1.0-2020-09-23 | |
22 | TabNet | TabNet is a general model which performs well on a wide range of classification and regression tasks. | Classification | Tabular | TabNet |
Task-specific LLM Prompts :
Customize language model outputs to meet specific needs. Prompts help to refine or enrich the outputs of the large language model selected.
Sno. | Name | Details | Task Name | Vision/ Language | Input DataType | Model Name |
---|---|---|---|---|---|---|
1 | Object classification | Classify an object using a small number of examples (few-shot prompting). | Classification | Vision, | Structured | LLM Prompt |
2 | Kindergarten Science Teacher | Your name is Miles. You are an astronomer who is knowledgeable about the solar system. Respond in short sentences. Shape your response as if talking to a 10-years-old. | Text Gen. | Language | Freeform | LLM Prompt |
3 | Online Return Customer Service | A customer service chatbot that provides basic customer support and makes decisions on simple tasks | Text Gen. | Language | Freeform | LLM Prompt |
4 | Gluten Free Advisor | A chatbot that provides gluten free cooking recipes and diet plans. | Text Gen. | Language | Freeform | LLM Prompt |
5 | Company Information Guide | A informative chatbot that has a simple company background and allows customers to ask questions about those products. | Text Gen. | Language | Freeform | LLM Prompt |
6 | Fictional Captain from the 1700s | Chat with a fictional character from the 1700s without any modern knowledge. | Text Gen. | Language | Freeform | LLM Prompt |
7 | Support rep chat summarization | You are a customer support manager and would like to quickly see what your team’s support calls are about. | Summarization | Language | Freeform | LLM Prompt |
8 | Summarize news article | News takes too much time to read. You want a quicker way to get the summary. Let Vertex help you. | Summarization | Language | Freeform | LLM Prompt |
9 | Chat agent summarization | You are a customer service center manager and you need to quickly see what your agents are talking about. | Summarization | Language | Freeform | LLM Prompt |
10 | Chat agent follow up | You are a customer service center manager. Sometimes your agents forget to note down follow ups. You want to automate follow up lists. | Info. Extraction | Language | Freeform | LLM Prompt |
11 | Transcript summarization | Summarize a block of text. | Summarization | Language | Structured | LLM Prompt |
12 | Dialog summarization | Summarize a conversation. | Summarization | Language | Structured | LLM Prompt |
13 | Hashtag tokenization | Create and tokenize hashtags based on the provided text. | Text Gen. | Language | Structured | LLM Prompt |
14 | Title generation | Generate a title based on the provided text. | Classification | Language | Structured | LLM Prompt |
15 | Sentiment analysis about a person | You would like to see how reporters write about certain people. You have articles and would like to see if a certain person is written about positivly or negatively. | Classification | Language | Freeform | LLM Prompt |
16 | Customer request classification, few-shot | Based on customer your customer’s answer, you want to automate routing of your customer to the proper service queue. Use few-shot learning. | Classification | Language | Structured | LLM Prompt |
17 | Text classification few-shot | You are an intern at a library and your job is to classify hundreds of articles every day. You’d rather automate this and do something else. | Classification | Language | Structured | LLM Prompt |
18 | Article classification | You are an intern at a library and your job is to classify hundreds of articles every day. You’d rather automate this and do something else. | Classification | Language | Freeform | LLM Prompt |
19 | Classification headline | Few shot classification on a given topic. | Classification | Language | Structured | LLM Prompt |
20 | Sentiment analysis | Explain the sentiment expressed in a body of text. | Classification | Language | Structured | LLM Prompt |
21 | Pixel Technical Specifications, one-shot | Generate technical specification from text of a Pixel phone into JSON, one-shot. | Info. Extraction | Language | Structured | LLM Prompt |
22 | Wifi troubleshooting | Given description of the different status lights on the Google WiFi router, what should be the troubleshooting step. | Text Gen. | Language | Freeform | LLM Prompt |
23 | Contract analysis | You are a partner of a law firm. Your associates are bored of reading contracts to find specific provisions when they can work on more intellectually challenging tasks. | Info. Extraction | Language | Freeform | LLM Prompt |
24 | Extractive Question Answering | Answer questions from given background texts. | Text Gen. | Language | Structured | LLM Prompt |
25 | Marketing generation Pixel | You work in Google’s device marketing team and you need to create marketing pitch for the new Pixel 7 Pro. You have writers block and need help. | Text Gen. | Language | Freeform | LLM Prompt |
26 | Ad copy generation | You are a marketer and want to create different versions of the same ad to target different audiences. You would like some suggestions. | Text Gen. | Language | Freeform | LLM Prompt |
27 | Essay outline | Generate an outline for an essay on a particular topic. | Text Gen. | Language | Freeform | LLM Prompt |
28 | Correct grammar | Correct grammar in the text. | Text Gen. | Language | Freeform | LLM Prompt |
29 | Ad copy from description | Write an ad copy for something based on a description. | Text Gen. | Language | Freeform | LLM Prompt |
30 | Write emails and letters | Write an email or letter based on the specified content. | Text Gen. | Language | Freeform | LLM Prompt |
31 | Reading comprehension test | Your child is preparing for SAT verbal exam and needs more practice in reading comprehension. | Summarization | Language | Freeform | LLM Prompt |
32 | Generate memes | Generate memes based on a certain topic. | Text Gen. | Language | Freeform | LLM Prompt |
33 | Interview questions | Generate a list of interview questions targeting a specific position. | Text Gen. | Language | Freeform | LLM Prompt |
34 | Naming | Generate ideas for names of a specified entity. | Text Gen. | Language | Freeform | LLM Prompt |
35 | General tips and advice | Get tips and advice on general topics. | Text Gen. | Language | Freeform | LLM Prompt |
Conclusion:
The realm of AI has witnessed remarkable advancements, thanks to platforms like Google’s VertexAI. By providing a vast array of pre-built models spanning computer vision, natural language processing, speech processing, and ML tasks on structured tabular data, VertexAI has simplified the development of AI solutions for a multitude of tasks. The platform’s comprehensive selection of models empowers data scientists to efficiently tackle image classification, object detection, sentiment analysis, speech recognition, and much more. Whether it’s creating voice assistants, automating customer support, analyzing visual data, or making data-driven predictions, Vertex AI’s models offer the versatility and performance required to succeed in today’s AI-driven landscape. As AI continues to transform industries, Google’s Vertex AI stands as a powerful tool that unlocks the potential of AI, enabling innovation and driving real-world impact across diverse domains.
By harnessing the power of Vertex AI and its pre-built models, businesses and developers can pave the way for intelligent applications that enhance efficiency, accuracy, and user experiences. With a commitment to ongoing research and development, Google’s Vertex AI is poised to continuously expand its model offerings, ensuring that users have access to cutting-edge AI capabilities and enabling them to push the boundaries of what is possible in the world of artificial intelligence.