How I Built a Bird Identification App with OpenAI CLIP | by Operation Curiosity

One 12 months of residing alone triggered an insurmountable loneliness that pushed me to start out some unconventional hobbies for a 26-year-old. Fowl-watching, stereotyped because the retirees’ favourite pastime, grew to become an enchanting and therapeutic exercise to me.

To start out off my journey as a bird-watcher, I purchased a small catalog of native birds and a not-so-advanced pair of binoculars. Though I discovered it partaking to sift by the pages of my small catalogue upon observing the grace and great thing about my goal ornithuran with my detective lenses on, as a software program engineer, I discovered it embarrassing to not develop a wise system to make chicken identification a lot sooner and simpler. There you have got it. I recognized an issue I created for myself, and like an obsessive techie who not too long ago remodeled right into a borderline ornithophile, I set out on a journey to determine the perfect resolution for this.

One approach to implement such a system is to seize pictures of those flying adventurers, feed them right into a mannequin primarily based on Convolutional Neural Networks (CNN) structure. In case you are accustomed to Machine Studying and Laptop Imaginative and prescient, the everyday life cycle consists of amassing knowledge, knowledge preprocessing, mannequin choice, mannequin coaching, and mannequin analysis. On this case, we have to seize the photographs of birds, annotate them with labels, practice them with a subset of the collected pictures known as the coaching set, and at last make predictions on unseen pictures. As for mannequin choice, we are able to discover numerous deep CNN-based fashions comparable to MobileNetV2, ResNet50, and ResNet152 (these are only a few examples).

A Convolution Neural Community is extensively utilized in pc imaginative and prescient functions to extract and detect options from visible knowledge. The CNN structure consists of a sequence of layers, i.e, Enter Layers, Convolutional Layers, Activation Layer, and Pooling Layer. The enter layer receives the uncooked picture knowledge, and the convolution layer extracts native options utilizing filters/kernels. The convolutional kernel performs convolution operation on the enter matrix to generate function maps that include the vital options of the enter matrix. The following hidden layers will add non-linearity, stop overfitting, cut back dimensionality, and put together knowledge for the ultimate prediction. Lastly, the output layer will produce remaining predictions utilizing a softmax or sigmoid activation perform.

Since this publish isn’t supposed to deal with CNN, I extremely advocate this article to have a greater understanding of the subject.

The apparent problem with this implementation is the standard and amount of the dataset required to make correct predictions. There are over 500 species of birds within the area the place I’m coming from. Ideally, you want not less than 100–500 pictures per species for those who’re constructing CNN from scratch. This difficulty may be solved through the use of the pre-trained fashions I discussed earlier, which might cut back the dimensions of pictures per species within the vary 20-100. But when we have to add increasingly more extra species of birds, the computational complexity will hold rising. In case your machine isn’t able to finishing up such computationally intensive duties, we could need to say goodbye to our little journey.

However pause for a minute and rethink this implementation. Aren’t we including extra complexities to a calming pastime right here? One of many many perks of bird-watching is how cheap the exercise is. Respectable-quality binoculars to reinforce your imaginative and prescient and a fairly sharp mind for identification are all that you simply want. To seize a high quality picture to feed as enter, we’d like an costly digital camera with lengthy lenses. Furthermore, these flying trouble-makers could reap the benefits of your poor reflexes to trigger hindrance to getting a good shot. So let’s suppose otherwise. As an alternative of constructing predictions from a picture, what if we give an in depth description of the goal chicken and retrieve the picture for affirmation? The concept appears attention-grabbing, however the implementation appears to be like a bit difficult with CNN. Image this: the consumer has seen a chicken on a pleasant Sunday morning, and s/he is ready to verbalize what s/he noticed. With this description, our system ought to establish the proper match. However this isn’t the one description the consumer can produce. There are a number of methods to explain the identical species with completely different vocabulary, phrasing, semantics, and syntax. Therefore, it’s practically unattainable to coach the mannequin with pictures and all attainable descriptions that may be tagged to them. However don’t get disillusioned but. With the arrival of transformer structure, we are able to inform the mannequin what we noticed, and it’ll establish the picture we’re searching for. OpenAI CLIP is right here for our rescue.

Learn extra about transformer structure here.

CLIP (Contrastive Language-Picture Pre-training) is a multimodal imaginative and prescient and language mannequin that focuses on studying visible ideas from pure language supervision. In easy phrases, CLIP permits us to “inform” the mannequin what we noticed, and it’ll output the visible illustration of the label (on this case, a photograph of the chicken we’re searching for). The distinctive structure of CLIP makes it simpler to realize zero-shot efficiency on all kinds of benchmarks and real-world datasets. Attributable to its functionality of zero-shot picture classification, we are able to use any immediate that clearly describes the chicken we noticed, with out explicitly coaching all of the attainable prompts that our mannequin could encounter throughout inference. Let’s briefly contact on the structure of CLIP.

CLIP consists of a textual content encoder for embedding texts and a picture encoder for embedding pictures. The textual content encoder relies on the transformer structure launched by the 2017 paper, Attention Is All You Need by Vaswani et al. For picture encoding, CLIP depends on Imaginative and prescient Transformer (ViT), the place pictures are decomposed right into a sequence of patches. CLIP is educated on over 400 million image-text pairs collected from publicly obtainable sources on the web. The method includes projecting each textual content and visible illustration of pictures onto the identical embedding area, the place the similarity scores of respective embedded vector pairs are calculated. We will use cosine similarity metrics to calculate this rating. The target of this structure is to maximise the cosine similarity of the proper pairs and decrease that of the inaccurate pairs.

Supply: Radford et al., 2021 — Studying Transferable Visible Fashions From Pure Language Supervision

CLIP’s contrastive coaching goal unburdens us from the necessity to explicitly label the photographs for coaching. In our case, throughout inference, the enter description can be handed by the pretrained textual content encoder, which produces its textual content embeddings. The mannequin calculates the cosine similarity rating between our enter textual content embedding and the saved picture embeddings, and the picture with the very best similarity rating can be retrieved.

Now that we’ve a greater concept of the strategy, let’s discover the implementation of our app. Your entire course of is damaged down into 5 steps.

Now we have many publicly obtainable on-line databases of chicken observations, their pictures, and descriptions. One of many lively communities that steadily will get up to date with chicken sightings is eBird, which supplies researchers, scientists, and nature fans with real-time knowledge on chicken distribution and observations. They’ve a available API to fetch chicken knowledge filtered by your location. I downloaded the photographs of over 300 native chicken species and their descriptions.

import requestsurl = f"https://api.ebird.org/v2/knowledge/obs/{region_code}/current"
payload={}
headers = {
'X-eBirdApiToken': api_key
}
response = requests.request("GET", url, headers=headers, knowledge=payload)
response_json = response.json()

After retrieving the info with the eBird API and organizing it right into a construction to simply course of the info, we’ve to carry out some preprocessing earlier than feeding it into the mannequin. Since CLIP’s textual content encoder has a token restrict of 77 tokens, we are able to both minimize off the textual descriptions exceeding the restrict or summarize them throughout the token restrict utilizing any textual content summarization mannequin.

Subsequent, we are going to load the pretrained CLIP vision-language mannequin from Hugging Face. As talked about earlier than, this pretrained neural community consists of a picture encoder and a textual content encoder.

import torch
from transformers import AutoProcessor, AutoModelmachine = "cuda" if torch.cuda.is_available() else "cpu"
mannequin = AutoModel.from_pretrained(
"openai/clip-vit-base-patch32",
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32
).to(machine)

We additionally must load the processor to resize and normalize pictures and tokenize textual content.

processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

With the tokenized textual content and processed picture, we’ve to generate numerical representations of them in the identical embedding area.

img_path = os.path.be a part of(bird_path, 'img_path.png')
picture = Picture.open(img_path).convert("RGB")
inputs = processor(textual content=
One 12 months of residing alone triggered an insurmountable loneliness that pushed me to start out some unconventional hobbies for a 26-year-old…
Continue reading on Medium »
, pictures=picture, return_tensors="pt", padding=True, truncation=True).to(machine)
with torch.no_grad():
outputs = mannequin(**inputs)
text_embedding = outputs.text_embeds.cpu()
image_embedding = outputs.image_embeds.cpu()

Subsequent, we’ve to pair picture embeddings and the corresponding textual content embeddings collectively for every species and save them as a torch file for inference.

bird_embeddings_ls = []bird_embeddings_ls.append({
"identify": ,
"text_embedding": text_embedding,
"image_embedding": image_embedding
})

torch.save(bird_embeddings_ls, "bird_embeddings.pt")

On the time of prediction, the consumer enter textual content can be handed by the identical course of as within the coaching part to generate its numerical representations. Then, we loop by all of the visible embeddings we saved throughout coaching and calculate the cosine similarity rating between the enter textual content embeddings and every of the picture embeddings. We will evaluate all of the similarity scores and pick the highest okay finest scores to get the perfect matches. We will retailer these pictures on any cloud storage. For this undertaking, I used AWS S3, which is a extremely obtainable, sturdy, and safe storage service.

with torch.no_grad():
user_text_embedding = mannequin.get_text_features(**text_inputs).cpu().squeeze(0).numpy()similarity_results = []
for chicken in bird_embeddings_ls:
image_embedding = chicken["image_embedding"].squeeze(0).numpy()
similarity = 1 - cosine(user_text_embedding, image_embedding)
similarity_results.append((similarity, chicken["name"], chicken["s3_image_path"]))
similarity_results.type(reverse=True)
top_matches = similarity_results[:top_k]

The applying is deployed on Hugging Face Areas as a Streamlit utility. Please go to the URL here.

Now that we’ve created a easy utility that may act as a digital companion for our bird-watching escapades, the following step is to navigate its areas of enchancment. Let’s undergo them.

It’s higher to discover completely different prompts and evaluate the accuracy of predictions. We will use LLMs to automate the technology of appropriate prompts for coaching and inference for far more constant outcomes.
As of now, we’ve solely educated with a single picture for every species for the sake of simplicity, because the app growth is in its nascent stage. We will practice the mannequin with a number of pictures and descriptions for every species, masking completely different sexes, breeding kind, and numerous developmental levels, since their appearances could range drastically from each other.

In a world plagued with the existential dread of AI taking on human experiences and contributions, let’s reap the benefits of this quickly rising discipline to assist us enhance our productiveness, navigate our bizarre hobbies, and in the end, make the most of the ability of AI to make our lives simpler and fewer chaotic.

Source link

The Good, The Bad and The Ugly of AI | by Mahmudur R Manna | Jun, 2025

🧠 Types of Machine Learning

Reinforcement Learning, But With Rules: Meet the Temporal Gatekeeper | by Satyam Mishra | Jun, 2025

Fueling seamless AI at scale

The LLM Control Trilogy: From Tuning to Architecture, an Insider’s Look at Taming AI | by Jessweb3 | Jessweb3 Notes | Jun, 2025

The Million-Dollar Mindset of Personal Finance Enthusiasts

Duos Edge AI Confirms EDC Deployment Goal in 2025

How I Scaled from Side Hustle to 7 Figures Using 4 AI Tools (No Tech Skills Needed)

Most Popular

How Elon Musk Aims to Fix Recent Issues at X, Tesla

Vision Transformers (ViT) Explained: Are They Better Than CNNs?

Reinventing Monopoly with Hierarchical Reinforcement Learning: Building a Smarter Game (Part 1) | by Srinivasan Sridhar | Mar, 2025

Our Picks

Putting Up A Paywall To Fight AI And Support My Family

Part 5: PostgreSQL Performance Management – Other Tools | by Arun Seetharaman | Feb, 2025

Hyperparameters in Machine Learning | by Ime Eti-mfon | Apr, 2025

How I Built a Bird Identification App with OpenAI CLIP | by Operation Curiosity | Jun, 2025

Related Posts