Image to text using blip2 gives incorrect answer

Here is code snippet slightly modified from blip2 site:

first prompt “Question: How many cats are there? Answer:” –> gives correct answer Two

However, second prompt “Question: How many dogs are there? Answer:” –> gives incorrect answer - Two should be Zero or None.

Is this because the accuracy of the trained model is not 100% we should get incorrect answers? OR AM I doing something incorrectly?

Here is the complete code:

from PIL import Image
import requests
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch

device = “cuda” if torch.cuda.is_available() else “cpu”

processor = Blip2Processor.from_pretrained(“Salesforce/blip2-opt-2.7b”)
model = Blip2ForConditionalGeneration.from_pretrained(
“Salesforce/blip2-opt-2.7b”, torch_dtype=torch.float16
)
model.to(device)

url = “http://images.cocodataset.org/val2017/000000039769.jpg”
image = Image.open(requests.get(url, stream=True).raw)

prompt = “Question: How many cats are there? Answer:”
inputs = processor(images=image, text=prompt, return_tensors=“pt”).to(
device, torch.float16
)

outputs = model.generate(**inputs)

text = processor.tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(text)

Gives correct answer: [‘Question: How many cats are there? Answer: Two\n’]

However, when I change prompt to

prompt2 = "Question: How many dogs are there? Answer: "

inputs2 = processor(images=image, text=prompt2, return_tensors=“pt”).to(
device, torch.float16
)

outputs2 = model.generate(**inputs2)

text2 = processor.tokenizer.batch_decode(outputs2, skip_special_tokens=True)
print(text2)

[‘Question: How many dogs are there? Answer: Two\n’]

OR AM I doing something incorrectly?

There’s no problem with the code; it seems to be a known issue with the model / architecture. You might want to try using some fine-tuned version.

Thanks!!

Tried the examples you pointed to. The number of dogs still gave Two. However, following the examples further got following results:

55.3% that image 0 is 'a photo of a cat'
44.7% that image 0 is 'a photo of a dog'

Perhaps this explains why the model cannot distinguish between cats, dogs or anything else?

Yeah. For example, CLIP can perfectly classify dogs and cats, but BLIP seems utterly unsuitable for classification

Thanks for the clear explanation!!