Open-Source Vision AI – SURPRISING Results! (Phi3 Vision vs LLaMA 3 Vision vs GPT4o)
AI技術は現在、私たちの世界を変革しています。その中でも、Phi3 Vision, LLaMA 3 Vision, and GPT4o Visionは最先端のAIモデルとして注目を集めています。
Phi3 Visionは、画像処理において驚異的な精度と効率性を持っており、複雑なデータセットにも対応することができます。これにより、医療診断や自動運転などの分野で革新的な進歩が期待されています。
LLaMA 3 Visionは、自然言語処理に特化したAIモデルであり、膨大なテキストデータを解析し、意味を理解する能力が高く評価されています。これにより、機械翻訳や文書分類などの分野で革新的な成果が期待されています。
GPT4o Visionは、汎用的なAIモデルとして幅広い用途に活用可能です。その高い柔軟性と学習能力により、さまざまな業界での問題解決や創造的なアプリケーションが可能となっています。
また、PineconeはVector DB(ベクトルデータベース)のニーズに特化したプラットフォームであり、AI開発者や研究者にとって必須のツールと言えるでしょう。
以上のように、AI技術は私たちの生活やビジネスに革新をもたらす重要な要素として位置付けられています。今後もさらなる進化が期待される中で、私たちはその可能性を最大限活かすことが重要です。
The Phi models are credibly good models unfortunately not very useful in practise because of how heavily censored they are. In the meme example for example, you ran into the issue with Phi where it refused to criticise or insult anyone. If any answer looks like it is "personal details" or has a negative slant against any body it will just refuse to answer or offend anyone and inside give that "everyone is working hard in their own way" type non-answer.
It's credibly disappointing because the Phi models are some of the best models out there otherwise. But you can't trust them do actually do what you say with arbitrary content.
I imagine if you had tried the OCR example with a meme critical of someone or something it would likely have even refused to tell you want the text in the image was, that's how heavily censored the models are in my testing.
#llamaisnotopensource
but idefics2-8b is.
You did nothing wrong at all. LLaVA has always been this bad. Always creatively verbose too. Aside, it would have been nice to see Claude 3 vision in this.
maybe also include examples with language and scripts other than english
Just use unaligned models and win. EASY.
I've actually had pretty good luck with llama 3 dolphin. I tried using the lava variant and I came up with kind of the same results.
Gpt4o seems to have the best understanding of 3D physical space, including direction, coordinates, mass, speed, collision, risk avoidance, obstacles, etc.
Crazy AI❤🎉🎉❤🎉❤❤🎉❤🎉
For future vision test you have to ask the vision model to describe an proper NSFW scene or picture.
I want to know how censored it is and how it acts when it get presented with such an image.
For example will it refuse, describe and if it refuses will it try to moralise or shame you like some models do if you do anything it finds restricted.
We definitely need some other way to test prompts for the visual models. I would use a long explanation system message to point local LLM to use, scan and convert the picture to a text and use it in order to "read" user prompt with the graphical embedding
Hi Matthew
I think a great test which none of the vision models are yet great at is to convert a bit map graph to data.
Eg a stacked bar graph, 3 or 4 series, 2 or more categories
It would be a life changing productivity hack!!
Great channel.
For the GPT-4o testing, I think there may be some personalization settings that may be affecting the results. The responses seem too succinct to me.
The “who is this” feature is probably only available to the developer of the LLM and the government…
GPT won the capture IMO. It knew the core of the question. No human would give you the „captcha“ letters unless he/she would troll you.
clearly misidentifying a human lizard 😀
You should test the Google vision model also
Awesome video! I was wondering how Phi-3-Vision fares compared to other vision-capable LLMs. I watched your video while I was working on my own Phi-3-Vision tests using Web UI screenshots (my hope is that it could be used for automated Web UI testing). However, Phi-3 turned out to be horrible at Web UI testing (you can see the video from my tests in my YouTube channel, if you are interested). It's nice to see that it fares much better with normal photos! Thanks for making this video – it saved me some time on testing it myself 🙂
Ask questions about charts. Line charts, bar charts, etc. ask questions that are not immediate obvious of the data but can be inferred. Ask it to convert from one chart type to another.
Gates went to Epstein 🙁
Another prompt idea: Ask what ingredients the shown image of a pizza has
Thanks for the great video! It would be great to show the vision model a screenshot of the game pong or the game snake and ask to write the python code to make the game as shown on the image
llava struggles with OCR
With that converted_table.csv, it's only using Code Interpreter to do the final conversion from a table to csv. If you just ask for the table I think it will just output the data and not use Code Interpreter.
how can we install this?
Phi 3 vision because it’s free faster. Okay maybe it doesn’t understand everything but it still amazing for free!
Well we explicitly know GPT4o was trained on the whole of Reddit, so it passing the digging meme is no surprise.
Your QR Code for the URL Looks really strange. Maybe provide an image without transparency?
Why didn’t you add Gemini 1.0 Pro Vision ? 😅
Very very useful Matt, thanks!
Please test their ability to analyze screenshots of web pages and determine mouse coordinates to click for potential web automation applications. For example, show a website and ask, "What coordinate should I click to log in?"
You think it maybe was intentionally blocking Gates ID for security reasons?
Solar is cheaper . These morons 🤦♂️ when we are drowning in debt lets spend MORE MONEY
Isn’t the Apple model open source? I think it could be a very good and small model.
With OpenAI's model performing worse, the open source community is doing what we've predicted all along—yet again proving that millions+ of people cooperating outpaces sectors that work through competitive secrecy
i think the captcha pic is way too easy ^^
I am a data annotator. One suggestion would be to use a "Where's Waldo" image and have the bot not only find Waldo, but also describe to the user how to find him. I would be curious to know how they navigate an image.
Nice to see Matt not insert his far left progressive ideology into a video. I'm still unsubbed though. He'll have to work harder to win me back.
Can you try uploading a graph picture (such as CPI) YoY line graph and ask the models How they understand It? Which average pace it's moving? Which date intervals it's growing or reducing?
Billgates is the former CEO of microsoft? When? I don't know it if I didn't click this video 😂
Maybe try to use them for reading a webpage and scraping the right data from it
Interesting but I will not be impressed until it can identify faces. Maybe one of the Chinese models?
Hi Matthew, I'm intrigued they could not identify Bill Gate. Please try with other tech giants – Jeff Bezos, Mark Zuckerberg, Sam Altman. Then try with non-tech celebrities, such as politicians, musicians, actors etc. It would be interesting to see the results.
GPT-4 initially struggled but eventually transformed into a Terminator ! 😂
I have tried dropping a simple software architecture diagram on them and asking them to extract the entities, hierarchy and connections into a json file, which usually works quite well.