in AIニュース

The Future is Now: Open-Source Vision AI Delivers Surprising Results!

by Matthew Berman 2024年11月18日, 08:18 44 Comments

Open-Source Vision AI – SURPRISING Results! (Phi3 Vision vs LLaMA 3 Vision vs GPT4o)

AI技術は現在、私たちの世界を変革しています。その中でも、Phi3 Vision, LLaMA 3 Vision, and GPT4o Visionは最先端のAIモデルとして注目を集めています。

Phi3 Visionは、画像処理において驚異的な精度と効率性を持っており、複雑なデータセットにも対応することができます。これにより、医療診断や自動運転などの分野で革新的な進歩が期待されています。

LLaMA 3 Visionは、自然言語処理に特化したAIモデルであり、膨大なテキストデータを解析し、意味を理解する能力が高く評価されています。これにより、機械翻訳や文書分類などの分野で革新的な成果が期待されています。

GPT4o Visionは、汎用的なAIモデルとして幅広い用途に活用可能です。その高い柔軟性と学習能力により、さまざまな業界での問題解決や創造的なアプリケーションが可能となっています。

また、PineconeはVector DB（ベクトルデータベース）のニーズに特化したプラットフォームであり、AI開発者や研究者にとって必須のツールと言えるでしょう。

以上のように、AI技術は私たちの生活やビジネスに革新をもたらす重要な要素として位置付けられています。今後もさらなる進化が期待される中で、私たちはその可能性を最大限活かすことが重要です。

動画はこちら

Written by Matthew Berman

コメントを残すコメントをキャンセル

GIPHY App Key not set. Please check settings

44 Comments

Sort by

@freerunnering says:

2024年11月18日 at 08:18 Copy Link of a Comment

The Phi models are credibly good models unfortunately not very useful in practise because of how heavily censored they are. In the meme example for example, you ran into the issue with Phi where it refused to criticise or insult anyone. If any answer looks like it is "personal details" or has a negative slant against any body it will just refuse to answer or offend anyone and inside give that "everyone is working hard in their own way" type non-answer.
It's credibly disappointing because the Phi models are some of the best models out there otherwise. But you can't trust them do actually do what you say with arbitrary content.

I imagine if you had tried the OCR example with a meme critical of someone or something it would likely have even refused to tell you want the text in the image was, that's how heavily censored the models are in my testing.

0

返信
@gabrielesilinic says:

2024年11月18日 at 08:18 Copy Link of a Comment

#llamaisnotopensource
but idefics2-8b is.

0

返信

@kencollins5918 says:

2024年11月18日 at 08:18 Copy Link of a Comment

You did nothing wrong at all. LLaVA has always been this bad. Always creatively verbose too. Aside, it would have been nice to see Claude 3 vision in this.

0

返信
@gerhardb-lt8dw says:

2024年11月18日 at 08:18 Copy Link of a Comment

maybe also include examples with language and scripts other than english

0

返信
@adshere-ny6js says:

2024年11月18日 at 08:18 Copy Link of a Comment

Just use unaligned models and win. EASY.

0

返信

@kranstopher says:

2024年11月18日 at 08:18 Copy Link of a Comment

I've actually had pretty good luck with llama 3 dolphin. I tried using the lava variant and I came up with kind of the same results.

0

返信
@MrNezlee says:

2024年11月18日 at 08:18 Copy Link of a Comment

Gpt4o seems to have the best understanding of 3D physical space, including direction, coordinates, mass, speed, collision, risk avoidance, obstacles, etc.

0

返信
@SukhchainBatth-ct5bv says:

2024年11月18日 at 08:18 Copy Link of a Comment

Crazy AI❤🎉🎉❤🎉❤❤🎉❤🎉

0

返信

@cajampa says:

2024年11月18日 at 08:18 Copy Link of a Comment

For future vision test you have to ask the vision model to describe an proper NSFW scene or picture.
I want to know how censored it is and how it acts when it get presented with such an image.
For example will it refuse, describe and if it refuses will it try to moralise or shame you like some models do if you do anything it finds restricted.

0

返信
@yevhendyachenko1384 says:

2024年11月18日 at 08:18 Copy Link of a Comment

We definitely need some other way to test prompts for the visual models. I would use a long explanation system message to point local LLM to use, scan and convert the picture to a text and use it in order to "read" user prompt with the graphical embedding

0

返信
@MarcWilson1000 says:

2024年11月18日 at 08:18 Copy Link of a Comment

Hi Matthew

I think a great test which none of the vision models are yet great at is to convert a bit map graph to data.
Eg a stacked bar graph, 3 or 4 series, 2 or more categories
It would be a life changing productivity hack!!
Great channel.

0

返信

@rghughes says:

2024年11月18日 at 08:18 Copy Link of a Comment

For the GPT-4o testing, I think there may be some personalization settings that may be affecting the results. The responses seem too succinct to me.

0

返信
@gaming_for_sanity says:

2024年11月18日 at 08:18 Copy Link of a Comment

The “who is this” feature is probably only available to the developer of the LLM and the government…

0

返信
@Tom__L says:

2024年11月18日 at 08:18 Copy Link of a Comment

GPT won the capture IMO. It knew the core of the question. No human would give you the „captcha“ letters unless he/she would troll you.

0

返信

@philipgeorgiev says:

2024年11月18日 at 08:18 Copy Link of a Comment

clearly misidentifying a human lizard 😀

0

返信
@vaibhavmishra1100 says:

2024年11月18日 at 08:18 Copy Link of a Comment

You should test the Google vision model also

0

返信
@tomtom_videos says:

2024年11月18日 at 08:18 Copy Link of a Comment

Awesome video! I was wondering how Phi-3-Vision fares compared to other vision-capable LLMs. I watched your video while I was working on my own Phi-3-Vision tests using Web UI screenshots (my hope is that it could be used for automated Web UI testing). However, Phi-3 turned out to be horrible at Web UI testing (you can see the video from my tests in my YouTube channel, if you are interested). It's nice to see that it fares much better with normal photos! Thanks for making this video – it saved me some time on testing it myself 🙂

0

返信

@LHSgoatman says:

2024年11月18日 at 08:18 Copy Link of a Comment

Ask questions about charts. Line charts, bar charts, etc. ask questions that are not immediate obvious of the data but can be inferred. Ask it to convert from one chart type to another.

0

返信
@Chris-Humblest says:

2024年11月18日 at 08:18 Copy Link of a Comment

Gates went to Epstein 🙁

0

返信
@wardehaj says:

2024年11月18日 at 08:18 Copy Link of a Comment

Another prompt idea: Ask what ingredients the shown image of a pizza has

0

返信

@wardehaj says:

2024年11月18日 at 08:18 Copy Link of a Comment

Thanks for the great video! It would be great to show the vision model a screenshot of the game pong or the game snake and ask to write the python code to make the game as shown on the image

0

返信
@emanuelcerqueira8660 says:

2024年11月18日 at 08:18 Copy Link of a Comment

llava struggles with OCR

0

返信
@OliNorwell says:

2024年11月18日 at 08:18 Copy Link of a Comment

With that converted_table.csv, it's only using Code Interpreter to do the final conversion from a table to csv. If you just ask for the table I think it will just output the data and not use Code Interpreter.

0

返信

@infinityviews433 says:

2024年11月18日 at 08:18 Copy Link of a Comment

how can we install this?

0

返信
@infinityviews433 says:

2024年11月18日 at 08:18 Copy Link of a Comment

Phi 3 vision because it’s free faster. Okay maybe it doesn’t understand everything but it still amazing for free!

0

返信
@SoulSolace12 says:

2024年11月18日 at 08:18 Copy Link of a Comment

Well we explicitly know GPT4o was trained on the whole of Reddit, so it passing the digging meme is no surprise.

0

返信

@efficiencyvi8369 says:

2024年11月18日 at 08:18 Copy Link of a Comment

Your QR Code for the URL Looks really strange. Maybe provide an image without transparency?

0

返信
@cucciolo182 says:

2024年11月18日 at 08:18 Copy Link of a Comment

Why didn’t you add Gemini 1.0 Pro Vision ? 😅

0

返信
@JoaquinTorroba says:

2024年11月18日 at 08:18 Copy Link of a Comment

Very very useful Matt, thanks!

0

返信

@VoldiWay says:

2024年11月18日 at 08:18 Copy Link of a Comment

Please test their ability to analyze screenshots of web pages and determine mouse coordinates to click for potential web automation applications. For example, show a website and ask, "What coordinate should I click to log in?"

0

返信
@DailyTuna says:

2024年11月18日 at 08:18 Copy Link of a Comment

You think it maybe was intentionally blocking Gates ID for security reasons?

0

返信
@camronrubin8599 says:

2024年11月18日 at 08:18 Copy Link of a Comment

Solar is cheaper . These morons 🤦‍♂️ when we are drowning in debt lets spend MORE MONEY

0

返信

@3dus says:

2024年11月18日 at 08:18 Copy Link of a Comment

Isn’t the Apple model open source? I think it could be a very good and small model.

0

返信
@nftawes2787 says:

2024年11月18日 at 08:18 Copy Link of a Comment

With OpenAI's model performing worse, the open source community is doing what we've predicted all along—yet again proving that millions+ of people cooperating outpaces sectors that work through competitive secrecy

0

返信
@Alexander_g2g says:

2024年11月18日 at 08:18 Copy Link of a Comment

i think the captcha pic is way too easy ^^

0

返信

@mesapysch says:

2024年11月18日 at 08:18 Copy Link of a Comment

I am a data annotator. One suggestion would be to use a "Where's Waldo" image and have the bot not only find Waldo, but also describe to the user how to find him. I would be curious to know how they navigate an image.

0

返信
@allanshpeley4284 says:

2024年11月18日 at 08:18 Copy Link of a Comment

Nice to see Matt not insert his far left progressive ideology into a video. I'm still unsubbed though. He'll have to work harder to win me back.

0

返信
@taveiraadriano says:

2024年11月18日 at 08:18 Copy Link of a Comment

Can you try uploading a graph picture (such as CPI) YoY line graph and ask the models How they understand It? Which average pace it's moving? Which date intervals it's growing or reducing?

0

返信

@Blue-pd3dv says:

2024年11月18日 at 08:18 Copy Link of a Comment

Billgates is the former CEO of microsoft? When? I don't know it if I didn't click this video 😂

0

返信
@AlfredNutile says:

2024年11月18日 at 08:18 Copy Link of a Comment

Maybe try to use them for reading a webpage and scraping the right data from it

0

返信
@Jacstaoisitio says:

2024年11月18日 at 08:18 Copy Link of a Comment

Interesting but I will not be impressed until it can identify faces. Maybe one of the Chinese models?

0

返信

@daniel1132channel says:

2024年11月18日 at 08:18 Copy Link of a Comment

Hi Matthew, I'm intrigued they could not identify Bill Gate. Please try with other tech giants – Jeff Bezos, Mark Zuckerberg, Sam Altman. Then try with non-tech celebrities, such as politicians, musicians, actors etc. It would be interesting to see the results.

0

返信
@mrpro7737 says:

2024年11月18日 at 08:18 Copy Link of a Comment

GPT-4 initially struggled but eventually transformed into a Terminator ! 😂

0

返信
@MarkDurbin says:

2024年11月18日 at 08:18 Copy Link of a Comment

I have tried dropping a simple software architecture diagram on them and asking them to extract the entities, hierarchy and connections into a json file, which usually works quite well.

0

返信

The Future is Now: Open-Source Vision AI Delivers Surprising Results!

Open-Source Vision AI – SURPRISING Results! (Phi3 Vision vs LLaMA 3 Vision vs GPT4o)

関連

Written by Matthew Berman

未来を感じる！AI技術の進化に驚愕。Llama 3が登場するが、その実力はいかに？

Unleashing the Power of LLaMA 3: A Game-Changer in AI Technology

「未来を体験しよう！AIツール『Ollama』の使い方完全ガイド」

Unveiling the Surprising Power of AI: The Key to Revolutionizing Our Future

「Metaの最新LLM「Llama 3」がChatGPTを超えた！AI技術の未来を切り拓く新たな可能性とは？」

未来を感じるAI技術Claude 3、ChatGPT 4、Google Gemini advancedの使い方と購入ガイド｜泛科学院

「AIの力で糖尿病管理を革新：GPT-4o-Mini搭載アプリ開発秘話」

「AIが手掛けた学術論文を見破れるのか？—チャットGPT時代の新たな挑戦」

「MicroStrategy、260億ドルのビットコイン保有で金融界のトップに躍進！」

Transform Your Text with AI: Explore Hugging Face Transformers in Python!

AIの驚くべき力：ChatGPTで命を救った体験が明らかに！

「AIが医療診断を革新する！？最新研究で明らかになった驚きの成果」

コメントを残すコメントをキャンセル

44 Comments

「AIが手掛けた学術論文を見破れるのか？—チャットGPT時代の新たな挑戦」

「AIの力で糖尿病管理を革新：GPT-4o-Mini搭載アプリ開発秘話」

「AIの力で糖尿病管理を革新：GPT-4o-Mini搭載アプリ開発秘話」