GPT-4 Vision in Practice: Building Smart Image Analysis Systems

多模態 AI 的威力

以前要做圖像分析，得訓練專門的 CV 模型。現在有了 GPT-4o，直接丟圖片問問題就行。

GPT-4o（"o" 代表 omni）是 OpenAI 的多模態旗艦模型，可以同時處理文字、圖片、甚至音訊。

主要能力

實測下來，GPT-4o 在這些任務上表現很好：

物件識別：辨識圖片中的物體、人物、場景
文字提取（OCR）：從圖片中讀取文字，支援多語言
圖表解讀：理解流程圖、架構圖、數據圖表
圖片比較：找出兩張圖的差異
程式碼截圖分析：看截圖就能理解程式碼

基本用法

API 呼叫

from openai import OpenAI
import base64

client = OpenAI()

def encode_image(image_path):
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode('utf-8')

# 編碼圖片
image_base64 = encode_image("screenshot.png")

# 發送請求
response = client.chat.completions.create(
    model="gpt-4o",  # 注意：不是 gpt-4-vision-preview
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "這張圖片裡有什麼？"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{image_base64}"
                    }
                }
            ]
        }
    ],
    max_tokens=500
)

print(response.choices[0].message.content)

使用 URL

也可以直接傳圖片 URL：

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "描述這張圖片"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg"
                    }
                }
            ]
        }
    ]
)

實際應用

1. 文件 OCR

把收據、表單的圖片轉成結構化資料：

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": """分析這張收據，提取以下資訊並以 JSON 格式回傳：
                    - 商家名稱
                    - 日期
                    - 品項列表（名稱、數量、金額）
                    - 總金額"""
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{receipt_image}"}
                }
            ]
        }
    ]
)

2. UI 截圖分析

給設計師的 UI 截圖，自動生成程式碼：

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "根據這個 UI 設計截圖，生成對應的 React + Tailwind CSS 程式碼"
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{ui_screenshot}"}
                }
            ]
        }
    ]
)

3. 錯誤截圖診斷

丟錯誤截圖，讓它幫你分析問題：

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "這是我程式跑出來的錯誤截圖，請分析問題原因並提供解決方案"
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{error_screenshot}"}
                }
            ]
        }
    ]
)

注意事項

Token 消耗

圖片會消耗 token，而且比文字貴。大圖可能消耗 1000+ tokens。

建議：

壓縮圖片再上傳
只截取需要的部分
使用 detail: "low" 參數降低精度（省 token）

{
    "type": "image_url",
    "image_url": {
        "url": f"data:image/png;base64,{image_base64}",
        "detail": "low"  # 或 "high"、"auto"
    }
}

限制

不能處理影片（只能單張圖片）
對於非常細小的文字可能辨識不準
部分敏感內容會被過濾

與傳統 CV 的比較

方面	GPT-4o	傳統 CV 模型
開發速度	快，直接 API 呼叫	慢，需要訓練
靈活性	高，自然語言指令	低，固定任務
成本	按 token 計費	一次性訓練成本
精確度	通用場景好	特定場景更準
延遲	較高	可優化到很低

我的建議：通用任務用 GPT-4o，特定高頻任務用專門訓練的模型。

總結

GPT-4o 讓圖像分析變得超簡單。以前需要幾週開發的功能，現在幾行程式碼就搞定。

適合的場景：

快速原型驗證
多變的分析需求
不想訓練專門模型

不適合的場景：

需要極低延遲
成本敏感的高頻調用
需要離線運作

參考資源

多模態 AI 的威力

以前要做圖像分析，得訓練專門的 CV 模型。現在有了 GPT-4o，直接丟圖片問問題就行。

GPT-4o（"o" 代表 omni）是 OpenAI 的多模態旗艦模型，可以同時處理文字、圖片、甚至音訊。

主要能力

實測下來，GPT-4o 在這些任務上表現很好：

物件識別：辨識圖片中的物體、人物、場景
文字提取（OCR）：從圖片中讀取文字，支援多語言
圖表解讀：理解流程圖、架構圖、數據圖表
圖片比較：找出兩張圖的差異
程式碼截圖分析：看截圖就能理解程式碼

基本用法

API 呼叫

from openai import OpenAI
import base64

client = OpenAI()

def encode_image(image_path):
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode('utf-8')

# 編碼圖片
image_base64 = encode_image("screenshot.png")

# 發送請求
response = client.chat.completions.create(
    model="gpt-4o",  # 注意：不是 gpt-4-vision-preview
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "這張圖片裡有什麼？"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{image_base64}"
                    }
                }
            ]
        }
    ],
    max_tokens=500
)

print(response.choices[0].message.content)

使用 URL

也可以直接傳圖片 URL：

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "描述這張圖片"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg"
                    }
                }
            ]
        }
    ]
)

實際應用

1. 文件 OCR

把收據、表單的圖片轉成結構化資料：

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": """分析這張收據，提取以下資訊並以 JSON 格式回傳：
                    - 商家名稱
                    - 日期
                    - 品項列表（名稱、數量、金額）
                    - 總金額"""
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{receipt_image}"}
                }
            ]
        }
    ]
)

2. UI 截圖分析

給設計師的 UI 截圖，自動生成程式碼：

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "根據這個 UI 設計截圖，生成對應的 React + Tailwind CSS 程式碼"
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{ui_screenshot}"}
                }
            ]
        }
    ]
)

3. 錯誤截圖診斷

丟錯誤截圖，讓它幫你分析問題：

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "這是我程式跑出來的錯誤截圖，請分析問題原因並提供解決方案"
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{error_screenshot}"}
                }
            ]
        }
    ]
)

注意事項

Token 消耗

圖片會消耗 token，而且比文字貴。大圖可能消耗 1000+ tokens。

建議：

壓縮圖片再上傳
只截取需要的部分
使用 detail: "low" 參數降低精度（省 token）

{
    "type": "image_url",
    "image_url": {
        "url": f"data:image/png;base64,{image_base64}",
        "detail": "low"  # 或 "high"、"auto"
    }
}

限制

不能處理影片（只能單張圖片）
對於非常細小的文字可能辨識不準
部分敏感內容會被過濾

與傳統 CV 的比較

方面	GPT-4o	傳統 CV 模型
開發速度	快，直接 API 呼叫	慢，需要訓練
靈活性	高，自然語言指令	低，固定任務
成本	按 token 計費	一次性訓練成本
精確度	通用場景好	特定場景更準
延遲	較高	可優化到很低

我的建議：通用任務用 GPT-4o，特定高頻任務用專門訓練的模型。

總結

GPT-4o 讓圖像分析變得超簡單。以前需要幾週開發的功能，現在幾行程式碼就搞定。

適合的場景：

快速原型驗證
多變的分析需求
不想訓練專門模型

不適合的場景：

需要極低延遲
成本敏感的高頻調用
需要離線運作

多模態 AI 的威力

主要能力

基本用法

API 呼叫

使用 URL

實際應用

1. 文件 OCR

2. UI 截圖分析

3. 錯誤截圖診斷

注意事項

Token 消耗

限制

與傳統 CV 的比較

總結

參考資源

Share this article

Related Articles

Developers often struggle to choose the right AI tool for their projects. In this guide, we will review top AI automation tools for software engineers to develop fast, highlighting their strengths, weaknesses, and use cases.

How to Deploy AI Agents on Consumer Hardware: The Complete 2026 Guide

Rust Asynchronous Programming Complete Beginner's Guide

留言討論

多模態 AI 的威力

主要能力

基本用法

API 呼叫

使用 URL

實際應用

1. 文件 OCR

2. UI 截圖分析

3. 錯誤截圖診斷

注意事項

Token 消耗

限制

與傳統 CV 的比較

總結

參考資源

Share this article

Related Articles

Developers often struggle to choose the right AI tool for their projects. In this guide, we will review top AI automation tools for software engineers to develop fast, highlighting their strengths, weaknesses, and use cases.

How to Deploy AI Agents on Consumer Hardware: The Complete 2026 Guide

Rust Asynchronous Programming Complete Beginner's Guide

留言討論