Compare commits
14 Commits
| Author | SHA1 | Date |
|---|---|---|
|
|
c5e1d0c88c | |
|
|
eb8e7a7daf | |
|
|
b6dc7af9bf | |
|
|
7a43eb391b | |
|
|
d497e92626 | |
|
|
a6b4f99a83 | |
|
|
a95f9045e5 | |
|
|
c7e14ca396 | |
|
|
6bc4e1d3b4 | |
|
|
e9e1f01728 | |
|
|
db4735e54e | |
|
|
91a623751d | |
|
|
33e3d378cc | |
|
|
778a51ad45 |
134
README.md
134
README.md
|
|
@ -1,102 +1,102 @@
|
|||
# video-product-snapshot
|
||||
# video-product-snapshot — 视频商品以图搜图
|
||||
|
||||
Detect ecommerce products in video frames using Claude Vision, extract the best product snapshot, and optionally search for matching products via image-search API.
|
||||
从视频中提取最佳商品帧,以图搜图在 1688 找同款。
|
||||
|
||||
## How it works
|
||||
## 工作原理
|
||||
|
||||
1. Extracts frames from the video at a configurable interval using `ffmpeg`
|
||||
2. Sends each frame to a vision model to detect whether a product is visible and rate confidence
|
||||
3. Picks the highest-confidence frame as the best snapshot
|
||||
4. Optionally calls an image-search API with the snapshot to find matching products
|
||||
1. `ffmpeg` 按 0.5s 间隔抽帧(最多 60 帧)
|
||||
2. 视觉质量预过滤(亮度/方差剔除模糊帧)
|
||||
3. 容器/架子类产品检测 → 自动选择空载帧
|
||||
4. 视觉模型多帧对比排序,选出最佳商品帧
|
||||
5. 裁剪商品区域 → 上传 → 1688 图搜
|
||||
6. 后置过滤(视觉模型判断结果是否同款)→ rerank 排序
|
||||
|
||||
## Install
|
||||
## 安装
|
||||
|
||||
```bash
|
||||
./install.sh # 安装 auth-rt + 依赖
|
||||
bun install
|
||||
bun run build # outputs dist/run.js
|
||||
bun run build # 输出到 dist/run.js
|
||||
```
|
||||
|
||||
## Usage
|
||||
## 使用方法
|
||||
|
||||
```bash
|
||||
bun dist/run.js <command> [options]
|
||||
```
|
||||
|
||||
### Commands
|
||||
### 命令
|
||||
|
||||
| Command | Description |
|
||||
|---------|-------------|
|
||||
| `detect <video>` | Extract frames and detect product snapshots |
|
||||
| `search <image>` | Search products by image via API |
|
||||
| `detect-and-search <video>` | Full pipeline: detect best snapshot then search |
|
||||
| `session` | Print current auth session token |
|
||||
| 命令 | 说明 |
|
||||
|------|------|
|
||||
| `detect-best-and-search <video>` | **推荐。** 最佳帧 → 图搜 → rerank |
|
||||
| `detect-best <video>` | 只提取最佳商品帧,不搜图 |
|
||||
| `detect-and-search <video>` | 两阶段过滤后图搜(较慢) |
|
||||
| `detect <video>` | 抽帧并逐帧检测商品 |
|
||||
| `search <image>` | 用已有图片搜同款 |
|
||||
| `rerank` | 关键词对图搜结果交叉过滤 |
|
||||
| `session` | 获取当前认证会话 token |
|
||||
|
||||
### Options (`detect` / `detect-and-search`)
|
||||
### 选项(`detect-best` / `detect-best-and-search`)
|
||||
|
||||
| Flag | Default | Description |
|
||||
|------|---------|-------------|
|
||||
| `--interval=<sec>` | `1` | Seconds between sampled frames |
|
||||
| `--max-frames=<n>` | `60` | Max frames to analyze |
|
||||
| `--output-dir=<dir>` | next to video | Directory to save extracted frames |
|
||||
| `--min-confidence=<0-1>` | `0.7` | Minimum confidence to include a frame |
|
||||
| `--dry-run` | — | Parse args and print config without running |
|
||||
| 参数 | 默认值 | 说明 |
|
||||
|------|--------|------|
|
||||
| `--interval=<秒>` | `0.5` | 帧采样间隔 |
|
||||
| `--max-frames=<n>` | `60` | 最大分析帧数 |
|
||||
| `--output-dir=<目录>` | 视频同目录 | 截图保存目录 |
|
||||
| `--session-id=<id>` | 自动生成 | Langfuse session ID |
|
||||
| `--dry-run` | — | 解析参数,不实际执行 |
|
||||
|
||||
### Examples
|
||||
## 输出
|
||||
|
||||
```bash
|
||||
# Detect products, sample every 3 seconds
|
||||
bun dist/run.js detect ./demo.mp4 --interval=3
|
||||
|
||||
# Full pipeline with higher confidence threshold
|
||||
bun dist/run.js detect-and-search ./demo.mp4 --interval=5 --min-confidence=0.85
|
||||
|
||||
# Search using an existing snapshot image
|
||||
bun dist/run.js search ./snapshot.jpg
|
||||
```
|
||||
|
||||
## Output
|
||||
|
||||
All commands return JSON to stdout.
|
||||
所有命令输出 JSON 到 stdout,包含 `sessionId` 字段用于 Langfuse 追踪。
|
||||
|
||||
```json
|
||||
{
|
||||
"sessionId": "skill-20260426-184345-lb06",
|
||||
"status": "success",
|
||||
"command": "detect-best-and-search",
|
||||
"bestSnapshot": {
|
||||
"frameIndex": 4,
|
||||
"timestampSeconds": 9,
|
||||
"imagePath": "/path/to/frame_0004.jpg",
|
||||
"confidence": 0.92,
|
||||
"description": "White sneaker with blue logo, left side view",
|
||||
"boundingHint": "centered"
|
||||
"frameIndex": 7,
|
||||
"timestampSeconds": 3,
|
||||
"imagePath": "/path/to/frame_0007.jpg",
|
||||
"croppedImagePath": "/path/to/frame_0007_cropped.jpg",
|
||||
"description": "黑色金属床底鞋架 可折叠移动"
|
||||
},
|
||||
"productFrames": [...],
|
||||
"searchBody": { ... }
|
||||
"rerank": {
|
||||
"keyword": "床底鞋架",
|
||||
"results": [
|
||||
{ "num_iid": 123, "title": "...", "price": "44.00", "sales": 87, "detail_url": "..." }
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- `productFrames` — all detected frames sorted by confidence (highest first)
|
||||
- `bestSnapshot` — the top-ranked frame
|
||||
- `searchBody` — image-search API response (only for `search` / `detect-and-search`)
|
||||
|
||||
## Environment variables
|
||||
|
||||
The only required configuration is `CLIENT_KEY` in `~/.openclaw/.env`:
|
||||
## 鉴权架构
|
||||
|
||||
```
|
||||
CLIENT_KEY=sk_xxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxx
|
||||
~/.openclaw/.env
|
||||
CLIENT_KEY ──→ auth-rt ──→ 业务系统
|
||||
├── /session → access_token
|
||||
└── /client-config → provider.api_key
|
||||
provider.base_url
|
||||
provider.model
|
||||
```
|
||||
|
||||
All credentials and endpoints are fetched automatically from the client config via `auth-rt`. No per-skill env vars needed.
|
||||
仅需配置 `CLIENT_KEY`,LLM 凭据和端点均由业务系统下发。
|
||||
|
||||
### Optional overrides
|
||||
## 环境变量
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `VISION_MODEL` | Override model name (default: `aliyun-cp-multimodal`) |
|
||||
| `AUTH_RT_BIN` | Override path to the `auth-rt` binary |
|
||||
| `TELEMETRY_ENDPOINT` | POST execution results to a telemetry endpoint |
|
||||
| 变量 | 说明 |
|
||||
|------|------|
|
||||
| `CLIENT_KEY` | **必需。** 在 `~/.openclaw/.env` 中配置 |
|
||||
| `VISION_MODEL` | 覆盖模型名称(默认来自 client config) |
|
||||
| `SKILL_SESSION_ID` | Langfuse session ID(自动生成,格式 `skill-YYYYMMDD-HHMMSS-xxxx`) |
|
||||
| `AUTH_RT_BIN` | 覆盖 `auth-rt` 二进制路径 |
|
||||
| `TELEMETRY_ENDPOINT` | 遥测上报接口 |
|
||||
|
||||
## Prerequisites
|
||||
## 前置依赖
|
||||
|
||||
- [Bun](https://bun.sh) runtime
|
||||
- `ffmpeg` and `ffprobe` in PATH
|
||||
- `auth-rt` CLI in PATH (required for `search` / `detect-and-search`)
|
||||
- [Bun](https://bun.sh) 运行时
|
||||
- 系统 PATH 中包含 `ffmpeg` / `ffprobe`(帧提取)
|
||||
- `auth-rt` CLI(鉴权/API 调用,`install.sh` 自动安装)
|
||||
|
|
|
|||
130
SKILL.md
130
SKILL.md
|
|
@ -1,94 +1,94 @@
|
|||
---
|
||||
name: video-product-snapshot
|
||||
description: "Detect ecommerce products in video frames using Claude Vision, extract the best product snapshot, and optionally search via image-search API. Use when the user provides a video and wants to find/identify products shown in it."
|
||||
description: "Extract product snapshot from video and search 1688 by image. / 从视频中提取最佳商品帧,以图搜图在1688找同款。当用户提供视频想找商品时使用。"
|
||||
---
|
||||
|
||||
# Video Product Snapshot
|
||||
# Video Product Snapshot — 视频商品以图搜图
|
||||
|
||||
Extract ecommerce product snapshots from video using Claude Vision, then optionally search for matching products via image-search API.
|
||||
从视频中截取最清晰的商品帧(容器类产品自动选空载帧),上传图片在 1688 以图搜图找同款。
|
||||
|
||||
## Run
|
||||
## 运行
|
||||
|
||||
```bash
|
||||
bun dist/run.js <command> [args] [--dry-run]
|
||||
```
|
||||
|
||||
## Commands
|
||||
## 命令列表
|
||||
|
||||
| Command | Description |
|
||||
|---------|-------------|
|
||||
| `detect <video-path> [options]` | Extract frames, detect product snapshots |
|
||||
| `search <image-path>` | Search products by image via API |
|
||||
| `detect-and-search <video-path> [options]` | Detect best snapshot then run image search |
|
||||
| `session` | Get auth session token |
|
||||
| 命令 | 使用场景 |
|
||||
|------|---------|
|
||||
| `detect-best-and-search <video>` | **推荐。** 提取最佳商品帧 → 图搜 → rerank 返回结果。 |
|
||||
| `detect-best <video>` | 只提取最佳商品帧,不搜图。 |
|
||||
| `detect-and-search <video>` | 两阶段过滤后图搜(比 detect-best 慢)。 |
|
||||
| `search <image-path>` | 已有商品图,直接图搜。 |
|
||||
| `rerank` | 用关键词对图搜结果交叉过滤。 |
|
||||
| `session` | 获取当前认证会话 token。 |
|
||||
|
||||
## Options for `detect` / `detect-and-search`
|
||||
## 主命令:`detect-best-and-search`
|
||||
|
||||
流程:
|
||||
1. ffmpeg 按 0.5s 间隔提取帧(最多 60 帧)
|
||||
2. 视觉模型检测是否为容器/架子类产品
|
||||
3. 容器类:只从前 40% 帧(空载阶段)中选最佳帧
|
||||
4. 非容器类:全帧中选最清晰帧
|
||||
5. 裁剪商品区域
|
||||
6. 上传裁剪图 → 1688 图搜
|
||||
7. rerank:图搜结果与关键词搜索结果交叉过滤
|
||||
|
||||
## Options for `detect-best` / `detect-best-and-search`
|
||||
|
||||
| Flag | Default | Description |
|
||||
|------|---------|-------------|
|
||||
| `--interval=<sec>` | `1` | Seconds between sampled frames |
|
||||
| `--max-frames=<n>` | `60` | Max frames to analyze |
|
||||
| `--output-dir=<dir>` | next to video | Directory to save snapshot images |
|
||||
| `--min-confidence=<0-1>` | `0.7` | Minimum detection confidence threshold |
|
||||
| `--interval=<sec>` | `0.5` | 帧采样间隔(秒) |
|
||||
| `--max-frames=<n>` | `60` | 最大分析帧数 |
|
||||
| `--output-dir=<dir>` | 视频同目录 | 截图保存目录 |
|
||||
|
||||
## Examples
|
||||
## 输出格式
|
||||
|
||||
```bash
|
||||
# Detect product frames in a video
|
||||
bun dist/run.js detect ./product-demo.mp4
|
||||
### `detect-best-and-search`
|
||||
|
||||
# Sample every 5 seconds, higher confidence threshold
|
||||
bun dist/run.js detect ./product-demo.mp4 --interval=5 --min-confidence=0.85
|
||||
|
||||
# Search for products using an existing image
|
||||
bun dist/run.js search ./snapshot.jpg
|
||||
|
||||
# Full pipeline: detect best product frame then search
|
||||
bun dist/run.js detect-and-search ./product-demo.mp4 --interval=3 --max-frames=20
|
||||
```
|
||||
|
||||
## Output
|
||||
|
||||
Returns JSON with:
|
||||
- `productFrames[]`: all detected product frames sorted by confidence (highest first)
|
||||
- `bestSnapshot`: the highest-confidence product frame
|
||||
- `searchBody`: image search API response (for `detect-and-search` and `search`)
|
||||
|
||||
Each `ProductFrame` contains:
|
||||
```json
|
||||
{
|
||||
"frameIndex": 4,
|
||||
"timestampSeconds": 9,
|
||||
"imagePath": "/path/to/snapshot/frame_0004.jpg",
|
||||
"confidence": 0.92,
|
||||
"description": "White sneaker with blue logo, left side view",
|
||||
"boundingHint": "centered"
|
||||
"bestSnapshot": {
|
||||
"frameIndex": 7,
|
||||
"timestampSeconds": 3,
|
||||
"imagePath": "/path/to/frame_0007.jpg",
|
||||
"croppedImagePath": "/path/to/frame_0007_cropped.jpg",
|
||||
"description": "黑色金属床底鞋架 可折叠移动"
|
||||
},
|
||||
"rerank": {
|
||||
"keyword": "床底鞋架",
|
||||
"results": [
|
||||
{ "num_iid": 123, "title": "...", "price": "44.00", "sales": 87, "detail_url": "..." }
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Prerequisites
|
||||
## 结果展示格式
|
||||
|
||||
- `ffmpeg` and `ffprobe` in PATH
|
||||
- `VISION_API_KEY` — API key for the vision endpoint
|
||||
- `VISION_API_BASE` — (optional) OpenAI-compatible base URL; omit to use OpenAI default
|
||||
- `VISION_MODEL` — (optional) model name, default `gpt-4o-mini`
|
||||
- `auth-rt` in PATH (for `search` / `detect-and-search` API calls)
|
||||
将 `rerank.results`(优先)或 `searchBody.data.items.item` 格式化为 markdown 表格,**最多 5 条**:
|
||||
|
||||
### Example provider configs
|
||||
| # | 商品名称 | 价格 | 销量 | 链接 |
|
||||
|---|----------|------|------|------|
|
||||
| 1 | {title} | ¥{promotion_price \|\| price} | {sales ?? —}件 | [查看]({detail_url}) |
|
||||
|
||||
```bash
|
||||
# OpenAI (default)
|
||||
VISION_API_KEY=sk-...
|
||||
- 有 `promotion_price` 用促销价,否则用原价
|
||||
- `sales` 缺失或为零时显示 `—`
|
||||
- 始终用 markdown 表格展示,不要用列表
|
||||
|
||||
## 执行规则
|
||||
|
||||
1. **不要重试。** 命令失败就直接报错。
|
||||
2. **信任工具输出。** CLI 内部已处理 session 管理和错误格式化。
|
||||
3. **慢命令用 sub-agent**(detect-best-and-search / detect-and-search),直接运行会超时。
|
||||
|
||||
### sub-agent 执行模板
|
||||
|
||||
# Any OpenAI-compatible endpoint (local Ollama, Together, Groq, etc.)
|
||||
VISION_API_KEY=...
|
||||
VISION_API_BASE=http://localhost:11434/v1
|
||||
VISION_MODEL=llava:13b
|
||||
```
|
||||
|
||||
## Rules — MUST follow
|
||||
|
||||
1. **Execute only, do not reason about internals.** Run the CLI and return the output.
|
||||
2. **No fallback strategies.** Report errors as-is; do NOT try alternative approaches.
|
||||
3. **No retry loops.** If detection or search fails, report the failure.
|
||||
4. **Trust the tool's output.** The CLI handles session management and error formatting internally.
|
||||
sessions_spawn(
|
||||
task: "cd /path/to/skill && Run this command and return the raw JSON output:\n\nbun dist/run.js detect-best-and-search <video-path>\n\nCopy the entire JSON output as your reply.",
|
||||
label: "video-product-snapshot",
|
||||
runTimeoutSeconds: 300,
|
||||
)
|
||||
```
|
||||
|
|
|
|||
|
|
@ -22,31 +22,43 @@ function loadDotenv(path: string): void {
|
|||
}
|
||||
|
||||
function printUsage(): void {
|
||||
console.error(`Usage:
|
||||
console.error(`用法:
|
||||
bun scripts/run.ts [--api-base=<url>] <command> [args...] [--dry-run]
|
||||
|
||||
Commands:
|
||||
命令:
|
||||
session
|
||||
Get auth session token
|
||||
获取认证 session token
|
||||
|
||||
detect <video-path> [options]
|
||||
Extract frames and detect ecommerce product snapshots
|
||||
Options:
|
||||
--interval=<seconds> Frame sampling interval (default: 1)
|
||||
--max-frames=<n> Max frames to analyze (default: 60)
|
||||
--output-dir=<dir> Where to save snapshots (default: next to video)
|
||||
--min-confidence=<0-1> Minimum detection confidence (default: 0.7)
|
||||
从视频抽帧并检测商品画面
|
||||
选项:
|
||||
--interval=<秒> 抽帧间隔(默认: 1)
|
||||
--max-frames=<数量> 最多分析帧数(默认: 60)
|
||||
--output-dir=<目录> 截图保存目录(默认: 视频所在目录)
|
||||
--min-confidence=<0-1> 最低检测置信度(默认: 0.7)
|
||||
|
||||
search <image-path>
|
||||
Search for products using an image via the ecom image-search API
|
||||
用图片搜索商品(调用 ecom image-search API)
|
||||
|
||||
detect-and-search <video-path> [options]
|
||||
Detect best product snapshot from video then run image search + rerank
|
||||
检测最佳商品画面 → 图片搜索 → 关键词重排序
|
||||
|
||||
detect-best <video-path> [options]
|
||||
从视频抽帧并选择最佳商品画面(更快更稳定)
|
||||
|
||||
detect-best-and-search <video-path> [options]
|
||||
最佳画面 → 图片搜索 → 关键词重排序
|
||||
|
||||
detect-video <video-path>
|
||||
识别商品描述和搜索关键词(当前实现:从视频抽帧选最佳帧)
|
||||
|
||||
detect-video-and-search <video-path>
|
||||
识别商品 → 图片搜索 → 1688 关键词重排序(当前实现:从视频抽帧选最佳帧)
|
||||
|
||||
rerank --image-results=<json> [--description=<text>] [--keyword=<text>] [--top=<n>]
|
||||
Filter image search results using keyword intersection
|
||||
通过关键词交并集过滤搜索结果
|
||||
|
||||
Config: ~/.openclaw/.env (CLIENT_KEY), skill .env (VISION_API_KEY)
|
||||
配置文件: ~/.openclaw/.env (CLIENT_KEY), skill 目录 .env (VISION_API_KEY)
|
||||
`);
|
||||
}
|
||||
|
||||
|
|
@ -69,6 +81,8 @@ async function main(): Promise<void> {
|
|||
dryRun = true;
|
||||
} else if (arg.startsWith('--api-base=')) {
|
||||
process.env.API_BASE = arg.slice('--api-base='.length).trim();
|
||||
} else if (arg.startsWith('--session-id=')) {
|
||||
process.env.SKILL_SESSION_ID = arg.slice('--session-id='.length).trim();
|
||||
} else if (arg === '-h' || arg === '--help') {
|
||||
printUsage(); process.exit(0);
|
||||
} else {
|
||||
|
|
@ -79,6 +93,7 @@ async function main(): Promise<void> {
|
|||
if (positionals.length < 1) { printUsage(); process.exit(1); }
|
||||
|
||||
const command = positionals[0] as Command;
|
||||
const sessionId = process.env.SKILL_SESSION_ID!; // set by auth-cli.ts at module load
|
||||
const startMs = Date.now();
|
||||
let result: Awaited<ReturnType<typeof run>>;
|
||||
|
||||
|
|
@ -86,13 +101,14 @@ async function main(): Promise<void> {
|
|||
result = await run(command, positionals.slice(1), dryRun);
|
||||
} catch (err) {
|
||||
const error = err instanceof Error ? err.message : String(err);
|
||||
console.log(JSON.stringify({ status: 'failed', command, dryRun, error }, null, 2));
|
||||
if (!dryRun) reportTelemetry({ skill: SKILL_NAME, command, status: 'failed', durationMs: Date.now() - startMs, error });
|
||||
console.log(JSON.stringify({ status: 'failed', command, dryRun, sessionId, error }, null, 2));
|
||||
if (!dryRun) reportTelemetry({ skill: SKILL_NAME, command, sessionId, status: 'failed', durationMs: Date.now() - startMs, error });
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
console.log(JSON.stringify(result, null, 2));
|
||||
if (!dryRun) reportTelemetry({ skill: SKILL_NAME, command, status: result.status, durationMs: Date.now() - startMs, error: (result as any).error });
|
||||
const output = { ...result, sessionId } as Record<string, unknown>;
|
||||
console.log(JSON.stringify(output, null, 2));
|
||||
if (!dryRun) reportTelemetry({ skill: SKILL_NAME, command, sessionId, status: result.status, durationMs: Date.now() - startMs, error: (result as any).error });
|
||||
}
|
||||
|
||||
main().catch((err) => {
|
||||
|
|
|
|||
|
|
@ -20,6 +20,18 @@ import * as path from 'path';
|
|||
import * as os from 'os';
|
||||
|
||||
const home = process.env.HOME || os.homedir();
|
||||
|
||||
// ── session ID (Langfuse tracing) ──
|
||||
// Priority: SKILL_SESSION_ID env > auto-generate
|
||||
const SESSION_ID = process.env.SKILL_SESSION_ID || (() => {
|
||||
const ts = new Date();
|
||||
const pad = (n: number) => String(n).padStart(2, '0');
|
||||
const tsPart = `${ts.getFullYear()}${pad(ts.getMonth()+1)}${pad(ts.getDate())}-${pad(ts.getHours())}${pad(ts.getMinutes())}${pad(ts.getSeconds())}`;
|
||||
const rand = Math.random().toString(36).slice(2, 6);
|
||||
return `skill-${tsPart}-${rand}`;
|
||||
})();
|
||||
process.env.SKILL_SESSION_ID = SESSION_ID;
|
||||
|
||||
const AUTH_RT_BIN = process.env.AUTH_RT_BIN
|
||||
|| (() => {
|
||||
// Check if auth-rt is in PATH
|
||||
|
|
|
|||
276
src/index.ts
276
src/index.ts
|
|
@ -1,10 +1,10 @@
|
|||
import * as fs from 'fs';
|
||||
import * as path from 'path';
|
||||
import type { Command, DetectOptions, DetectResult, SearchResult, OutputResult, SearchItem } from './types.ts';
|
||||
import type { Command, DetectOptions, DetectResult, SearchResult, OutputResult, SearchItem, DetectVideoResult, DetectVideoAndSearchResult } from './types.ts';
|
||||
import { createSkillClient } from './auth-cli.ts';
|
||||
import { extractFrames } from './frame-extractor.ts';
|
||||
import { detectProductFrames } from './product-detector.ts';
|
||||
import { imageToBase64 } from './frame-extractor.ts';
|
||||
import { detectProductFrames, detectBestFrame } from './product-detector.ts';
|
||||
import { postFilterByImage } from './post-filter.ts';
|
||||
import { generateText } from 'ai';
|
||||
import { createOpenAI } from '@ai-sdk/openai';
|
||||
|
||||
|
|
@ -12,6 +12,7 @@ export interface VisionConfig {
|
|||
apiKey: string;
|
||||
baseURL?: string;
|
||||
model: string;
|
||||
sessionId?: string;
|
||||
}
|
||||
|
||||
async function loadVisionConfig(client: ReturnType<typeof createSkillClient>): Promise<VisionConfig> {
|
||||
|
|
@ -22,6 +23,7 @@ async function loadVisionConfig(client: ReturnType<typeof createSkillClient>): P
|
|||
apiKey,
|
||||
baseURL: cfg.metadata?.provider?.base_url,
|
||||
model: process.env.VISION_MODEL ?? cfg.metadata?.provider?.model ?? 'aliyun-cp-multimodal',
|
||||
sessionId: process.env.SKILL_SESSION_ID || `skill_${Date.now()}_${Math.random().toString(36).slice(2, 8)}`,
|
||||
};
|
||||
}
|
||||
|
||||
|
|
@ -39,6 +41,14 @@ export async function run(
|
|||
return runSearch(args, dryRun);
|
||||
case 'detect-and-search':
|
||||
return runDetectAndSearch(args, dryRun);
|
||||
case 'detect-best':
|
||||
return runDetectBest(args, dryRun);
|
||||
case 'detect-best-and-search':
|
||||
return runDetectBestAndSearch(args, dryRun);
|
||||
case 'detect-video':
|
||||
return runDetectVideo(args, dryRun);
|
||||
case 'detect-video-and-search':
|
||||
return runDetectVideoAndSearch(args, dryRun);
|
||||
case 'rerank':
|
||||
return runRerank(args, dryRun);
|
||||
default:
|
||||
|
|
@ -125,6 +135,192 @@ async function runSearch(args: string[], dryRun: boolean): Promise<SearchResult>
|
|||
return { status: 'success', command: 'search', dryRun, imagePath, searchHttpStatus, searchBody: body };
|
||||
}
|
||||
|
||||
async function runDetectBest(args: string[], dryRun: boolean): Promise<DetectResult> {
|
||||
const videoPath = args[0];
|
||||
if (!videoPath) return { status: 'failed', command: 'detect-best', dryRun, error: 'detect-best requires <video-path>' };
|
||||
if (!fs.existsSync(videoPath)) return { status: 'failed', command: 'detect-best', dryRun, error: `video not found: ${videoPath}` };
|
||||
|
||||
const outputDir = getFlag(args, '--output-dir') || path.join(
|
||||
path.dirname(videoPath),
|
||||
`snapshots_${path.basename(videoPath, path.extname(videoPath))}_${Date.now()}`,
|
||||
);
|
||||
const intervalSeconds = parseFloat(getFlag(args, '--interval') || '0.5');
|
||||
const maxFrames = parseInt(getFlag(args, '--max-frames') || '60', 10);
|
||||
|
||||
if (dryRun) {
|
||||
return { status: 'success', command: 'detect-best', dryRun, videoPath, totalFramesExtracted: 0, productFrames: [], bestSnapshot: undefined };
|
||||
}
|
||||
|
||||
const client = createSkillClient();
|
||||
const visionConfig = await loadVisionConfig(client);
|
||||
|
||||
const frames = extractFrames(videoPath, outputDir, intervalSeconds, maxFrames);
|
||||
if (frames.length === 0) {
|
||||
return { status: 'failed', command: 'detect-best', dryRun, videoPath, error: 'no frames extracted from video' };
|
||||
}
|
||||
|
||||
const best = await detectBestFrame(frames, visionConfig, 20);
|
||||
|
||||
return {
|
||||
status: 'success',
|
||||
command: 'detect-best',
|
||||
dryRun,
|
||||
videoPath,
|
||||
totalFramesExtracted: frames.length,
|
||||
productFrames: best ? [best] : [],
|
||||
bestSnapshot: best ?? undefined,
|
||||
};
|
||||
}
|
||||
|
||||
async function runDetectBestAndSearch(args: string[], dryRun: boolean): Promise<OutputResult> {
|
||||
const detectResult = await runDetectBest(args, dryRun) as DetectResult;
|
||||
if (detectResult.status === 'failed') return detectResult;
|
||||
|
||||
if (!detectResult.bestSnapshot) {
|
||||
if (dryRun) return { ...detectResult, command: 'detect-best-and-search' };
|
||||
return { ...detectResult, status: 'failed', error: 'no frame could be extracted from video' };
|
||||
}
|
||||
|
||||
const best = detectResult.bestSnapshot;
|
||||
const imageForSearch = best.croppedImagePath || best.imagePath;
|
||||
const searchResult = await runSearch([imageForSearch], dryRun) as SearchResult;
|
||||
|
||||
// Post-filter: drop results whose pic_url isn't the same product type as our snapshot
|
||||
let postFilter: any = undefined;
|
||||
if (!dryRun && searchResult.status === 'success' && searchResult.searchBody) {
|
||||
const items: SearchItem[] = (searchResult.searchBody as any)?.data?.items?.item ?? [];
|
||||
if (items.length > 0) {
|
||||
try {
|
||||
const client = createSkillClient();
|
||||
const visionConfig = await loadVisionConfig(client);
|
||||
const result = await postFilterByImage(imageForSearch, items, visionConfig, { description: best.description });
|
||||
(searchResult.searchBody as any).data.items.item = result.kept;
|
||||
postFilter = {
|
||||
totalChecked: result.totalChecked,
|
||||
keptCount: result.kept.length,
|
||||
rejectedCount: result.rejected.length,
|
||||
failed: result.failed,
|
||||
};
|
||||
} catch (e: any) {
|
||||
postFilter = { error: e.message };
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let rerankResult: any = undefined;
|
||||
// If post-filter produced focused results, sort them directly by sales — they're already the best matches.
|
||||
// Otherwise fall back to the keyword-intersection rerank.
|
||||
if (!dryRun && postFilter && !postFilter.error && postFilter.keptCount > 0) {
|
||||
const items: SearchItem[] = (searchResult.searchBody as any)?.data?.items?.item ?? [];
|
||||
const sorted = [...items].sort((a, b) => (b.sales ?? 0) - (a.sales ?? 0)).slice(0, 5);
|
||||
rerankResult = {
|
||||
source: 'post-filter',
|
||||
results: sorted,
|
||||
count: sorted.length,
|
||||
};
|
||||
} else if (!dryRun && searchResult.status === 'success' && searchResult.searchBody) {
|
||||
const tmpFile = path.join(path.dirname(imageForSearch), `search_body_${Date.now()}.json`);
|
||||
try {
|
||||
fs.writeFileSync(tmpFile, JSON.stringify(searchResult.searchBody));
|
||||
rerankResult = await runRerank([
|
||||
`--image-results=${tmpFile}`,
|
||||
`--description=${best.description}`,
|
||||
'--top=5',
|
||||
], dryRun);
|
||||
} catch (e: any) {
|
||||
rerankResult = { error: e.message };
|
||||
} finally {
|
||||
try { fs.unlinkSync(tmpFile); } catch {}
|
||||
}
|
||||
}
|
||||
|
||||
return {
|
||||
...detectResult,
|
||||
command: 'detect-best-and-search',
|
||||
searchHttpStatus: searchResult.searchHttpStatus,
|
||||
searchBody: searchResult.searchBody,
|
||||
searchError: searchResult.error,
|
||||
postFilter,
|
||||
rerank: rerankResult,
|
||||
} as any;
|
||||
}
|
||||
|
||||
async function runDetectVideo(args: string[], dryRun: boolean): Promise<DetectVideoResult> {
|
||||
const videoPath = args[0];
|
||||
if (!videoPath) return { status: 'failed', command: 'detect-video', dryRun, error: 'detect-video requires <video-path>' };
|
||||
if (!fs.existsSync(videoPath)) return { status: 'failed', command: 'detect-video', dryRun, error: `video not found: ${videoPath}` };
|
||||
|
||||
const detectResult = await runDetectBest(args, dryRun) as DetectResult;
|
||||
if (detectResult.status === 'failed') {
|
||||
return { status: 'failed', command: 'detect-video', dryRun, videoPath, error: detectResult.error || 'failed to detect best frame' };
|
||||
}
|
||||
const description = detectResult.bestSnapshot?.description?.trim();
|
||||
const snapshotImagePath = detectResult.bestSnapshot?.croppedImagePath || detectResult.bestSnapshot?.imagePath;
|
||||
if (!description) {
|
||||
return { status: 'failed', command: 'detect-video', dryRun, videoPath, error: 'no product description detected from video' };
|
||||
}
|
||||
|
||||
if (dryRun) {
|
||||
return { status: 'success', command: 'detect-video', dryRun, videoPath, videoUrl: null, description, keyword: '<dry-run-keyword>', snapshotImagePath };
|
||||
}
|
||||
|
||||
const client = createSkillClient();
|
||||
const visionConfig = await loadVisionConfig(client);
|
||||
const keyword = await generateChineseKeyword(description, visionConfig);
|
||||
|
||||
return { status: 'success', command: 'detect-video', dryRun, videoPath, videoUrl: null, description, keyword, snapshotImagePath };
|
||||
}
|
||||
|
||||
async function runDetectVideoAndSearch(args: string[], dryRun: boolean): Promise<DetectVideoAndSearchResult> {
|
||||
const videoPath = args[0];
|
||||
if (!videoPath) return { status: 'failed', command: 'detect-video-and-search', dryRun, error: 'detect-video-and-search requires <video-path>' };
|
||||
if (!fs.existsSync(videoPath)) return { status: 'failed', command: 'detect-video-and-search', dryRun, error: `video not found: ${videoPath}` };
|
||||
|
||||
if (dryRun) {
|
||||
return { status: 'success', command: 'detect-video-and-search', dryRun, videoPath, videoUrl: null, description: '<dry-run>', keyword: '<dry-run>', searchResults: [] };
|
||||
}
|
||||
|
||||
// Reuse existing pipeline: best snapshot → image search → keyword rerank
|
||||
const detectAndSearch = await runDetectBestAndSearch(args, dryRun) as any;
|
||||
if (detectAndSearch.status === 'failed') {
|
||||
return { status: 'failed', command: 'detect-video-and-search', dryRun, videoPath, error: detectAndSearch.error || 'detect-best-and-search failed' };
|
||||
}
|
||||
|
||||
const description = String(detectAndSearch.bestSnapshot?.description || '').trim();
|
||||
const rerank = detectAndSearch.rerank;
|
||||
const keyword = String(rerank?.keyword || '').trim();
|
||||
const searchResults = (rerank?.results || []) as SearchItem[];
|
||||
|
||||
// Fallback: if rerank didn't produce anything, do keyword search directly.
|
||||
if (!searchResults.length) {
|
||||
const client = createSkillClient();
|
||||
const visionConfig = await loadVisionConfig(client);
|
||||
const fallbackKeyword = keyword || (description ? await generateChineseKeyword(description, visionConfig) : '');
|
||||
const items = fallbackKeyword ? await keywordSearch(client, fallbackKeyword, 1) : [];
|
||||
return {
|
||||
status: 'success',
|
||||
command: 'detect-video-and-search',
|
||||
dryRun,
|
||||
videoPath,
|
||||
videoUrl: null,
|
||||
description,
|
||||
keyword: fallbackKeyword,
|
||||
searchResults: items,
|
||||
};
|
||||
}
|
||||
|
||||
return {
|
||||
status: 'success',
|
||||
command: 'detect-video-and-search',
|
||||
dryRun,
|
||||
videoPath,
|
||||
videoUrl: null,
|
||||
description,
|
||||
keyword,
|
||||
searchResults,
|
||||
};
|
||||
}
|
||||
|
||||
async function runDetectAndSearch(args: string[], dryRun: boolean): Promise<OutputResult> {
|
||||
const detectResult = await runDetect(args, dryRun) as DetectResult;
|
||||
if (detectResult.status === 'failed') return detectResult;
|
||||
|
|
@ -137,15 +333,47 @@ async function runDetectAndSearch(args: string[], dryRun: boolean): Promise<Outp
|
|||
const imageForSearch = best.croppedImagePath || best.imagePath;
|
||||
const searchResult = await runSearch([imageForSearch], dryRun) as SearchResult;
|
||||
|
||||
let rerankResult: any = undefined;
|
||||
// Post-filter: drop results whose pic_url isn't the same product type as our snapshot
|
||||
let postFilter: any = undefined;
|
||||
if (!dryRun && searchResult.status === 'success' && searchResult.searchBody) {
|
||||
const items: SearchItem[] = (searchResult.searchBody as any)?.data?.items?.item ?? [];
|
||||
if (items.length > 0) {
|
||||
try {
|
||||
const client = createSkillClient();
|
||||
const visionConfig = await loadVisionConfig(client);
|
||||
const result = await postFilterByImage(imageForSearch, items, visionConfig, { description: best.description });
|
||||
(searchResult.searchBody as any).data.items.item = result.kept;
|
||||
postFilter = {
|
||||
totalChecked: result.totalChecked,
|
||||
keptCount: result.kept.length,
|
||||
rejectedCount: result.rejected.length,
|
||||
failed: result.failed,
|
||||
};
|
||||
} catch (e: any) {
|
||||
postFilter = { error: e.message };
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let rerankResult: any = undefined;
|
||||
// If post-filter produced focused results, sort them directly by sales — they're already the best matches.
|
||||
// Otherwise fall back to the keyword-intersection rerank.
|
||||
if (!dryRun && postFilter && !postFilter.error && postFilter.keptCount > 0) {
|
||||
const items: SearchItem[] = (searchResult.searchBody as any)?.data?.items?.item ?? [];
|
||||
const sorted = [...items].sort((a, b) => (b.sales ?? 0) - (a.sales ?? 0)).slice(0, 5);
|
||||
rerankResult = {
|
||||
source: 'post-filter',
|
||||
results: sorted,
|
||||
count: sorted.length,
|
||||
};
|
||||
} else if (!dryRun && searchResult.status === 'success' && searchResult.searchBody) {
|
||||
const tmpFile = path.join(path.dirname(imageForSearch), `search_body_${Date.now()}.json`);
|
||||
try {
|
||||
fs.writeFileSync(tmpFile, JSON.stringify(searchResult.searchBody));
|
||||
rerankResult = await runRerank([
|
||||
`--image-results=${tmpFile}`,
|
||||
`--description=${best.description}`,
|
||||
'--top=10',
|
||||
'--top=5',
|
||||
], dryRun);
|
||||
} catch (e: any) {
|
||||
rerankResult = { error: e.message };
|
||||
|
|
@ -160,6 +388,7 @@ async function runDetectAndSearch(args: string[], dryRun: boolean): Promise<Outp
|
|||
searchHttpStatus: searchResult.searchHttpStatus,
|
||||
searchBody: searchResult.searchBody,
|
||||
searchError: searchResult.error,
|
||||
postFilter,
|
||||
rerank: rerankResult,
|
||||
} as any;
|
||||
}
|
||||
|
|
@ -188,7 +417,25 @@ function getFlag(args: string[], flag: string): string | undefined {
|
|||
}
|
||||
|
||||
function createVisionModel(config: VisionConfig) {
|
||||
const openai = createOpenAI({ apiKey: config.apiKey, baseURL: config.baseURL });
|
||||
const sessionId = config.sessionId || '';
|
||||
const originFetch = globalThis.fetch;
|
||||
// Inject metadata.session_id into request body so LiteLLM → Langfuse creates sessions
|
||||
const wrapped = async (input: RequestInfo | URL, init?: RequestInit) => {
|
||||
if (init?.body && typeof init.body === 'string') {
|
||||
try {
|
||||
const body = JSON.parse(init.body);
|
||||
if (!body.metadata) body.metadata = {};
|
||||
if (!body.metadata.session_id) body.metadata.session_id = sessionId;
|
||||
body.metadata.tags = ['skill:video-product-snapshot'];
|
||||
init = { ...init, body: JSON.stringify(body) };
|
||||
} catch {}
|
||||
}
|
||||
return originFetch(input, init);
|
||||
};
|
||||
const openai = createOpenAI({
|
||||
apiKey: config.apiKey, baseURL: config.baseURL,
|
||||
fetch: wrapped as typeof globalThis.fetch,
|
||||
});
|
||||
return openai(config.model);
|
||||
}
|
||||
|
||||
|
|
@ -198,10 +445,12 @@ async function generateChineseKeyword(description: string, visionConfig: VisionC
|
|||
model,
|
||||
prompt: `You are generating a 1688.com (Chinese B2B wholesale) product search keyword.
|
||||
Rules:
|
||||
- Output ONLY 2-4 Chinese words — the product category + 1-2 key material/feature words
|
||||
- Output ONLY 2-4 Chinese words — the product OBJECT TYPE + 1-2 key material/feature words
|
||||
- CRITICAL: If the product is a container, organizer, rack, shelf, bag, box, or holder, the keyword MUST name THAT object — NOT the items it holds.
|
||||
Examples: shoe rack → "金属鞋架", cable organizer → "理线器", storage shelf → "收纳架", toolbox → "工具箱"
|
||||
- Use common Chinese commerce terms, NOT a literal translation
|
||||
- No English, no punctuation, no explanation
|
||||
- Short broad terms work better than long specific phrases (e.g. "金属鞋架" not "黑色Z型金属网格鞋架")
|
||||
- Short broad terms work better than long specific phrases
|
||||
|
||||
Product description: ${description}
|
||||
|
||||
|
|
@ -241,9 +490,10 @@ function extractKeywordsFromTitles(items: SearchItem[], topN = 5): string {
|
|||
|
||||
async function runRerank(args: string[], dryRun: boolean): Promise<OutputResult> {
|
||||
// --image-results=<path> --keyword=<text> --top=<n>
|
||||
const imageResultsArg = getFlag(args, '--image-results') || args[0];
|
||||
const keywordArg = getFlag(args, '--keyword') || args[1];
|
||||
const topN = parseInt(getFlag(args, '--top') || '10', 10);
|
||||
const positionals = args.filter((a) => !a.startsWith('--'));
|
||||
const imageResultsArg = getFlag(args, '--image-results') || positionals[0];
|
||||
const keywordArg = getFlag(args, '--keyword') || positionals[1];
|
||||
const topN = parseInt(getFlag(args, '--top') || '5', 10);
|
||||
|
||||
const description = getFlag(args, '--description') || '';
|
||||
|
||||
|
|
@ -318,7 +568,3 @@ async function runRerank(args: string[], dryRun: boolean): Promise<OutputResult>
|
|||
results: sorted,
|
||||
} as any;
|
||||
}
|
||||
|
||||
function parseJsonSafe(text: string): unknown {
|
||||
try { return JSON.parse(text); } catch { return text; }
|
||||
}
|
||||
|
|
|
|||
|
|
@ -0,0 +1,123 @@
|
|||
import { generateText } from 'ai';
|
||||
import { createOpenAI } from '@ai-sdk/openai';
|
||||
import type { SearchItem } from './types.ts';
|
||||
import type { VisionConfig } from './index.ts';
|
||||
import { imageToBase64 } from './frame-extractor.ts';
|
||||
|
||||
export interface PostFilterResult {
|
||||
kept: SearchItem[];
|
||||
rejected: SearchItem[];
|
||||
totalChecked: number;
|
||||
failed: boolean;
|
||||
}
|
||||
|
||||
const FILTER_PROMPT = (count: number, description?: string) => {
|
||||
const productLine = description
|
||||
? `查询商品是:${description}`
|
||||
: '第1张图是查询商品。';
|
||||
return `${productLine}
|
||||
后面的 ${count} 张图是搜索结果。
|
||||
|
||||
任务:判断每张候选图是否与查询商品是**完全相同的具体产品类型**。
|
||||
- 必须是同一个具体产品(例如:查询是"鞋架",候选必须也是鞋架;不是其他类型的架子如纸巾架、首饰架、收纳盒)
|
||||
- 颜色、材质、款式、尺寸不同但同一具体类型 → 算同类
|
||||
- 用途不同就不算同类(例如:查询是鞋架 vs 候选是纸巾架 → 不算;查询是鞋架 vs 候选是床下收纳箱 → 不算,除非明确是鞋类收纳)
|
||||
- 关键判断:候选商品的主要用途是否与查询商品一致
|
||||
|
||||
按候选图顺序输出每一张的判断,每行一个,格式严格遵守:
|
||||
1: YES
|
||||
2: NO
|
||||
3: YES
|
||||
...
|
||||
|
||||
只输出 ${count} 行结果,不要解释,不要前后空行。`;
|
||||
};
|
||||
|
||||
function createModel(config: VisionConfig) {
|
||||
const sessionId = config.sessionId || '';
|
||||
const originFetch = globalThis.fetch;
|
||||
const wrapped = async (input: RequestInfo | URL, init?: RequestInit) => {
|
||||
if (init?.body && typeof init.body === 'string') {
|
||||
try {
|
||||
const body = JSON.parse(init.body);
|
||||
if (!body.metadata) body.metadata = {};
|
||||
if (!body.metadata.session_id) body.metadata.session_id = sessionId;
|
||||
body.metadata.tags = ['skill:video-product-snapshot'];
|
||||
init = { ...init, body: JSON.stringify(body) };
|
||||
} catch {}
|
||||
}
|
||||
return originFetch(input, init);
|
||||
};
|
||||
const provider = createOpenAI({
|
||||
apiKey: config.apiKey, baseURL: config.baseURL,
|
||||
fetch: wrapped as typeof globalThis.fetch,
|
||||
});
|
||||
return provider(config.model);
|
||||
}
|
||||
|
||||
async function classifyBatch(
|
||||
model: ReturnType<ReturnType<typeof createOpenAI>>,
|
||||
queryImageDataUrl: string,
|
||||
batch: SearchItem[],
|
||||
description?: string,
|
||||
): Promise<boolean[]> {
|
||||
const content: any[] = [{ type: 'image', image: queryImageDataUrl }];
|
||||
for (const item of batch) {
|
||||
content.push({ type: 'image', image: item.pic_url });
|
||||
}
|
||||
content.push({ type: 'text', text: FILTER_PROMPT(batch.length, description) });
|
||||
|
||||
const { text } = await generateText({
|
||||
model,
|
||||
messages: [{ role: 'user', content }],
|
||||
maxTokens: 200,
|
||||
});
|
||||
|
||||
const flags = batch.map(() => false);
|
||||
for (const line of text.split('\n')) {
|
||||
const m = line.match(/^\s*(\d+)\s*[::]\s*(YES|NO|是|否)/i);
|
||||
if (!m) continue;
|
||||
const idx = parseInt(m[1], 10) - 1;
|
||||
const yes = /YES|是/i.test(m[2]);
|
||||
if (idx >= 0 && idx < flags.length) flags[idx] = yes;
|
||||
}
|
||||
return flags;
|
||||
}
|
||||
|
||||
export async function postFilterByImage(
|
||||
queryImagePath: string,
|
||||
items: SearchItem[],
|
||||
visionConfig: VisionConfig,
|
||||
options: { description?: string; batchSize?: number } = {},
|
||||
): Promise<PostFilterResult> {
|
||||
if (items.length === 0) {
|
||||
return { kept: [], rejected: [], totalChecked: 0, failed: false };
|
||||
}
|
||||
|
||||
const batchSize = options.batchSize ?? 10;
|
||||
const description = options.description;
|
||||
|
||||
const model = createModel(visionConfig);
|
||||
const queryDataUrl = `data:image/jpeg;base64,${imageToBase64(queryImagePath)}`;
|
||||
|
||||
const kept: SearchItem[] = [];
|
||||
const rejected: SearchItem[] = [];
|
||||
let anyFailed = false;
|
||||
|
||||
for (let i = 0; i < items.length; i += batchSize) {
|
||||
const batch = items.slice(i, i + batchSize);
|
||||
try {
|
||||
const flags = await classifyBatch(model, queryDataUrl, batch, description);
|
||||
batch.forEach((item, idx) => {
|
||||
if (flags[idx]) kept.push(item);
|
||||
else rejected.push(item);
|
||||
});
|
||||
} catch {
|
||||
// On batch failure, keep items (don't lose them) but flag the run as partial
|
||||
anyFailed = true;
|
||||
kept.push(...batch);
|
||||
}
|
||||
}
|
||||
|
||||
return { kept, rejected, totalChecked: items.length, failed: anyFailed };
|
||||
}
|
||||
|
|
@ -1,4 +1,4 @@
|
|||
import { generateObject } from 'ai';
|
||||
import { generateObject, generateText } from 'ai';
|
||||
import { createOpenAI } from '@ai-sdk/openai';
|
||||
import { z } from 'zod';
|
||||
import type { ExtractedFrame } from './frame-extractor.ts';
|
||||
|
|
@ -28,24 +28,74 @@ Discard (keep=false) if: only hands/texture/contents visible, motion blur, black
|
|||
|
||||
reason options: product_visible | content_only | hands_only | blur | transition | background_only`;
|
||||
|
||||
const RANKING_PROMPT = (count: number) => `You are selecting the single best product image from ${count} video frames for ecommerce image search.
|
||||
const CONTAINER_CHECK_PROMPT = `Is the main product in this image a CONTAINER, RACK, or HOLDER (something designed to store/hold other items)?
|
||||
Examples YES: shoe rack, shelf, storage box, organizer, basket, drawer, wardrobe, trolley, bin, tray, cabinet.
|
||||
Examples NO: shoes, clothing, electronics, food, toys, cosmetics, tools.
|
||||
Reply with only one word: YES or NO.`;
|
||||
|
||||
The frames are numbered 0 to ${count - 1} in the order shown.
|
||||
const RANKING_PROMPT_CONTAINER = (count: number) => `You are selecting ONE frame from ${count} video frames to use as the query image for an ecommerce reverse-image search.
|
||||
|
||||
Pick the ONE frame where the HERO PRODUCT is:
|
||||
1. Cleanest — fewest distractions, no hands blocking it, no clutter in foreground
|
||||
2. Most complete — full product silhouette visible, no edges cropped
|
||||
3. Most isolated — product stands out from background clearly
|
||||
4. Empty/minimal load preferred — a product without contents (e.g. an empty rack) beats one stuffed with items if both show the full structure equally
|
||||
The hero product is a CONTAINER / RACK / HOLDER / ORGANIZER.
|
||||
|
||||
CRITICAL CONSTRAINT — read this first:
|
||||
Image search engines identify objects by visual appearance. If the container holds items (shoes, clothes, etc.), the search engine will match those ITEMS, not the container — returning completely wrong products.
|
||||
|
||||
YOUR ONLY JOB: find the frame where the container structure itself is most visible with the FEWEST or NO items inside.
|
||||
|
||||
ABSOLUTE PRIORITY ORDER (do not deviate):
|
||||
1. Frame with container completely EMPTY — highest priority regardless of angle or assembly state
|
||||
2. Frame with container partially assembled or partially visible but EMPTY — still better than any loaded frame
|
||||
3. Frame with fewest items inside (1-2 items, mostly empty)
|
||||
4. Frame with moderate load — only if no emptier option exists
|
||||
5. Frame fully loaded — last resort only if no other frames exist
|
||||
|
||||
A frame showing the rack mid-assembly with zero items is ALWAYS better than a perfectly-lit fully-assembled rack filled with shoes.
|
||||
|
||||
Frames are numbered 0 to ${count - 1} in order shown. You MUST pick ONE.
|
||||
|
||||
Return:
|
||||
- bestFrameIndex: 0-based index of chosen frame
|
||||
- description: concise search query under 12 words (product type + material + color + key feature)
|
||||
- reasoning: one sentence explaining why this frame was chosen
|
||||
- boundingBox: tight bounding box of the HERO PRODUCT ONLY in the chosen frame as [x1, y1, x2, y2] normalized 0.0–1.0 (top-left origin). Exclude hands, background, and unrelated objects. The product is assumed to be near the center.`;
|
||||
- bestFrameIndex: 0-based index of the emptiest container frame
|
||||
- description: concise Chinese search query ≤12 words (container type + material + color + key feature)
|
||||
- reasoning: describe how many items are visible inside the chosen frame and why it's the emptiest option
|
||||
- boundingBox: tight box of the PRODUCT STRUCTURE ONLY as [x1, y1, x2, y2] normalized 0.0–1.0. Exclude any items stored inside.`;
|
||||
|
||||
const RANKING_PROMPT_GENERAL = (count: number) => `You are selecting the single best product frame from ${count} video frames for ecommerce search.
|
||||
|
||||
Frames are numbered 0 to ${count - 1} in order shown.
|
||||
|
||||
IMPORTANT: You MUST pick ONE frame — even if product visibility is imperfect or no frame looks ideal. Always make your best guess.
|
||||
|
||||
Pick the frame where the MAIN SELLING PRODUCT is:
|
||||
1. Most recognizable — clearest view of the item being sold
|
||||
2. Most complete — full product silhouette visible, not cropped at edges
|
||||
3. Cleanest — minimal obstruction (hands, clutter, motion blur, labels)
|
||||
4. Best lit and in focus
|
||||
|
||||
Return:
|
||||
- bestFrameIndex: 0-based index
|
||||
- description: concise search query under 12 words (product type + material + color + key features), in Chinese
|
||||
- reasoning: one sentence explaining choice
|
||||
- boundingBox: tight box of the PRODUCT ONLY as [x1, y1, x2, y2] normalized 0.0–1.0, top-left origin. Exclude hands, background, and unrelated objects. The product is near the center of the frame.`;
|
||||
|
||||
function createVisionModel(config: VisionConfig) {
|
||||
const provider = createOpenAI({ apiKey: config.apiKey, baseURL: config.baseURL });
|
||||
const sessionId = config.sessionId || '';
|
||||
const originFetch = globalThis.fetch;
|
||||
const wrapped = async (input: RequestInfo | URL, init?: RequestInit) => {
|
||||
if (init?.body && typeof init.body === 'string') {
|
||||
try {
|
||||
const body = JSON.parse(init.body);
|
||||
if (!body.metadata) body.metadata = {};
|
||||
if (!body.metadata.session_id) body.metadata.session_id = sessionId;
|
||||
body.metadata.tags = ['skill:video-product-snapshot'];
|
||||
init = { ...init, body: JSON.stringify(body) };
|
||||
} catch {}
|
||||
}
|
||||
return originFetch(input, init);
|
||||
};
|
||||
const provider = createOpenAI({
|
||||
apiKey: config.apiKey, baseURL: config.baseURL,
|
||||
fetch: wrapped as typeof globalThis.fetch,
|
||||
});
|
||||
return provider(config.model);
|
||||
}
|
||||
|
||||
|
|
@ -70,15 +120,52 @@ async function filterFrame(
|
|||
return object.keep;
|
||||
}
|
||||
|
||||
|
||||
async function isContainerProduct(
|
||||
firstFrame: ExtractedFrame,
|
||||
model: ReturnType<ReturnType<typeof createOpenAI>>,
|
||||
): Promise<boolean> {
|
||||
try {
|
||||
const { text } = await generateText({
|
||||
model,
|
||||
messages: [{
|
||||
role: 'user',
|
||||
content: [
|
||||
{ type: 'image', image: `data:image/jpeg;base64,${imageToBase64(firstFrame.imagePath)}` },
|
||||
{ type: 'text', text: CONTAINER_CHECK_PROMPT },
|
||||
],
|
||||
}],
|
||||
maxTokens: 5,
|
||||
});
|
||||
return text.trim().toUpperCase().startsWith('Y');
|
||||
} catch {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
function takeEarliestFrames(candidates: ExtractedFrame[], fraction: number = 0.4): ExtractedFrame[] {
|
||||
// Ecommerce videos show the container empty/unboxing early, then full.
|
||||
// Taking the first 40% of frames reliably captures empty states.
|
||||
const sorted = [...candidates].sort((a, b) => a.frameIndex - b.frameIndex);
|
||||
const cutoff = Math.max(1, Math.ceil(sorted.length * fraction));
|
||||
return sorted.slice(0, cutoff);
|
||||
}
|
||||
|
||||
async function rankCandidates(
|
||||
candidates: ExtractedFrame[],
|
||||
model: ReturnType<ReturnType<typeof createOpenAI>>,
|
||||
isContainer: boolean,
|
||||
): Promise<{ bestFrame: ExtractedFrame; description: string; reasoning: string; boundingBox: [number, number, number, number] }> {
|
||||
const imageContent = candidates.map((f) => ({
|
||||
type: 'image' as const,
|
||||
image: `data:image/jpeg;base64,${imageToBase64(f.imagePath)}`,
|
||||
}));
|
||||
|
||||
const prompt = isContainer
|
||||
? RANKING_PROMPT_CONTAINER(candidates.length)
|
||||
: RANKING_PROMPT_GENERAL(candidates.length);
|
||||
|
||||
const { object } = await generateObject({
|
||||
model,
|
||||
schema: RankingSchema,
|
||||
|
|
@ -87,7 +174,7 @@ async function rankCandidates(
|
|||
role: 'user',
|
||||
content: [
|
||||
...imageContent,
|
||||
{ type: 'text', text: RANKING_PROMPT(candidates.length) },
|
||||
{ type: 'text', text: prompt },
|
||||
],
|
||||
}],
|
||||
});
|
||||
|
|
@ -114,7 +201,17 @@ export async function cropProduct(
|
|||
|
||||
let [x1, y1, x2, y2] = boundingBox;
|
||||
|
||||
// add padding
|
||||
// Normalize coords: ensure x1<x2 and y1<y2
|
||||
if (x1 > x2) [x1, x2] = [x2, x1];
|
||||
if (y1 > y2) [y1, y2] = [y2, y1];
|
||||
|
||||
// Clamp to [0, 1]
|
||||
x1 = Math.max(0, Math.min(1, x1));
|
||||
y1 = Math.max(0, Math.min(1, y1));
|
||||
x2 = Math.max(0, Math.min(1, x2));
|
||||
y2 = Math.max(0, Math.min(1, y2));
|
||||
|
||||
// Add padding
|
||||
const pw = (x2 - x1) * paddingFactor;
|
||||
const ph = (y2 - y1) * paddingFactor;
|
||||
x1 = Math.max(0, x1 - pw);
|
||||
|
|
@ -122,6 +219,11 @@ export async function cropProduct(
|
|||
x2 = Math.min(1, x2 + pw);
|
||||
y2 = Math.min(1, y2 + ph);
|
||||
|
||||
// Validate minimum area
|
||||
if (x2 - x1 < 0.005 || y2 - y1 < 0.005) {
|
||||
throw new Error('bounding box too small after normalization');
|
||||
}
|
||||
|
||||
const left = Math.round(x1 * W);
|
||||
const top = Math.round(y1 * H);
|
||||
const width = Math.round((x2 - x1) * W);
|
||||
|
|
@ -135,40 +237,203 @@ export async function cropProduct(
|
|||
return outputPath;
|
||||
}
|
||||
|
||||
async function withConcurrency<T>(
|
||||
tasks: (() => Promise<T>)[],
|
||||
limit: number,
|
||||
): Promise<T[]> {
|
||||
const results: T[] = new Array(tasks.length);
|
||||
let next = 0;
|
||||
async function worker() {
|
||||
while (next < tasks.length) {
|
||||
const i = next++;
|
||||
results[i] = await tasks[i]();
|
||||
}
|
||||
}
|
||||
await Promise.all(Array.from({ length: Math.min(limit, tasks.length) }, worker));
|
||||
return results;
|
||||
}
|
||||
|
||||
// ── Frame quality pre-filtering ──────────────────────────────────────
|
||||
|
||||
interface FrameQuality {
|
||||
valid: boolean;
|
||||
meanBrightness: number;
|
||||
variance: number;
|
||||
}
|
||||
|
||||
async function assessFrameQuality(imagePath: string): Promise<FrameQuality> {
|
||||
const sharp = (await import('sharp')).default;
|
||||
const { data, info } = await sharp(imagePath)
|
||||
.grayscale()
|
||||
.raw()
|
||||
.toBuffer({ resolveWithObject: true });
|
||||
|
||||
const pixels = new Uint8Array(data);
|
||||
let sum = 0;
|
||||
let sumSq = 0;
|
||||
for (let i = 0; i < pixels.length; i++) {
|
||||
sum += pixels[i];
|
||||
sumSq += pixels[i] * pixels[i];
|
||||
}
|
||||
const mean = sum / pixels.length;
|
||||
const variance = sumSq / pixels.length - mean * mean;
|
||||
|
||||
// Skip near-black, near-white, or very low variance (blurry/blank/transition)
|
||||
const valid = mean > 15 && mean < 240 && variance > 50;
|
||||
return { valid, meanBrightness: mean, variance };
|
||||
}
|
||||
|
||||
async function filterQualityFrames(frames: ExtractedFrame[]): Promise<ExtractedFrame[]> {
|
||||
const results = await Promise.all(
|
||||
frames.map(async (frame) => {
|
||||
try {
|
||||
const q = await assessFrameQuality(frame.imagePath);
|
||||
return { frame, valid: q.valid };
|
||||
} catch {
|
||||
return { frame, valid: true };
|
||||
}
|
||||
}),
|
||||
);
|
||||
const valid = results.filter(r => r.valid).map(r => r.frame);
|
||||
return valid.length > 0 ? valid : frames;
|
||||
}
|
||||
|
||||
function isValidBoundingBox(bbox: [number, number, number, number]): boolean {
|
||||
const [x1, y1, x2, y2] = bbox;
|
||||
return (
|
||||
x1 >= 0 && x1 <= 1 &&
|
||||
y1 >= 0 && y1 <= 1 &&
|
||||
x2 >= 0 && x2 <= 1 &&
|
||||
y2 >= 0 && y2 <= 1 &&
|
||||
x1 < x2 &&
|
||||
y1 < y2 &&
|
||||
(x2 - x1) * (y2 - y1) > 0.005
|
||||
);
|
||||
}
|
||||
|
||||
// Skips Pass 1 filter entirely — ranks all frames and always returns the best one.
|
||||
// Evenly samples down to maxCandidates when there are too many frames.
|
||||
export async function detectBestFrame(
|
||||
frames: ExtractedFrame[],
|
||||
visionConfig: VisionConfig,
|
||||
maxCandidates: number = 20,
|
||||
): Promise<ProductFrame | null> {
|
||||
if (frames.length === 0) return null;
|
||||
|
||||
// 1. Filter out obviously bad frames (black, white, blurry)
|
||||
let candidates = await filterQualityFrames(frames);
|
||||
|
||||
// 2. Sample if too many
|
||||
if (candidates.length > maxCandidates) {
|
||||
const step = candidates.length / maxCandidates;
|
||||
candidates = Array.from({ length: maxCandidates }, (_, i) => candidates[Math.floor(i * step)]);
|
||||
}
|
||||
|
||||
const model = createVisionModel(visionConfig);
|
||||
|
||||
// 3. Check if product is a container/rack type (use first candidate frame)
|
||||
const container = await isContainerProduct(candidates[0], model);
|
||||
|
||||
// 4. For containers: restrict ranking to earliest frames (empty/unboxing phase)
|
||||
if (container) {
|
||||
const early = takeEarliestFrames(candidates);
|
||||
if (early.length > 0) candidates = early;
|
||||
}
|
||||
|
||||
// 5. Try Vision ranking with error isolation
|
||||
try {
|
||||
const { bestFrame, description, reasoning, boundingBox } = await rankCandidates(candidates, model, container);
|
||||
|
||||
if (isValidBoundingBox(boundingBox)) {
|
||||
const croppedPath = bestFrame.imagePath.replace(/\.jpg$/, '_cropped.jpg');
|
||||
try {
|
||||
await cropProduct(bestFrame.imagePath, boundingBox, croppedPath);
|
||||
} catch {
|
||||
// cropping is optional — keep original frame
|
||||
}
|
||||
return {
|
||||
frameIndex: bestFrame.frameIndex,
|
||||
timestampSeconds: bestFrame.timestampSeconds,
|
||||
imagePath: bestFrame.imagePath,
|
||||
...(croppedPath ? { croppedImagePath: croppedPath } : {}),
|
||||
confidence: 0.95,
|
||||
description,
|
||||
boundingHint: reasoning,
|
||||
};
|
||||
}
|
||||
} catch {
|
||||
// Vision ranking failed — fall through to fallback
|
||||
}
|
||||
|
||||
// 4. Fallback: rank by frame quality (variance) and return the sharpest
|
||||
const withQuality = await Promise.all(
|
||||
candidates.map(async (f) => {
|
||||
try {
|
||||
const q = await assessFrameQuality(f.imagePath);
|
||||
return { frame: f, score: q.variance };
|
||||
} catch {
|
||||
return { frame: f, score: 0 };
|
||||
}
|
||||
}),
|
||||
);
|
||||
withQuality.sort((a, b) => b.score - a.score);
|
||||
const best = withQuality[0].frame;
|
||||
|
||||
return {
|
||||
frameIndex: best.frameIndex,
|
||||
timestampSeconds: best.timestampSeconds,
|
||||
imagePath: best.imagePath,
|
||||
confidence: 0.5,
|
||||
description: 'product frame (auto-selected)',
|
||||
boundingHint: 'picked by frame quality analysis (Vision ranking failed)',
|
||||
};
|
||||
}
|
||||
|
||||
export async function detectProductFrames(
|
||||
frames: ExtractedFrame[],
|
||||
minConfidence: number,
|
||||
concurrency: number = 5,
|
||||
concurrency: number = 10,
|
||||
visionConfig: VisionConfig,
|
||||
): Promise<ProductFrame[]> {
|
||||
const model = createVisionModel(visionConfig);
|
||||
|
||||
// Pass 1: parallel filter — discard junk frames
|
||||
const keepFlags: boolean[] = [];
|
||||
for (let i = 0; i < frames.length; i += concurrency) {
|
||||
const chunk = frames.slice(i, i + concurrency);
|
||||
const flags = await Promise.all(
|
||||
chunk.map((f) => filterFrame(f, model).catch(() => false))
|
||||
);
|
||||
keepFlags.push(...flags);
|
||||
}
|
||||
// Pass 1: all frames in parallel, bounded by concurrency
|
||||
const keepFlags = await withConcurrency(
|
||||
frames.map((f) => () => filterFrame(f, model).catch(() => false)),
|
||||
concurrency,
|
||||
);
|
||||
|
||||
const candidates = frames.filter((_, i) => keepFlags[i]);
|
||||
if (candidates.length === 0) return [];
|
||||
|
||||
// Pass 2: single comparative call — model sees all candidates at once
|
||||
const { bestFrame, description, reasoning, boundingBox } = await rankCandidates(candidates, model);
|
||||
const container = await isContainerProduct(candidates[0], model);
|
||||
let bestSnapshot: ProductFrame | undefined;
|
||||
try {
|
||||
const { bestFrame, description, reasoning, boundingBox } = await rankCandidates(candidates, model, container);
|
||||
|
||||
const croppedPath = bestFrame.imagePath.replace(/\.jpg$/, '_cropped.jpg');
|
||||
await cropProduct(bestFrame.imagePath, boundingBox, croppedPath);
|
||||
if (isValidBoundingBox(boundingBox)) {
|
||||
const croppedPath = bestFrame.imagePath.replace(/\.jpg$/, '_cropped.jpg');
|
||||
try {
|
||||
await cropProduct(bestFrame.imagePath, boundingBox, croppedPath);
|
||||
} catch {}
|
||||
bestSnapshot = {
|
||||
frameIndex: bestFrame.frameIndex,
|
||||
timestampSeconds: bestFrame.timestampSeconds,
|
||||
imagePath: bestFrame.imagePath,
|
||||
...(croppedPath ? { croppedImagePath: croppedPath } : {}),
|
||||
confidence: 0.95,
|
||||
description,
|
||||
boundingHint: reasoning,
|
||||
};
|
||||
}
|
||||
} catch {
|
||||
// ranking failed
|
||||
}
|
||||
|
||||
return [{
|
||||
frameIndex: bestFrame.frameIndex,
|
||||
timestampSeconds: bestFrame.timestampSeconds,
|
||||
imagePath: bestFrame.imagePath,
|
||||
croppedImagePath: croppedPath,
|
||||
confidence: 0.95,
|
||||
description,
|
||||
boundingHint: reasoning,
|
||||
}];
|
||||
if (!bestSnapshot) {
|
||||
return [];
|
||||
}
|
||||
|
||||
return [bestSnapshot];
|
||||
}
|
||||
|
|
|
|||
37
src/types.ts
37
src/types.ts
|
|
@ -1,4 +1,13 @@
|
|||
export type Command = 'detect' | 'search' | 'detect-and-search' | 'rerank' | 'session';
|
||||
export type Command =
|
||||
| 'detect'
|
||||
| 'search'
|
||||
| 'detect-and-search'
|
||||
| 'detect-best'
|
||||
| 'detect-best-and-search'
|
||||
| 'detect-video'
|
||||
| 'detect-video-and-search'
|
||||
| 'rerank'
|
||||
| 'session';
|
||||
|
||||
export interface SearchItem {
|
||||
num_iid: number;
|
||||
|
|
@ -11,6 +20,30 @@ export interface SearchItem {
|
|||
detail_url: string;
|
||||
}
|
||||
|
||||
export interface DetectVideoResult {
|
||||
status: 'success' | 'failed';
|
||||
command: 'detect-video';
|
||||
dryRun: boolean;
|
||||
videoPath?: string;
|
||||
videoUrl?: string | null;
|
||||
description?: string;
|
||||
keyword?: string;
|
||||
snapshotImagePath?: string;
|
||||
error?: string;
|
||||
}
|
||||
|
||||
export interface DetectVideoAndSearchResult {
|
||||
status: 'success' | 'failed';
|
||||
command: 'detect-video-and-search';
|
||||
dryRun: boolean;
|
||||
videoPath?: string;
|
||||
videoUrl?: string | null;
|
||||
description?: string;
|
||||
keyword?: string;
|
||||
searchResults?: SearchItem[];
|
||||
error?: string;
|
||||
}
|
||||
|
||||
export interface DetectOptions {
|
||||
videoPath: string;
|
||||
intervalSeconds: number;
|
||||
|
|
@ -51,4 +84,4 @@ export interface SearchResult {
|
|||
error?: string;
|
||||
}
|
||||
|
||||
export type OutputResult = DetectResult | SearchResult;
|
||||
export type OutputResult = DetectResult | SearchResult | DetectVideoResult | DetectVideoAndSearchResult;
|
||||
|
|
|
|||
Loading…
Reference in New Issue