Compare commits

..

No commits in common. "main" and "v0.0.1" have entirely different histories.
main ... v0.0.1

9 changed files with 242 additions and 1040 deletions

View File

@ -1,18 +1,40 @@
# =============================================================================
# video-product-snapshot 环境变量配置
# =============================================================================
#
# 只需在 ~/.openclaw/.env 中配置 CLIENT_KEY
# CLIENT_KEY=sk_xxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxx
#
# 所有配置vision API key、接口地址等均通过 auth-rt client-config 自动获取。
# 复制为 .env 并填入真实值cp .env.example .env
# =============================================================================
# 覆盖 Vision 模型(默认来自 client configfallback 为 aliyun-cp-multimodal
# VISION_MODEL=aliyun-cp-multimodal
# -----------------------------------------------------------------------------
# Vision API 配置(用于商品帧检测)
# 兼容任何 OpenAI 格式接口OpenAI / Groq / Together / 本地 Ollama 等
# -----------------------------------------------------------------------------
# 覆盖 auth-rt 二进制路径
# AUTH_RT_BIN=/custom/path/to/auth-rt
# API Key必填
VISION_API_KEY=your-api-key-here
# 遥测上报(可选)
# TELEMETRY_ENDPOINT=https://api-gw-test.yuanwei-lnc.com/ecom/tasks/telemetry
# API Base URL可选留空则使用 OpenAI 官方地址)
# VISION_API_BASE=https://api.groq.com/openai/v1
# VISION_API_BASE=http://localhost:11434/v1
# 模型名称(可选,默认 gpt-4o-mini
# VISION_MODEL=gpt-4o-mini
# VISION_MODEL=meta-llama/llama-4-scout-17b-16e-instruct
# VISION_MODEL=llava:13b
# -----------------------------------------------------------------------------
# 1688 图搜配置via woo-data-scrawler 本地服务,端口 3202
# 所有 Onebound 调用均通过本地服务代理,无需持有 API 密钥
# -----------------------------------------------------------------------------
# 上传图片接口(将本地图片上传到公共存储,获取可访问 URL
ONEBOUND_UPLOAD_ENDPOINT=http://localhost:3202/api/v1/tasks/upload-image
# 以图搜图接口
ONEBOUND_SEARCH_ENDPOINT=http://localhost:3202/api/v1/tasks/search-by-image
# 关键词搜索接口(用于 rerank 二次过滤)
ONEBOUND_KEYWORD_SEARCH_ENDPOINT=http://localhost:3202/api/v1/tasks/keyword-search
# -----------------------------------------------------------------------------
# Auth由 auth-rt 自动处理,配置见 ~/.openclaw/.env
# 只需在 ~/.openclaw/.env 中设置 CLIENT_KEY=sk_xxx
# -----------------------------------------------------------------------------

102
README.md
View File

@ -1,102 +0,0 @@
# video-product-snapshot — 视频商品以图搜图
从视频中提取最佳商品帧,以图搜图在 1688 找同款。
## 工作原理
1. `ffmpeg` 按 0.5s 间隔抽帧(最多 60 帧)
2. 视觉质量预过滤(亮度/方差剔除模糊帧)
3. 容器/架子类产品检测 → 自动选择空载帧
4. 视觉模型多帧对比排序,选出最佳商品帧
5. 裁剪商品区域 → 上传 → 1688 图搜
6. 后置过滤(视觉模型判断结果是否同款)→ rerank 排序
## 安装
```bash
./install.sh # 安装 auth-rt + 依赖
bun install
bun run build # 输出到 dist/run.js
```
## 使用方法
```bash
bun dist/run.js <command> [options]
```
### 命令
| 命令 | 说明 |
|------|------|
| `detect-best-and-search <video>` | **推荐。** 最佳帧 → 图搜 → rerank |
| `detect-best <video>` | 只提取最佳商品帧,不搜图 |
| `detect-and-search <video>` | 两阶段过滤后图搜(较慢) |
| `detect <video>` | 抽帧并逐帧检测商品 |
| `search <image>` | 用已有图片搜同款 |
| `rerank` | 关键词对图搜结果交叉过滤 |
| `session` | 获取当前认证会话 token |
### 选项(`detect-best` / `detect-best-and-search`
| 参数 | 默认值 | 说明 |
|------|--------|------|
| `--interval=<秒>` | `0.5` | 帧采样间隔 |
| `--max-frames=<n>` | `60` | 最大分析帧数 |
| `--output-dir=<目录>` | 视频同目录 | 截图保存目录 |
| `--session-id=<id>` | 自动生成 | Langfuse session ID |
| `--dry-run` | — | 解析参数,不实际执行 |
## 输出
所有命令输出 JSON 到 stdout包含 `sessionId` 字段用于 Langfuse 追踪。
```json
{
"sessionId": "skill-20260426-184345-lb06",
"status": "success",
"command": "detect-best-and-search",
"bestSnapshot": {
"frameIndex": 7,
"timestampSeconds": 3,
"imagePath": "/path/to/frame_0007.jpg",
"croppedImagePath": "/path/to/frame_0007_cropped.jpg",
"description": "黑色金属床底鞋架 可折叠移动"
},
"rerank": {
"keyword": "床底鞋架",
"results": [
{ "num_iid": 123, "title": "...", "price": "44.00", "sales": 87, "detail_url": "..." }
]
}
}
```
## 鉴权架构
```
~/.openclaw/.env
CLIENT_KEY ──→ auth-rt ──→ 业务系统
├── /session → access_token
└── /client-config → provider.api_key
provider.base_url
provider.model
```
仅需配置 `CLIENT_KEY`LLM 凭据和端点均由业务系统下发。
## 环境变量
| 变量 | 说明 |
|------|------|
| `CLIENT_KEY` | **必需。**`~/.openclaw/.env` 中配置 |
| `VISION_MODEL` | 覆盖模型名称(默认来自 client config |
| `SKILL_SESSION_ID` | Langfuse session ID自动生成格式 `skill-YYYYMMDD-HHMMSS-xxxx` |
| `AUTH_RT_BIN` | 覆盖 `auth-rt` 二进制路径 |
| `TELEMETRY_ENDPOINT` | 遥测上报接口 |
## 前置依赖
- [Bun](https://bun.sh) 运行时
- 系统 PATH 中包含 `ffmpeg` / `ffprobe`(帧提取)
- `auth-rt` CLI鉴权/API 调用,`install.sh` 自动安装)

130
SKILL.md
View File

@ -1,94 +1,94 @@
---
name: video-product-snapshot
description: "Extract product snapshot from video and search 1688 by image. / 从视频中提取最佳商品帧以图搜图在1688找同款。当用户提供视频想找商品时使用。"
description: "Detect ecommerce products in video frames using Claude Vision, extract the best product snapshot, and optionally search via image-search API. Use when the user provides a video and wants to find/identify products shown in it."
---
# Video Product Snapshot — 视频商品以图搜图
# Video Product Snapshot
从视频中截取最清晰的商品帧(容器类产品自动选空载帧),上传图片在 1688 以图搜图找同款。
Extract ecommerce product snapshots from video using Claude Vision, then optionally search for matching products via image-search API.
## 运行
## Run
```bash
bun dist/run.js <command> [args] [--dry-run]
```
## 命令列表
## Commands
| 命令 | 使用场景 |
|------|---------|
| `detect-best-and-search <video>` | **推荐。** 提取最佳商品帧 → 图搜 → rerank 返回结果。 |
| `detect-best <video>` | 只提取最佳商品帧,不搜图。 |
| `detect-and-search <video>` | 两阶段过滤后图搜(比 detect-best 慢)。 |
| `search <image-path>` | 已有商品图,直接图搜。 |
| `rerank` | 用关键词对图搜结果交叉过滤。 |
| `session` | 获取当前认证会话 token。 |
| Command | Description |
|---------|-------------|
| `detect <video-path> [options]` | Extract frames, detect product snapshots |
| `search <image-path>` | Search products by image via API |
| `detect-and-search <video-path> [options]` | Detect best snapshot then run image search |
| `session` | Get auth session token |
## 主命令:`detect-best-and-search`
流程:
1. ffmpeg 按 0.5s 间隔提取帧(最多 60 帧)
2. 视觉模型检测是否为容器/架子类产品
3. 容器类:只从前 40% 帧(空载阶段)中选最佳帧
4. 非容器类:全帧中选最清晰帧
5. 裁剪商品区域
6. 上传裁剪图 → 1688 图搜
7. rerank图搜结果与关键词搜索结果交叉过滤
## Options for `detect-best` / `detect-best-and-search`
## Options for `detect` / `detect-and-search`
| Flag | Default | Description |
|------|---------|-------------|
| `--interval=<sec>` | `0.5` | 帧采样间隔(秒) |
| `--max-frames=<n>` | `60` | 最大分析帧数 |
| `--output-dir=<dir>` | 视频同目录 | 截图保存目录 |
| `--interval=<sec>` | `1` | Seconds between sampled frames |
| `--max-frames=<n>` | `60` | Max frames to analyze |
| `--output-dir=<dir>` | next to video | Directory to save snapshot images |
| `--min-confidence=<0-1>` | `0.7` | Minimum detection confidence threshold |
## 输出格式
## Examples
### `detect-best-and-search`
```bash
# Detect product frames in a video
bun dist/run.js detect ./product-demo.mp4
# Sample every 5 seconds, higher confidence threshold
bun dist/run.js detect ./product-demo.mp4 --interval=5 --min-confidence=0.85
# Search for products using an existing image
bun dist/run.js search ./snapshot.jpg
# Full pipeline: detect best product frame then search
bun dist/run.js detect-and-search ./product-demo.mp4 --interval=3 --max-frames=20
```
## Output
Returns JSON with:
- `productFrames[]`: all detected product frames sorted by confidence (highest first)
- `bestSnapshot`: the highest-confidence product frame
- `searchBody`: image search API response (for `detect-and-search` and `search`)
Each `ProductFrame` contains:
```json
{
"bestSnapshot": {
"frameIndex": 7,
"timestampSeconds": 3,
"imagePath": "/path/to/frame_0007.jpg",
"croppedImagePath": "/path/to/frame_0007_cropped.jpg",
"description": "黑色金属床底鞋架 可折叠移动"
},
"rerank": {
"keyword": "床底鞋架",
"results": [
{ "num_iid": 123, "title": "...", "price": "44.00", "sales": 87, "detail_url": "..." }
]
}
"frameIndex": 4,
"timestampSeconds": 9,
"imagePath": "/path/to/snapshot/frame_0004.jpg",
"confidence": 0.92,
"description": "White sneaker with blue logo, left side view",
"boundingHint": "centered"
}
```
## 结果展示格式
## Prerequisites
`rerank.results`(优先)或 `searchBody.data.items.item` 格式化为 markdown 表格,**最多 5 条**
- `ffmpeg` and `ffprobe` in PATH
- `VISION_API_KEY` — API key for the vision endpoint
- `VISION_API_BASE` — (optional) OpenAI-compatible base URL; omit to use OpenAI default
- `VISION_MODEL` — (optional) model name, default `gpt-4o-mini`
- `auth-rt` in PATH (for `search` / `detect-and-search` API calls)
| # | 商品名称 | 价格 | 销量 | 链接 |
|---|----------|------|------|------|
| 1 | {title} | ¥{promotion_price \|\| price} | {sales ?? —}件 | [查看]({detail_url}) |
### Example provider configs
- 有 `promotion_price` 用促销价,否则用原价
- `sales` 缺失或为零时显示 `—`
- 始终用 markdown 表格展示,不要用列表
## 执行规则
1. **不要重试。** 命令失败就直接报错。
2. **信任工具输出。** CLI 内部已处理 session 管理和错误格式化。
3. **慢命令用 sub-agent**detect-best-and-search / detect-and-search直接运行会超时。
### sub-agent 执行模板
```bash
# OpenAI (default)
VISION_API_KEY=sk-...
# Any OpenAI-compatible endpoint (local Ollama, Together, Groq, etc.)
VISION_API_KEY=...
VISION_API_BASE=http://localhost:11434/v1
VISION_MODEL=llava:13b
```
sessions_spawn(
task: "cd /path/to/skill && Run this command and return the raw JSON output:\n\nbun dist/run.js detect-best-and-search <video-path>\n\nCopy the entire JSON output as your reply.",
label: "video-product-snapshot",
runTimeoutSeconds: 300,
)
```
## Rules — MUST follow
1. **Execute only, do not reason about internals.** Run the CLI and return the output.
2. **No fallback strategies.** Report errors as-is; do NOT try alternative approaches.
3. **No retry loops.** If detection or search fails, report the failure.
4. **Trust the tool's output.** The CLI handles session management and error formatting internally.

View File

@ -2,7 +2,6 @@
import { resolve } from 'path';
import type { Command } from '../src/types.ts';
import { run } from '../src/index.ts';
const SKILL_NAME = 'video-product-snapshot';
// Load .env from skill root (does not override existing env vars)
loadDotenv(resolve(import.meta.dir, '../.env'));
@ -22,56 +21,39 @@ function loadDotenv(path: string): void {
}
function printUsage(): void {
console.error(`用法:
console.error(`Usage:
bun scripts/run.ts [--api-base=<url>] <command> [args...] [--dry-run]
:
Commands:
session
session token
Get auth session token
detect <video-path> [options]
:
--interval=<> 默认: 1
--max-frames=<数量> 默认: 60
--output-dir=<目录> 默认: 视频所在目录
--min-confidence=<0-1> 默认: 0.7
Extract frames and detect ecommerce product snapshots
Options:
--interval=<seconds> Frame sampling interval (default: 3)
--max-frames=<n> Max frames to analyze (default: 30)
--output-dir=<dir> Where to save snapshots (default: next to video)
--min-confidence=<0-1> Minimum detection confidence (default: 0.7)
--concurrency=<n> Parallel Vision API calls per chunk (default: 5)
search <image-path>
ecom image-search API
Search for products using an image via the ecom image-search API
detect-and-search <video-path> [options]
Detect best product snapshot from video then run image search
detect-best <video-path> [options]
Examples:
bun scripts/run.ts detect ./demo.mp4
bun scripts/run.ts detect ./demo.mp4 --interval=5 --max-frames=20
bun scripts/run.ts search ./snapshot.jpg
bun scripts/run.ts detect-and-search ./demo.mp4 --min-confidence=0.8
detect-best-and-search <video-path> [options]
detect-video <video-path>
detect-video-and-search <video-path>
1688
rerank --image-results=<json> [--description=<text>] [--keyword=<text>] [--top=<n>]
: ~/.openclaw/.env (CLIENT_KEY), skill .env (VISION_API_KEY)
Config: ANTHROPIC_API_KEY env var required for detection.
auth-rt in PATH required for search commands.
`);
}
function reportTelemetry(payload: object): void {
const endpoint = process.env.TELEMETRY_ENDPOINT;
if (!endpoint) return;
fetch(endpoint, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(payload),
}).catch(() => {});
}
async function main(): Promise<void> {
const positionals: string[] = [];
let dryRun = false;
@ -81,8 +63,6 @@ async function main(): Promise<void> {
dryRun = true;
} else if (arg.startsWith('--api-base=')) {
process.env.API_BASE = arg.slice('--api-base='.length).trim();
} else if (arg.startsWith('--session-id=')) {
process.env.SKILL_SESSION_ID = arg.slice('--session-id='.length).trim();
} else if (arg === '-h' || arg === '--help') {
printUsage(); process.exit(0);
} else {
@ -92,23 +72,8 @@ async function main(): Promise<void> {
if (positionals.length < 1) { printUsage(); process.exit(1); }
const command = positionals[0] as Command;
const sessionId = process.env.SKILL_SESSION_ID!; // set by auth-cli.ts at module load
const startMs = Date.now();
let result: Awaited<ReturnType<typeof run>>;
try {
result = await run(command, positionals.slice(1), dryRun);
} catch (err) {
const error = err instanceof Error ? err.message : String(err);
console.log(JSON.stringify({ status: 'failed', command, dryRun, sessionId, error }, null, 2));
if (!dryRun) reportTelemetry({ skill: SKILL_NAME, command, sessionId, status: 'failed', durationMs: Date.now() - startMs, error });
process.exit(1);
}
const output = { ...result, sessionId } as Record<string, unknown>;
console.log(JSON.stringify(output, null, 2));
if (!dryRun) reportTelemetry({ skill: SKILL_NAME, command, sessionId, status: result.status, durationMs: Date.now() - startMs, error: (result as any).error });
const result = await run(positionals[0] as Command, positionals.slice(1), dryRun);
console.log(JSON.stringify(result, null, 2));
}
main().catch((err) => {

View File

@ -20,18 +20,6 @@ import * as path from 'path';
import * as os from 'os';
const home = process.env.HOME || os.homedir();
// ── session ID (Langfuse tracing) ──
// Priority: SKILL_SESSION_ID env > auto-generate
const SESSION_ID = process.env.SKILL_SESSION_ID || (() => {
const ts = new Date();
const pad = (n: number) => String(n).padStart(2, '0');
const tsPart = `${ts.getFullYear()}${pad(ts.getMonth()+1)}${pad(ts.getDate())}-${pad(ts.getHours())}${pad(ts.getMinutes())}${pad(ts.getSeconds())}`;
const rand = Math.random().toString(36).slice(2, 6);
return `skill-${tsPart}-${rand}`;
})();
process.env.SKILL_SESSION_ID = SESSION_ID;
const AUTH_RT_BIN = process.env.AUTH_RT_BIN
|| (() => {
// Check if auth-rt is in PATH
@ -55,20 +43,6 @@ export interface SessionResponse {
hookToken?: string;
}
export interface ClientConfig {
clientId: string;
name: string;
status: string;
metadata: {
provider?: {
api_key?: string;
base_url?: string;
model?: string;
};
[key: string]: unknown;
};
}
export interface SkillClientOptions {
apiBase?: string;
dryRun?: boolean;
@ -105,13 +79,6 @@ export class SkillClient {
return JSON.parse(runCli('session'));
}
async clientConfig(): Promise<ClientConfig> {
if (this.dryRun) {
return { clientId: '<dry-run>', name: '<dry-run>', status: 'active', metadata: {} };
}
return JSON.parse(runCli('client-config'));
}
async get(urlPath: string): Promise<ApiResponse> {
return this.request('GET', urlPath);
}

View File

@ -1,32 +1,13 @@
import * as fs from 'fs';
import * as path from 'path';
import type { Command, DetectOptions, DetectResult, SearchResult, OutputResult, SearchItem, DetectVideoResult, DetectVideoAndSearchResult } from './types.ts';
import type { Command, DetectOptions, DetectResult, SearchResult, OutputResult, SearchItem } from './types.ts';
import { createSkillClient } from './auth-cli.ts';
import { extractFrames } from './frame-extractor.ts';
import { detectProductFrames, detectBestFrame } from './product-detector.ts';
import { postFilterByImage } from './post-filter.ts';
import { detectProductFrames } from './product-detector.ts';
import { imageToBase64 } from './frame-extractor.ts';
import { generateText } from 'ai';
import { createOpenAI } from '@ai-sdk/openai';
export interface VisionConfig {
apiKey: string;
baseURL?: string;
model: string;
sessionId?: string;
}
async function loadVisionConfig(client: ReturnType<typeof createSkillClient>): Promise<VisionConfig> {
const cfg = await client.clientConfig();
const apiKey = cfg.metadata?.provider?.api_key;
if (!apiKey) throw new Error('Vision API key not found in client config (metadata.provider.api_key)');
return {
apiKey,
baseURL: cfg.metadata?.provider?.base_url,
model: process.env.VISION_MODEL ?? cfg.metadata?.provider?.model ?? 'aliyun-cp-multimodal',
sessionId: process.env.SKILL_SESSION_ID || `skill_${Date.now()}_${Math.random().toString(36).slice(2, 8)}`,
};
}
export async function run(
command: Command,
args: string[],
@ -41,14 +22,6 @@ export async function run(
return runSearch(args, dryRun);
case 'detect-and-search':
return runDetectAndSearch(args, dryRun);
case 'detect-best':
return runDetectBest(args, dryRun);
case 'detect-best-and-search':
return runDetectBestAndSearch(args, dryRun);
case 'detect-video':
return runDetectVideo(args, dryRun);
case 'detect-video-and-search':
return runDetectVideoAndSearch(args, dryRun);
case 'rerank':
return runRerank(args, dryRun);
default:
@ -77,11 +50,8 @@ async function runDetect(args: string[], dryRun: boolean): Promise<DetectResult>
};
}
const client = createSkillClient();
const visionConfig = await loadVisionConfig(client);
const frames = extractFrames(videoPath, opts.outputDir, opts.intervalSeconds, opts.maxFrames);
const productFrames = await detectProductFrames(frames, opts.minConfidence, opts.concurrency, visionConfig);
const productFrames = await detectProductFrames(frames, opts.minConfidence, opts.concurrency);
return {
status: 'success',
@ -94,16 +64,28 @@ async function runDetect(args: string[], dryRun: boolean): Promise<DetectResult>
};
}
async function uploadImage(client: ReturnType<typeof createSkillClient>, imagePath: string): Promise<string> {
async function uploadImage(imagePath: string): Promise<string> {
const searchEndpoint = process.env.ONEBOUND_SEARCH_ENDPOINT;
if (!searchEndpoint) throw new Error('ONEBOUND_SEARCH_ENDPOINT not set');
const uploadEndpoint = process.env.ONEBOUND_UPLOAD_ENDPOINT;
if (!uploadEndpoint) throw new Error('ONEBOUND_UPLOAD_ENDPOINT not set');
const imageBuffer = fs.readFileSync(imagePath);
const filename = `video-snapshot-${Date.now()}.jpg`;
const res = await client.post('/ecom/tasks/upload-image', {
data: imageBuffer.toString('base64'),
filename,
contentType: 'image/jpeg',
const response = await fetch(uploadEndpoint, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
data: imageBuffer.toString('base64'),
filename,
contentType: 'image/jpeg',
}),
});
if (res.status >= 400) throw new Error(`Upload failed: HTTP ${res.status}`);
const json = JSON.parse(res.body) as { url?: string };
if (!response.ok) throw new Error(`Upload failed: HTTP ${response.status}`);
const json = await response.json() as { url?: string };
if (!json.url) throw new Error('Upload response missing url');
return json.url;
}
@ -113,214 +95,35 @@ async function runSearch(args: string[], dryRun: boolean): Promise<SearchResult>
if (!imagePath) return { status: 'failed', command: 'search', dryRun, error: 'search requires <image-path>' };
if (!fs.existsSync(imagePath)) return { status: 'failed', command: 'search', dryRun, error: `image not found: ${imagePath}` };
const searchEndpoint = process.env.ONEBOUND_SEARCH_ENDPOINT;
if (!searchEndpoint) return { status: 'failed', command: 'search', dryRun, error: 'ONEBOUND_SEARCH_ENDPOINT not set' };
if (dryRun) {
return { status: 'success', command: 'search', dryRun, imagePath, searchHttpStatus: 0, searchBody: null };
}
const client = createSkillClient();
// If given a local file, upload it first to get a public URL
let imgid = imagePath;
if (!imagePath.startsWith('http')) {
imgid = await uploadImage(client, imagePath);
imgid = await uploadImage(imagePath);
}
const res = await client.post('/ecom/tasks/search-by-image', { imgid, page: 1 });
const searchHttpStatus = res.status;
const body = JSON.parse(res.body);
const response = await fetch(searchEndpoint, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ imgid, page: 1 }),
});
if (res.status >= 400) {
const searchHttpStatus = response.status;
const body = await response.json();
if (!response.ok) {
return { status: 'failed', command: 'search', dryRun, imagePath, searchHttpStatus, error: JSON.stringify(body) };
}
return { status: 'success', command: 'search', dryRun, imagePath, searchHttpStatus, searchBody: body };
}
async function runDetectBest(args: string[], dryRun: boolean): Promise<DetectResult> {
const videoPath = args[0];
if (!videoPath) return { status: 'failed', command: 'detect-best', dryRun, error: 'detect-best requires <video-path>' };
if (!fs.existsSync(videoPath)) return { status: 'failed', command: 'detect-best', dryRun, error: `video not found: ${videoPath}` };
const outputDir = getFlag(args, '--output-dir') || path.join(
path.dirname(videoPath),
`snapshots_${path.basename(videoPath, path.extname(videoPath))}_${Date.now()}`,
);
const intervalSeconds = parseFloat(getFlag(args, '--interval') || '0.5');
const maxFrames = parseInt(getFlag(args, '--max-frames') || '60', 10);
if (dryRun) {
return { status: 'success', command: 'detect-best', dryRun, videoPath, totalFramesExtracted: 0, productFrames: [], bestSnapshot: undefined };
}
const client = createSkillClient();
const visionConfig = await loadVisionConfig(client);
const frames = extractFrames(videoPath, outputDir, intervalSeconds, maxFrames);
if (frames.length === 0) {
return { status: 'failed', command: 'detect-best', dryRun, videoPath, error: 'no frames extracted from video' };
}
const best = await detectBestFrame(frames, visionConfig, 20);
return {
status: 'success',
command: 'detect-best',
dryRun,
videoPath,
totalFramesExtracted: frames.length,
productFrames: best ? [best] : [],
bestSnapshot: best ?? undefined,
};
}
async function runDetectBestAndSearch(args: string[], dryRun: boolean): Promise<OutputResult> {
const detectResult = await runDetectBest(args, dryRun) as DetectResult;
if (detectResult.status === 'failed') return detectResult;
if (!detectResult.bestSnapshot) {
if (dryRun) return { ...detectResult, command: 'detect-best-and-search' };
return { ...detectResult, status: 'failed', error: 'no frame could be extracted from video' };
}
const best = detectResult.bestSnapshot;
const imageForSearch = best.croppedImagePath || best.imagePath;
const searchResult = await runSearch([imageForSearch], dryRun) as SearchResult;
// Post-filter: drop results whose pic_url isn't the same product type as our snapshot
let postFilter: any = undefined;
if (!dryRun && searchResult.status === 'success' && searchResult.searchBody) {
const items: SearchItem[] = (searchResult.searchBody as any)?.data?.items?.item ?? [];
if (items.length > 0) {
try {
const client = createSkillClient();
const visionConfig = await loadVisionConfig(client);
const result = await postFilterByImage(imageForSearch, items, visionConfig, { description: best.description });
(searchResult.searchBody as any).data.items.item = result.kept;
postFilter = {
totalChecked: result.totalChecked,
keptCount: result.kept.length,
rejectedCount: result.rejected.length,
failed: result.failed,
};
} catch (e: any) {
postFilter = { error: e.message };
}
}
}
let rerankResult: any = undefined;
// If post-filter produced focused results, sort them directly by sales — they're already the best matches.
// Otherwise fall back to the keyword-intersection rerank.
if (!dryRun && postFilter && !postFilter.error && postFilter.keptCount > 0) {
const items: SearchItem[] = (searchResult.searchBody as any)?.data?.items?.item ?? [];
const sorted = [...items].sort((a, b) => (b.sales ?? 0) - (a.sales ?? 0)).slice(0, 5);
rerankResult = {
source: 'post-filter',
results: sorted,
count: sorted.length,
};
} else if (!dryRun && searchResult.status === 'success' && searchResult.searchBody) {
const tmpFile = path.join(path.dirname(imageForSearch), `search_body_${Date.now()}.json`);
try {
fs.writeFileSync(tmpFile, JSON.stringify(searchResult.searchBody));
rerankResult = await runRerank([
`--image-results=${tmpFile}`,
`--description=${best.description}`,
'--top=5',
], dryRun);
} catch (e: any) {
rerankResult = { error: e.message };
} finally {
try { fs.unlinkSync(tmpFile); } catch {}
}
}
return {
...detectResult,
command: 'detect-best-and-search',
searchHttpStatus: searchResult.searchHttpStatus,
searchBody: searchResult.searchBody,
searchError: searchResult.error,
postFilter,
rerank: rerankResult,
} as any;
}
async function runDetectVideo(args: string[], dryRun: boolean): Promise<DetectVideoResult> {
const videoPath = args[0];
if (!videoPath) return { status: 'failed', command: 'detect-video', dryRun, error: 'detect-video requires <video-path>' };
if (!fs.existsSync(videoPath)) return { status: 'failed', command: 'detect-video', dryRun, error: `video not found: ${videoPath}` };
const detectResult = await runDetectBest(args, dryRun) as DetectResult;
if (detectResult.status === 'failed') {
return { status: 'failed', command: 'detect-video', dryRun, videoPath, error: detectResult.error || 'failed to detect best frame' };
}
const description = detectResult.bestSnapshot?.description?.trim();
const snapshotImagePath = detectResult.bestSnapshot?.croppedImagePath || detectResult.bestSnapshot?.imagePath;
if (!description) {
return { status: 'failed', command: 'detect-video', dryRun, videoPath, error: 'no product description detected from video' };
}
if (dryRun) {
return { status: 'success', command: 'detect-video', dryRun, videoPath, videoUrl: null, description, keyword: '<dry-run-keyword>', snapshotImagePath };
}
const client = createSkillClient();
const visionConfig = await loadVisionConfig(client);
const keyword = await generateChineseKeyword(description, visionConfig);
return { status: 'success', command: 'detect-video', dryRun, videoPath, videoUrl: null, description, keyword, snapshotImagePath };
}
async function runDetectVideoAndSearch(args: string[], dryRun: boolean): Promise<DetectVideoAndSearchResult> {
const videoPath = args[0];
if (!videoPath) return { status: 'failed', command: 'detect-video-and-search', dryRun, error: 'detect-video-and-search requires <video-path>' };
if (!fs.existsSync(videoPath)) return { status: 'failed', command: 'detect-video-and-search', dryRun, error: `video not found: ${videoPath}` };
if (dryRun) {
return { status: 'success', command: 'detect-video-and-search', dryRun, videoPath, videoUrl: null, description: '<dry-run>', keyword: '<dry-run>', searchResults: [] };
}
// Reuse existing pipeline: best snapshot → image search → keyword rerank
const detectAndSearch = await runDetectBestAndSearch(args, dryRun) as any;
if (detectAndSearch.status === 'failed') {
return { status: 'failed', command: 'detect-video-and-search', dryRun, videoPath, error: detectAndSearch.error || 'detect-best-and-search failed' };
}
const description = String(detectAndSearch.bestSnapshot?.description || '').trim();
const rerank = detectAndSearch.rerank;
const keyword = String(rerank?.keyword || '').trim();
const searchResults = (rerank?.results || []) as SearchItem[];
// Fallback: if rerank didn't produce anything, do keyword search directly.
if (!searchResults.length) {
const client = createSkillClient();
const visionConfig = await loadVisionConfig(client);
const fallbackKeyword = keyword || (description ? await generateChineseKeyword(description, visionConfig) : '');
const items = fallbackKeyword ? await keywordSearch(client, fallbackKeyword, 1) : [];
return {
status: 'success',
command: 'detect-video-and-search',
dryRun,
videoPath,
videoUrl: null,
description,
keyword: fallbackKeyword,
searchResults: items,
};
}
return {
status: 'success',
command: 'detect-video-and-search',
dryRun,
videoPath,
videoUrl: null,
description,
keyword,
searchResults,
};
}
async function runDetectAndSearch(args: string[], dryRun: boolean): Promise<OutputResult> {
const detectResult = await runDetect(args, dryRun) as DetectResult;
if (detectResult.status === 'failed') return detectResult;
@ -330,51 +133,23 @@ async function runDetectAndSearch(args: string[], dryRun: boolean): Promise<Outp
}
const best = detectResult.bestSnapshot;
// Use cropped image if available, otherwise full frame
const imageForSearch = best.croppedImagePath || best.imagePath;
const searchResult = await runSearch([imageForSearch], dryRun) as SearchResult;
// Post-filter: drop results whose pic_url isn't the same product type as our snapshot
let postFilter: any = undefined;
if (!dryRun && searchResult.status === 'success' && searchResult.searchBody) {
const items: SearchItem[] = (searchResult.searchBody as any)?.data?.items?.item ?? [];
if (items.length > 0) {
try {
const client = createSkillClient();
const visionConfig = await loadVisionConfig(client);
const result = await postFilterByImage(imageForSearch, items, visionConfig, { description: best.description });
(searchResult.searchBody as any).data.items.item = result.kept;
postFilter = {
totalChecked: result.totalChecked,
keptCount: result.kept.length,
rejectedCount: result.rejected.length,
failed: result.failed,
};
} catch (e: any) {
postFilter = { error: e.message };
}
}
}
// Auto-rerank using product description to generate Chinese keyword
let rerankResult: any = undefined;
// If post-filter produced focused results, sort them directly by sales — they're already the best matches.
// Otherwise fall back to the keyword-intersection rerank.
if (!dryRun && postFilter && !postFilter.error && postFilter.keptCount > 0) {
const items: SearchItem[] = (searchResult.searchBody as any)?.data?.items?.item ?? [];
const sorted = [...items].sort((a, b) => (b.sales ?? 0) - (a.sales ?? 0)).slice(0, 5);
rerankResult = {
source: 'post-filter',
results: sorted,
count: sorted.length,
};
} else if (!dryRun && searchResult.status === 'success' && searchResult.searchBody) {
if (!dryRun && searchResult.status === 'success' && searchResult.searchBody) {
// Save search body to temp file for rerank
const tmpFile = path.join(path.dirname(imageForSearch), `search_body_${Date.now()}.json`);
try {
fs.writeFileSync(tmpFile, JSON.stringify(searchResult.searchBody));
rerankResult = await runRerank([
const rerankArgs = [
`--image-results=${tmpFile}`,
`--description=${best.description}`,
'--top=5',
], dryRun);
'--top=10',
];
rerankResult = await runRerank(rerankArgs, dryRun);
} catch (e: any) {
rerankResult = { error: e.message };
} finally {
@ -388,7 +163,6 @@ async function runDetectAndSearch(args: string[], dryRun: boolean): Promise<Outp
searchHttpStatus: searchResult.searchHttpStatus,
searchBody: searchResult.searchBody,
searchError: searchResult.error,
postFilter,
rerank: rerankResult,
} as any;
}
@ -416,41 +190,25 @@ function getFlag(args: string[], flag: string): string | undefined {
return undefined;
}
function createVisionModel(config: VisionConfig) {
const sessionId = config.sessionId || '';
const originFetch = globalThis.fetch;
// Inject metadata.session_id into request body so LiteLLM → Langfuse creates sessions
const wrapped = async (input: RequestInfo | URL, init?: RequestInit) => {
if (init?.body && typeof init.body === 'string') {
try {
const body = JSON.parse(init.body);
if (!body.metadata) body.metadata = {};
if (!body.metadata.session_id) body.metadata.session_id = sessionId;
body.metadata.tags = ['skill:video-product-snapshot'];
init = { ...init, body: JSON.stringify(body) };
} catch {}
}
return originFetch(input, init);
};
const openai = createOpenAI({
apiKey: config.apiKey, baseURL: config.baseURL,
fetch: wrapped as typeof globalThis.fetch,
});
return openai(config.model);
function createVisionModel() {
const apiKey = process.env.VISION_API_KEY;
if (!apiKey) throw new Error('VISION_API_KEY not set');
const baseURL = process.env.VISION_API_BASE || undefined;
const modelName = process.env.VISION_MODEL || 'gpt-4o-mini';
const openai = createOpenAI({ apiKey, baseURL });
return openai(modelName);
}
async function generateChineseKeyword(description: string, visionConfig: VisionConfig): Promise<string> {
const model = createVisionModel(visionConfig);
async function generateChineseKeyword(description: string): Promise<string> {
const model = createVisionModel();
const { text } = await generateText({
model,
prompt: `You are generating a 1688.com (Chinese B2B wholesale) product search keyword.
Rules:
- Output ONLY 2-4 Chinese words the product OBJECT TYPE + 1-2 key material/feature words
- CRITICAL: If the product is a container, organizer, rack, shelf, bag, box, or holder, the keyword MUST name THAT object NOT the items it holds.
Examples: shoe rack "金属鞋架", cable organizer "理线器", storage shelf "收纳架", toolbox "工具箱"
- Output ONLY 2-4 Chinese words the product category + 1-2 key material/feature words
- Use common Chinese commerce terms, NOT a literal translation
- No English, no punctuation, no explanation
- Short broad terms work better than long specific phrases
- Short broad terms work better than long specific phrases (e.g. "金属鞋架" not "黑色Z型金属网格鞋架")
Product description: ${description}
@ -459,9 +217,16 @@ Output only the search query:`,
return text.trim().replace(/[^\u4e00-\u9fff\u3400-\u4dbf]/g, '').trim();
}
async function keywordSearch(client: ReturnType<typeof createSkillClient>, keyword: string, page = 1): Promise<SearchItem[]> {
const res = await client.post('/ecom/tasks/keyword-search', { keyword, page });
const json = JSON.parse(res.body) as any;
async function keywordSearch(keyword: string, page = 1): Promise<SearchItem[]> {
const endpoint = process.env.ONEBOUND_KEYWORD_SEARCH_ENDPOINT;
if (!endpoint) throw new Error('ONEBOUND_KEYWORD_SEARCH_ENDPOINT not set');
const res = await fetch(endpoint, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ keyword, page }),
});
const json = await res.json() as any;
return (json?.data?.items?.item ?? []) as SearchItem[];
}
@ -490,10 +255,9 @@ function extractKeywordsFromTitles(items: SearchItem[], topN = 5): string {
async function runRerank(args: string[], dryRun: boolean): Promise<OutputResult> {
// --image-results=<path> --keyword=<text> --top=<n>
const positionals = args.filter((a) => !a.startsWith('--'));
const imageResultsArg = getFlag(args, '--image-results') || positionals[0];
const keywordArg = getFlag(args, '--keyword') || positionals[1];
const topN = parseInt(getFlag(args, '--top') || '5', 10);
const imageResultsArg = getFlag(args, '--image-results') || args[0];
const keywordArg = getFlag(args, '--keyword') || args[1];
const topN = parseInt(getFlag(args, '--top') || '10', 10);
const description = getFlag(args, '--description') || '';
@ -501,9 +265,7 @@ async function runRerank(args: string[], dryRun: boolean): Promise<OutputResult>
if (dryRun) return { status: 'success', command: 'rerank', dryRun } as any;
const client = createSkillClient();
const visionConfig = await loadVisionConfig(client);
// Load image search results
let imageItems: SearchItem[];
try {
const raw = fs.existsSync(imageResultsArg)
@ -527,7 +289,7 @@ async function runRerank(args: string[], dryRun: boolean): Promise<OutputResult>
// Prefer product description for accurate translation; fall back to image titles
const sourceText = description || keyword || extractKeywordsFromTitles(imageItems);
try {
autoGeneratedKeyword = await generateChineseKeyword(sourceText, visionConfig);
autoGeneratedKeyword = await generateChineseKeyword(sourceText);
} catch {
autoGeneratedKeyword = extractKeywordsFromTitles(imageItems);
}
@ -537,7 +299,7 @@ async function runRerank(args: string[], dryRun: boolean): Promise<OutputResult>
// Keyword search on 1688
let keywordItems: SearchItem[] = [];
try {
keywordItems = await keywordSearch(client, keyword);
keywordItems = await keywordSearch(keyword);
} catch (e: any) {
return { status: 'failed', command: 'rerank', dryRun, error: `keyword search failed: ${e.message}` };
}
@ -568,3 +330,7 @@ async function runRerank(args: string[], dryRun: boolean): Promise<OutputResult>
results: sorted,
} as any;
}
function parseJsonSafe(text: string): unknown {
try { return JSON.parse(text); } catch { return text; }
}

View File

@ -1,123 +0,0 @@
import { generateText } from 'ai';
import { createOpenAI } from '@ai-sdk/openai';
import type { SearchItem } from './types.ts';
import type { VisionConfig } from './index.ts';
import { imageToBase64 } from './frame-extractor.ts';
export interface PostFilterResult {
kept: SearchItem[];
rejected: SearchItem[];
totalChecked: number;
failed: boolean;
}
const FILTER_PROMPT = (count: number, description?: string) => {
const productLine = description
? `查询商品是:${description}`
: '第1张图是查询商品。';
return `${productLine}
${count}
****
- "鞋架"
-
- vs vs
-
1: YES
2: NO
3: YES
...
${count} `;
};
function createModel(config: VisionConfig) {
const sessionId = config.sessionId || '';
const originFetch = globalThis.fetch;
const wrapped = async (input: RequestInfo | URL, init?: RequestInit) => {
if (init?.body && typeof init.body === 'string') {
try {
const body = JSON.parse(init.body);
if (!body.metadata) body.metadata = {};
if (!body.metadata.session_id) body.metadata.session_id = sessionId;
body.metadata.tags = ['skill:video-product-snapshot'];
init = { ...init, body: JSON.stringify(body) };
} catch {}
}
return originFetch(input, init);
};
const provider = createOpenAI({
apiKey: config.apiKey, baseURL: config.baseURL,
fetch: wrapped as typeof globalThis.fetch,
});
return provider(config.model);
}
async function classifyBatch(
model: ReturnType<ReturnType<typeof createOpenAI>>,
queryImageDataUrl: string,
batch: SearchItem[],
description?: string,
): Promise<boolean[]> {
const content: any[] = [{ type: 'image', image: queryImageDataUrl }];
for (const item of batch) {
content.push({ type: 'image', image: item.pic_url });
}
content.push({ type: 'text', text: FILTER_PROMPT(batch.length, description) });
const { text } = await generateText({
model,
messages: [{ role: 'user', content }],
maxTokens: 200,
});
const flags = batch.map(() => false);
for (const line of text.split('\n')) {
const m = line.match(/^\s*(\d+)\s*[:]\s*(YES|NO|是|否)/i);
if (!m) continue;
const idx = parseInt(m[1], 10) - 1;
const yes = /YES|是/i.test(m[2]);
if (idx >= 0 && idx < flags.length) flags[idx] = yes;
}
return flags;
}
export async function postFilterByImage(
queryImagePath: string,
items: SearchItem[],
visionConfig: VisionConfig,
options: { description?: string; batchSize?: number } = {},
): Promise<PostFilterResult> {
if (items.length === 0) {
return { kept: [], rejected: [], totalChecked: 0, failed: false };
}
const batchSize = options.batchSize ?? 10;
const description = options.description;
const model = createModel(visionConfig);
const queryDataUrl = `data:image/jpeg;base64,${imageToBase64(queryImagePath)}`;
const kept: SearchItem[] = [];
const rejected: SearchItem[] = [];
let anyFailed = false;
for (let i = 0; i < items.length; i += batchSize) {
const batch = items.slice(i, i + batchSize);
try {
const flags = await classifyBatch(model, queryDataUrl, batch, description);
batch.forEach((item, idx) => {
if (flags[idx]) kept.push(item);
else rejected.push(item);
});
} catch {
// On batch failure, keep items (don't lose them) but flag the run as partial
anyFailed = true;
kept.push(...batch);
}
}
return { kept, rejected, totalChecked: items.length, failed: anyFailed };
}

View File

@ -1,10 +1,9 @@
import { generateObject, generateText } from 'ai';
import { generateObject } from 'ai';
import { createOpenAI } from '@ai-sdk/openai';
import { z } from 'zod';
import type { ExtractedFrame } from './frame-extractor.ts';
import type { ProductFrame } from './types.ts';
import { imageToBase64 } from './frame-extractor.ts';
import type { VisionConfig } from './index.ts';
// Pass 1: quick filter — discard frames that clearly have no product
const FilterSchema = z.object({
@ -28,75 +27,32 @@ Discard (keep=false) if: only hands/texture/contents visible, motion blur, black
reason options: product_visible | content_only | hands_only | blur | transition | background_only`;
const CONTAINER_CHECK_PROMPT = `Is the main product in this image a CONTAINER, RACK, or HOLDER (something designed to store/hold other items)?
Examples YES: shoe rack, shelf, storage box, organizer, basket, drawer, wardrobe, trolley, bin, tray, cabinet.
Examples NO: shoes, clothing, electronics, food, toys, cosmetics, tools.
Reply with only one word: YES or NO.`;
const RANKING_PROMPT = (count: number) => `You are selecting the single best product image from ${count} video frames for ecommerce image search.
const RANKING_PROMPT_CONTAINER = (count: number) => `You are selecting ONE frame from ${count} video frames to use as the query image for an ecommerce reverse-image search.
The frames are numbered 0 to ${count - 1} in the order shown.
The hero product is a CONTAINER / RACK / HOLDER / ORGANIZER.
CRITICAL CONSTRAINT read this first:
Image search engines identify objects by visual appearance. If the container holds items (shoes, clothes, etc.), the search engine will match those ITEMS, not the container returning completely wrong products.
YOUR ONLY JOB: find the frame where the container structure itself is most visible with the FEWEST or NO items inside.
ABSOLUTE PRIORITY ORDER (do not deviate):
1. Frame with container completely EMPTY highest priority regardless of angle or assembly state
2. Frame with container partially assembled or partially visible but EMPTY still better than any loaded frame
3. Frame with fewest items inside (1-2 items, mostly empty)
4. Frame with moderate load only if no emptier option exists
5. Frame fully loaded last resort only if no other frames exist
A frame showing the rack mid-assembly with zero items is ALWAYS better than a perfectly-lit fully-assembled rack filled with shoes.
Frames are numbered 0 to ${count - 1} in order shown. You MUST pick ONE.
Pick the ONE frame where the HERO PRODUCT is:
1. Cleanest fewest distractions, no hands blocking it, no clutter in foreground
2. Most complete full product silhouette visible, no edges cropped
3. Most isolated product stands out from background clearly
4. Empty/minimal load preferred a product without contents (e.g. an empty rack) beats one stuffed with items if both show the full structure equally
Return:
- bestFrameIndex: 0-based index of the emptiest container frame
- description: concise Chinese search query 12 words (container type + material + color + key feature)
- reasoning: describe how many items are visible inside the chosen frame and why it's the emptiest option
- boundingBox: tight box of the PRODUCT STRUCTURE ONLY as [x1, y1, x2, y2] normalized 0.01.0. Exclude any items stored inside.`;
- bestFrameIndex: 0-based index of chosen frame
- description: concise search query under 12 words (product type + material + color + key feature)
- reasoning: one sentence explaining why this frame was chosen
- boundingBox: tight bounding box of the HERO PRODUCT ONLY in the chosen frame as [x1, y1, x2, y2] normalized 0.01.0 (top-left origin). Exclude hands, background, and unrelated objects. The product is assumed to be near the center.`;
const RANKING_PROMPT_GENERAL = (count: number) => `You are selecting the single best product frame from ${count} video frames for ecommerce search.
function createVisionModel() {
const apiKey = process.env.VISION_API_KEY;
if (!apiKey) throw new Error('VISION_API_KEY not set');
Frames are numbered 0 to ${count - 1} in order shown.
IMPORTANT: You MUST pick ONE frame even if product visibility is imperfect or no frame looks ideal. Always make your best guess.
Pick the frame where the MAIN SELLING PRODUCT is:
1. Most recognizable clearest view of the item being sold
2. Most complete full product silhouette visible, not cropped at edges
3. Cleanest minimal obstruction (hands, clutter, motion blur, labels)
4. Best lit and in focus
Return:
- bestFrameIndex: 0-based index
- description: concise search query under 12 words (product type + material + color + key features), in Chinese
- reasoning: one sentence explaining choice
- boundingBox: tight box of the PRODUCT ONLY as [x1, y1, x2, y2] normalized 0.01.0, top-left origin. Exclude hands, background, and unrelated objects. The product is near the center of the frame.`;
function createVisionModel(config: VisionConfig) {
const sessionId = config.sessionId || '';
const originFetch = globalThis.fetch;
const wrapped = async (input: RequestInfo | URL, init?: RequestInit) => {
if (init?.body && typeof init.body === 'string') {
try {
const body = JSON.parse(init.body);
if (!body.metadata) body.metadata = {};
if (!body.metadata.session_id) body.metadata.session_id = sessionId;
body.metadata.tags = ['skill:video-product-snapshot'];
init = { ...init, body: JSON.stringify(body) };
} catch {}
}
return originFetch(input, init);
};
const provider = createOpenAI({
apiKey: config.apiKey, baseURL: config.baseURL,
fetch: wrapped as typeof globalThis.fetch,
apiKey,
baseURL: process.env.VISION_API_BASE,
});
return provider(config.model);
return provider(process.env.VISION_MODEL ?? 'gpt-4o-mini');
}
async function filterFrame(
@ -120,52 +76,15 @@ async function filterFrame(
return object.keep;
}
async function isContainerProduct(
firstFrame: ExtractedFrame,
model: ReturnType<ReturnType<typeof createOpenAI>>,
): Promise<boolean> {
try {
const { text } = await generateText({
model,
messages: [{
role: 'user',
content: [
{ type: 'image', image: `data:image/jpeg;base64,${imageToBase64(firstFrame.imagePath)}` },
{ type: 'text', text: CONTAINER_CHECK_PROMPT },
],
}],
maxTokens: 5,
});
return text.trim().toUpperCase().startsWith('Y');
} catch {
return false;
}
}
function takeEarliestFrames(candidates: ExtractedFrame[], fraction: number = 0.4): ExtractedFrame[] {
// Ecommerce videos show the container empty/unboxing early, then full.
// Taking the first 40% of frames reliably captures empty states.
const sorted = [...candidates].sort((a, b) => a.frameIndex - b.frameIndex);
const cutoff = Math.max(1, Math.ceil(sorted.length * fraction));
return sorted.slice(0, cutoff);
}
async function rankCandidates(
candidates: ExtractedFrame[],
model: ReturnType<ReturnType<typeof createOpenAI>>,
isContainer: boolean,
): Promise<{ bestFrame: ExtractedFrame; description: string; reasoning: string; boundingBox: [number, number, number, number] }> {
const imageContent = candidates.map((f) => ({
type: 'image' as const,
image: `data:image/jpeg;base64,${imageToBase64(f.imagePath)}`,
}));
const prompt = isContainer
? RANKING_PROMPT_CONTAINER(candidates.length)
: RANKING_PROMPT_GENERAL(candidates.length);
const { object } = await generateObject({
model,
schema: RankingSchema,
@ -174,7 +93,7 @@ async function rankCandidates(
role: 'user',
content: [
...imageContent,
{ type: 'text', text: prompt },
{ type: 'text', text: RANKING_PROMPT(candidates.length) },
],
}],
});
@ -201,17 +120,7 @@ export async function cropProduct(
let [x1, y1, x2, y2] = boundingBox;
// Normalize coords: ensure x1<x2 and y1<y2
if (x1 > x2) [x1, x2] = [x2, x1];
if (y1 > y2) [y1, y2] = [y2, y1];
// Clamp to [0, 1]
x1 = Math.max(0, Math.min(1, x1));
y1 = Math.max(0, Math.min(1, y1));
x2 = Math.max(0, Math.min(1, x2));
y2 = Math.max(0, Math.min(1, y2));
// Add padding
// add padding
const pw = (x2 - x1) * paddingFactor;
const ph = (y2 - y1) * paddingFactor;
x1 = Math.max(0, x1 - pw);
@ -219,11 +128,6 @@ export async function cropProduct(
x2 = Math.min(1, x2 + pw);
y2 = Math.min(1, y2 + ph);
// Validate minimum area
if (x2 - x1 < 0.005 || y2 - y1 < 0.005) {
throw new Error('bounding box too small after normalization');
}
const left = Math.round(x1 * W);
const top = Math.round(y1 * H);
const width = Math.round((x2 - x1) * W);
@ -237,203 +141,39 @@ export async function cropProduct(
return outputPath;
}
async function withConcurrency<T>(
tasks: (() => Promise<T>)[],
limit: number,
): Promise<T[]> {
const results: T[] = new Array(tasks.length);
let next = 0;
async function worker() {
while (next < tasks.length) {
const i = next++;
results[i] = await tasks[i]();
}
}
await Promise.all(Array.from({ length: Math.min(limit, tasks.length) }, worker));
return results;
}
// ── Frame quality pre-filtering ──────────────────────────────────────
interface FrameQuality {
valid: boolean;
meanBrightness: number;
variance: number;
}
async function assessFrameQuality(imagePath: string): Promise<FrameQuality> {
const sharp = (await import('sharp')).default;
const { data, info } = await sharp(imagePath)
.grayscale()
.raw()
.toBuffer({ resolveWithObject: true });
const pixels = new Uint8Array(data);
let sum = 0;
let sumSq = 0;
for (let i = 0; i < pixels.length; i++) {
sum += pixels[i];
sumSq += pixels[i] * pixels[i];
}
const mean = sum / pixels.length;
const variance = sumSq / pixels.length - mean * mean;
// Skip near-black, near-white, or very low variance (blurry/blank/transition)
const valid = mean > 15 && mean < 240 && variance > 50;
return { valid, meanBrightness: mean, variance };
}
async function filterQualityFrames(frames: ExtractedFrame[]): Promise<ExtractedFrame[]> {
const results = await Promise.all(
frames.map(async (frame) => {
try {
const q = await assessFrameQuality(frame.imagePath);
return { frame, valid: q.valid };
} catch {
return { frame, valid: true };
}
}),
);
const valid = results.filter(r => r.valid).map(r => r.frame);
return valid.length > 0 ? valid : frames;
}
function isValidBoundingBox(bbox: [number, number, number, number]): boolean {
const [x1, y1, x2, y2] = bbox;
return (
x1 >= 0 && x1 <= 1 &&
y1 >= 0 && y1 <= 1 &&
x2 >= 0 && x2 <= 1 &&
y2 >= 0 && y2 <= 1 &&
x1 < x2 &&
y1 < y2 &&
(x2 - x1) * (y2 - y1) > 0.005
);
}
// Skips Pass 1 filter entirely — ranks all frames and always returns the best one.
// Evenly samples down to maxCandidates when there are too many frames.
export async function detectBestFrame(
frames: ExtractedFrame[],
visionConfig: VisionConfig,
maxCandidates: number = 20,
): Promise<ProductFrame | null> {
if (frames.length === 0) return null;
// 1. Filter out obviously bad frames (black, white, blurry)
let candidates = await filterQualityFrames(frames);
// 2. Sample if too many
if (candidates.length > maxCandidates) {
const step = candidates.length / maxCandidates;
candidates = Array.from({ length: maxCandidates }, (_, i) => candidates[Math.floor(i * step)]);
}
const model = createVisionModel(visionConfig);
// 3. Check if product is a container/rack type (use first candidate frame)
const container = await isContainerProduct(candidates[0], model);
// 4. For containers: restrict ranking to earliest frames (empty/unboxing phase)
if (container) {
const early = takeEarliestFrames(candidates);
if (early.length > 0) candidates = early;
}
// 5. Try Vision ranking with error isolation
try {
const { bestFrame, description, reasoning, boundingBox } = await rankCandidates(candidates, model, container);
if (isValidBoundingBox(boundingBox)) {
const croppedPath = bestFrame.imagePath.replace(/\.jpg$/, '_cropped.jpg');
try {
await cropProduct(bestFrame.imagePath, boundingBox, croppedPath);
} catch {
// cropping is optional — keep original frame
}
return {
frameIndex: bestFrame.frameIndex,
timestampSeconds: bestFrame.timestampSeconds,
imagePath: bestFrame.imagePath,
...(croppedPath ? { croppedImagePath: croppedPath } : {}),
confidence: 0.95,
description,
boundingHint: reasoning,
};
}
} catch {
// Vision ranking failed — fall through to fallback
}
// 4. Fallback: rank by frame quality (variance) and return the sharpest
const withQuality = await Promise.all(
candidates.map(async (f) => {
try {
const q = await assessFrameQuality(f.imagePath);
return { frame: f, score: q.variance };
} catch {
return { frame: f, score: 0 };
}
}),
);
withQuality.sort((a, b) => b.score - a.score);
const best = withQuality[0].frame;
return {
frameIndex: best.frameIndex,
timestampSeconds: best.timestampSeconds,
imagePath: best.imagePath,
confidence: 0.5,
description: 'product frame (auto-selected)',
boundingHint: 'picked by frame quality analysis (Vision ranking failed)',
};
}
export async function detectProductFrames(
frames: ExtractedFrame[],
minConfidence: number,
concurrency: number = 10,
visionConfig: VisionConfig,
concurrency: number = 5,
): Promise<ProductFrame[]> {
const model = createVisionModel(visionConfig);
const model = createVisionModel();
// Pass 1: all frames in parallel, bounded by concurrency
const keepFlags = await withConcurrency(
frames.map((f) => () => filterFrame(f, model).catch(() => false)),
concurrency,
);
// Pass 1: parallel filter — discard junk frames
const keepFlags: boolean[] = [];
for (let i = 0; i < frames.length; i += concurrency) {
const chunk = frames.slice(i, i + concurrency);
const flags = await Promise.all(
chunk.map((f) => filterFrame(f, model).catch(() => false))
);
keepFlags.push(...flags);
}
const candidates = frames.filter((_, i) => keepFlags[i]);
if (candidates.length === 0) return [];
// Pass 2: single comparative call — model sees all candidates at once
const container = await isContainerProduct(candidates[0], model);
let bestSnapshot: ProductFrame | undefined;
try {
const { bestFrame, description, reasoning, boundingBox } = await rankCandidates(candidates, model, container);
const { bestFrame, description, reasoning, boundingBox } = await rankCandidates(candidates, model);
if (isValidBoundingBox(boundingBox)) {
const croppedPath = bestFrame.imagePath.replace(/\.jpg$/, '_cropped.jpg');
try {
await cropProduct(bestFrame.imagePath, boundingBox, croppedPath);
} catch {}
bestSnapshot = {
frameIndex: bestFrame.frameIndex,
timestampSeconds: bestFrame.timestampSeconds,
imagePath: bestFrame.imagePath,
...(croppedPath ? { croppedImagePath: croppedPath } : {}),
confidence: 0.95,
description,
boundingHint: reasoning,
};
}
} catch {
// ranking failed
}
const croppedPath = bestFrame.imagePath.replace(/\.jpg$/, '_cropped.jpg');
await cropProduct(bestFrame.imagePath, boundingBox, croppedPath);
if (!bestSnapshot) {
return [];
}
return [bestSnapshot];
return [{
frameIndex: bestFrame.frameIndex,
timestampSeconds: bestFrame.timestampSeconds,
imagePath: bestFrame.imagePath,
croppedImagePath: croppedPath,
confidence: 0.95,
description,
boundingHint: reasoning,
}];
}

View File

@ -1,13 +1,4 @@
export type Command =
| 'detect'
| 'search'
| 'detect-and-search'
| 'detect-best'
| 'detect-best-and-search'
| 'detect-video'
| 'detect-video-and-search'
| 'rerank'
| 'session';
export type Command = 'detect' | 'search' | 'detect-and-search' | 'rerank' | 'session';
export interface SearchItem {
num_iid: number;
@ -20,30 +11,6 @@ export interface SearchItem {
detail_url: string;
}
export interface DetectVideoResult {
status: 'success' | 'failed';
command: 'detect-video';
dryRun: boolean;
videoPath?: string;
videoUrl?: string | null;
description?: string;
keyword?: string;
snapshotImagePath?: string;
error?: string;
}
export interface DetectVideoAndSearchResult {
status: 'success' | 'failed';
command: 'detect-video-and-search';
dryRun: boolean;
videoPath?: string;
videoUrl?: string | null;
description?: string;
keyword?: string;
searchResults?: SearchItem[];
error?: string;
}
export interface DetectOptions {
videoPath: string;
intervalSeconds: number;
@ -84,4 +51,4 @@ export interface SearchResult {
error?: string;
}
export type OutputResult = DetectResult | SearchResult | DetectVideoResult | DetectVideoAndSearchResult;
export type OutputResult = DetectResult | SearchResult;