feat: 1688 logistics scraper — extract weight/size from product pages
register-skill-release / register (push) Failing after 24s
Details
register-skill-release / register (push) Failing after 24s
Details
Scrapes 1688 product pages via Chrome browser to extract logistics data (weight, dimensions, volume) from attributes, variants, and detail images. Zero npm dependencies — uses raw CDP over WebSocket. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
8be28aab5a
commit
99ce9d96d1
71
README.md
71
README.md
|
|
@ -1,65 +1,32 @@
|
|||
# template-skill
|
||||
# 1688-logistics-scraper
|
||||
|
||||
新 skill 的基础模版。
|
||||
从 1688 商品页面提取物流相关数据(重量、尺寸、体积)。
|
||||
|
||||
## 认证机制:auth-cli.ts
|
||||
通过 Chrome DevTools Protocol (CDP) 连接到已运行的 Chrome 浏览器,自动提取商品属性、SKU 变体中的物流数据,并下载详情图片供进一步分析。
|
||||
|
||||
每个 skill 内置一份 `src/auth-cli.ts`,它是一个薄 wrapper,通过 subprocess 调用 `auth-rt` 二进制。
|
||||
## 前置条件
|
||||
|
||||
**不使用 npm 依赖**,auth-runtime 更新时只需重新编译二进制,不需要改动任何 skill。
|
||||
|
||||
### 工作原理
|
||||
|
||||
```
|
||||
skill/src/index.ts
|
||||
→ import { createSkillClient } from './auth-cli.ts'
|
||||
→ auth-cli.ts 通过 spawnSync 调用 auth-rt 二进制
|
||||
→ auth-rt 处理 token/session/request
|
||||
```
|
||||
|
||||
### 使用方式
|
||||
|
||||
```typescript
|
||||
import { createSkillClient } from './auth-cli.ts';
|
||||
|
||||
const client = createSkillClient({
|
||||
apiBase: process.env.API_BASE, // 可选
|
||||
dryRun: false, // 可选,dry-run 模式返回模拟数据
|
||||
});
|
||||
|
||||
// API 调用
|
||||
const res = await client.post('/ecom/your/endpoint', { param: 'value' });
|
||||
// res = { status: 200, body: '...' }
|
||||
|
||||
// 获取 session
|
||||
const session = await client.session();
|
||||
// session = { accessToken: '...', expiresIn: 900 }
|
||||
```
|
||||
|
||||
### 前置条件
|
||||
|
||||
每台运行 skill 的机器上必须安装 `auth-rt` 二进制:
|
||||
启动 Chrome 并开启远程调试:
|
||||
|
||||
```bash
|
||||
git clone http://192.168.0.108:3030/agent-skills/auth-runtime.git ~/clawd/skills/auth-runtime
|
||||
cd ~/clawd/skills/auth-runtime && ./install.sh
|
||||
# 安装到 ~/.openclaw/bin/auth-rt
|
||||
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222
|
||||
```
|
||||
|
||||
确保 `~/.openclaw/bin` 在 PATH 中,或通过 `AUTH_RT_BIN` 环境变量指定路径。
|
||||
## 安装
|
||||
|
||||
### auth-runtime 更新流程
|
||||
|
||||
auth-runtime 代码变更后:
|
||||
```bash
|
||||
cd ~/clawd/skills/auth-runtime && git pull && ./install.sh
|
||||
bash install.sh
|
||||
```
|
||||
重新编译即可,**无需改动任何 skill 代码**。
|
||||
|
||||
### 新建 skill 检查清单
|
||||
## 使用
|
||||
|
||||
1. 从此模版创建仓库
|
||||
2. 确认 `src/auth-cli.ts` 已包含(直接从模版继承)
|
||||
3. `src/index.ts` 中 `import { createSkillClient } from './auth-cli.ts'`
|
||||
4. `package.json` 中 **不要** 添加 `@clawd/auth-runtime` 依赖
|
||||
5. `install.sh` 中包含 auth-rt 二进制检查
|
||||
```bash
|
||||
bun scripts/run.ts scrape 'https://detail.1688.com/offer/852504650877.html'
|
||||
```
|
||||
|
||||
## 数据来源
|
||||
|
||||
1. 商品属性表(商品属性/商品参数)
|
||||
2. SKU/变体规格
|
||||
3. 物流信息区域
|
||||
4. 商品详情图片(下载到 `/tmp/1688-logistics/<offer-id>/`)
|
||||
|
|
|
|||
73
SKILL.md
73
SKILL.md
|
|
@ -1,26 +1,75 @@
|
|||
---
|
||||
name: my-skill
|
||||
description: "TODO: describe what this skill does and when to use it."
|
||||
name: 1688-logistics-scraper
|
||||
description: "Extract product weight/size/logistics data from 1688 product pages via Chrome browser, output structured JSON. Use when the user provides a 1688 product URL and needs logistics specs."
|
||||
---
|
||||
|
||||
# my-skill
|
||||
# 1688 Logistics Scraper
|
||||
|
||||
TODO: one-line description.
|
||||
|
||||
> Auth is handled automatically via `auth-cli.ts` → `auth-runtime` CLI.
|
||||
Extract product weight, size, and logistics data from 1688 product pages.
|
||||
|
||||
## Run
|
||||
|
||||
```bash
|
||||
bun scripts/run.ts <command> [args] [--dry-run]
|
||||
bun scripts/run.ts scrape <url> [--dry-run]
|
||||
```
|
||||
|
||||
## Commands
|
||||
### Examples
|
||||
|
||||
| Command | Description |
|
||||
|---------|-------------|
|
||||
| `run <arg>` | TODO: describe |
|
||||
```bash
|
||||
bun scripts/run.ts scrape 'https://detail.1688.com/offer/852504650877.html'
|
||||
bun scripts/run.ts scrape 'https://detail.1688.com/offer/852504650877.html' --dry-run
|
||||
```
|
||||
|
||||
## What It Does
|
||||
|
||||
1. Opens the 1688 product URL in the browser
|
||||
2. Extracts weight/size data from wherever it appears on the page — product attributes, variant specs, logistics section
|
||||
3. Downloads detail images (商品详情图片) for analysis — weight/size is often only in images
|
||||
4. Outputs structured JSON
|
||||
|
||||
## Where To Look For Data
|
||||
|
||||
Weight/size data on 1688 pages hides in multiple places. Check all before giving up:
|
||||
|
||||
1. **Product attributes** (商品属性 / 商品参数) — key-value table, most reliable
|
||||
2. **Variant/SKU specs** — per-variant weight or size
|
||||
3. **Logistics section** — shipping weight, volume, freight info
|
||||
4. **Detail images** — downloaded to `/tmp/1688-logistics/<offer-id>/`, read them to find weight/size text baked into images
|
||||
|
||||
## Output
|
||||
|
||||
Returns JSON: `{ "status": "success" | "failed", "data": ... }`
|
||||
```json
|
||||
{
|
||||
"status": "success",
|
||||
"url": "https://detail.1688.com/offer/...",
|
||||
"product": {
|
||||
"title": "产品标题",
|
||||
"logistics": {
|
||||
"weight": { "value": 0.5, "unit": "kg", "source": "attributes" },
|
||||
"dimensions": { "length": 30, "width": 20, "height": 10, "unit": "cm", "source": "attributes" },
|
||||
"grossWeight": null,
|
||||
"netWeight": null,
|
||||
"packageWeight": null,
|
||||
"volume": null,
|
||||
"shippingMethod": null,
|
||||
"shippingCost": null,
|
||||
"origin": null
|
||||
},
|
||||
"variants": [
|
||||
{ "name": "颜色: 红色", "weight": null, "dimensions": null }
|
||||
]
|
||||
},
|
||||
"detailImages": ["/tmp/1688-logistics/852504650877/img_001.jpg"],
|
||||
"rawAttributes": { "重量": "0.5kg", "尺寸": "30*20*10cm" }
|
||||
}
|
||||
```
|
||||
|
||||
`null` = not found in text. Check `detailImages` — the data may be in the images.
|
||||
|
||||
## Rules
|
||||
|
||||
1. If the browser is not running, report the error. Do not try to launch it.
|
||||
2. Check all data sources before reporting `null`.
|
||||
3. Normalize units: 克→kg, 毫米→cm. Keep raw values in `rawAttributes`.
|
||||
4. No retries. If it fails, report as-is.
|
||||
5. Trust page content. Do not guess values.
|
||||
|
|
|
|||
27
install.sh
27
install.sh
|
|
@ -2,25 +2,8 @@
|
|||
set -euo pipefail
|
||||
cd "$(dirname "$0")"
|
||||
|
||||
# Auto-install auth-rt if not found
|
||||
if ! command -v auth-rt &>/dev/null && [ ! -x "$HOME/.local/bin/auth-rt" ]; then
|
||||
echo "auth-rt not found, installing..."
|
||||
_FORGEJO="http://192.168.0.108:3030"
|
||||
_OS="$(uname -s | tr '[:upper:]' '[:lower:]')"
|
||||
_ARCH="$(uname -m)"; case "$_ARCH" in x86_64) _ARCH="amd64";; aarch64) _ARCH="arm64";; esac
|
||||
_URL="$_FORGEJO/agent-skills/auth-runtime/releases/download/latest/auth-rt-${_OS}-${_ARCH}"
|
||||
mkdir -p "$HOME/.local/bin"
|
||||
if curl -fsSL "$_URL" -o "$HOME/.local/bin/auth-rt" 2>/dev/null; then
|
||||
chmod +x "$HOME/.local/bin/auth-rt"
|
||||
echo "auth-rt installed (downloaded)"
|
||||
else
|
||||
echo "Download failed, building from source..."
|
||||
_SRC="$HOME/.local/share/auth-runtime"
|
||||
if [ -d "$_SRC/.git" ]; then git -C "$_SRC" pull --ff-only
|
||||
else git clone --depth 1 "$_FORGEJO/agent-skills/auth-runtime.git" "$_SRC"
|
||||
fi
|
||||
bash "$_SRC/install.sh"
|
||||
fi
|
||||
fi
|
||||
|
||||
npm install
|
||||
bun install
|
||||
echo "1688-logistics-scraper installed."
|
||||
echo ""
|
||||
echo "Prerequisites: Chrome must be running with remote debugging:"
|
||||
echo " /Applications/Google\\ Chrome.app/Contents/MacOS/Google\\ Chrome --remote-debugging-port=9222"
|
||||
|
|
|
|||
|
|
@ -1,5 +1,5 @@
|
|||
{
|
||||
"name": "my-skill",
|
||||
"name": "1688-logistics-scraper",
|
||||
"version": "0.1.0",
|
||||
"type": "module",
|
||||
"scripts": {
|
||||
|
|
|
|||
|
|
@ -4,24 +4,31 @@ import { run } from '../src/index.ts';
|
|||
|
||||
function printUsage(): void {
|
||||
console.error(`Usage:
|
||||
bun scripts/run.ts [--api-base=<url>] <command> [args...] [--dry-run]
|
||||
bun scripts/run.ts [--port=<cdp-port>] <command> [args...] [--dry-run]
|
||||
|
||||
Commands:
|
||||
run <arg>
|
||||
scrape <1688-url> Scrape logistics data (weight/size) from product page
|
||||
|
||||
Config: ~/.openclaw/.env (API_BASE)
|
||||
Examples:
|
||||
bun scripts/run.ts scrape 'https://detail.1688.com/offer/852504650877.html'
|
||||
bun scripts/run.ts scrape 'https://detail.1688.com/offer/852504650877.html' --dry-run
|
||||
bun scripts/run.ts --port=9223 scrape 'https://detail.1688.com/offer/852504650877.html'
|
||||
|
||||
Prerequisites:
|
||||
Chrome must be running with --remote-debugging-port=9222
|
||||
`);
|
||||
}
|
||||
|
||||
async function main(): Promise<void> {
|
||||
const positionals: string[] = [];
|
||||
let dryRun = false;
|
||||
let port = 9222;
|
||||
|
||||
for (const arg of process.argv.slice(2)) {
|
||||
if (arg === '--dry-run') {
|
||||
dryRun = true;
|
||||
} else if (arg.startsWith('--api-base=')) {
|
||||
process.env.API_BASE = arg.slice('--api-base='.length).trim();
|
||||
} else if (arg.startsWith('--port=')) {
|
||||
port = parseInt(arg.slice('--port='.length), 10);
|
||||
} else if (arg === '-h' || arg === '--help') {
|
||||
printUsage(); process.exit(0);
|
||||
} else {
|
||||
|
|
@ -31,11 +38,14 @@ async function main(): Promise<void> {
|
|||
|
||||
if (positionals.length < 1) { printUsage(); process.exit(1); }
|
||||
|
||||
const result = await run(positionals[0] as Command, positionals.slice(1), dryRun);
|
||||
const result = await run(positionals[0] as Command, positionals.slice(1), dryRun, port);
|
||||
console.log(JSON.stringify(result, null, 2));
|
||||
}
|
||||
|
||||
main().catch((err) => {
|
||||
console.error(JSON.stringify({ status: 'failed', error: err instanceof Error ? err.message : String(err) }, null, 2));
|
||||
console.error(JSON.stringify({
|
||||
status: 'failed',
|
||||
error: err instanceof Error ? err.message : String(err),
|
||||
}, null, 2));
|
||||
process.exit(1);
|
||||
});
|
||||
|
|
|
|||
119
src/auth-cli.ts
119
src/auth-cli.ts
|
|
@ -1,119 +0,0 @@
|
|||
/**
|
||||
* Thin CLI wrapper for auth-runtime.
|
||||
*
|
||||
* Copy this file into your skill's src/ directory. It calls the
|
||||
* `auth-rt` binary (a standalone Go executable), so the skill has
|
||||
* zero npm/runtime dependency on auth-runtime.
|
||||
*
|
||||
* Prerequisites:
|
||||
* `auth-rt` must be in PATH or at ~/.local/bin/auth-rt
|
||||
* (install.sh handles this automatically)
|
||||
*
|
||||
* Usage:
|
||||
* import { createSkillClient } from './auth-cli.ts';
|
||||
* const client = createSkillClient();
|
||||
* const res = await client.post('/ecom/tasks/scrape', { url: '...' });
|
||||
*/
|
||||
|
||||
import { spawnSync } from 'child_process';
|
||||
import * as path from 'path';
|
||||
import * as os from 'os';
|
||||
|
||||
const home = process.env.HOME || os.homedir();
|
||||
const AUTH_RT_BIN = process.env.AUTH_RT_BIN
|
||||
|| (() => {
|
||||
// Check if auth-rt is in PATH
|
||||
const which = spawnSync('which', ['auth-rt'], { encoding: 'utf-8' });
|
||||
if (which.status === 0 && which.stdout.trim()) {
|
||||
return which.stdout.trim();
|
||||
}
|
||||
return path.join(home, '.local', 'bin', 'auth-rt');
|
||||
})();
|
||||
|
||||
export interface ApiResponse {
|
||||
status: number;
|
||||
body: string;
|
||||
}
|
||||
|
||||
export interface SessionResponse {
|
||||
accessToken: string;
|
||||
expiresIn: number;
|
||||
ownerSessionToken?: string;
|
||||
hookUrl?: string;
|
||||
hookToken?: string;
|
||||
}
|
||||
|
||||
export interface SkillClientOptions {
|
||||
apiBase?: string;
|
||||
dryRun?: boolean;
|
||||
}
|
||||
|
||||
function runCli(...args: string[]): string {
|
||||
const result = spawnSync(AUTH_RT_BIN, args, {
|
||||
encoding: 'utf-8',
|
||||
timeout: 60_000,
|
||||
});
|
||||
|
||||
if (result.error) {
|
||||
throw new Error(`auth-rt spawn failed: ${result.error.message}`);
|
||||
}
|
||||
if (result.status !== 0) {
|
||||
throw new Error(`auth-rt failed (exit ${result.status}): ${(result.stderr || '').trim()}`);
|
||||
}
|
||||
return (result.stdout || '').trim();
|
||||
}
|
||||
|
||||
export class SkillClient {
|
||||
private readonly apiBase?: string;
|
||||
private readonly dryRun: boolean;
|
||||
|
||||
constructor(options: SkillClientOptions = {}) {
|
||||
this.apiBase = options.apiBase;
|
||||
this.dryRun = options.dryRun ?? false;
|
||||
}
|
||||
|
||||
async session(): Promise<SessionResponse> {
|
||||
if (this.dryRun) {
|
||||
return { accessToken: '<dry-run-token>', expiresIn: 900 };
|
||||
}
|
||||
return JSON.parse(runCli('session'));
|
||||
}
|
||||
|
||||
async get(urlPath: string): Promise<ApiResponse> {
|
||||
return this.request('GET', urlPath);
|
||||
}
|
||||
|
||||
async post(urlPath: string, body?: unknown): Promise<ApiResponse> {
|
||||
return this.request('POST', urlPath, body);
|
||||
}
|
||||
|
||||
async put(urlPath: string, body?: unknown): Promise<ApiResponse> {
|
||||
return this.request('PUT', urlPath, body);
|
||||
}
|
||||
|
||||
async patch(urlPath: string, body?: unknown): Promise<ApiResponse> {
|
||||
return this.request('PATCH', urlPath, body);
|
||||
}
|
||||
|
||||
async delete(urlPath: string, body?: unknown): Promise<ApiResponse> {
|
||||
return this.request('DELETE', urlPath, body);
|
||||
}
|
||||
|
||||
private async request(method: string, urlPath: string, body?: unknown): Promise<ApiResponse> {
|
||||
if (this.dryRun) {
|
||||
return { status: 200, body: JSON.stringify({ dryRun: true, method, path: urlPath }) };
|
||||
}
|
||||
const args = ['request', method, urlPath];
|
||||
if (body != null) {
|
||||
args.push('--body', JSON.stringify(body));
|
||||
}
|
||||
if (this.apiBase) {
|
||||
args.push('--api-base', this.apiBase);
|
||||
}
|
||||
return JSON.parse(runCli(...args));
|
||||
}
|
||||
}
|
||||
|
||||
export function createSkillClient(options?: SkillClientOptions): SkillClient {
|
||||
return new SkillClient(options);
|
||||
}
|
||||
351
src/index.ts
351
src/index.ts
|
|
@ -1,34 +1,355 @@
|
|||
import { createSkillClient, type ApiResponse } from './auth-cli.ts';
|
||||
import * as fs from 'fs';
|
||||
import * as path from 'path';
|
||||
|
||||
export type Command = 'run'; // TODO: add your commands
|
||||
export type Command = 'scrape';
|
||||
|
||||
export interface RunResult {
|
||||
export interface LogisticsValue {
|
||||
value: number | null;
|
||||
unit: string | null;
|
||||
source: string;
|
||||
}
|
||||
|
||||
export interface Dimensions {
|
||||
length: number | null;
|
||||
width: number | null;
|
||||
height: number | null;
|
||||
unit: string | null;
|
||||
source: string;
|
||||
}
|
||||
|
||||
export interface LogisticsData {
|
||||
weight: LogisticsValue | null;
|
||||
dimensions: Dimensions | null;
|
||||
grossWeight: LogisticsValue | null;
|
||||
netWeight: LogisticsValue | null;
|
||||
packageWeight: LogisticsValue | null;
|
||||
volume: LogisticsValue | null;
|
||||
shippingMethod: string | null;
|
||||
shippingCost: string | null;
|
||||
origin: string | null;
|
||||
}
|
||||
|
||||
export interface VariantInfo {
|
||||
name: string;
|
||||
weight: LogisticsValue | null;
|
||||
dimensions: Dimensions | null;
|
||||
}
|
||||
|
||||
export interface ScrapeResult {
|
||||
status: 'success' | 'failed';
|
||||
url: string;
|
||||
command: Command;
|
||||
dryRun: boolean;
|
||||
data?: unknown;
|
||||
product?: {
|
||||
title: string;
|
||||
logistics: LogisticsData;
|
||||
variants: VariantInfo[];
|
||||
};
|
||||
detailImages?: string[];
|
||||
rawAttributes?: Record<string, string>;
|
||||
error?: string;
|
||||
}
|
||||
|
||||
// --- CDP helpers (raw WebSocket, no npm deps) ---
|
||||
|
||||
interface CdpResult {
|
||||
id: number;
|
||||
result?: any;
|
||||
error?: { message: string };
|
||||
}
|
||||
|
||||
class CdpSession {
|
||||
private ws!: WebSocket;
|
||||
private msgId = 0;
|
||||
private pending = new Map<number, { resolve: (v: any) => void; reject: (e: Error) => void }>();
|
||||
|
||||
static async connect(port: number): Promise<CdpSession> {
|
||||
const resp = await fetch(`http://127.0.0.1:${port}/json`);
|
||||
const targets = (await resp.json()) as Array<{ webSocketDebuggerUrl: string; type: string }>;
|
||||
const page = targets.find(t => t.type === 'page');
|
||||
if (!page) throw new Error('No Chrome page tab found. Open a tab first.');
|
||||
const session = new CdpSession();
|
||||
await session.open(page.webSocketDebuggerUrl);
|
||||
return session;
|
||||
}
|
||||
|
||||
private open(wsUrl: string): Promise<void> {
|
||||
return new Promise((resolve, reject) => {
|
||||
this.ws = new WebSocket(wsUrl);
|
||||
this.ws.onopen = () => resolve();
|
||||
this.ws.onerror = (e: any) => reject(new Error(`WebSocket error: ${e.message || e}`));
|
||||
this.ws.onmessage = (ev: MessageEvent) => {
|
||||
const msg: CdpResult = JSON.parse(typeof ev.data === 'string' ? ev.data : ev.data.toString());
|
||||
if (msg.id != null && this.pending.has(msg.id)) {
|
||||
const p = this.pending.get(msg.id)!;
|
||||
this.pending.delete(msg.id);
|
||||
if (msg.error) p.reject(new Error(msg.error.message));
|
||||
else p.resolve(msg.result);
|
||||
}
|
||||
};
|
||||
});
|
||||
}
|
||||
|
||||
send(method: string, params: Record<string, any> = {}): Promise<any> {
|
||||
const id = ++this.msgId;
|
||||
return new Promise((resolve, reject) => {
|
||||
this.pending.set(id, { resolve, reject });
|
||||
this.ws.send(JSON.stringify({ id, method, params }));
|
||||
});
|
||||
}
|
||||
|
||||
async evaluate(expression: string): Promise<any> {
|
||||
const res = await this.send('Runtime.evaluate', { expression, returnByValue: true });
|
||||
return res?.result?.value;
|
||||
}
|
||||
|
||||
close() {
|
||||
try { this.ws.close(); } catch {}
|
||||
}
|
||||
}
|
||||
|
||||
// --- Parsers ---
|
||||
|
||||
const WEIGHT_KEYS = ['重量', '毛重', '净重', '单件重量', '包装重量', '产品重量', '单品重量', 'weight'];
|
||||
const DIMENSION_KEYS = ['尺寸', '规格', '长宽高', '外箱尺寸', '包装尺寸', '产品尺寸', '大小', 'size', 'dimensions'];
|
||||
const VOLUME_KEYS = ['体积', '容积', 'volume'];
|
||||
|
||||
function extractOfferId(url: string): string {
|
||||
return url.match(/offer\/(\d+)/)?.[1] || 'unknown';
|
||||
}
|
||||
|
||||
function parseWeight(raw: string): LogisticsValue | null {
|
||||
const m = raw.match(/([\d.]+)\s*(kg|g|克|千克|公斤|斤)/i);
|
||||
if (!m) return null;
|
||||
let value = parseFloat(m[1]);
|
||||
let unit = m[2].toLowerCase();
|
||||
if (unit === 'g' || unit === '克') { value /= 1000; unit = 'kg'; }
|
||||
if (unit === '千克' || unit === '公斤') unit = 'kg';
|
||||
if (unit === '斤') { value *= 0.5; unit = 'kg'; }
|
||||
return { value, unit, source: '' };
|
||||
}
|
||||
|
||||
function parseDimensions(raw: string): Dimensions | null {
|
||||
const m = raw.match(/([\d.]+)\s*[*xX×]\s*([\d.]+)\s*[*xX×]\s*([\d.]+)\s*(cm|mm|毫米|厘米|m|米)?/i);
|
||||
if (!m) return null;
|
||||
let [l, w, h] = [parseFloat(m[1]), parseFloat(m[2]), parseFloat(m[3])];
|
||||
let unit = (m[4] || 'cm').toLowerCase();
|
||||
if (unit === 'mm' || unit === '毫米') { l /= 10; w /= 10; h /= 10; unit = 'cm'; }
|
||||
if (unit === '厘米') unit = 'cm';
|
||||
if (unit === 'm' || unit === '米') { l *= 100; w *= 100; h *= 100; unit = 'cm'; }
|
||||
return { length: l, width: w, height: h, unit, source: '' };
|
||||
}
|
||||
|
||||
function parseVolume(raw: string): LogisticsValue | null {
|
||||
const m = raw.match(/([\d.]+)\s*(m³|cm³|L|ml|升|毫升|立方米|立方厘米)/i);
|
||||
if (!m) return null;
|
||||
return { value: parseFloat(m[1]), unit: m[2], source: '' };
|
||||
}
|
||||
|
||||
function matchKey(text: string, keys: string[]): boolean {
|
||||
const lower = text.toLowerCase();
|
||||
return keys.some(k => lower.includes(k.toLowerCase()));
|
||||
}
|
||||
|
||||
// --- Page extraction ---
|
||||
|
||||
const JS_EXTRACT_ATTRS = `
|
||||
(function() {
|
||||
const attrs = {};
|
||||
const sels = [
|
||||
'.detail-attributes-list .attributes-item',
|
||||
'.obj-leading .obj-content li',
|
||||
'#mod-detail-attributes .attribute-item',
|
||||
'.detail-info table tr',
|
||||
'[class*="attribute"] li',
|
||||
'[class*="param"] li',
|
||||
'.offer-attr-list .offer-attr-item',
|
||||
];
|
||||
for (const sel of sels) {
|
||||
document.querySelectorAll(sel).forEach(el => {
|
||||
const parts = el.textContent.trim().split(/[::]/);
|
||||
if (parts.length >= 2) attrs[parts[0].trim()] = parts.slice(1).join(':').trim();
|
||||
});
|
||||
}
|
||||
document.querySelectorAll('table tr, .detail-attributes-list tr').forEach(tr => {
|
||||
const cells = tr.querySelectorAll('td, th');
|
||||
if (cells.length >= 2) attrs[cells[0].textContent.trim()] = cells[1].textContent.trim();
|
||||
});
|
||||
return JSON.stringify(attrs);
|
||||
})()`;
|
||||
|
||||
const JS_EXTRACT_VARIANTS = `
|
||||
(function() {
|
||||
const variants = [];
|
||||
const sels = [
|
||||
'.sku-item-wrapper .sku-item',
|
||||
'[class*="sku"] [class*="item"]',
|
||||
'.obj-sku .obj-content li',
|
||||
'.unit-detail-spec-operator .spec-item',
|
||||
];
|
||||
for (const sel of sels) {
|
||||
document.querySelectorAll(sel).forEach(el => {
|
||||
const name = el.textContent.trim().replace(/\\s+/g, ' ');
|
||||
if (name && name.length < 200) variants.push({ name, text: el.textContent });
|
||||
});
|
||||
}
|
||||
return JSON.stringify(variants);
|
||||
})()`;
|
||||
|
||||
const JS_EXTRACT_TITLE = `
|
||||
(function() {
|
||||
for (const sel of ['.title-text','.detail-title-text','h1[class*="title"]','.mod-detail-title h1','.d-title']) {
|
||||
const el = document.querySelector(sel);
|
||||
if (el && el.textContent.trim()) return el.textContent.trim();
|
||||
}
|
||||
return document.title || '';
|
||||
})()`;
|
||||
|
||||
const JS_EXTRACT_IMAGES = `
|
||||
(function() {
|
||||
const imgs = [], seen = new Set();
|
||||
const sels = [
|
||||
'#desc-lazyload-container img',
|
||||
'.detail-desc-decorate-richtext img',
|
||||
'[class*="detail-desc"] img',
|
||||
'.mod-detail-description img',
|
||||
'.offer-attr-item img',
|
||||
'.desc-img-loaded img',
|
||||
];
|
||||
for (const sel of sels) {
|
||||
document.querySelectorAll(sel).forEach(img => {
|
||||
const src = img.src || img.dataset.src || img.dataset.lazySrc || '';
|
||||
if (src && !seen.has(src) && (src.startsWith('http') || src.startsWith('//'))) {
|
||||
seen.add(src);
|
||||
imgs.push(src.startsWith('//') ? 'https:' + src : src);
|
||||
}
|
||||
});
|
||||
}
|
||||
return JSON.stringify(imgs);
|
||||
})()`;
|
||||
|
||||
async function downloadImages(urls: string[], outputDir: string): Promise<string[]> {
|
||||
fs.mkdirSync(outputDir, { recursive: true });
|
||||
const saved: string[] = [];
|
||||
for (let i = 0; i < urls.length; i++) {
|
||||
try {
|
||||
const resp = await fetch(urls[i]);
|
||||
if (!resp.ok) continue;
|
||||
const buf = Buffer.from(await resp.arrayBuffer());
|
||||
const ext = urls[i].match(/\.(jpg|jpeg|png|webp|gif)/i)?.[1] || 'jpg';
|
||||
const p = path.join(outputDir, `img_${String(i + 1).padStart(3, '0')}.${ext}`);
|
||||
fs.writeFileSync(p, buf);
|
||||
saved.push(p);
|
||||
} catch {}
|
||||
}
|
||||
return saved;
|
||||
}
|
||||
|
||||
// --- Main ---
|
||||
|
||||
export async function run(
|
||||
command: Command,
|
||||
args: string[],
|
||||
dryRun: boolean,
|
||||
): Promise<RunResult> {
|
||||
const client = createSkillClient({
|
||||
apiBase: process.env.API_BASE,
|
||||
dryRun,
|
||||
cdpPort: number = 9222,
|
||||
): Promise<ScrapeResult> {
|
||||
if (command !== 'scrape') {
|
||||
return { status: 'failed', url: '', command, dryRun, error: `unknown command: ${command}` };
|
||||
}
|
||||
|
||||
const url = args[0];
|
||||
if (!url) {
|
||||
return { status: 'failed', url: '', command, dryRun, error: 'scrape requires <url>' };
|
||||
}
|
||||
|
||||
if (dryRun) {
|
||||
return {
|
||||
status: 'success', url, command, dryRun,
|
||||
product: {
|
||||
title: '<dry-run>',
|
||||
logistics: {
|
||||
weight: null, dimensions: null, grossWeight: null, netWeight: null,
|
||||
packageWeight: null, volume: null, shippingMethod: null, shippingCost: null, origin: null,
|
||||
},
|
||||
variants: [],
|
||||
},
|
||||
detailImages: [],
|
||||
rawAttributes: {},
|
||||
};
|
||||
}
|
||||
|
||||
let cdp: CdpSession | null = null;
|
||||
try {
|
||||
cdp = await CdpSession.connect(cdpPort);
|
||||
|
||||
await cdp.send('Page.enable');
|
||||
await cdp.send('Runtime.enable');
|
||||
await cdp.send('Page.navigate', { url });
|
||||
|
||||
// Wait for load
|
||||
await new Promise(r => setTimeout(r, 5000));
|
||||
|
||||
const title: string = await cdp.evaluate(JS_EXTRACT_TITLE) || '';
|
||||
const rawAttributes: Record<string, string> = JSON.parse(await cdp.evaluate(JS_EXTRACT_ATTRS) || '{}');
|
||||
const rawVariants: Array<{ name: string; text: string }> = JSON.parse(await cdp.evaluate(JS_EXTRACT_VARIANTS) || '[]');
|
||||
const imgUrls: string[] = JSON.parse(await cdp.evaluate(JS_EXTRACT_IMAGES) || '[]');
|
||||
|
||||
const variants: VariantInfo[] = rawVariants.map(v => {
|
||||
const weight = parseWeight(v.text);
|
||||
const dimensions = parseDimensions(v.text);
|
||||
if (weight) weight.source = 'variant';
|
||||
if (dimensions) dimensions.source = 'variant';
|
||||
return { name: v.name, weight, dimensions };
|
||||
});
|
||||
|
||||
if (command === 'run') {
|
||||
const response: ApiResponse = await client.post('/your/endpoint', { param: args[0] });
|
||||
const logistics: LogisticsData = {
|
||||
weight: null, dimensions: null, grossWeight: null, netWeight: null,
|
||||
packageWeight: null, volume: null, shippingMethod: null, shippingCost: null, origin: null,
|
||||
};
|
||||
|
||||
if (response.status < 200 || response.status >= 300) {
|
||||
return { status: 'failed', command, dryRun, error: `HTTP ${response.status}: ${response.body}` };
|
||||
for (const [key, val] of Object.entries(rawAttributes)) {
|
||||
if (matchKey(key, ['毛重'])) {
|
||||
logistics.grossWeight = parseWeight(val);
|
||||
if (logistics.grossWeight) logistics.grossWeight.source = 'attributes';
|
||||
} else if (matchKey(key, ['净重'])) {
|
||||
logistics.netWeight = parseWeight(val);
|
||||
if (logistics.netWeight) logistics.netWeight.source = 'attributes';
|
||||
} else if (matchKey(key, ['包装重量'])) {
|
||||
logistics.packageWeight = parseWeight(val);
|
||||
if (logistics.packageWeight) logistics.packageWeight.source = 'attributes';
|
||||
} else if (matchKey(key, WEIGHT_KEYS)) {
|
||||
logistics.weight = parseWeight(val);
|
||||
if (logistics.weight) logistics.weight.source = 'attributes';
|
||||
}
|
||||
if (matchKey(key, DIMENSION_KEYS)) {
|
||||
logistics.dimensions = parseDimensions(val);
|
||||
if (logistics.dimensions) logistics.dimensions.source = 'attributes';
|
||||
}
|
||||
if (matchKey(key, VOLUME_KEYS)) {
|
||||
logistics.volume = parseVolume(val);
|
||||
if (logistics.volume) logistics.volume.source = 'attributes';
|
||||
}
|
||||
if (matchKey(key, ['产地', '发货地', '所在地'])) {
|
||||
logistics.origin = val;
|
||||
}
|
||||
}
|
||||
|
||||
return { status: 'success', command, dryRun, data: JSON.parse(response.body) };
|
||||
}
|
||||
const offerId = extractOfferId(url);
|
||||
const imgDir = path.join('/tmp', '1688-logistics', offerId);
|
||||
const detailImages = await downloadImages(imgUrls, imgDir);
|
||||
|
||||
return { status: 'failed', command, dryRun, error: `unknown command: ${command}` };
|
||||
return {
|
||||
status: 'success', url, command, dryRun,
|
||||
product: { title, logistics, variants },
|
||||
detailImages,
|
||||
rawAttributes,
|
||||
};
|
||||
} catch (error) {
|
||||
return {
|
||||
status: 'failed', url, command, dryRun,
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
};
|
||||
} finally {
|
||||
cdp?.close();
|
||||
}
|
||||
}
|
||||
|
|
|
|||
Loading…
Reference in New Issue