Doc Chunker

Convert audio, video, PDFs, images, and Office documents into text chunks for your vector stores.

Doc Chunker is an optional DeepFellow component that turns diverse knowledge sources into clean, searchable text. It converts documents, presentations, spreadsheets, images, audio, and video into text and splits that text into chunks ready for embedding.

The connection between Doc Chunker and vector stores is still under development.

Right now DeepFellow Server extracts and chunks text on its own, which works best for plain text and Markdown files.

What You Can Convert

Doc Chunker handles a wide range of formats.

Documents and Markup: pdf, docx, pptx, xlsx, html, md. OCR extracts text from scanned PDFs and images embedded in documents.
Legacy Office formats: doc, ppt, xls, odt, odp, ods, rtf. Doc Chunker converts these with LibreOffice before parsing.
Images: png, jpg, jpeg, tiff, bmp. OCR reads text from the image. Optional image description adds a natural-language summary of the picture.
Audio: mp3, wav, m4a, ogg, flac, opus, aac, wma. Doc Chunker transcribes speech to text through a speech-to-text endpoint.
Video: mp4, mkv, mov, avi, webm. Doc Chunker extracts the audio track and transcribes it.
Other formats: Doc Chunker falls back to Apache Tika for any remaining file type it recognizes, such as epub or csv.

Audio and video transcription requires a speech-to-text endpoint. Image descriptions require a vision-language model. Both are optional and you configure them during installation. See Configuration Options.

Install Doc Chunker

Install Doc Chunker from the Infra Web Panel as a model of the custom service. The general procedure matches the one in the Custom Models guide.

In the services view, locate the custom service and open model list.

Find the doc_chunker model and open its configuration form.

Set the required fields: endpoint prefix, hardware, and picture description mode. The default prefix is doc_chunker, which exposes Doc Chunker at /custom/doc_chunker.

Configure the optional fields described in Configuration Options. The defaults work for documents and images without OCR descriptions or audio transcription.

Click "Install". DeepFellow selects the CPU or GPU image automatically based on your hardware.

After a while, the model will appear with the green "Installed" label.

GPU version of Doc Chunker needs about 15 GB of disk space for its image but runs much faster.

Configuration Options

The Doc Chunker installation form exposes the following fields.

Processing

prefix: Endpoint prefix. Default doc_chunker.
max_concurrent: Maximum number of files Doc Chunker converts in parallel. Default 2.
document_timeout: Maximum time in seconds for a single conversion. Default 600.

Image Descriptions

To add natural-language descriptions of images and pictures, set picture_description_mode to one of the following values.

disabled: No image descriptions. This is the default.
preset: Use a Docling preset vision-language model. Set picture_description_preset to smolvlm, granite_vision, qwen, or pixtral. DeepFellow downloads the model from Hugging Face on the first request.
local: Use any Hugging Face vision-language model. Set picture_description_repo to the repository ID, for example Qwen/Qwen2.5-VL-7B-Instruct.
api: Use an external endpoint compatible with /v1/chat/completions. Set picture_description_api_url, picture_description_api_model, and, if required, picture_description_api_key.

Set picture_description_prompt to control how the model describes each image.

Audio Transcription

To transcribe audio and video, point Doc Chunker at a speech-to-text endpoint.

audio_stt_api_url: A speech-to-text endpoint compatible with /v1/audio/transcriptions, for example your DeepFellow Speech to Text endpoint.
audio_stt_api_model: The transcription model, for example Systran/faster-whisper-base.
audio_stt_api_key: The API key for the endpoint, if required.
audio_silence_threshold: The minimum duration of silence in seconds used to split audio into segments.

Set hf_token to a Hugging Face token when a preset or local model requires you to accept its terms of service before download.

Call Doc Chunker Directly

For advanced use cases, call Doc Chunker through the /custom/doc_chunker endpoint without a vector store. Send the file as multipart/form-data in the file field. The /custom/doc_chunker/chunks endpoint returns a ZIP archive that contains chunks.json with the extracted chunks, document.json with the parsed document, and an assets folder with extracted images.

curl -X 'POST' \
  'https://deepfellow-server-host/custom/doc_chunker/chunks' \
  -H 'Authorization: Bearer DEEPFELLOW-PROJECT-API-KEY' \
  -F 'file=@report.pdf;type=application/pdf' \
  -o chunks.zip

import io
import json
import zipfile

import requests

with open("report.pdf", "rb") as f:
    response = requests.post(
        "https://deepfellow-server-host/custom/doc_chunker/chunks",
        files={"file": ("report.pdf", f, "application/pdf")},
        headers={"Authorization": "Bearer DEEPFELLOW-PROJECT-API-KEY"},
    )

with zipfile.ZipFile(io.BytesIO(response.content)) as archive:
    chunks = json.loads(archive.read("chunks.json"))

print(chunks)

import * as fs from 'fs';

const buffer = await fs.promises.readFile('report.pdf');
const formData = new FormData();
formData.append('file', new Blob([buffer], { type: 'application/pdf' }), 'report.pdf');

const response = await fetch('https://deepfellow-server-host/custom/doc_chunker/chunks', {
    method: 'POST',
    headers: {
        Authorization: 'Bearer DEEPFELLOW-PROJECT-API-KEY'
    },
    body: formData
});

const archive = Buffer.from(await response.arrayBuffer());
await fs.promises.writeFile('chunks.zip', archive);

The chunks.json file inside the archive contains the chunks and a total count.

{
    "chunks": [
        {
            "text": "Quarterly revenue grew by 12 percent compared to the previous period.",
            "meta": {
                "headings": ["Financial Summary"],
                "pages": [3]
            }
        }
    ],
    "count": 1
}

To extract plain text instead of chunks, call /custom/doc_chunker/text. Set the image_processing_mode query parameter to ignore, base64, or description to control how the endpoint handles images.

What You Can Convert

Install Doc Chunker

Configuration Options

Call Doc Chunker Directly

On this page