Doc Chunker
Convert audio, video, PDFs, images, and Office documents into text chunks for your vector stores.
Doc Chunker is an optional DeepFellow component that turns diverse knowledge sources into clean, searchable text. It converts documents, presentations, spreadsheets, images, audio, and video into text and splits that text into chunks ready for embedding.
The connection between Doc Chunker and vector stores is still under development.
Right now DeepFellow Server extracts and chunks text on its own, which works best for plain text and Markdown files.
What You Can Convert
Doc Chunker handles a wide range of formats.
- Documents and Markup:
pdf,docx,pptx,xlsx,html,md. OCR extracts text from scanned PDFs and images embedded in documents. - Legacy Office formats:
doc,ppt,xls,odt,odp,ods,rtf. Doc Chunker converts these with LibreOffice before parsing. - Images:
png,jpg,jpeg,tiff,bmp. OCR reads text from the image. Optional image description adds a natural-language summary of the picture. - Audio:
mp3,wav,m4a,ogg,flac,opus,aac,wma. Doc Chunker transcribes speech to text through a speech-to-text endpoint. - Video:
mp4,mkv,mov,avi,webm. Doc Chunker extracts the audio track and transcribes it. - Other formats: Doc Chunker falls back to Apache Tika for any remaining file type it recognizes,
such as
epuborcsv.
Audio and video transcription requires a speech-to-text endpoint. Image descriptions require a vision-language model. Both are optional and you configure them during installation. See Configuration Options.
Install Doc Chunker
Install Doc Chunker from the Infra Web Panel as a model of the custom service.
The general procedure matches the one in the Custom Models guide.
In the services view, locate the custom service and open model list.
Find the doc_chunker model and open its configuration form.
Set the required fields: endpoint prefix, hardware, and picture description mode.
The default prefix is doc_chunker, which exposes Doc Chunker at /custom/doc_chunker.
Configure the optional fields described in Configuration Options. The defaults work for documents and images without OCR descriptions or audio transcription.
Click "Install". DeepFellow selects the CPU or GPU image automatically based on your hardware.
After a while, the model will appear with the green "Installed" label.
GPU version of Doc Chunker needs about 15 GB of disk space for its image but runs much faster.
Configuration Options
The Doc Chunker installation form exposes the following fields.
Processing
prefix: Endpoint prefix. Defaultdoc_chunker.max_concurrent: Maximum number of files Doc Chunker converts in parallel. Default2.document_timeout: Maximum time in seconds for a single conversion. Default600.
Image Descriptions
To add natural-language descriptions of images and pictures, set picture_description_mode to one of
the following values.
disabled: No image descriptions. This is the default.preset: Use a Docling preset vision-language model. Setpicture_description_presettosmolvlm,granite_vision,qwen, orpixtral. DeepFellow downloads the model from Hugging Face on the first request.local: Use any Hugging Face vision-language model. Setpicture_description_repoto the repository ID, for exampleQwen/Qwen2.5-VL-7B-Instruct.api: Use an external endpoint compatible with/v1/chat/completions. Setpicture_description_api_url,picture_description_api_model, and, if required,picture_description_api_key.
Set picture_description_prompt to control how the model describes each image.
Audio Transcription
To transcribe audio and video, point Doc Chunker at a speech-to-text endpoint.
audio_stt_api_url: A speech-to-text endpoint compatible with/v1/audio/transcriptions, for example your DeepFellow Speech to Text endpoint.audio_stt_api_model: The transcription model, for exampleSystran/faster-whisper-base.audio_stt_api_key: The API key for the endpoint, if required.audio_silence_threshold: The minimum duration of silence in seconds used to split audio into segments.
Set hf_token to a Hugging Face token when a preset or local model requires you to accept its terms of
service before download.
Call Doc Chunker Directly
For advanced use cases, call Doc Chunker through the /custom/doc_chunker endpoint without a vector
store. Send the file as multipart/form-data in the file field. The /custom/doc_chunker/chunks
endpoint returns a ZIP archive that contains chunks.json with the extracted chunks, document.json
with the parsed document, and an assets folder with extracted images.
curl -X 'POST' \
'https://deepfellow-server-host/custom/doc_chunker/chunks' \
-H 'Authorization: Bearer DEEPFELLOW-PROJECT-API-KEY' \
-F 'file=@report.pdf;type=application/pdf' \
-o chunks.zipimport io
import json
import zipfile
import requests
with open("report.pdf", "rb") as f:
response = requests.post(
"https://deepfellow-server-host/custom/doc_chunker/chunks",
files={"file": ("report.pdf", f, "application/pdf")},
headers={"Authorization": "Bearer DEEPFELLOW-PROJECT-API-KEY"},
)
with zipfile.ZipFile(io.BytesIO(response.content)) as archive:
chunks = json.loads(archive.read("chunks.json"))
print(chunks)import * as fs from 'fs';
const buffer = await fs.promises.readFile('report.pdf');
const formData = new FormData();
formData.append('file', new Blob([buffer], { type: 'application/pdf' }), 'report.pdf');
const response = await fetch('https://deepfellow-server-host/custom/doc_chunker/chunks', {
method: 'POST',
headers: {
Authorization: 'Bearer DEEPFELLOW-PROJECT-API-KEY'
},
body: formData
});
const archive = Buffer.from(await response.arrayBuffer());
await fs.promises.writeFile('chunks.zip', archive);The chunks.json file inside the archive contains the chunks and a total count.
{
"chunks": [
{
"text": "Quarterly revenue grew by 12 percent compared to the previous period.",
"meta": {
"headings": ["Financial Summary"],
"pages": [3]
}
}
],
"count": 1
}To extract plain text instead of chunks, call /custom/doc_chunker/text. Set the
image_processing_mode query parameter to ignore, base64, or description to control how the
endpoint handles images.
We use cookies on our website. We use them to ensure proper functioning of the site and, if you agree, for purposes such as analytics, marketing, and targeting ads.