Documents don't live in one place, in one format, updated on one schedule. RenderDraw's ingestion pipeline handles PDFs from the fileserver, Word docs from SharePoint, HTML pages from an internal wiki, and everything in between — processing each format into clean, searchable, versioned chunks without manual intervention.
Every major document format your organization uses is supported. If your format isn't listed, contact support — the ingestion pipeline is extensible.
Data files route to the Structured Data ingestion path — rows and columns are preserved, not flattened into prose. See the Structured Data guide.
From raw file to retrievable knowledge in four stages. Understanding this pipeline helps you configure each stage for your content type.
The raw file is parsed to extract clean text. Each format has a dedicated extractor:
The extracted text is split into retrieval units (chunks). Chunking strategy is configured per-KB:
Chunk overlap (a configurable percentage of the chunk size) is repeated at the start of each subsequent chunk to prevent relevant context from being cut at a boundary.
Each chunk is converted to a dense vector representation (embedding) using the KB's configured embedding model. The embedding captures the semantic meaning of the chunk in a 1,536-dimensional (or higher) vector space — enabling similarity search that finds related content by meaning rather than keyword overlap.
Embedding is the most computationally intensive step and runs asynchronously after upload. For most document sizes, embedding completes within 30–120 seconds of upload. Large batches (thousands of documents) process in the background and generate a completion notification when ready.
Embeddings are stored persistently and do not need to be recomputed unless the document content changes or the KB is migrated to a new embedding model. Model migrations happen automatically during platform updates and are performed in a rolling fashion — the KB remains fully queryable throughout the migration.
Embeddings are written to the vector index (pgvector). RenderDraw uses an IVFFLAT approximate nearest-neighbor index configured for sub-100ms retrieval at KBs of up to 500,000 chunks, and an HNSW index for KBs exceeding that threshold.
Alongside the vector index, a full-text index is built for hybrid search — queries that benefit from both semantic similarity and exact term matching (e.g., part numbers, standards codes, named specifications) can use hybrid mode to combine both signal types in the ranking.
Document metadata (file name, upload date, document version, source library, custom tags) is stored in a relational index alongside the vector index. This enables metadata-filtered retrieval: "find the most relevant chunk from documents tagged year:2025 only."
Batch upload is the fastest way to get an initial corpus into a knowledgebase. From the KB management panel, drag-and-drop up to 500 files at once, or use the folder upload option to upload an entire directory tree preserving subfolder structure as document metadata tags.
During batch upload, RenderDraw validates each file — checking format support, file size limits, and OCR eligibility for PDFs — and queues valid files for processing. Invalid files are flagged with error details and skipped. A progress panel shows each file's processing stage: queued, parsing, chunking, embedding, indexing, complete.
For very large initial loads (thousands of documents, multi-GB archives), use the bulk import package: a ZIP archive with a manifest JSON file that specifies metadata tags, chunk strategy overrides, and version identifiers per document. Submit the package via the API or the platform's import panel. Bulk imports process in dedicated background workers without affecting the performance of existing KBs.
API ingestion is the preferred mode for automated document pipelines. The POST /api/knowledgebases/{kb_id}/documents endpoint accepts a document file (multipart/form-data) or a URL (for web pages and live-accessible files) alongside metadata fields.
POST /api/knowledgebases/kb_specs_2025/documents
Content-Type: multipart/form-data
file: [binary]
metadata: {
"title": "Conveyor Spec Sheet Rev C",
"tags": ["product-line:conveyors", "revision:C"],
"version": "2025-11-15",
"replace_previous": true
}
The replace_previous: true flag supersedes the previous version of a document with the same title in the KB. The previous version is retained in version history but is no longer included in default retrieval. Set replace_previous: false to add the new version alongside the previous one (e.g. for parallel spec versions that are both valid).
Webhook callbacks notify your system when processing is complete: POST {your_callback_url} with the document ID, processing status, chunk count, and any parsing warnings. Use this to trigger downstream actions — for example, notifying workflow authors that a KB has been updated.
For knowledgebases connected to live sources — SharePoint, Google Drive, Confluence, Notion — scheduled re-sync keeps the KB current without manual effort. RenderDraw polls the connected source on the configured schedule, identifies new and modified files, and re-ingests only the changed documents (incremental sync, not full re-index).
Re-sync schedules are configurable per-KB: hourly for fast-moving sources like active project document libraries, daily for standard document management libraries, or weekly/monthly for stable reference archives like compliance standards libraries. A manual Sync Now trigger is always available for immediate re-sync outside the scheduled window.
When a document is updated in the source system, RenderDraw detects the change via the source's file modification timestamp or change notification API (SharePoint webhook, Google Drive push notification). The previous version of the document is preserved in the KB's version history, and the new version becomes the active retrieval target.
Documents deleted from the source are not automatically removed from the KB. This is intentional — deleted documents may still be referenced in past workflow outputs and are needed for audit purposes. Deleted source documents are flagged as "source deleted" in the KB document list. You can manually archive them or set an auto-archive policy (e.g. archive source-deleted documents after 90 days).
Scanned PDFs are ubiquitous in construction and manufacturing — vendor submittals, as-built drawings, historical spec sheets, legacy RFP responses, and pre-digital project records are often image-only PDFs with no text layer. RenderDraw's OCR pipeline is tuned specifically for this content type.
The OCR engine handles: mixed-orientation pages (portrait and landscape in the same document), multi-column layouts, technical tables with fine gridlines, superscript and subscript notation common in engineering specs, handwritten annotations in margins, and stamps and watermarks that obscure underlying text.
For technical drawings (P&IDs, floor plans, wiring diagrams), OCR extracts text from callouts, labels, title blocks, and notes — but does not interpret the geometric content of the drawing itself. For AI-driven interpretation of drawing geometry, use the CAD Asset KB type with a native CAD file rather than a scanned PDF derivative.
OCR processing time averages 8–15 seconds per scanned page. A 200-page submittal package processes in approximately 30–45 minutes. High-priority OCR queues are available for time-sensitive uploads — contact support to enable priority OCR for your tenant.
After OCR processing, each document receives a quality score (0–100). Documents below 60 are flagged for manual review — their text extraction may be unreliable. Common causes: very low scan resolution (<150 DPI), heavy visual noise (fax artifacts), or non-standard fonts.
End-to-end walkthrough: create a KB, configure chunking, set context rules, connect to a workflow block.
Read →How Excel and CSV data is ingested differently from prose documents — row preservation, column mapping, live sync.
Read →What happens at runtime: semantic search, RAG, context assembly, and citation generation.
Read →