Document Ingestion for AI Knowledgebases: PDFs, Word, SharePoint

Supported Formats

Every major document format your organization uses is supported. If your format isn't listed, contact support — the ingestion pipeline is extensible.

📄

Prose Documents

PDF — text-layer and scanned (OCR)
Word — .docx and legacy .doc
PowerPoint — .pptx and .ppt (slide text + notes)
HTML — web pages, wikis, inline portals
Markdown — .md files, documentation sites
Plain text — .txt, .log files
XML — with configurable schema mapping
Email — .eml files and attachment extraction

📊

Data Files

Excel — .xlsx, .xls (multiple sheets)
CSV and TSV — flat tables
JSON — structured objects and arrays
XML data exports — ERP/CRM exports
Database dumps — via live datasource connector

Data files route to the Structured Data ingestion path — rows and columns are preserved, not flattened into prose. See the Structured Data guide.

🌐

Live Sources

SharePoint — document libraries + OneDrive folders
Google Drive — folders and shared drives
Confluence — spaces and pages
Notion — workspaces and databases
Box — folders and collections
Dropbox Business — team folders
Email archives — Exchange / Google Workspace

How Documents Are Processed

From raw file to retrievable knowledge in four stages. Understanding this pipeline helps you configure each stage for your content type.

1

Parsing and Extraction

The raw file is parsed to extract clean text. Each format has a dedicated extractor:

PDFs with text layers are extracted with structure preservation — heading hierarchy, paragraph boundaries, and table detection are retained as metadata.
Scanned PDFs run through OCR. RenderDraw uses a construction-domain-tuned OCR model that handles technical drawings, specification tables, and engineering notation with higher accuracy than general-purpose OCR.
Word and PowerPoint files preserve heading styles, bullet structure, and embedded table content. Slide notes in PowerPoint are extracted and attributed to their slide.
HTML pages strip navigation, footers, and boilerplate. The main content block is identified using a semantic extraction heuristic tuned for technical documentation sites.

2

Chunking

The extracted text is split into retrieval units (chunks). Chunking strategy is configured per-KB:

Paragraph splitting — the default. Respects sentence and paragraph boundaries. Produces natural, self-contained chunks that read coherently when retrieved.
Section splitting — uses heading structure (H1/H2/H3) to create section-level chunks. Ideal for long manuals where section context matters for retrieval.
Fixed-size splitting — divides at exact token counts with a configurable stride. Deterministic and useful for very uniform documents where natural boundaries don't exist.
Semantic splitting — identifies thematic shifts in the text and splits at topic boundaries. More expensive to compute but produces the highest-quality chunks for mixed-topic documents.

Chunk overlap (a configurable percentage of the chunk size) is repeated at the start of each subsequent chunk to prevent relevant context from being cut at a boundary.

3

Embedding

Each chunk is converted to a dense vector representation (embedding) using the KB's configured embedding model. The embedding captures the semantic meaning of the chunk in a 1,536-dimensional (or higher) vector space — enabling similarity search that finds related content by meaning rather than keyword overlap.

Embedding is the most computationally intensive step and runs asynchronously after upload. For most document sizes, embedding completes within 30–120 seconds of upload. Large batches (thousands of documents) process in the background and generate a completion notification when ready.

Embeddings are stored persistently and do not need to be recomputed unless the document content changes or the KB is migrated to a new embedding model. Model migrations happen automatically during platform updates and are performed in a rolling fashion — the KB remains fully queryable throughout the migration.

4

Indexing

Embeddings are written to the vector index (pgvector). RenderDraw uses an IVFFLAT approximate nearest-neighbor index configured for sub-100ms retrieval at KBs of up to 500,000 chunks, and an HNSW index for KBs exceeding that threshold.

Alongside the vector index, a full-text index is built for hybrid search — queries that benefit from both semantic similarity and exact term matching (e.g., part numbers, standards codes, named specifications) can use hybrid mode to combine both signal types in the ranking.

Document metadata (file name, upload date, document version, source library, custom tags) is stored in a relational index alongside the vector index. This enables metadata-filtered retrieval: "find the most relevant chunk from documents tagged year:2025 only."

Batch Upload

Batch upload is the fastest way to get an initial corpus into a knowledgebase. From the KB management panel, drag-and-drop up to 500 files at once, or use the folder upload option to upload an entire directory tree preserving subfolder structure as document metadata tags.

During batch upload, RenderDraw validates each file — checking format support, file size limits, and OCR eligibility for PDFs — and queues valid files for processing. Invalid files are flagged with error details and skipped. A progress panel shows each file's processing stage: queued, parsing, chunking, embedding, indexing, complete.

For very large initial loads (thousands of documents, multi-GB archives), use the bulk import package: a ZIP archive with a manifest JSON file that specifies metadata tags, chunk strategy overrides, and version identifiers per document. Submit the package via the API or the platform's import panel. Bulk imports process in dedicated background workers without affecting the performance of existing KBs.

File Size Limits

Single file upload: 250 MB per file
Batch upload: 2 GB total per batch
Bulk import package: 10 GB per package (no per-file limit within package)
Live source documents: no size limit (streamed directly)

API Ingestion

API ingestion is the preferred mode for automated document pipelines. The POST /api/knowledgebases/{kb_id}/documents endpoint accepts a document file (multipart/form-data) or a URL (for web pages and live-accessible files) alongside metadata fields.

POST /api/knowledgebases/kb_specs_2025/documents
Content-Type: multipart/form-data

file: [binary]
metadata: {
  "title": "Conveyor Spec Sheet Rev C",
  "tags": ["product-line:conveyors", "revision:C"],
  "version": "2025-11-15",
  "replace_previous": true
}

The replace_previous: true flag supersedes the previous version of a document with the same title in the KB. The previous version is retained in version history but is no longer included in default retrieval. Set replace_previous: false to add the new version alongside the previous one (e.g. for parallel spec versions that are both valid).

Webhook callbacks notify your system when processing is complete: POST {your_callback_url} with the document ID, processing status, chunk count, and any parsing warnings. Use this to trigger downstream actions — for example, notifying workflow authors that a KB has been updated.

Scheduled Re-Sync for Live Sources

For knowledgebases connected to live sources — SharePoint, Google Drive, Confluence, Notion — scheduled re-sync keeps the KB current without manual effort. RenderDraw polls the connected source on the configured schedule, identifies new and modified files, and re-ingests only the changed documents (incremental sync, not full re-index).

Re-sync schedules are configurable per-KB: hourly for fast-moving sources like active project document libraries, daily for standard document management libraries, or weekly/monthly for stable reference archives like compliance standards libraries. A manual Sync Now trigger is always available for immediate re-sync outside the scheduled window.

When a document is updated in the source system, RenderDraw detects the change via the source's file modification timestamp or change notification API (SharePoint webhook, Google Drive push notification). The previous version of the document is preserved in the KB's version history, and the new version becomes the active retrieval target.

Documents deleted from the source are not automatically removed from the KB. This is intentional — deleted documents may still be referenced in past workflow outputs and are needed for audit purposes. Deleted source documents are flagged as "source deleted" in the KB document list. You can manually archive them or set an auto-archive policy (e.g. archive source-deleted documents after 90 days).

Re-Sync Schedule Recommendations

Active project SharePoint libraries: Hourly
Engineering spec document libraries: Daily
Pricing and BOM exports: Daily (or on ERP export trigger)
Compliance standards libraries: Weekly
Archived proposal libraries: Monthly
Historical project archives: Manual only

Change Detection Methods

SharePoint: Change notification webhook (real-time) or polling
Google Drive: Push notification API (near real-time)
Confluence: Audit log polling (5-minute minimum interval)
Box / Dropbox: Webhook (real-time) or polling
Manual uploads: Immediate on upload

OCR for Scanned Construction Documents

Scanned PDFs are ubiquitous in construction and manufacturing — vendor submittals, as-built drawings, historical spec sheets, legacy RFP responses, and pre-digital project records are often image-only PDFs with no text layer. RenderDraw's OCR pipeline is tuned specifically for this content type.

The OCR engine handles: mixed-orientation pages (portrait and landscape in the same document), multi-column layouts, technical tables with fine gridlines, superscript and subscript notation common in engineering specs, handwritten annotations in margins, and stamps and watermarks that obscure underlying text.

For technical drawings (P&IDs, floor plans, wiring diagrams), OCR extracts text from callouts, labels, title blocks, and notes — but does not interpret the geometric content of the drawing itself. For AI-driven interpretation of drawing geometry, use the CAD Asset KB type with a native CAD file rather than a scanned PDF derivative.

OCR processing time averages 8–15 seconds per scanned page. A 200-page submittal package processes in approximately 30–45 minutes. High-priority OCR queues are available for time-sensitive uploads — contact support to enable priority OCR for your tenant.

OCR Quality Indicators

After OCR processing, each document receives a quality score (0–100). Documents below 60 are flagged for manual review — their text extraction may be unreliable. Common causes: very low scan resolution (<150 DPI), heavy visual noise (fax artifacts), or non-standard fonts.