DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Here we can share any new feature or direction that we want to give to ManifoldCF.
From Legacy Crawler to AI-Ready Data Ingestion Hub.
Assessment and upgrading the current connectors
We probably need to create an assessment table in order to understand which connector we should keep, remove, update or rewrite from scratch.
Output Connectors for vector databases (RAG)
It could be interesting to understand if we can add specific connectors for Retrieval-Augmented Generation (RAG) for helping the injection of data in vector databases.
We could implement one or more RAG Output Connectors using Langchain4j for example one for each embedding store (Elasticsearch, Cassandra, Neo4j and so on)
Security
We have to continue to solve critical vulnerabilities at least upgrading the related dependencies.
Resiliency
Should we invest time in making the architecture more reliable?
Core Performance & Modernization (The Java 21 Leap)
The transition to OpenJDK 21 is the foundation for a more scalable and responsive architecture.
Virtual Threads Integration (Project Loom):
Replace legacy thread pooling with Virtual Threads to handle thousands of concurrent repository connections with minimal memory overhead.
Implement Structured Concurrency to improve the reliability of complex crawling jobs and prevent resource leakage.
REST API v2 & OpenAPI Specification:
Complete the transition to a fully documented RESTful API.
Enable "Configuration-as-Code" to support modern DevOps workflows and external orchestration.
Observability with OpenTelemetry:
Integrate native tracing to monitor document processing latency across connectors, leveraging Java 21's improved profiling capabilities.
AI & Vector Ecosystem Integration (RAG-Readiness)
Positioning ManifoldCF as the primary "ingestion engine" for Retrieval-Augmented Generation (RAG) and LLM applications.
Universal Embedding Transformation Connector:
Develop a new transformation module using LangChain4j or Apache OpenNLP (ONNX support).
Enable in-flight embedding generation (converting text to vectors) directly within the MCF pipeline at no cost using local open-source models (e.g., BGE-M3, Nomic).
Native Vector Store Output Connectors:
Launch official output connectors for leading open-source Vector Databases: Solr Dense Vector, Milvus, Qdrant, and Weaviate.
Develop a specialized pgvector connector for users leveraging PostgreSQL as a unified metadata and vector store.
Advanced Metadata & ACL Mapping for AI:
Ensure that security permissions (ACLs) are seamlessly passed to vector stores as "payload" filters to maintain document security in AI search interfaces.
Cloud-Native & Ecosystem Synergy
Expanding the reach of ManifoldCF through deeper integration with the Apache ecosystem and containerized environments.
Apache Airflow & NiFi Integration:
Release an official Apache Airflow Provider to trigger and monitor MCF jobs within global data pipelines.
Optimize the data hand-off between MCF (Source) and Apache NiFi (Processor) for high-speed streaming.
Kubernetes Operator:
Develop a native K8s Operator to simplify deployment, auto-scaling of Agents, and management of the database backend.
Next-Gen Administrative UI:
Refresh the management console using a modern frontend framework (React/Vue) to provide real-time throughput dashboards and intuitive job monitoring.
Summary of Strategic Goals
| Goal | Description | Value Proposition |
| Scale | Java 21 Virtual Threads | Higher throughput with lower hardware costs. |
| Intelligence | Local Embedding Connectors | Enable RAG pipelines without third-party API costs. |
| Connectivity | Vector Store Connectors | Integration with the modern AI/LLM stack. |
| Usability | REST API & Airflow | Easier adoption for DevOps and Data Engineers. |