Multimodal

HOT

Projects and tools that deal with multimodal data, combining modalities like text, images, and speech.

Active projects 100
New this week +178
Total star growth +591
Cross-source 5
592.1k
Total Stars
78.4k
Total Forks
5
Multi-Source Repos
+591
Stars This Period

Top Projects (100)

PA

fikrikarim/parlor

On-device, real-time multimodal AI. Have natural voice and vision conversations with an AI that runs entirely on your machine. Powered by Gemma 4 E2B and Kokoro.

Trend 12
Breakout +87.1%
apple-silicon gemma kokoro litert-lm local-llm mlx multimodal on-device-ai python real-time speech-recognition text-to-speech voice-assistant
1.1k 107 +187/wk
GitHub
AE

x-zheng16/Awesome-Embodied-AI-Safety

Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses | 400+ Papers | Perception, Cognition, Planning, Interaction, Agentic System

Trend 4
🔥 Heating Up +13.5%
adversarial-attacks ai-safety autonomous-driving backdoor-attacks embodied-agents embodied-ai jailbreak large-language-models multimodal robotics survey
59 0 +4/wk
GitHub
GE

lcqysl/GEMS

GEMS: Agent-Native Multimodal Generation with Memory and Skills

Trend 4
🔥 Heating Up +12.5%
agent generation multimodal reasoning
90 4 +2/wk
GitHub
GS

zai-org/GLM-skills

Official skills for the GLM family of models.

Trend 4
glm multimodal ocr skills vision
272 19 +7/wk
GitHub
CL

qingchencloud/clawpanel

🦞 OpenClaw 可视化管理面板 — 内置 AI 助手(工具调用 + 图片识别 + 多模态),一键安装 | Visual management panel with built-in AI assistant (tool calling + vision + multimodal + i18n(11))

Trend 3
admin-panel ai-agent ai-assistant ai-chat ai-tools chatgpt cross-platform deepseek desktop-app image-recognition llm management-panel multimodal openclaw openclaw-panel rust tauri tauri-v2 tool-calling vite
2.2k 279 +28/wk
GitHub
MT

OpenMOSS/MOSS-TTS

MOSS‑TTS Family is an open‑source speech and sound generation model family from MOSI.AI and the OpenMOSS team. It is designed for high‑fidelity, high‑expressiveness, and complex real‑world scenarios, covering stable long‑form speech, multi‑speaker dialogue, voice/character design, environmental sound effects, and real‑time streaming TTS.

Trend 3
audio audio-tokenizer llm multimodal text-to-speech voice-cloning
1.1k 103 +16/wk
GitHub
VO

vllm-project/vllm-omni

A framework for efficient model inference with omni-modality models

Trend 3
audio-generation diffusion image-generation inference model-serving multimodal pytorch transformer video-generation
4.2k 719 +50/wk
GitHub
AI

datawhalechina/all-in-rag

🔍大模型应用开发实战一:RAG 技术全栈指南,在线阅读地址:https://datawhalechina.github.io/all-in-rag/

Trend 3
ai deepseek embedding kimi-k2 langchain llama-index llm milvus multimodal neo4j python rag
5.9k 2.9k +52/wk
GitHub
UN

yuanzhao-CVLAB/UniMMAD

[CVPR 2026] Official Implementation of UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression

Trend 3
anomaly-detection mixture-of-experts multimodal
206 21 +1/wk
GitHub
WA

Anyesh/wardrowbe

Put your wardrobe in rows. Self-hosted AI-powered wardrobe management app.

Trend 3
ai outfit-ai outfit-pairing outfits style-ai wardrobe wardrobe-app wardrobe-management
169 23 +0/wk
GitHub
LO

ParisNeo/lollms

An all in one AI solution compatible with any known AI service on the planet

Trend 3
ai llm multimodal
63 17 +0/wk
GitHub
CV

AccumulateMore/CV

✔(已完结)超级全面的 深度学习 笔记【土堆 Pytorch】【李沐 动手学深度学习】【吴恩达 深度学习】【大飞 大模型Agent】

Trend 3
agent agents book chinese computer-vision cv deep-learning jupyter-notebook llm llms machine-learning natural-language-processing nlp notebook python rag
19.5k 2.2k +76/wk
GitHub PyPI 2-source
MS

modelscope/ms-swift

Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.5, DeepSeek-R1, GLM-5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, Phi4, ...) (AAAI 2025).

Trend 3
deepseek-r1 embedding grpo internvl liger llama llama4 llm lora megatron moe multimodal open-r1 peft qwen3 qwen3-5 qwen3-omni qwen3-vl reranker sft
13.6k 1.3k +16/wk
GitHub
MA

Blaizzy/mlx-audio

A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.

Trend 3
apple-silicon audio-processing mlx multimodal speech-recognition speech-synthesis speech-to-text text-to-speech transformers
6.6k 541 +8/wk
GitHub
ML

UbiquitousLearning/mllm

Fast Multimodal LLM on Mobile Devices

Trend 3
ai llama llm mobile multimodal
1.5k 187 +0/wk
GitHub
AJ

llm-jp/awesome-japanese-llm

日本語LLMまとめ - Overview of Japanese LLMs

Trend 3
foundation-models generative-ai generative-model generative-models japanese japanese-language japanese-language-model japanese-llm language-model language-models large-language-model large-language-models llm llm-japanese llms multimodal vision-and-language vision-language vision-language-model
1.4k 43 +1/wk
GitHub
PO

InternRobotics/PointLLM

[ECCV 2024 Best Paper Candidate & TPAMI 2025] PointLLM: Empowering Large Language Models to Understand Point Clouds

Trend 3
3d chatbot foundation-models gpt-4 large-language-models llama multimodal objaverse point-cloud pointllm representation-learning vision-and-language
999 57 +1/wk
GitHub
MO

OpenMOSS/MOVA

MOVA: Towards Scalable and Synchronized Video–Audio Generation

Trend 3
diffusion-models multimodal sglang video-audio-generation
887 62 +4/wk
GitHub
PA

PaddlePaddle/PaddleMIX

Paddle Multimodal Integration and eXploration, supporting mainstream multi-modal tasks, including end-to-end large-scale multi-modal pretrain models and diffusion model toolbox. Equipped with high performance and flexibility.

Trend 3
aigc clip controlnet deepseek-vl dit eva-clip got-ocr20 image-to-text internvl2 llava minicpm-v multimodal ppdiffusers qwen2-vl sd-xl sora stable-diffusion stablevideodiffusion text-to-image text-to-video
721 225 +1/wk
GitHub
NE

EvolvingLMMs-Lab/NEO

NEO Series: Native Vision-Language Models from First Principles

Trend 3
agi encoder-free-vlm large-language-models mllm multimodal multimodal-large-language-models native-multimodal-model vlm
699 25 +1/wk
GitHub
MM

enoche/MMRec

A Toolbox for MultiModal Recommendation. Integrating 10+ Models...

Trend 3
multi-modal-retrieval multimedia-recommendation multimodal recommender-system
649 97 +2/wk
GitHub
VL

TIGER-AI-Lab/VLM2Vec

This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]

Trend 3
benchmark contrastive-learning embedding image-retrieval mmeb multimodal rag representation-learning video-retrieval visual-document-retrieval vlm
622 59 +0/wk
GitHub
OH

shenhao-stu/ohmycaptcha

⚡ Self-hostable YesCaptcha-compatible captcha solver built with FastAPI, Playwright, and OpenAI-compatible multimodal models.

Trend 3
captcha fastapi multimodal openai-compatible playwright recaptcha self-hosted vision-models yescaptcha-compatible
619 210 +1/wk
GitHub
LM

ictnlp/LLaVA-Mini

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

Trend 3
efficient gpt4o gpt4v large-language-models large-multimodal-models llama llava multimodal multimodal-large-language-models video vision vision-language-model visual-instruction-tuning
569 32 +1/wk
GitHub
CS

suzuran0y/CCTV-Smartphone-AI-Monitoring

本地监控 + AI 视觉 — LAN-based smartphone-powered AI monitoring framework with structured event output for data acquisition and analysis.

Trend 3
ai-monitoring computer-vision device-repurposing event-driven image-recognition-tool ip-camera ml-ops monitoring-system multimodal structured-output video-streaming
547 38 +1/wk
GitHub
RA

RobotecAI/rai

RAI is a vendor agnostic agentic framework for Physical AI robotics, utilizing ROS 2 tools to perform complex actions, defined scenarios, free interface execution, log summaries, voice interaction and more.

Trend 3
ai ai-agents-framework embodied-agent embodied-agents embodied-ai embodied-artificial-intelligence generative-ai llm multi-agent-systems multimodal o3de physical-ai robotec robotics ros2 vlm
487 65 +1/wk
GitHub
CL

qingchencloud/clawapp

📱 ClawApp — OpenClaw AI 智能体手机聊天客户端 | 流式对话 · 图片收发 · 工具调用 · PWA + APK | Mobile chat client for OpenClaw AI Agent

Trend 3
ai-agents ai-assistant android capacitor chat-client chinese dark-mode h5 i18n markdown mobile-chat multimodal openclaw pwa self-hosted streaming tool-calling voice-input websocket
375 45 +1/wk
GitHub
AN

antflydb/antfly

Trend 3
ai-agents autoscaling elasticsearch information-retrieval ml multimodal rag semantic-search
324 20 +0/wk
GitHub
OP

clawdotnet/openclaw.net

Self-hosted OpenClaw gateway + agent runtime in .NET (NativeAOT-friendly)

Trend 3
agent-runtime ai-agent automation csharp discord-bot dotnet llm mcp memory microsoft-agent-framework multimodal nativeaot openai-compatible realtime self-evolving self-hosted text-to-speech tool-calling tool-execution
196 31 +1/wk
GitHub
MF

marqo-ai/marqo-FashionCLIP

State-of-the-art CLIP/SigLIP embedding models finetuned for the fashion domain. +57% increase in evaluation metrics vs FashionCLIP 2.0.

Trend 3
clip embeddings fashion-classifier fashionclip informationretrieval multimodal recomendations search transformers vectorsearch vision-transformer
127 14 +0/wk
GitHub
PL

isLinXu/paper-list

autoupdate paper list

Trend 3
action-recognition anomaly-detection audio-processing classification depth-estimation graph-neural-networks image-generation llm multimodal object-detection object-tracking optical-flow pose-estimation reinforcement-learning scene-understanding semantic-segmentation transfer-learning
118 10 +0/wk
GitHub
SM

SmooSenseAI/smoosense

Interactively browse multimodal tabular data

Trend 3
analytics exploratory-data-analysis exploratory-data-visualizations multimodal visualization
108 13 +0/wk
GitHub
LE

oidlabs-com/Lexoid

Multimodal document parser for high quality data understanding and extraction

Trend 3
genai html-to-markdown html-to-pdf large-language-models llms multimodal ocr ocr-python parser-library pdf-document pdf-parser pdf-to-json pdf-to-latex
98 12 +0/wk
GitHub
UT

bytedance/UI-TARS-desktop

The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra

Trend 3
agent agent-tars browser-use computer-use cowork gui-agent gui-operator mcp mcp-server multimodal tars ui-tars vision vlm
29.3k 2.9k +19/wk
GitHub
LL

haotian-liu/LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Trend 3
chatbot chatgpt foundation-models gpt-4 instruction-tuning llama llama-2 llama2 llava multi-modality multimodal vision-language-model visual-language-learning
24.7k 2.8k +7/wk
GitHub PyPI 2-source
UN

microsoft/unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Trend 3
beit beit-3 bitnet deepnet document-ai foundation-models kosmos kosmos-1 layoutlm layoutxlm llm minilm mllm multimodal nlp pre-trained-model textdiffuser trocr unilm xlm-e
22.1k 2.7k +1/wk
GitHub HuggingFace 2-source
SE

jina-ai/serve

☁️ Build multimodal AI applications with cloud-native stack

Trend 3
cloud-native cncf deep-learning docker fastapi framework generative-ai grpc jaeger kubernetes llmops machine-learning microservice mlops multimodal neural-search opentelemetry orchestration pipeline prometheus
21.9k 2.2k -1/wk
GitHub PyPI 2-source
SC

screenpipe/screenpipe

Run agents that work for you based on what you do. AI finally knows what you are doing

Trend 3
agents agi ai computer-vision llm machine-learning ml multimodal vision
18.1k 1.6k +9/wk
GitHub HuggingFace 2-source
VI

pytorch/vision

Datasets, Transforms and Models specific to Computer Vision

Trend 3
computer-vision machine-learning
17.6k 7.2k +0/wk
GitHub
AP

bharathgs/Awesome-pytorch-list

A comprehensive list of pytorch related content on github,such as different models,implementations,helper libraries,tutorials etc.

Trend 3
awesome awesome-list computer-vision cv data-science deep-learning facebook machine-learning natural-language-processing neural-network nlp nlp-library papers probabilistic-programming python pytorch pytorch-model pytorch-tutorials tutorials utility-library
16.5k 2.8k +1/wk
GitHub
LT

datawhalechina/leedl-tutorial

《李宏毅深度学习教程》(李宏毅老师推荐👍,苹果书🍎),PDF下载地址:https://github.com/datawhalechina/leedl-tutorial/releases

Trend 3
bert chatgpt cnn deep-learning diffusion gan leedl-tutorial machine-learning network-compression pruning reinforcement-learning rnn self-attention transfer-learning transformer tutorial
16.5k 3.1k -3/wk
GitHub
LO

lukas-blecher/LaTeX-OCR

pix2tex: Using a ViT to convert images of equations into LaTeX code.

Trend 3
dataset deep-learning im2latex im2markup im2text image-processing image2text latex latex-ocr machine-learning math-ocr ocr python pytorch transformer vision-transformer vit
16.3k 1.3k +4/wk
GitHub
UN

Unstructured-IO/unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

Trend 3
data-pipelines deep-learning document-image-analysis document-image-processing document-parser document-parsing docx donut information-retrieval langchain llm machine-learning ml natural-language-processing nlp ocr pdf pdf-to-json pdf-to-text preprocessing
14.4k 1.2k +7/wk
GitHub
DL

davisking/dlib

A toolkit for making real world machine learning and data analysis applications in C++

Trend 3
c-plus-plus computer-vision deep-learning dlib machine-learning machine-learning-library python
14.4k 3.5k +0/wk
GitHub
VI

virgili0/Virgilio

Your new Mentor for Data Science E-Learning.

Trend 3
business-intelligence computer-vision data-science datascience guide guidelines hacktoberfest learning learning-python machine-learning machine-vision nlp path python scikit-learn statistics study studypath tensorflow virgilio
14.3k 2.5k +1/wk
GitHub
PG

jacobgil/pytorch-grad-cam

Advanced AI Explainability for computer vision. Support for CNNs, Vision Transformers, Classification, Object detection, Segmentation, Image similarity and more.

Trend 3
class-activation-maps computer-vision deep-learning explainable-ai explainable-ml grad-cam image-classification interpretability interpretable-ai interpretable-deep-learning machine-learning object-detection pytorch score-cam vision-transformers visualizations xai
12.7k 1.7k +0/wk
GitHub
AD

diff-usion/Awesome-Diffusion-Models

A collection of resources and papers on Diffusion Models

Trend 3
artificial-intelligence diffusion-models generative-model machine-learning score-based score-matching
12.3k 1.0k +1/wk
GitHub
NE

nerfstudio-project/nerfstudio

A collaboration friendly studio for NeRFs

Trend 3
3d 3d-graphics 3d-reconstruction computer-vision deep-learning gaussian-splatting machine-learning nerf photogrammetry pytorch
11.4k 1.6k +2/wk
GitHub
KO

kornia/kornia

🐍 Geometric Computer Vision Library for Spatial AI

Trend 3
artificial-intelligence computer-vision deep-learning hacktoberfest image-processing machine-learning neural-network python pytorch robotics spatial-ai
11.2k 1.2k +1/wk
GitHub
FI

voxel51/fiftyone

Refine high-quality datasets and visual AI models

Trend 3
active-learning artificial-intelligence computer-vision data-centric-ai data-cleaning data-curation data-quality data-science deep-learning developer-tools image-classification machine-learning object-detection python unstructured-data vector-search visualization
10.6k 736 +5/wk
GitHub
RE

rerun-io/rerun

An open source SDK for logging, storing, querying, and visualizing multimodal and multi-rate data

Trend 3
computer-vision cpp multimodal python robotics rust visualization
10.5k 706 +7/wk
GitHub
CA

esimov/caire

Content aware image resize library

Trend 3
computer-vision content-aware-resize content-aware-scaling edge-detection face-detection golang image-processing image-resize machine-learning seam-carving
10.5k 386 -1/wk
GitHub
PY

yzhao062/pyod

A Python Library for Outlier and Anomaly Detection on Tabular, Text, and Image Data

Trend 3
anomaly anomaly-detection autoencoder data-mining data-science deep-learning foundation-models fraud-detection image-anomaly-detection machine-learning multimodal neural-networks nlp-anomaly-detection novelty-detection out-of-distribution-detection outlier-detection outlier-ensembles outliers unsupervised-learning
9.8k 1.5k +3/wk
GitHub
SE

apache/seatunnel

SeaTunnel is a multimodal, high-performance, distributed, massive data integration tool.

Trend 3
apache batch cdc change-data-capture data-ingestion data-integration elt embeddings high-performance llm multimodal offline real-time streaming
9.2k 2.2k -1/wk
GitHub
MO

X-PLUG/MobileAgent

Mobile-Agent: The Powerful GUI Agent Family

Trend 3
agent android app automation copilot gui mllm mobile mobile-agents multimodal multimodal-agent multimodal-large-language-models
8.4k 850 +4/wk
GitHub
VR

om-ai-lab/VLM-R1

Solve Visual Understanding with Reinforced VLMs

Trend 3
deepseek-r1 grpo llm multimodal multimodal-r1 qwen r1-zero reinforcement-learning vlm vlm-r1
5.9k 378 +2/wk
GitHub
GE

genkit-ai/genkit

Open-source framework for building AI-powered apps in JavaScript, Go, and Python, built and used in production by Google

Trend 3
agents ai embedders genkit llm multimodal rag vector-database
5.8k 706 +1/wk
GitHub
UL

OpenBMB/UltraRAG

A Low-Code MCP Framework for Building Complex and Innovative RAG Pipelines

Trend 3
deepseek demo easy embedding flask gpt huggingface-transformers llm mcp multimodal openai qwen rag sentence-transformers ui vllm vlm
5.5k 410 +2/wk
GitHub
DA

Eventual-Inc/Daft

High-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale

Trend 3
ai-engineering ai-pipeline arrow artificial-intelligence big-data data-engineering distributed distributed-computing distributed-systems embeddings etl huggingface iceberg machine-learning multimodal parquet python ray rust
5.4k 439 +4/wk
GitHub
XT

InternLM/xtuner

A Next-Generation Training Engine Built for Ultra-Large MoE Models

Trend 3
agent deepseek-v3 gpt-oss intern-s1 internvl kimi-k2 llm multimodal qwen3-moe qwen3-vl reinforcement-learning
5.1k 413 +0/wk
GitHub
AA

PKU-Alignment/align-anything

Align Anything: Training All-modality Model with Feedback

Trend 3
chameleon dpo large-language-models multimodal rlhf vision-language-model
4.6k 507 +0/wk
GitHub
AA

luban-agi/Awesome-AIGC-Tutorials

Curated tutorials and resources for Large Language Models, AI Painting, and more.

Trend 3
ai aigc awesome chatgpt courses-resource deep-learning llm midjourney multimodal nlp prompt-engineering stable-diffusion tutorials
4.5k 300 +2/wk
GitHub
IM

rom1504/img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Trend 3
big-data dataset deep-learning download-images image image-dataset multimodal
4.4k 375 +0/wk
GitHub
LE

EvolvingLMMs-Lab/lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

Trend 3
agi audio-evaluation benchmark evaluation large-language-models llm-evaluation multimodal multimodal-evaluation video-understanding vision-language-model vlm
4.0k 557 +0/wk
GitHub
NG

NExT-GPT/NExT-GPT

Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large Language Model

Trend 3
chatgpt foundation-models gpt-4 instruction-tuning large-language-models llm mllm multi-modal-chatgpt multimodal visual-language-learning
3.6k 361 +0/wk
GitHub
AL

atfortes/Awesome-LLM-Reasoning

From Chain-of-Thought prompting to OpenAI o1 and DeepSeek-R1 🍓

Trend 3
awesome chain-of-thought chatgpt cot deepseek deepseek-r1 gpt gpt-4o in-context-learning language-models mllm multimodal openai-o1 papers prompt prompt-engineering reasoning strawberry
3.6k 202 +1/wk
GitHub
MC

morphik-org/morphik-core

The most accurate document search and store for building AI apps

Trend 3
artificial-intelligence cache-augmented-generation colpali database litellm multimodal rag rules-based-ingestion
3.6k 297 +2/wk
GitHub
MT

embeddings-benchmark/mteb

MTEB: Massive Text Embedding Benchmark

Trend 3
benchmark bitext-mining clustering information-retrieval low-resource-nlp mteb multilingual-nlp multimodal neural-search reranking retrieval sbert semantic-search sentence-transformers sts text-classification text-embedding
3.2k 586 +0/wk
GitHub
VO

vortex-data/vortex

An extensible, state of the art columnar file format. Formerly at @spiraldb, now an Incubation Stage project at LFAI&Data, part of the Linux Foundation.

Trend 3
array arrow compression file multimodal python rust
2.8k 144 +2/wk
GitHub
CR

rom1504/clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them

Trend 3
ai clip deep-learning knn multimodal semantic-search
2.7k 239 +1/wk
GitHub
MA

roboflow/maestro

streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL

Trend 3
captioning fine-tuning florence-2 multimodal objectdetection paligemma phi-3-vision qwen2-vl transformers vision-and-language vqa
2.7k 222 +0/wk
GitHub
OF

OFA-Sys/OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Trend 3
chinese image-captioning multimodal pretrained-models pretraining prompt prompt-tuning referring-expression-comprehension text-to-image-synthesis vision-language visual-question-answering
2.6k 250 +1/wk
GitHub
HU

InternLM/HuixiangDou

HuixiangDou: Overcoming Group Chat Scenarios with LLM-based Technical Assistance

Trend 3
application assistant assistant-chat-bots chatbot dsl group-chat image-retrieval lark llm multimodal pipeline rag robot wechat
2.5k 185 +1/wk
GitHub
MD

X-PLUG/mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Trend 3
chart-understanding document-understanding mllm multimodal multimodal-large-language-models table-understanding
2.4k 149 +1/wk
GitHub
IN

OpenGVLab/InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding

Trend 3
action-recognition benchmark contrastive-learning foundation-models instruction-tuning masked-autoencoder multimodal open-set-recognition self-supervised spatio-temporal-action-localization temporal-action-localization video-clip video-data video-dataset video-question-answering video-retrieval video-understanding vision-transformer zero-shot-classification zero-shot-retrieval
2.2k 144 +1/wk
GitHub
GA

genieincodebottle/generative-ai

Comprehensive resources on Generative AI, including a detailed roadmap, projects, use cases, interview preparation, and coding preparation.

Trend 3
agentic-ai agentic-framework claude gemini genai genai-usecase generative-ai interview-questions langchain langgraph large-language-model llm-agent llm-evaluation mcp model-context-protocol multimodal n8n n8n-workflow openai-api retrieval-augmented-generation
2.2k 539 +2/wk
GitHub
GP

google-gemini/genai-processors

GenAI Processors is a lightweight Python library that enables efficient, parallel content processing.

Trend 3
agent ai asyncio gemini genai generative-ai language-model multimodal python realtime
2.1k 214 +2/wk
GitHub
BI

kyegomez/BitNet

Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch

Trend 3
artificial-intelligence deep-neural-networks deeplearning gpt4 machine-learning multimodal multimodal-deep-learning
1.9k 172 +2/wk
GitHub
SO

showlab/Show-o

[ICLR & NeurIPS 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.

Trend 3
diffusion-models large-language-models multimodal
1.9k 90 +0/wk
GitHub
QV

2U1/Qwen-VL-Series-Finetune

An open-source implementaion for fine-tuning Qwen-VL series by Alibaba Cloud.

Trend 3
multimodal qwen2-5-vl qwen2-vl qwen3-5 qwen3-vl vision-language vision-language-model vlm
1.8k 207 +0/wk
GitHub
DE

potamides/DeTikZify

Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ.

Trend 3
draw graph huggingface inverse-graphics latex llama llm multimodal sketch tikz transformers vectorization visualization
1.8k 91 +0/wk
GitHub
SD

dailenson/SDT

This repository is the official implementation of Disentangling Writer and Character Styles for Handwriting Generation (CVPR 2023)

Trend 3
computer-vision contrastive-learning deep-learning generative-models gmm handwriting-generation multimodal pytorch-implementation transformer
1.4k 111 +0/wk
GitHub
HL

valentinfrlch/ha-llmvision

Visual intelligence for your home.

Trend 3
ai cctv-detection hacs-integration home-assistant llm multimodal notifications smart-home vision
1.3k 115 +1/wk
GitHub
AV

gokayfem/awesome-vlm-architectures

Famous Vision Language Models and Their Architectures

Trend 3
awesome awesome-list blip clip cogvlm image-encoder internlm kosmos llava multimodal qwen-vl text-encoder vision-language-model vlm
1.2k 56 +0/wk
GitHub
CL

ArrowLuo/CLIP4Clip

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

Trend 3
activitynet clip didemo lsmdc msrvtt msvd multimodal multimodal-learning multimodality ranking retrieval retrieval-model search video-clip-retrieval video-text-retrieval
1.0k 135 +1/wk
GitHub
AM

yaotingwangofficial/Awesome-MCoT

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Trend 3
chain-of-thought cot deepseek-r1 instruction-tuning large-vision-language-model mcts mllm-reasoning multimodal multimodal-chain-of-thought multimodal-large-language-models openai-o1 reasoning slow-thinking survey system-2
976 32 +1/wk
GitHub
PA

allenai/papermage

library supporting NLP and CV research on scientific papers

Trend 3
computer-vision machine-learning multimodal natural-language-processing pdf-processing python scientific-papers
793 64 +0/wk
GitHub
LE

EvolvingLMMs-Lab/lmms-engine

A simple, unified multimodal models training engine. Lean, flexible, and built for hacking at scale.

Trend 3
agi large-language-models multimodal unified-multimodal-models video-generation
756 35 +1/wk
GitHub
RD

Denis2054/RAG-Driven-Generative-AI

This repository provides programs to build Retrieval Augmented Generation (RAG) code for Generative AI with LlamaIndex, Deep Lake, and Pinecone leveraging the power of OpenAI and Hugging Face models for generation and evaluation.

Trend 3
advanced-rag chroma chromadb embedding-models fine-tuning gpt-4o-mini gpt4-omni grok huggingface indexing-querying llama llama-index multimodal openai-api pinecone rag scaling vision-transformer xai-grok
596 202 +1/wk
GitHub
CL

monatis/clip.cpp

CLIP inference in plain C/C++ with no extra dependencies

Trend 3
c clip cpp ggml image-search multimodal
557 53 +1/wk
GitHub
HO

Tencent-Hunyuan/Hunyuan3D-Omni

Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

Trend 3
3d 3d-aigc 3d-generation hunyuan3d image-to-3d multimodal shape
553 48 +0/wk
GitHub
CH

NetManAIOps/ChatTS

[VLDB' 25] ChatTS: Understanding, Chat, Reasoning about Time Series with TS-MLLM

Trend 3
llm multimodal timeseries timeseries-analysis
445 45 +1/wk
GitHub
GR

tangbotony/GraTAG

GraTAG — Production AI Search via Graph-Based Query Decomposition and Triplet-Aligned Generation with Rich Multimodal Representations

Trend 1
New Signal
ai-search-engine multimodal query-decomposition rag reinforcement-learning retrieval-augmented-generation triplet-extraction
113 7 +31/wk
GitHub
JA

deepseek-ai/Janus

Janus-Series: Unified Multimodal Understanding and Generation Models

Trend 0
any-to-any foundation-models llm multimodal unified-model vision-language-pretraining
17.7k 2.2k -1/wk
GitHub
IN

NVlabs/instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more

Trend 0
3d-reconstruction computer-graphics computer-vision cuda function-approximation machine-learning nerf neural-network real-time real-time-rendering realtime signed-distance-functions
17.4k 2.1k +1/wk
GitHub
DL

kmario23/deep-learning-drizzle

Drench yourself in Deep Learning, Reinforcement Learning, Machine Learning, Computer Vision, and NLP by learning from these exciting lectures!!

Trend 0
artificial-intelligence-algorithms artificial-neural-networks bayesian-statistics computer-vision deep-learning deep-neural-networks deep-reinforcement-learning explainable-ai geometric-deep-learning graph-neural-networks machine-learning medical-imaging natural-language-processing optimization pattern-recognition probabilistic-graphical-models probability reinforcement-learning speech-recognition visual-recognition
12.8k 3.0k +0/wk
GitHub
FM

zalandoresearch/fashion-mnist

A MNIST-like fashion product database. Benchmark :point_down:

Trend 0
benchmark computer-vision convolutional-neural-networks dataset deep-learning fashion fashion-mnist gan machine-learning mnist zalando
12.7k 3.1k -2/wk
GitHub
CP

extreme-assistant/CVPR2024-Paper-Code-Interpretation

cvpr2024/cvpr2023/cvpr2022/cvpr2021/cvpr2020/cvpr2019/cvpr2018/cvpr2017 论文/代码/解读/直播合集,极市团队整理

Trend 0
computer-vision cvpr2019 cvpr2020 cvpr2021 cvpr2022 deep-learning image-classification image-segmentation machine-learning object-detection papers
12.5k 2.2k +0/wk
GitHub
LU

ludwig-ai/ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models

Trend 0
computer-vision data-centric data-science deep deep-learning deeplearning fine-tuning learning llama llama2 llm llm-training machine-learning machinelearning mistral ml natural-language natural-language-processing neural-network pytorch
11.7k 1.2k -1/wk
GitHub
RS

RunanywhereAI/runanywhere-sdks

Production ready toolkit to run AI locally

Trend 0
android apple-intelligence cpp diffusion-models edge flutter inference ios kotlin llamacpp llm multimodal ollama on-device-ai react-native swift vlm voice-ai web websdk
10.3k 347 +0/wk
GitHub

Source Breakdown

GitHub
Stars592.1k
Forks78.4k
Repos100
PyPI
Packages3
HuggingFace
Linked Repos2

Related Topics