Papers - Image
updated
FaceChain-SuDe: Building Derived Class to Inherit Category Attributes
for One-shot Subject-Driven Generation
Paper
• 2403.06775
• Published
• 4
An Image is Worth 16x16 Words: Transformers for Image Recognition at
Scale
Paper
• 2010.11929
• Published
• 15
Data Incubation -- Synthesizing Missing Data for Handwriting Recognition
Paper
• 2110.07040
• Published
• 2
A Mixture of Expert Approach for Low-Cost Customization of Deep Neural
Networks
Paper
• 1811.00056
• Published
• 2
Data Generation for Post-OCR correction of Cyrillic handwriting
Paper
• 2311.15896
• Published
• 4
Character Queries: A Transformer-based Approach to On-Line Handwritten
Character Segmentation
Paper
• 2309.03072
• Published
• 2
Densely Connected Convolutional Networks
Paper
• 1608.06993
• Published
• 3
BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage
Models
Paper
• 2003.11142
• Published
• 2
U-Net: Convolutional Networks for Biomedical Image Segmentation
Paper
• 1505.04597
• Published
• 17
Image Segmentation using U-Net Architecture for Powder X-ray Diffraction
Images
Paper
• 2310.16186
• Published
• 2
RTSeg: Real-time Semantic Segmentation Comparative Study
Paper
• 1803.02758
• Published
• 2
Generalizability vs. Robustness: Adversarial Examples for Medical
Imaging
Paper
• 1804.00504
• Published
• 2
Hierarchical multi-class segmentation of glioma images using networks
with multi-level activation function
Paper
• 1810.09488
• Published
• 2
IVD-Net: Intervertebral disc localization and segmentation in MRI with a
multi-modal UNet
Paper
• 1811.08305
• Published
• 2
A multi-path 2.5 dimensional convolutional neural network system for
segmenting stroke lesions in brain MRI images
Paper
• 1905.10835
• Published
• 3
Enforcing temporal consistency in Deep Learning segmentation of brain MR
images
Paper
• 1906.07160
• Published
• 3
Bias Loss for Mobile Neural Networks
Paper
• 2107.11170
• Published
• 2
Skip-Connected Neural Networks with Layout Graphs for Floor Plan
Auto-Generation
Paper
• 2309.13881
• Published
• 2
Inter-Scale Dependency Modeling for Skin Lesion Segmentation with
Transformer-based Networks
Paper
• 2310.13727
• Published
• 2
Latent Diffusion Model for Medical Image Standardization and Enhancement
Paper
• 2310.05237
• Published
• 2
3D Medical Image Segmentation based on multi-scale MPU-Net
Paper
• 2307.05799
• Published
• 2
Self-Supervised U-Net for Segmenting Flat and Sessile Polyps
Paper
• 2110.08776
• Published
• 2
Enforcing Morphological Information in Fully Convolutional Networks to
Improve Cell Instance Segmentation in Fluorescence Microscopy Images
Paper
• 2106.05843
• Published
• 2
Saliency-Guided Deep Learning Network for Automatic Tumor Bed Volume
Delineation in Post-operative Breast Irradiation
Paper
• 2105.02771
• Published
• 2
Qutrit-inspired Fully Self-supervised Shallow Quantum Learning Network
for Brain Tumor Segmentation
Paper
• 2009.06767
• Published
• 2
The Effects of Image Pre- and Post-Processing, Wavelet Decomposition,
and Local Binary Patterns on U-Nets for Skin Lesion Segmentation
Paper
• 1805.05239
• Published
• 2
A joint 3D UNet-Graph Neural Network-based method for Airway
Segmentation from chest CTs
Paper
• 1908.08588
• Published
• 2
Joint Liver and Hepatic Lesion Segmentation in MRI using a Hybrid CNN
with Transformer Layers
Paper
• 2201.10981
• Published
• 2
CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows
Paper
• 2107.00652
• Published
• 2
2nd Place Solution to Google Landmark Recognition Competition 2021
Paper
• 2110.02638
• Published
• 2
BOAT: Bilateral Local Attention Vision Transformer
Paper
• 2201.13027
• Published
• 2
Long-tailed Recognition by Routing Diverse Distribution-Aware Experts
Paper
• 2010.01809
• Published
• 2
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Paper
• 2103.14030
• Published
• 5
A Novel Transformer Based Semantic Segmentation Scheme for
Fine-Resolution Remote Sensing Images
Paper
• 2104.12137
• Published
• 2
Self-Supervised Learning with Swin Transformers
Paper
• 2105.04553
• Published
• 3
Bootstrap your own latent: A new approach to self-supervised Learning
Paper
• 2006.07733
• Published
• 2
Evaluating Transformer-based Semantic Segmentation Networks for
Pathological Image Segmentation
Paper
• 2108.11993
• Published
• 2
Using Multi-scale SwinTransformer-HTC with Data augmentation in CoNIC
Challenge
Paper
• 2202.13588
• Published
• 2
From Modern CNNs to Vision Transformers: Assessing the Performance,
Robustness, and Classification Strategies of Deep Learning Models in
Histopathology
Paper
• 2204.05044
• Published
• 2
Emerging Properties in Self-Supervised Vision Transformers
Paper
• 2104.14294
• Published
• 4
GasHis-Transformer: A Multi-scale Visual Transformer Approach for
Gastric Histopathological Image Detection
Paper
• 2104.14528
• Published
• 2
CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation
Paper
• 2401.12208
• Published
• 22
Paper
• 2309.16671
• Published
• 21
Vision Transformers Need Registers
Paper
• 2309.16588
• Published
• 86
DAS: A Deformable Attention to Capture Salient Information in CNNs
Paper
• 2311.12091
• Published
• 2
TANKER: Distributed Architecture for Named Entity Recognition and
Disambiguation
Paper
• 1708.09230
• Published
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper
• 2403.09611
• Published
• 129
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering
Paper
• 2403.09622
• Published
• 17
VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision
Understanding
Paper
• 2403.09530
• Published
• 10
LocalMamba: Visual State Space Model with Windowed Selective Scan
Paper
• 2403.09338
• Published
• 8
GiT: Towards Generalist Vision Transformer through Universal Language
Interface
Paper
• 2403.09394
• Published
• 26
Vision Transformer with Quadrangle Attention
Paper
• 2303.15105
• Published
• 2
StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based
Semantic Control
Paper
• 2403.09055
• Published
• 26
Language Grounded QFormer for Efficient Vision Language Understanding
Paper
• 2311.07449
• Published
• 2
GLIDE: Towards Photorealistic Image Generation and Editing with
Text-Guided Diffusion Models
Paper
• 2112.10741
• Published
• 4
Synthetic Shifts to Initial Seed Vector Exposes the Brittle Nature of
Latent-Based Diffusion Models
Paper
• 2312.11473
• Published
• 3
Lightweight Image Inpainting by Stripe Window Transformer with Joint
Attention to CNN
Paper
• 2301.00553
• Published
• 3
Semi-Supervised Semantic Segmentation using Redesigned Self-Training for
White Blood Cells
Paper
• 2401.07278
• Published
• 2
Flamingo: a Visual Language Model for Few-Shot Learning
Paper
• 2204.14198
• Published
• 16
VideoAgent: Long-form Video Understanding with Large Language Model as
Agent
Paper
• 2403.10517
• Published
• 37
LightIt: Illumination Modeling and Control for Diffusion Models
Paper
• 2403.10615
• Published
• 18
Generic 3D Diffusion Adapter Using Controlled Multi-View Editing
Paper
• 2403.12032
• Published
• 15
MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data
Paper
• 2403.11207
• Published
• 15
AnimateDiff-Lightning: Cross-Model Diffusion Distillation
Paper
• 2403.12706
• Published
• 18
FouriScale: A Frequency Perspective on Training-Free High-Resolution
Image Synthesis
Paper
• 2403.12963
• Published
• 8
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper
• 2403.11703
• Published
• 17
TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation
Paper
• 2403.12906
• Published
• 7
Towards 3D Molecule-Text Interpretation in Language Models
Paper
• 2401.13923
• Published
• 9
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal
Large Language Models
Paper
• 2403.13447
• Published
• 19
MyVLM: Personalizing VLMs for User-Specific Queries
Paper
• 2403.14599
• Published
• 17
S2LIC: Learned Image Compression with the SwinV2 Block, Adaptive
Channel-wise and Global-inter Attention Context
Paper
• 2403.14471
• Published
• 2
DepthFM: Fast Monocular Depth Estimation with Flow Matching
Paper
• 2403.13788
• Published
• 18
SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions
Paper
• 2403.16627
• Published
• 22
FlashFace: Human Image Personalization with High-fidelity Identity
Preservation
Paper
• 2403.17008
• Published
• 22
Prompt me a Dataset: An investigation of text-image prompting for
historical image dataset creation using foundation models
Paper
• 2309.01674
• Published
• 2
Paper
• 2304.02643
• Published
• 5
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for
Vision-Language Few-Shot Prompting
Paper
• 2210.07179
• Published
• 3
DreamPolisher: Towards High-Quality Text-to-3D Generation via Geometric
Diffusion
Paper
• 2403.17237
• Published
• 11
One-step Diffusion with Distribution Matching Distillation
Paper
• 2311.18828
• Published
• 3
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
Paper
• 1801.03924
• Published
• 2
ViTAR: Vision Transformer with Any Resolution
Paper
• 2403.18361
• Published
• 55
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper
• 2404.01197
• Published
• 31
Condition-Aware Neural Network for Controlled Image Generation
Paper
• 2404.01143
• Published
• 13
Measuring Style Similarity in Diffusion Models
Paper
• 2404.01292
• Published
• 17
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image
Generation
Paper
• 2404.02733
• Published
• 22
Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion
Models
Paper
• 2404.02747
• Published
• 13
Event Camera Demosaicing via Swin Transformer and Pixel-focus Loss
Paper
• 2404.02731
• Published
• 1
PointInfinity: Resolution-Invariant Point Diffusion Models
Paper
• 2404.03566
• Published
• 16
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept
Matching
Paper
• 2404.03653
• Published
• 35
Learning Transferable Visual Models From Natural Language Supervision
Paper
• 2103.00020
• Published
• 19
Prompt-to-Prompt Image Editing with Cross Attention Control
Paper
• 2208.01626
• Published
• 3
DeViDe: Faceted medical knowledge for improved medical vision-language
pre-training
Paper
• 2404.03618
• Published
• 2
OmniFusion Technical Report
Paper
• 2404.06212
• Published
• 77
Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and
Latent Diffusion
Paper
• 2310.03502
• Published
• 79
Toward a Better Understanding of Fourier Neural Operators: Analysis and
Improvement from a Spectral Perspective
Paper
• 2404.07200
• Published
• 2
Paper
• 2404.07821
• Published
• 13
ConsistencyDet: Robust Object Detector with Denoising Paradigm of
Consistency Model
Paper
• 2404.07773
• Published
• 1
ODA: Observation-Driven Agent for integrating LLMs and Knowledge Graphs
Paper
• 2404.07677
• Published
• 1
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
Language Models
Paper
• 2404.07973
• Published
• 32
Text Role Classification in Scientific Charts Using Multimodal
Transformers
Paper
• 2402.14579
• Published
• 1
Using Explainable AI and Transfer Learning to understand and predict the
maintenance of Atlantic blocking with limited observational data
Paper
• 2404.08613
• Published
• 1
HSIDMamba: Exploring Bidirectional State-Space Models for Hyperspectral
Denoising
Paper
• 2404.09697
• Published
• 1
Deformable MRI Sequence Registration for AI-based Prostate Cancer
Diagnosis
Paper
• 2404.09666
• Published
• 1
Comprehensive Survey of Model Compression and Speed up for Vision
Transformers
Paper
• 2404.10407
• Published
• 1
Explainable Lung Disease Classification from Chest X-Ray Images
Utilizing Deep Learning and XAI
Paper
• 2404.11428
• Published
• 1
MoA: Mixture-of-Attention for Subject-Context Disentanglement in
Personalized Image Generation
Paper
• 2404.11565
• Published
• 15
EdgeFusion: On-Device Text-to-Image Generation
Paper
• 2404.11925
• Published
• 23
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
Paper
• 2401.18059
• Published
• 48
GLIGEN: Open-Set Grounded Text-to-Image Generation
Paper
• 2301.07093
• Published
• 4
Grounded Language-Image Pre-training
Paper
• 2112.03857
• Published
• 3
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper
• 2404.12803
• Published
• 30
Groma: Localized Visual Tokenization for Grounding Multimodal Large
Language Models
Paper
• 2404.13013
• Published
• 31
Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image
Synthesis
Paper
• 2404.13686
• Published
• 29
Scene Coordinate Reconstruction: Posing of Image Collections via
Incremental Learning of a Relocalizer
Paper
• 2404.14351
• Published
• 6
MultiBooth: Towards Generating All Your Concepts in an Image from Text
Paper
• 2404.14239
• Published
• 9
All you need is a good init
Paper
• 1511.06422
• Published
• 1
Efficient Transformer Encoders for Mask2Former-style models
Paper
• 2404.15244
• Published
• 1
Deep Residual Learning for Image Recognition
Paper
• 1512.03385
• Published
• 12
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster
Pre-training on Web-scale Image-Text Data
Paper
• 2404.15653
• Published
• 29
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
• 2404.16994
• Published
• 37
HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring
Unconstrained Photo Collections
Paper
• 2404.16845
• Published
• 7
Stylus: Automatic Adapter Selection for Diffusion Models
Paper
• 2404.18928
• Published
• 15
DOCCI: Descriptions of Connected and Contrasting Images
Paper
• 2404.19753
• Published
• 13
Paint by Inpaint: Learning to Add Image Objects by Removing Them First
Paper
• 2404.18212
• Published
• 30
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video
Generation
Paper
• 2405.01434
• Published
• 56
Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models
Paper
• 2405.16759
• Published
• 8
Neural Autoregressive Distribution Estimation
Paper
• 1605.02226
• Published
• 1
Autoregressive Model Beats Diffusion: Llama for Scalable Image
Generation
Paper
• 2406.06525
• Published
• 71
Diffusion Models Beat GANs on Image Synthesis
Paper
• 2105.05233
• Published
• 2
Zero-shot Image Editing with Reference Imitation
Paper
• 2406.07547
• Published
• 33
VideoFACT: Detecting Video Forgeries Using Attention, Scene Context, and
Forensic Traces
Paper
• 2211.15775
• Published
• 1
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
Paper
• 2406.06911
• Published
• 12
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation
in Videos
Paper
• 2406.08407
• Published
• 28
DataComp: In search of the next generation of multimodal datasets
Paper
• 2304.14108
• Published
• 2
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal
LLMs
Paper
• 2406.18521
• Published
• 30
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
Understanding
Paper
• 2406.19389
• Published
• 54
Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept
Space
Paper
• 2406.19370
• Published
• 1
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity
Paper
• 2406.17720
• Published
• 8
We-Math: Does Your Large Multimodal Model Achieve Human-like
Mathematical Reasoning?
Paper
• 2407.01284
• Published
• 81
No Training, No Problem: Rethinking Classifier-Free Guidance for
Diffusion Models
Paper
• 2407.02687
• Published
• 24
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper
• 2407.02392
• Published
• 23
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output
Paper
• 2407.03320
• Published
• 94
DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents
Paper
• 2407.03300
• Published
• 14
Florence-2: Advancing a Unified Representation for a Variety of Vision
Tasks
Paper
• 2311.06242
• Published
• 95
Unveiling Encoder-Free Vision-Language Models
Paper
• 2406.11832
• Published
• 54
Vision language models are blind
Paper
• 2407.06581
• Published
• 84
MAVIS: Mathematical Visual Instruction Tuning
Paper
• 2407.08739
• Published
• 32
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
Instruction Using Language Model
Paper
• 2407.07053
• Published
• 47
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
Paper
• 2312.04461
• Published
• 62
Paper
• 2405.15932
• Published
• 1
SAM 2: Segment Anything in Images and Videos
Paper
• 2408.00714
• Published
• 120
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Paper
• 2403.03206
• Published
• 71
LLaVA-OneVision: Easy Visual Task Transfer
Paper
• 2408.03326
• Published
• 61
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal
Large Language Models
Paper
• 2408.04840
• Published
• 33
Paper
• 2408.07009
• Published
• 62
VideoGameBunny: Towards vision assistants for video games
Paper
• 2407.15295
• Published
• 23
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
• 2408.08872
• Published
• 101
Equivariant Transformer Networks
Paper
• 1901.11399
• Published
• 1
Law of Vision Representation in MLLMs
Paper
• 2408.16357
• Published
• 95
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse
Autoencoders
Paper
• 2410.22366
• Published
• 84
OmniGen: Unified Image Generation
Paper
• 2409.11340
• Published
• 115
Randomized Autoregressive Visual Generation
Paper
• 2411.00776
• Published
• 18
Analyzing The Language of Visual Tokens
Paper
• 2411.05001
• Published
• 24
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Paper
• 2411.02327
• Published
• 11
Smoothed Energy Guidance: Guiding Diffusion Models with Reduced Energy
Curvature of Attention
Paper
• 2408.00760
• Published
• 7
Generalized Out-of-Distribution Detection and Beyond in Vision Language
Model Era: A Survey
Paper
• 2407.21794
• Published
• 6
Building and better understanding vision-language models: insights and
future directions
Paper
• 2408.12637
• Published
• 133
MagicQuill: An Intelligent Interactive Image Editing System
Paper
• 2411.09703
• Published
• 80
BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed
Dual-Branch Diffusion
Paper
• 2403.06976
• Published
• 2
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
• 2411.10440
• Published
• 129
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper
• 2411.14402
• Published
• 47
Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
Paper
• 2303.08797
• Published
• 1
DETRs Beat YOLOs on Real-time Object Detection
Paper
• 2304.08069
• Published
• 15
RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time
Detection Transformer
Paper
• 2407.17140
• Published
• 2
HAT: Hybrid Attention Transformer for Image Restoration
Paper
• 2309.05239
• Published
• 1
No More Adam: Learning Rate Scaling at Initialization is All You Need
Paper
• 2412.11768
• Published
• 43
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified
Multimodal Understanding and Generation
Paper
• 2411.07975
• Published
• 31
Cosmos World Foundation Model Platform for Physical AI
Paper
• 2501.03575
• Published
• 82