-
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
Paper • 2405.07990 • Published • 20 -
Large Language Models as Planning Domain Generators
Paper • 2405.06650 • Published • 13 -
AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation
Paper • 2404.12753 • Published • 43 -
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Paper • 2404.07972 • Published • 51
Collections
Discover the best community collections!
Collections including paper arxiv:2402.14261
-
Evaluating Very Long-Term Conversational Memory of LLM Agents
Paper • 2402.17753 • Published • 19 -
StructLM: Towards Building Generalist Models for Structured Knowledge Grounding
Paper • 2402.16671 • Published • 27 -
Do Large Language Models Latently Perform Multi-Hop Reasoning?
Paper • 2402.16837 • Published • 28 -
Divide-or-Conquer? Which Part Should You Distill Your LLM?
Paper • 2402.15000 • Published • 23
-
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
Paper • 2402.13249 • Published • 15 -
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper • 2402.12659 • Published • 23 -
Instruction-tuned Language Models are Better Knowledge Learners
Paper • 2402.12847 • Published • 26 -
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models
Paper • 2402.13064 • Published • 50
-
Rethinking FID: Towards a Better Evaluation Metric for Image Generation
Paper • 2401.09603 • Published • 17 -
LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models
Paper • 2402.10524 • Published • 23 -
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper • 2402.14261 • Published • 10 -
RewardBench: Evaluating Reward Models for Language Modeling
Paper • 2403.13787 • Published • 22
-
nuprl/MultiPL-E
Viewer • Updated • 12.7k • 64.8k • 60 -
openai/openai_humaneval
Viewer • Updated • 164 • 126k • 360 -
Big Code Models Leaderboard
📈1.49kExplore and compare code generation models on a leaderboard
-
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper • 2402.14261 • Published • 10
-
Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code
Paper • 2311.07989 • Published • 26 -
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper • 2310.06770 • Published • 9 -
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 11 -
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper • 2402.14261 • Published • 10
-
ReGAL: Refactoring Programs to Discover Generalizable Abstractions
Paper • 2401.16467 • Published • 10 -
StarCoder 2 and The Stack v2: The Next Generation
Paper • 2402.19173 • Published • 152 -
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
Paper • 2402.14658 • Published • 83 -
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper • 2402.14261 • Published • 10
-
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 151 -
ReFT: Reasoning with Reinforced Fine-Tuning
Paper • 2401.08967 • Published • 31 -
Tuning Language Models by Proxy
Paper • 2401.08565 • Published • 22 -
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 69
-
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
Paper • 2405.07990 • Published • 20 -
Large Language Models as Planning Domain Generators
Paper • 2405.06650 • Published • 13 -
AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation
Paper • 2404.12753 • Published • 43 -
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Paper • 2404.07972 • Published • 51
-
nuprl/MultiPL-E
Viewer • Updated • 12.7k • 64.8k • 60 -
openai/openai_humaneval
Viewer • Updated • 164 • 126k • 360 -
Big Code Models Leaderboard
📈1.49kExplore and compare code generation models on a leaderboard
-
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper • 2402.14261 • Published • 10
-
Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code
Paper • 2311.07989 • Published • 26 -
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper • 2310.06770 • Published • 9 -
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 11 -
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper • 2402.14261 • Published • 10
-
Evaluating Very Long-Term Conversational Memory of LLM Agents
Paper • 2402.17753 • Published • 19 -
StructLM: Towards Building Generalist Models for Structured Knowledge Grounding
Paper • 2402.16671 • Published • 27 -
Do Large Language Models Latently Perform Multi-Hop Reasoning?
Paper • 2402.16837 • Published • 28 -
Divide-or-Conquer? Which Part Should You Distill Your LLM?
Paper • 2402.15000 • Published • 23
-
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
Paper • 2402.13249 • Published • 15 -
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper • 2402.12659 • Published • 23 -
Instruction-tuned Language Models are Better Knowledge Learners
Paper • 2402.12847 • Published • 26 -
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models
Paper • 2402.13064 • Published • 50
-
ReGAL: Refactoring Programs to Discover Generalizable Abstractions
Paper • 2401.16467 • Published • 10 -
StarCoder 2 and The Stack v2: The Next Generation
Paper • 2402.19173 • Published • 152 -
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
Paper • 2402.14658 • Published • 83 -
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper • 2402.14261 • Published • 10
-
Rethinking FID: Towards a Better Evaluation Metric for Image Generation
Paper • 2401.09603 • Published • 17 -
LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models
Paper • 2402.10524 • Published • 23 -
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper • 2402.14261 • Published • 10 -
RewardBench: Evaluating Reward Models for Language Modeling
Paper • 2403.13787 • Published • 22
-
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 151 -
ReFT: Reasoning with Reinforced Fine-Tuning
Paper • 2401.08967 • Published • 31 -
Tuning Language Models by Proxy
Paper • 2401.08565 • Published • 22 -
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 69