Anthropic to Google: Who’s winning against AI hallucinations?

Anthropic to Google: Who’s winning against AI hallucinations?


Galileo, a leading developer of generative AI for enterprise applications, has released its latest Hallucination Index.

The evaluation framework – which focuses on Retrieval Augmented Generation (RAG) – assessed 22 prominent Gen AI LLMs from major players including OpenAI, Anthropic, Google, and Meta. This year’s index expanded significantly, adding 11 new models to reflect the rapid growth in both open- and closed-source LLMs over the past eight months.

Vikram Chatterji, CEO and Co-founder of Galileo, said: “In today’s rapidly evolving AI landscape, developers and enterprises face a critical challenge: how to harness the power of generative AI while balancing cost, accuracy, and reliability. Current benchmarks are often based on academic use-cases, rather than real-world applications.”

The index employed Galileo’s proprietary evaluation metric, context adherence, to check for output inaccuracies across various input lengths, ranging from 1,000 to 100,000 tokens. This approach aims to help enterprises make informed decisions about balancing price and performance in their AI implementations.

Binance

Key findings from the index include:

Anthropic’s Claude 3.5 Sonnet emerged as the best overall performing model, consistently scoring near-perfect across short, medium, and long context scenarios.

Google’s Gemini 1.5 Flash ranked as the best performing model in terms of cost-effectiveness, delivering strong performance across all tasks.

Alibaba’s Qwen2-72B-Instruct stood out as the top open-source model, particularly excelling in short and medium context scenarios.

The index also highlighted several trends in the LLM landscape:

Open-source models are rapidly closing the gap with their closed-source counterparts, offering improved hallucination performance at lower costs.

Current RAG LLMs demonstrate significant improvements in handling extended context lengths without sacrificing quality or accuracy.

Smaller models sometimes outperform larger ones, suggesting that efficient design can be more crucial than scale.

The emergence of strong performers from outside the US, such as Mistral’s Mistral-large and Alibaba’s qwen2-72b-instruct, indicates a growing global competition in LLM development.

While closed-source models like Claude 3.5 Sonnet and Gemini 1.5 Flash maintain their lead due to proprietary training data, the index reveals that the landscape is evolving rapidly. Google’s performance was particularly noteworthy, with its open-source Gemma-7b model performing poorly while its closed-source Gemini 1.5 Flash consistently ranked near the top.

As the AI industry continues to grapple with hallucinations as a major hurdle to production-ready Gen AI products, Galileo’s Hallucination Index provides valuable insights for enterprises looking to adopt the right model for their specific needs and budget constraints.

See also: Senators probe OpenAI on safety and employment practices

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Anthropic to Google: Who’s winning against AI hallucinations? appeared first on AI News.



Source link

[wp-stealth-ads rows="2" mobile-rows="3"]

Leave a Reply

Your email address will not be published. Required fields are marked *

Pin It on Pinterest

#GlobalNewsIt
Fiverr
#GlobalNewsIt
Anthropic to Google: Who’s winning against AI hallucinations?
Binance
Fiverr
AWISEE.com Analyzes Gmail's AI-Powered Search Update and Its Impact on Influencer Marketing
Anthropic’s Evaluation of Chain-of-Thought Faithfulness: Investigating Hidden Reasoning, Reward Hacks, and the Limitations of Verbal AI Transparency in Reasoning Models
Meta's answer to DeepSeek is here: Llama 4 launches with long context Scout and Maverick models, and 2T parameter Behemoth on the way!
Example images generated by the Midjourney V7 image generation model that has been released in alpha for testing by the AI community.
Augment Code Released Augment SWE-bench Verified Agent: An Open-Source Agent Combining Claude Sonnet 3.7 and OpenAI O1 to Excel in Complex Software Engineering Tasks
OpenAI just made ChatGPT Plus free for millions of college students — and it's a brilliant competitive move against Anthropic
bitcoin
ethereum
bnb
xrp
cardano
solana
dogecoin
polkadot
shiba-inu
dai
BTC, ETH, XRP, BNB, SOL, DOGE, ADA, TON, LEO, LINK
Conor McGregor’s token creators to refund bidders after failed launch
Crypto plunges as Trump tariff 'medicine' brutalizes global stock markets
Senator Ted Cruz introduces FLARE Act to repurpose flared gas for Bitcoin mining
Stablecoins are the best way to ensure US dollar dominance — Web3 CEO
BTC, ETH, XRP, BNB, SOL, DOGE, ADA, TON, LEO, LINK
Conor McGregor’s token creators to refund bidders after failed launch
Crypto plunges as Trump tariff 'medicine' brutalizes global stock markets
Senator Ted Cruz introduces FLARE Act to repurpose flared gas for Bitcoin mining
bitcoin
ethereum
tether
xrp
bnb
usd-coin
solana
tron
dogecoin
cardano
bitcoin
ethereum
tether
xrp
bnb
usd-coin
solana
tron
dogecoin
cardano