AIエージェントのテストを欺く方法
AIエージェントの評価には、結果だけでは不十分。shipreadyは、エージェントがどのように到達したかを判定するツールで、実際の問題解決過程を評価することで、信頼できるエージェントを選ぶことが可能になる。
From X
Elon 本人のリアルタイム発言を、随時お届け。
USAID money killed millions
Not a good use of your taxpayer money
RT @cursor_ai: Three announcements from our keynote at Compile, including how we're training a new model with SpaceX. https://t.co/oV7IaEuZ…
They didn’t just import voters, they imported judges too
By their logic, absolutely!
Correct
Exactly
Exactly
RT @MarshaBlackburn: A Biden-appointed activist judge, Sparkle Sooknanan, just blocked a commonsense tool to keep non-citizens from voting.…
RT @SpaceX: Falcon 9 launches the Starfall Demo mission from Florida https://t.co/Pkh9HiKj1e
RT @cb_doge: USAID was a criminal organization that funded bioweapons, censorship & global coups with your tax dollars. It was never about…
RT @bennyjohnson: Let me get this straight. In America, in 2026, a federal judge just ruled that the government is NOT ALLOWED to check wh…
RT @SpaceX: Today’s mission includes a demo of a new vehicle that will enable affordable, routine access to the microgravity environment fo…
RT @SERobinsonJr: STARLINK: The National Superannuation Fund Limited (Nasfund) of Papua New Guinea launched Mobile Service Booths Powered b…
RT @C_3C_3: Ending USAID saved lives. Fact.
RT @KaiserBenKaiser: Selbst aus mathematisch-statistischer Perspektive ist das erschreckend: Syrer zum Beispiel sind bei Vergewaltigungen e…
By their logic, yes
True
RT @skscartoon: No, Elon Musk and DOGE did not cause "millions of deaths of children." Anthony Fauci, however, very much did. https://t.co…
RT @MikeBenzCyber: USAID probably killed more people by creating Covid
SpaceX · OpenAI · Anthropic
Falcon 9 launches the Starfall Demo mission from Florida https://t.co/Pkh9HiKj1e
Deployment of Starfall confirmed
Today’s mission includes a demo of a new vehicle that will enable affordable, routine access to the microgravity environment for scientific research and in-space manufacturing. After demonstrating controlled flight, the spacecraft will splash down in the Pacific Ocean https://t.co/NLwhigtSWC
Watch Falcon 9 launch the Starfall Demo mission to orbit from Florida https://t.co/cxgrchwMco
We’re expanding OpenAI Daybreak to help democratize patching vulnerable software at machine speed: - Codex Security plugin: find, validate, and fix vulnerabilities right inside Codex - The full version of GPT-5.5-Cyber model: a great model for trusted defenders - Cyber Partner Program: powering products built on top of our best cyber capabilities for leading security companies to secure the world's software - Patch the Planet: working with maintainers to secure critical open source projects https://t.co/hyIi6gQmkm
Falcon 9 launches 24 @Starlink satellites from California https://t.co/UmMBikZgK5
Deployment of 24 @Starlink satellites confirmed
Watch Falcon 9 launch 24 @Starlink satellites to orbit from California https://t.co/7doAmnCBxa
Falcon 9 launches NROL-179 to orbit from pad 4E in California https://t.co/GLYoPzrNAk
Falcon 9’s first stage lands on LZ-4 https://t.co/Ydf012BViS
Liftoff! https://t.co/G7xxdrAV1t
Watch Falcon 9 launch the @NRO_gov’s NROL-179 mission from pad 4E in California https://t.co/MyjE8pAve9
As AI takes on longer, higher-stakes tasks, we want models to carry beneficial and safe behavior into new domains beyond their training—and maintain it under pressure. That’s the idea behind our new research on training models to be broadly and persistently beneficial. https://t.co/6Yw45s1RRq
Falcon 9 is vertical at pad 4E in California ahead of tomorrow’s launch of the @NRO_gov’s NROL-179 mission → https://t.co/ikCA04Rzvz https://t.co/QdwFj4NRRW
GPT-5.5 Instant is now on par with our frontier Thinking models for health-related questions. Every week, more than 230 million people turn to ChatGPT with health and wellness questions, and GPT-5.5 Instant is better at recognizing when urgent care may be needed, asking for relevant context, explaining uncertainty, and making complex information easier to understand. Because GPT-5.5 Instant is available to all free users in ChatGPT, these improvements can help more people. Physician-led evaluation was critical to making these major intelligence gains.
New Frontier Red Team blog: Phase 2 of Project Fetch, where we test how well Claude can program a robodog. Opus 4.7, on its own, was ~20x faster than last year's best human team aided by Opus 4.1. (The robodog, alas, still failed to fetch a beach ball.) https://t.co/CgbBtRf85e
Together with researchers at Boston Children’s Hospital and Harvard, we published a study in NEJM AI showing how o3 Deep Research helped clinicians revisit previously unsolved rare pediatric disease cases, and find answers for families who had waited years. https://t.co/HVVDlEkuYR
Introducing LifeSciBench, a benchmark for measuring and improving how well AI supports real-world life science research. Developed with 173 scientists from biotechnology and pharmaceutical research, LifeSciBench includes 750 expert-authored tasks across seven biological research workflows. https://t.co/JTk0wXHFrT
GPT-5.4 helped drive a medicinal chemistry project from literature review to a validated experimental result. Paired with https://t.co/gcDaph8b2B’s Maria AI and specialized lab, the model proposed an unexpected way to improve a widely used reaction in drug discovery. https://t.co/KmyBlHLX8y
Splashdown of Dragon confirmed, completing SpaceX’s 34th Commercial Resupply Services mission to the @Space_Station!
Dragon’s four main parachutes have deployed
Drogues deployed
Falcon 9 launches from pad 40 in Florida https://t.co/DswQicc5TM
Deployment of all three BlueBird satellites confirmed
Falcon 9’s first stage has landed on the A Shortfall of Gravitas droneship https://t.co/LqecUt0AwY
Watch Falcon 9 launch the @AST_SpaceMobile BlueBird 8-10 mission to orbit https://t.co/wvLHBYHxID
Dragon is on track to reenter Earth’s atmosphere and splash down off the southern coast of California near Oceanside at ~5:10 a.m. PT on Wednesday, June 17 https://t.co/9ZNVoHPf2y
We’re sharing new research on a method for anticipating how models may behave in real-world use before release: simulating deployment with recent, de-identified user requests and studying candidate model responses. https://t.co/7RJzBfNniQ
Our latest economic research introduces a framework for tracking Claude Code as it scales. Who is using Claude Code, and what are they using it for? How is the value of tasks changing? And how much does domain expertise shape whether a session succeeds? https://t.co/IjjwQvrESo
Let’s talk about evals. We’re always looking for better ways to measure and forecast model progress, especially as benchmarks get saturated or gamed. @tejalpatwardhan, who leads our frontier evals team, spoke to @andrewmayne about why evals matter and what models need to be judged on next.
RT @OpenAIDevs: More of Codex is rolling out across Europe this week. We’re bringing Computer use, the Codex Chrome extension, personalize…
Separation confirmed! Dragon is performing four departure burns to move away from the @Space_Station. Splashdown in ~20 hours off the coast of California https://t.co/F85xpP6vCu
RT @Space_Station: .@NASA and @SpaceX now target Dragon's undocking time for 12:25pm ET today.
All cargo is loaded, the hatch is closed, and Dragon is ready for an on-time departure from the @Space_Station at 12:05 p.m. ET
SpaceX has exercised the option to acquire @cursor_ai in an all-stock transaction with the goal of building the world’s most useful AI models. For the past few months, SpaceXAI has been jointly training a model with Cursor, which will be released in Cursor and Grok Build soon. We look forward to working closely with the Cursor team to advance our frontier AI capabilities
After 30 days docked to the @Space_Station, the Dragon spacecraft supporting SpaceX’s 34th Commercial Resupply Services mission for @NASA will undock from the orbiting lab on Tuesday, June 16 → https://t.co/2AmF6GjvNP https://t.co/eIWmVNRuuu
Falcon 9 launches 24 @Starlink satellites from California https://t.co/YbjRLQBEZo
Deployment of 24 @Starlink satellites confirmed
Watch Falcon 9 launch 24 @Starlink satellites to orbit from California https://t.co/meDwb05qOE
RT @Nasdaq: Imagine what's next. @SpaceX $SPCX https://t.co/q0VC2XClYn
The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees. The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance. Access to all other Claude models is not affected. We apologize for this disruption to our customers. We believe this is a misunderstanding and are working to restore access as soon as possible. Read our full statement: https://t.co/bwn0sximKZ
RT @Nasdaq: From Texas to Times Square, from the markets to Mars. Congratulations to @SpaceX on its first day of trading on Nasdaq. $SPCX h…
RT @jpmorgan: J.P. Morgan + SpaceX= Largest IPO Congratulations to the @spaceX team on this milestone, we were proud to serve as a lead bo…
RT @Nasdaq: To the moon was never just a metaphor. @elonmusk @SpaceX $SPCX https://t.co/jOlcauvTFw
We heard you wanted to use Codex rate limit resets on your own time. Starting today, we’re rolling out the ability to save rate limit resets to use later. We’re starting Go, Plus, Pro, and Business users with one free reset: https://t.co/gucyTi04wc
We’re launching Claude Corps, a national fellowship program matching people early in their careers with US nonprofits. We'll teach 1,000 people to use Claude, and pay them to use AI to advance their hosts’ missions. https://t.co/QI6JmlAdSr
AI is advancing at a pace our policymaking institutions were never built for—and the gap between the two is becoming the central challenge of the technology. In his latest essay, our CEO Dario Amodei lays out how to close it. We're launching three new initiatives to support the efforts he outlines.
RT @merettm: The north stars we're working towards at OpenAI all center around the mission: ensure AGI benefits all of humanity. AI should…
An issue caused some user accounts to be incorrectly suspended. We’re restoring access and working through related subscription and credit issues. https://t.co/Vyqnn17RzG
What happened when one of our models found a counterexample to an 80-year-old Erdős conjecture? Researchers @alexwei_, @HongxunWu, and @wjmzbmr1 shared the story on the OpenAI Podcast with @AndrewMayne and explained how mathematicians and models can work together to make new discoveries.
Latest News
Elon Musk と関連企業、そして AI 業界全体の最新動向を、毎日まとめてお届け。
AIエージェントの評価には、結果だけでは不十分。shipreadyは、エージェントがどのように到達したかを判定するツールで、実際の問題解決過程を評価することで、信頼できるエージェントを選ぶことが可能になる。
本文では、AIエージェントがWebサイトをどのように理解するのかを示す。AIエージェントはRaw HTMLを読み取り、5つの要素で判断し、理解の可否を決める。
オーストラリアの北部熱帯雨林で、危険な種類のアリを捕らえる特殊な蜘蛛が発見された。新種のクモは、カタパルトのようなシルクの罠を作り、15倍もの加速度で獲物を引っかける力がある。研究者たちは、この罠を『球石投げ』と呼び、危険なアリを捕らえるために進化したと考えている。
著者は7ヶ月間、20,000行のPythonコードで個人財務シミュレーターWARPSimLabを開発し、ChatGPTを使用して設計・構築した。AIはprototyping、clean code、refactoring、複雑なアルゴリズム設計などで貢献し、デバッグもスムーズに行えた。著者はAIの有用性を体験し、将来のソフトウェア開発に大きな期待を寄せている。
Clarioxは、AIを使用したインタビューシミュレーターで、面接、交渉、リーダーコンバージョン、ピッチングの練習ができます。ユーザーは、実際の面接と同様に、AIとの会話を通じてフィードバックを受け、改善点を特定することができます。また、コーチングや進捗状況の追跡など、自己成長をサポートする機能も備えています。
Stack Overflowは、15回目の年次開発者調査を実施し、49,000以上の回答を得た。同調査では、AIエージェントツールやLLMs、コミュニティプラットフォームなど314の技術に焦点を当てている。結果はCSVデータで提供される
Markdyは、オープンソースのアニメーション DSL エンジンです。Markdownを使用してシーンを記述し、AIが生成したコードを実行することで、動画やWebページにアニメーションを追加できます。
チームはモデルを問題視することが多いが、実際にはモデルの周囲、すなわち
人類を多惑星種にするため、火星移住を実現する。
世界を持続可能なエネルギーで満たし、未来を守る。
AIが人類の良きパートナーとなる未来を創る。
人間の可能性を拡張し、新たな未来を切り拓く。
革新的なインフラと交通システムで、より良い都市を創る。