Preprint / Version 1

Balancing Speed and Precision: A Comparative Analysis of Lightweight LLMs on SAT Reading and Writing Tasks

##article.authors##

  • Zixuan Shang Tesla STEM High School
  • Jesus Valdiviezo

DOI:

https://doi.org/10.58445/rars.2753

Keywords:

Large Language Models (LLMs), SAT Reading and Writing

Abstract

This study evaluated six cost-efficient large language models (LLMs)—ChatGPT 4.1 mini, Gemini 2.0 Flash, Qwen3 235B-A22B, Llama 3.3 70B Instruct, Claude 3.5 Haiku, and DeepSeek V3—on SAT Reading and Writing multiple-choice questions. Using a structured pipeline with LangChain, we assessed 90 questions across difficulty levels (easy, medium, hard) and skill subdivisions (Craft and Structure, Expression of Ideas, Information and Ideas, Standard English Conventions). The LLMs were not tested on Command of Evidence questions that included a graphical representation of data. Key findings reveal ChatGPT 4.1 mini and DeepSeek V3 as top performers (91.1% accuracy), closely followed by Gemini 2.0 Flash (88.9%), with Qwen3 235B-A22B lagging significantly (32.2%). Accuracy declined with question difficulty (e.g., ChatGPT 4.1 mini dropped from 96.7% on easy to 83.3% on hard questions), and all models struggled most with Standard English Conventions (16.7–72.2% accuracy), particularly grammar tasks like boundaries (44.4% average accuracy). While Gemini 2.0 Flash delivered optimal speed-accuracy balance (88.9% accuracy in 94.51 seconds), DeepSeek V3 matched ChatGPT’s precision (91.1%) at half the latency (221.84s vs. 184.5s). Models demonstrated moderate-to-high consistency (variability = 1.00–1.39). These results suggest smaller LLMs are viable for automated SAT-style assessments in comprehension (Information and Ideas: 96.3% accuracy) and analysis (Craft and Structure: 100% for top models) but require urgent improvements in grammatical precision and complex reasoning. Educators and developers should leverage ChatGPT 4.1 mini or DeepSeek V3 for high-accuracy feedback while reserving Gemini 2.0 Flash for rapid-response applications, with caution for grammar-focused tasks. 

References

College Board. “SAT Suite Question Bank | ...” Satsuitequestionbank.collegeboard.org, satsuitequestionbank.collegeboard.org/.

J.D. Capelouto. “Here’s How GPT-4 Scored on the GRE, LSAT, AP English, and Other Exams.” Semafor.com, 15 Mar. 2023, www.semafor.com/article/03/15/2023/how-gpt-4-performed-in-academic-exams?utm_source=chatgpt.com. Accessed 9 Feb. 2025.

Liang, Wenfeng. “DeepSeek.” Deepseek.com, Dec. 2023, deepseek.com.

OpenAI. “ChatGPT.” ChatGPT, OpenAI, 30 Nov. 2022, chatgpt.com/.

Shang, Zixuan. “GitHub - 1082098-LWSD/SAT-Question-Evaluation.” GitHub, 2024, github.com/1082098-LWSD/SAT-question-evaluation.git.

Suresh, Vikram, and Saannidhya Rawat. “GPT Takes the SAT: Tracing Changes InTest Difficulty and Students’ Math Performance.” Arxiv.org, 2023, arxiv.org/html/2409.10750v1. Accessed 3 May 2025.

Downloads

Posted

2025-07-18