Balancing Speed and Precision: A Comparative Analysis of Lightweight LLMs on SAT Reading and Writing Tasks
DOI:
https://doi.org/10.58445/rars.2753Keywords:
Large Language Models (LLMs), SAT Reading and WritingAbstract
This study evaluated six cost-efficient large language models (LLMs)—ChatGPT 4.1 mini, Gemini 2.0 Flash, Qwen3 235B-A22B, Llama 3.3 70B Instruct, Claude 3.5 Haiku, and DeepSeek V3—on SAT Reading and Writing multiple-choice questions. Using a structured pipeline with LangChain, we assessed 90 questions across difficulty levels (easy, medium, hard) and skill subdivisions (Craft and Structure, Expression of Ideas, Information and Ideas, Standard English Conventions). The LLMs were not tested on Command of Evidence questions that included a graphical representation of data. Key findings reveal ChatGPT 4.1 mini and DeepSeek V3 as top performers (91.1% accuracy), closely followed by Gemini 2.0 Flash (88.9%), with Qwen3 235B-A22B lagging significantly (32.2%). Accuracy declined with question difficulty (e.g., ChatGPT 4.1 mini dropped from 96.7% on easy to 83.3% on hard questions), and all models struggled most with Standard English Conventions (16.7–72.2% accuracy), particularly grammar tasks like boundaries (44.4% average accuracy). While Gemini 2.0 Flash delivered optimal speed-accuracy balance (88.9% accuracy in 94.51 seconds), DeepSeek V3 matched ChatGPT’s precision (91.1%) at half the latency (221.84s vs. 184.5s). Models demonstrated moderate-to-high consistency (variability = 1.00–1.39). These results suggest smaller LLMs are viable for automated SAT-style assessments in comprehension (Information and Ideas: 96.3% accuracy) and analysis (Craft and Structure: 100% for top models) but require urgent improvements in grammatical precision and complex reasoning. Educators and developers should leverage ChatGPT 4.1 mini or DeepSeek V3 for high-accuracy feedback while reserving Gemini 2.0 Flash for rapid-response applications, with caution for grammar-focused tasks.
References
College Board. “SAT Suite Question Bank | ...” Satsuitequestionbank.collegeboard.org, satsuitequestionbank.collegeboard.org/.
J.D. Capelouto. “Here’s How GPT-4 Scored on the GRE, LSAT, AP English, and Other Exams.” Semafor.com, 15 Mar. 2023, www.semafor.com/article/03/15/2023/how-gpt-4-performed-in-academic-exams?utm_source=chatgpt.com. Accessed 9 Feb. 2025.
Liang, Wenfeng. “DeepSeek.” Deepseek.com, Dec. 2023, deepseek.com.
OpenAI. “ChatGPT.” ChatGPT, OpenAI, 30 Nov. 2022, chatgpt.com/.
Shang, Zixuan. “GitHub - 1082098-LWSD/SAT-Question-Evaluation.” GitHub, 2024, github.com/1082098-LWSD/SAT-question-evaluation.git.
Suresh, Vikram, and Saannidhya Rawat. “GPT Takes the SAT: Tracing Changes InTest Difficulty and Students’ Math Performance.” Arxiv.org, 2023, arxiv.org/html/2409.10750v1. Accessed 3 May 2025.
Downloads
Posted
Categories
License
Copyright (c) 2025 Zixuan Shang, Jesus Valdiviezo

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.