Preprint / Version 1

Generating Classical Music With Transformers

##article.authors##

  • Ethan Feng Polygence
  • Mariel Werner

DOI:

https://doi.org/10.58445/rars.3642

Keywords:

transformer models, music, ai

Abstract

Models based on the Transformer Architecture (Vaswani et. al. 2017) have been a staple in the landscape of autoregressive generation for nearly a decade as of the publishing of this paper. While such models have proven to be proficient in areas such as Text Generation in GPTs (Generative Pre-Trained Transformers) and NLP (Natural Language Processing), their potential for autoregressive generation is yet unproven in other similar areas, especially in the field of Music Generation. With the goal of exploring the potential of the transformer architecture in more novel areas, we propose a small, 57M-parameter model built on a standard Transformer Architecture and trained on classical piano music in the MIDI format. We train on a dataset that primarily contains music from the Classical and Romantic periods, with music from the more modern periods also included, albeit at a smaller scale. To help our model better understand the nuances of classical music that are absent in conventional text applications, we introduce a novel Musical Loss Function that will work in tandem with standard loss functions, specifically Cross-Entropy Loss, to ensure coherent and melodious generation. Through our model, we aim to prove that ~50M-parameter models trained on smaller, meticulously selected datasets can compete with larger music generation models trained on larger datasets in terms of performance. 



References

Ba, Jimmy Lei, et al. “Layer Normalization.” ArXiv:1607.06450 [Cs, Stat], 21 July 2016, arxiv.org/abs/1607.06450.

Hendrycks, Dan, and Kevin Gimpel. “Gaussian Error Linear Units (GELUs).” ArXiv:1606.08415 [Cs], 8 July 2020, arxiv.org/abs/1606.08415.

Holtzman, Ari, et al. “The Curious Case of Neural Text Degeneration.” ArXiv:1904.09751 [Cs], 14 Feb. 2020, arxiv.org/abs/1904.09751.

Hsiao, Wen-Yi, et al. “Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs.” ArXiv:2101.02402 [Cs, Eess], 7 Jan. 2021, arxiv.org/abs/2101.02402.

Huang, Cheng-Zhi Anna, et al. “Music Transformer.” ArXiv:1809.04281 [Cs, Eess, Stat], 12 Dec. 2018, arxiv.org/abs/1809.04281.

Lehmkuhl, Jonathan, et al. “Generating Piano Music with Transformers: A Comparative Study of Scale, Data, and Metrics.” ArXiv.org, 2025, arxiv.org/abs/2511.07268. Accessed 22 Jan. 2026.

Lin, Xufeng, et al. “On the Detection-To-Track Association for Online Multi-Object Tracking.” Pattern Recognition Letters, vol. 146, June 2021, pp. 200–207, arxiv.org/abs/2107.00500, https://doi.org/10.1016/j.patrec.2021.03.022. Accessed 22 Jan. 2026.

Oore, Sageev, et al. “This Time with Feeling: Learning Expressive Musical Performance.” ArXiv.org, 10 Aug. 2018, arxiv.org/abs/1808.03715.

Radford, Alec, et al. Language Models Are Unsupervised Multitask Learners.

“Somascape : MIDI Ways - Guide to the MIDI 1.0 Technical Specification.” Somascape.org, 2022, www.somascape.org/midi/tech/spec.html.

Su, Jianlin, et al. RoFormer: Enhanced Transformer with Rotary Position Embedding. 20 Apr. 2021, https://doi.org/10.48550/arxiv.2104.09864.

Vaswani, Ashish, et al. “Attention Is All You Need.” Cornell University, 12 June 2017, arxiv.org/abs/1706.03762.

Downloads

Posted

2026-02-26