SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models

1The Knowledge Engineering Group (KEG), Tsinghua University, 2Zhipu AI, 3California Institute of Technology

Abstract

Large Language Models (LLMs) have shown promise in assisting scientific discovery. However, such applications are currently limited by LLMs' deficiencies in understanding intricate scientific concepts, deriving symbolic equations, and solving advanced numerical calculations. To bridge these gaps, we introduce SciInstruct, a suite of scientific instructions for training scientific language models capable of college-level scientific reasoning.

Central to our approach is a novel self-reflective instruction annotation framework to address the data scarcity challenge in the science domain. This framework leverages existing LLMs to generate step-by-step reasoning for unlabelled scientific questions, followed by a process of self-reflective critic-and-revise. Applying this framework, we curated a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs.

We analyze the curated SciInstruct from multiple interesting perspectives (e.g., domain, scale, source, question type, answer length, etc.). To verify the effectiveness of SciInstruct, we fine-tuned different language models with SciInstruct, i.e., ChatGLM3 (6B and 32B), Llama3-8b-Instruct, and Mistral-7B, enhancing their scientific and mathematical reasoning capabilities, without sacrificing the language understanding capabilities of the base model.

Average Accuracy

Figure 1: Average accuracy on CEval-Sci, SciEval, SciBench, MATH, and SAT-Math benchmarks of different LLMs.

Examples Generated with SciGLM

Some examples in SciBench that have been solved accurately with SciGLM (32B). They show that after instruction-tuning, the SciGLM learns skills/behaviors to first analyze the knowledge required for each problem, and then step-by-step solve the problem with correct formula and calculations.

Our Dataset: SciInstruct

Figure 2: The pipeline of constructing SciInstruct. On the far left is a mix of training datasets. The purpose of the annotation is to supplement chain-of-thought processes with reflective generation. The goal of the filter is to train an instruction-quality classifier and only keep high-quality reasoning traces as instructions.

Figure 3: Domain and question type proportions of SciInstruct.

Examples in SciInstruct

Overall Scientific Results

Overall Mathematical Results

Reference

If you find our work helpful, please kindly cite our paper:

@article{zhang2024sciglm,
        title={Sciglm: Training scientific language models with self-reflective instruction annotation and tuning},
        author={Zhang, Dan and Hu, Ziniu and Zhoubian, Sining and Du, Zhengxiao and Yang, Kaiyu and Wang, Zihan and Yue, Yisong and Dong, Yuxiao and Tang, Jie},
        journal={arXiv preprint arXiv:2401.07950},
        year={2024}
      }