Due to the poor performance of the large language model (LLM) in the face of highly compositional reasoning questions, we tested quantitatively on two datasets of geographic location and kinship.
The results show that LLMs are deficient in both deductive and inductive reasoning, and provide insights into possible solutions: giving model the logic rules and specifically designed prompting.