论文:MetaSQL: A Generate-and-rank Framework for Natural Language to SQL Translation
会议:ICDE 2024
作者:Yuankai Fan, Zhenying He, Tonghui Ren, Can Huang, Yinan Jing, Kai Zhang, and X. Sean Wang
简介:The Natural Language Interface to Databases (NLIDB) empowers non-technical users with database access through intuitive natural language (NL) interactions. Advanced approaches, utilizing neural sequence-to-sequence models or large-scale language models, typically employ auto-regressive decoding to generate unique SQL queries sequentially. While these translation models have greatly improved the overall translation accuracy, surpassing 70% on NLIDB benchmarks, the use of auto-regressive decoding to generate single SQL queries may result in sub-optimal outputs, potentially leading to erroneous translations. We propose MetaSQL, a unified generate-then-rank framework that can be flexibly incorporated with existing NLIDBs to consistently improve their translation accuracy. MetaSQL introduces query metadata to control the generation of better SQL query candidates and use learning-to-rank algorithms to retrieve globally optimized queries. Specifically, METASQL first breaks down the meaning of the given NL query into a set of possible query metadata, representing the basic concepts of the semantics. These metadata are then used as language constraints to steer the underlying translation model toward generating a set of candidate SQL queries. Finally, MetaSQL ranks the candidates to identify the best matching one for the given NL query.
Drawing inspiration from controllable text generation techniques in NLP, MetaSQL incorporates control signals, either explicitly or implicitly, into the standard auto-regressive decoding process, thereby facilitating more targeted SQL generation. To tackle the problem of insufficient output diversity, MetaSQL introduces query metadata as an explicit control signal to manipulate the behavior of translation models for better SQL query candidate generation. Additionally, to overcome the lack of global context, MetaSQL reframes the NL2SQL problem as a post-processing ranking procedure (as an implicit control signal), leveraging the entire global context rather than partial information involved in sequence generation.