Recently we had M.C.Srivas, CTO and Co-Founder of MapR Technologies, as a speaker at our Munich Hadoop User Group. He gave a nice talk about the Apache Drill Project which develops a tool providing fast interactive SQL on Hadoop and other data sources. We took the opportunity to ask Srivas a thing or two about Drill and his view on it.
What is your personal role within the Drill project?
M.C. Srivas: I work with the Drill team to understand performance, figure out some of the architectural issues, and bascially play around with it. But the project is mainly run by Jacques Nadeau, and he is a really great guy to work with.
What is the philosophy behind Apache Drill?
M.C. Srivas: Two main things: First, Drill is designed to be completely extensible, and second, Drill is designed to do things that the underlying data storage may not be capable of doing, yet exploit the power of the storage when it is indeed capable.
How does Drill compare to Hive, Impala & Shark?
M.C. Srivas: Drill implements full ANSI SQL 2003, with some really cool extensions. Hive, Impala and Shark are all implementations of the HIVE query language which is different from ANSI SQL.
What makes Apache Drill special?
M.C. Srivas: Apache Drill is the first time anyone has tried to handle semi-structured data in a meaningful manner within the SQL language. It is also the first time that SQL can handle self-described data without requiring a meta-data manager. So an analyst can query data without requiring a schema definition. The raw data can be directly processed without ETL. The toughest challenge was to detect the schema automatically, and to compensate when the schema changes during the query itself.
How are the different data sources reflected in the query language?
M.C. Srivas: The source of the data is included directly in the FROM clause instead of using connectors. Other interesting innovations are tokenizing the directory structure in the data file pathnames, so those tokens can be used in the query.
How does Drill handle nested data?
M.C. Srivas: Drill introduces a FLATTEN clause to promote nested data to the top level, where it can be queried. Drill also borrowed an idea from Google’s Dremel and BigQuery to query inside the nested data, by implementing the WITHIN RECORD clause in the FROM clause.
You mentioned changing schemas while the query is running. How does Drill manage these?
M.C. Srivas: Drill does all its work in 256K boundaries. If a schema change is detected within the last 256K of data, it will first emit whatever it has computed so far, and then reconfigure its operators to the new schema and continue execution.
How ist MapR involved in the Drill project?
M.C. Srivas: MapR kicked off the Drill project about 18 months ago, and now has almost 20 engineers working full-time on it. But the project itself is much larger than MapR and there are several companies and individuals involved in the project. I suspect including the folks at MapR, there are about 35-40 people actively working on Drill.
Does Drill take advantage of any of MapR’s special features?
M.C. Srivas: No, because it’s not possible to do so. MapR’s special features are all administrative improvements and do not modify the API.
Finally: What’s your impression of the Munich Hadoop User Group?
M.C. Srivas: I think there’s always a lot of interest in Hadoop and Big Data in general in Munich. There are many companies in Munich that are doing Hadoop projects. I am very grateful to comSysto for sponsoring and organizing the HUG regularly. comSysto is a great company and the people and the management are really terrific to work with.
If you would like to find out more about Drill you can have a look at the projects’s website. M.C. Srivas’ slides of his talk at the user group are available here. If you would like to know how comSysto can help your company to take full advantage of its data, please feel free to contact us!