Author: Weiqing Yang
Current state: Accepted
Released: Samza 1.0
Samza SQL users have long been asking for a one stop shell, in which SQL statements and SQL files can be executed interactively. The requirement can be summarized as following:
An interactive command shell
Execute SQL queries and non-queries, show tables, functions, schema, etc.
Execute SQL files
Set environment variables interactively and/or via configuration file to configure behavior of shell
Command editing, history, highlight and auto-completion
Display SQL query in a streaming friendly way
List status of submitted non-query SQL statements and be able to stop them
Apart from Samza SQL, support other streaming SQLs and data sources by letting user implement their own plugins
We need to implement a Samza SQL shell.
Terminal: Choice of Programming Language and Libraries
Any shell is essentially a REPL (Repeat-Evaluate-Print Loop), yet any non-trivial one is much more than that. Features like highlighting, history, command line editing, auto completion and rich data presentation require the capability of full control of the terminal. Terminal programming is a messy area, however, with the complexity coming from the long history and the vast variety of terminal types and operating systems. C/C++ is ideal for terminal IO programming, but for easier interoperation with data sources and other systems, Java is chosen.
Using Java to directly control the terminal is a lot of work. For example, to use the alternate screen and restore the previous screen we need to write code like this:
System.out.print(“\e[?1049h"); // tput smcup
System.out.print("\e[?1049l"); // tput rmcup
Which is neither pretty nor platform independent. Library Lanterna offers a Java style abstraction and solves the platform dependency problem. However, Lanterna is a very low level API; there’s still too much work. For example, for features like highlighting and auto completion, we need to use the non-canonical mode of the terminal, in which in order to process the simple user input of backspace we need to move the cursor, print a space, and move the cursor again.
The final library chosen is higher level one, Jline 3.8.2. Though knowledge of terminal programming is still necessary, Jline saves a lot of work. The mysterious code above now becomes:
terminal.puts(InfoCmp.Capability.enter_ca_mode); // tput smcup
terminal.puts(InfoCmp.Capability.exit_ca_mode); // tput rmcup
Use “HELP” to show all commands and their brief descriptions, or “help <command>” to get more detail about a specific command.
Use "SHOW TABLES" to Displays all tables. Use "SHOW FUNCTIONS" to display all functions. Use "DESCRIBE <table name>" to display the schema of a table. The result is displayed as a formatted table and the column width is automatically adjusted, where lines are automatically wrapped within the columns to make most efficient usage of the screen room. Lines are colored with alternative colors for better reading.
Use "SELECT" command to executes a select SQL statement. The shell uses a logging view to display the result for now (table view will be supported in the future and users will be allowed to choose among various views by the SET command). User can press SPACE to pause the view from scrolling or resume displaying. When the view is paused, data accumulates in the buffer.
Use "EXECUTE" command to executes a SQL file. All the SELECT statements in the file will be omitted as executing streaming SELECT statements in a batch doesn’t make sense.
Use "INSERT INTO" to execute an INSERT INTO statement. User gets an execution ID if this non-query statement is submitted successfully, and they can use the ID to control the execution later.
Use "LIST" command to list all the non-query executions that have been executed. It displays the ID, status and original statement of the execution. You can also use “LS [id1, id2...]” to list specific executions.
Use "STOP" command to stop the specified execution(s).
Use "RM" command to remove stopped executions(s) that from showing in the future.
SET: Use “SET” command to display the name and value of all environment variables that impacts the behavior of the shell and the "SqlExecutor". Use “SET <<name>=<value>>” to set a variable. These environment variables can also be set in the configuration file, namely "conf/shell-defaults.conf".
Use "CLEAR" command to clear the screen.
Use "EXIT" or "QUIT" to quit the shell.
Samza SQL shells run jobs in Samza standalone mode. Samza executor can run multiple non-query jobs as backend jobs simultaneously, and run a single query job as a front-end job. Job results are stored in a data buffer. Samza SQL shell provides commands for users to manage their jobs easily.
The configuration file is located at "conf/shell-defaults.conf". Most of the variables set here can also be set in the shell via command "SET". Some of the important configurations are listed below.
Specifies which SQL executor to use. By default org.apache.samza.sql.client.impl.SamzaExecutor is used, a.k.a Samza SQL.
Some of the samza code, though not supposed to, actually prints something to the stdout and stderr when a SQL is being executed. Those messages shall not be seen by the end user, and more importantly, they mess up the terminal display. Imagine you are painting a table on the screen and refreshing it every second, and part of it are replaced by some INFO or WARNING information!
When the shell starts up it takes control of the terminal completely and turns off all output to Java System.out and System.err. To prevent the shell from doing so, set "shell.debug" to true.
Samza SQL shell is designed to support other SQLs like Spark SQL. The org.apache.samza.sql.client.interfaces package enables user to implement the SQL executor of their own by implementing the SqlExecutor interface. Other interfaces and classes in the package serves for this purpose.
Data Feed Model
Many SQL frameworks today use callback functions to feed data to the client. Our model does NOT allow an executor to print directly to the screen. An executor shall cache the data and wait for the shell to retrieve the data upon necessary (like when the shell decides to refresh the screen).
We believe exceptions are exceptional. When implementing a SqlExecutor, report recoverable errors (like table does not exist in a select statement) by return values, not exceptions. Report EXCEPTIONAL and unrecoverable conditions like unexpected corrupted or inconsistent data structures by thrown an ExecutionException. The shell will report the exception to the user and try to clean up and exit.
Command Line Editor
The shell supports primitive auto completion and highlighting at the moment. SQL Keywords are highlighted and pressing tab key triggers auto completion. It can NOT handle keywords with spaces (like INSERT INTO) and double quotes though. This will be improved as a future work.
Multiple line editing is supported, by using ‘\’ character or double quote character at the end of a line.
Press up and down keys to navigate among history commands. One thing to note is if you have typed in some characters at the command line, pressing up and down keys will navigate among only those history commands that start with those characters.
As any common shell, CTRL C gives up the current line input, CTRL D ends the session, CTRL A moves cursor to the start of the line and CTRL E moves the cursor to the end of the line.
Displaying the query result
Assume an user tries to execute SQL statement that selects profiles of those who have visited his LinkedIn page in the past 5 minutes. How do we display the result?
When the data volume is huge and comes very fast, a table view is probably the best. No matter how much the data volume is, the screen always displays one page out of maybe 5,000 pages. The page may contain 50 lines, each of which displays some brief information of a record. User can navigate through pages and select the row they are interested in and see the detail of the record.
When the amount of data is much less, however, the table view is not straightforward. A logging view, which continuously displays the newly coming data and scrolls the screen up is more convenient for the user. Of course, the user can still pause and resume the screen so they can examine a specific record that they are interested in.
The shell supports the logging view at the moment. The table view will be supported as a future work.
SamzaExecutor is the default implementation of SqlExecutor. SamzaExecutor returns job execution results into a data buffer, where the shell retrieves results and then show those results in the terminal with the given format specified by users.
List tables. Currently Shell can only talk to Kafka system, but in SAMZA-1902, we will use a general way to connect to different systems.
Get table schema. SamzaExecutor uses AvroSqlSchemaConverter to convert Avro schema to Samza SQL schema. Currently Shell works only for systems that has Avro schemas.
List functions. Currently the Shell only shows some UDFs supported by Samza internally. We may need to require UDFs to provide a function of getting their "SamzaSqlUdfDisplayInfo", then we can get the UDF information from SamzaSqlApplicationConfig.udfResolver or SamzaSqlApplicationConfig.udfMetadata (please refer to SAMZA-1957).
Execute non-query jobs. SamzaExecutor can run multiple non-query jobs simultaneously, and users can manage (ls / stop / rm) each job easily.
Execute query jobs. SamzaExecutor runs a single query job as a front-end job. It stores query results in a data buffer, where Shell retrieves results.
Compile and Debug
./gradlew clean releaseSqlShellTarGz
tar -xvzf samza-sql-shell/build/distributions/*.tgz -C samza-sql-shell/build/distributions/
Run using samza-sql-shell.sh:
- Users can configure conf/samza-sql-shell-log4j.xml
- Users can perform “SET shell.debug=true” in shell or configure “shell.debug” in conf/shell-defaults.conf
- Users can check log information in logs/samza-sql-shell.log