Explore 1.5M+ audiobooks & ebooks free for days

Only $12.99 CAD/month after trial. Cancel anytime.

Querying Databricks with Spark SQL: Leverage SQL to query and analyze Big Data for insights (English Edition)
Querying Databricks with Spark SQL: Leverage SQL to query and analyze Big Data for insights (English Edition)
Querying Databricks with Spark SQL: Leverage SQL to query and analyze Big Data for insights (English Edition)
Ebook1,065 pages21 hours

Querying Databricks with Spark SQL: Leverage SQL to query and analyze Big Data for insights (English Edition)

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Databricks stands out as a widely embraced platform dedicated to the creation of data lakes. Within its framework, it extends support to a specialized version of Structured Query Language (SQL) known as Spark SQL. If you are interested in learning more about how to use Spark SQL to analyze data in a data lake, then this book is for you.

The book covers everything from basic queries to complex data-processing tasks. It begins with an introduction to SQL and Spark. It then covers the basics of SQL, including data types, operators, and clauses. The next few chapters focus on filtering, aggregation, and calculation. Additionally, it covers dates and times, formatting output, and using logic in your queries. It also covers joining tables, subqueries, derived tables, and common table expressions. Additionally, it discusses correlated subqueries, joining and filtering datasets, using SQL in calculations, segmenting and classifying data, rolling analysis, and analyzing data over time. The book concludes with a chapter on advanced data presentation.

By the end of the book, you will be able to use Spark SQL to perform complex data analysis tasks on data lakes.
LanguageEnglish
PublisherBPB Online LLP
Release dateOct 5, 2023
ISBN9789355518026
Querying Databricks with Spark SQL: Leverage SQL to query and analyze Big Data for insights (English Edition)

Read more from Adam Aspin

Related to Querying Databricks with Spark SQL

Related ebooks

Programming For You

View More

Reviews for Querying Databricks with Spark SQL

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Querying Databricks with Spark SQL - Adam Aspin

    C

    HAPTER

    1

    Writing Basic SQL Queries

    Introduction

    Welcome to Databricks and the new world of data analysis that you are about to experience. As you are standing on the threshold of this voyage into the realm of serverless analytics, you could be feeling a little apprehensive. Well, do not worry; your journey will be as simple and comprehensible as possible. This chapter will start you on your adventure, first by outlining the software that you will be using and then by explaining what Databricks is. Then, it will show you how to take an initial look at the data itself. As you progress, you will learn how to be more selective about the data that you analyze.

    It may seem obvious, but you will need some data in an accessible dataset before you can start your analysis. Throughout this book, you will be developing your analytical skills with the aid of a sample dataset named Prestige Cars. This dataset contains a small amount of data concerning sales of vehicles by a fictitious British car reseller. If you want to try the examples in this chapter, you will have to download the sample data from the BPB website and install it into a version of Databricks. So, it follows that now could be a good time to set up the sample dataset as described in Appendix A unless you have already done so. Of course, you can install the sample data only if Databricks is already available. So, if you are not in an enterprise environment where Databricks is already accessible, you will need to set up an account for the Databricks community edition before anything else.

    Structure

    In this chapter, you will learn about the following topics:

    Databricks

    The Databricks Web interface

    Getting started with Databricks notebooks

    Cells

    Spark SQL

    Databricks databases (schemas) and tables

    Activating a database

    Displaying the data in a table

    Help writing SQL

    Limiting the number of records displayed

    Displaying data from a specific field

    Finding the columns in a table

    Displaying data from a specific set of fields

    Modifying the field name in the output

    Removing duplicates from query output

    Sorting data

    Sorting data in reverse order

    Applying multiple sort criteria

    Running SQL queries

    Displaying the available tables

    Finding all the views in a database

    Resizing book cells

    Overriding the 1,000-row output limit

    Assuming that you have all the prerequisites in place, it is time to move on to the core focus of this chapter and start querying Databricks data.

    Objectives

    In this chapter, you will learn how to:

    Query Databricks databases using Databricks notebooks

    List the contents of tables

    Select only certain fields in tables to display

    Display only a few records from a table

    Give columns new names in your query output

    Sort your output

    Note: If you know a little about SQL and if you have a basic knowledge of databases, then feel free to skip past the first few sections of this chapter (or even the first few chapters) until you find the parts that are new to you. If this is not the case, it is preferable to start from the beginning and provide all the information that you are likely to need to get the most out of your SQL learning experience.

    Databricks

    Databricks is a cloud data platform that delivers serverless analytics. This means that all the data and processing take place in the cloud (currently, Amazon Web Services, Microsoft Azure, or Google Cloud). All the analytics and data querying that you will do is carried out in a Web browser. Unlike traditional databases, Databricks does not load data into a database but stores data in files in a cloud-based data lake—hence the term serverless analytics.

    Another area where Databricks breaks with the traditional database paradigm is by separating data storage from processing. This approach allows for much greater scalability and the ability to handle huge amounts of data-making Databricks is a leader in the area of Big Data.

    The technical approaches that underpin Databricks are extremely complex, but fortunately, the Databricks interface hides virtually all the complexity. All you have to do is load data (or access data that is already loaded) and start querying.

    The Databricks Web interface

    To start querying data in Databricks, you first need to log on. If you have a login to an enterprise version of Databricks, then you will have been given a corporate login to use. If not (and assuming that you have created a community account), then all you need to do is to enter the following URL in your Web browser:

    https://community.cloud.databricks.com/login.html

    This will take you to the Databricks startup page, as shown in Figure 1.1:

    Figure 1.1: The Databricks startup page

    Hover over the left sidebar to expand it. This is the main menu where you can find all the core Databricks options. You can see this in Figure 1.2:

    Figure 1.2: The Databricks main menu

    Note: This book is only focused on one aspect of Databricks-querying data with Spark SQL. Consequently, there are huge swathes of Databricks that we will not be looking at. If you need to delve into other aspects of the platform, then the Databricks documentation is a good place to start.

    Getting started with Databricks notebooks

    Interaction with data in Databricks takes place in Databricks notebooks. You can consider these to be a kind of programming window where you will write your code to analyze and return data. The first thing to do is to create a new notebook. To create a new notebook, follow the following steps:

    Hover over the plus symbol in the left pane and select Notebook in the popup pane. You can see this in Figure 1.3:

    Figure 1.3: The Create pane

    In the dialog that appears, enter a name for your notebook, ensure that the default language is set to SQL, and choose the cluster that you will be using to compute results. You can see this in Figure 1.4:

    Figure 1.4: The Create Notebook dialog

    Click Create, and a new notebook will be created. You can see this in Figure 1.5:

    Figure 1.5: A Databricks Notebook

    It is as simple. You can now start typing in SQL code to query your data.

    Notebooks are fundamental to the user experience and are extremely simple to use. Their simplicity belies their power, however. As the idea is to start using Spark SQL to query data, it is probably best not to overwhelm you with interface considerations for the moment. Nonetheless, to begin, you need to be aware of that.

    A notebook must be attached to a compute cluster. If you are working in a corporate environment, then your IT department will indicate which cluster to use. If you are using the Databricks Community edition, then you will need to set up a cluster and connect this cluster to your notebooks.

    Notebooks are saved automatically. You do not have to manually save the code that you enter in a notebook.

    Notebooks can contain code written in SQL, Python, Scala, or R. The notebook needs to know which language you will be using to process code. In our case, this will be SQL, which is why you selected SQL as the default language. You can change the default language at any time by clicking on the language popup at the top of the notebook.

    A workbook name can contain virtually any combination of letters, numbers, spaces, and even special characters.

    Workbooks are stored in Workspaces, which are a kind of container for your (or your colleagues’) interactions with Databricks.

    Cells

    All the SQL that you write to query data in Databricks will be written in a cell inside a notebook. You can create as many cells as you like and can reorder them inside the notebook. Cells can be copied and deleted, and each cell can run code in any of the languages that Databricks uses.

    This allows notebooks to become complex programs, where the code contained in sequences of cells can be run from end to end. For the moment, it is easiest (at least when starting out with Databricks notebooks) to use a single cell for your SQL and replace the cell contents with the code that you are writing for each code sample in the book.

    Note: In a corporate environment (but not, unfortunately, when using the community edition of Databricks), you will also see an option for SQL Warehouse in the menu bar on the left. This also lets you enter SQL commands and also offers a powerful SQL editor. The results will, in any case, be the same as those when using notebooks to write SQL queries.

    Spark SQL

    Databricks lets you use several languages to manipulate data. This book (as you can guess from the title) focuses on only one of these languages-Structured Query Language or SQL. The version of SQL that Databricks uses is known as Spark SQL. Databricks itself is built on top of the Spark open-source technology stack. It is a full-fledged and robust variant of SQL that you can use to extract and shape data stored in Databricks datalakes. Spark SQL will generally be referred to simply as SQL in this book.

    Databricks databases (schemas) and tables

    As you have decided to learn and use Spark SQL queries to analyze data stored in Databricks, you need to know a few of the basic concepts that underpin SQL and databases, to begin with. This is how Databricks classifies the data that you will be querying in Spark SQL:

    At the lowest layer are the files that contain data. As far as the sample data you will be using is concerned, these are simple comma-separated lists.

    These lists are attributed to a metadata layer (a conceptual overlay) that gives each file a recognizable structure. This structure is called a table. All the rows in a table consist of the same number of columns. So, a table is really nothing more than a well-structured list—rather like the ones that you have probably encountered in Excel.

    Tables are grouped (logically, at least) into databases or, as Databricks prefers to call them, schemas.

    To resume, then, a Databricks database (or schema) is a collection of lists (tables) containing columns (fields) of data in a set of rows (records). These elements can then be accessed independently or joined together to deliver the analysis you are looking for.

    Conceptually, a database looks something like as shown in Figure 1.6:

    Figure 1.6: Conceptualizing data in Databricks

    It is worth grasping from the outset that Databricks does not copy the source files into tables like a traditional database does. The data remains in the source files. What Databricks does is it adds an internal overlay to the data in the files so that the data can be accessed as if it were in a database, and so can use SQL (the classic database query language) to query the data.

    Note: Please note that all the exercises in this book use Databricks’s default database. The sample data is visible in the default database in the Databricks Community edition.

    Activating a database

    Before running any queries, you need to ensure that your book is using the correct database. To switch to (or to keep using) the default database where the sample data is stored, simply enter the following command in a workbook cell:

    USE DATABASE default;

    Then press Ctrl + Enter to run the command. Databricks will simply display OK beneath the cell to indicate that the database is active.

    Displaying the data in a table

    For your first query, let us suppose firstly that you have loaded the sample data as described in Appendix B and that you want to see the details of vehicles sold by Prestige Cars Ltd:

    In a blank book cell, type in the following short piece of SQL:

    SELECT  *

    FROM    allsales;

    Press the Ctrl + Enter to run the code and show the results. The results will appear under the code, as shown in Figure 1.7:

    Figure 1.7: Displaying all the data in a table with SELECT *

    How does it work

    In just four words, you have said to Databricks, Show me the complete contents of the AllSales table, including all the rows and all the columns of data. All you needed to know was which table you wanted to look at. SQL did the rest.

    This code snippet is incredibly simple—four words in all—but that is enough to show you how SQL works in practice. If you issue the right command, then you will get back the data you want to see. The point of the command is to let you see everything that is stored in a table of data.

    Of course, you will need to know which table contains the data you want to display when you are dealing with your own data. If you are working in an enterprise environment, this may involve talking to the people in your organization who developed or maintained the databases. Alternatively, there may be documentation that you can read to find the information you require.

    If there is no one you can ask and no documentation available, then you can still acquaint yourself with the data by running the SELECT * FROM clause with each data table that you can find in Databricks. However, before doing this, it is best to continue a little further in this book and learn how to limit the number of records outputs by a query. Later in this chapter, you will learn how to display all the tables in a database.

    So, what exactly have you done here? Let us take a closer look at what you have written. Figure 1.8 breaks down the SQL statement into its constituent parts:

    Figure 1.8: The anatomy of a simple SQL statement

    These constituent parts are as follows:

    Keywords: SQL is built using a set of keywords that you combine to produce commands. These commands (or snippets or phrases, if you want to call them that) are in English and follow a fairly rigorous syntax. Learning to understand and apply the grammar of SQL is what this book is all about.

    Clauses: These are a set of short phrases composed of keywords and data elements that make up a SQL statement.

    This was a simple SQL command, but the following are nonetheless a few key points that you will need to remember:

    As mentioned previously, Databricks can contain (or host, if you prefer) dozens of databases. This means you always have to tell Databricks which database you want to interrogate. When using the community edition, it is simpler to use the default database. In a corporate environment, you should probably not use the default database.

    If you do not indicate the correct database to use, Databricks will either return no data at all or, worse, return the wrong data from another database.

    Ctrl + Enter runs (or executes if you prefer) the SQL code that you have typed or copied into a book cell. There are other ways to run a query that you will learn a little later in this chapter.

    You can click inside the table of data returned by a query to scroll left and right or up and down through the data.

    It can happen (even when you are an experienced data analyst) that executing a query, returns nothing more than an irritating error message like the following:

    Error in SQL statement: AnalysisException: Table or view not found: allsales; line 1 pos 14;.

    This probably means you have not specified correctly which database you want to query or that you have simply mistyped the table name.

    In this book, the data tables that you need to use will be explained as you meet them. Here you have seen the first of the small set of tables that make up the sample data. As its name suggests, the AllSales table contains the details of cars that have been sold by Prestige Cars Ltd.

    By default, only 1,000 rows are output from a query. This can be overridden, as you will learn a little later in this chapter.

    Under the output data, you can also see how long it took to process the data, when it was last refreshed, the number of rows returned by the query, and how long since the query was run.

    Help writing SQL

    While Databricks books cannot write SQL for you, they can help you enter the names of tables, fields, and SQL keywords. As an example, clear the contents of the current cell and then do the following:

    Start typing SELECT.

    After three characters, press the Tab key.

    You will see that the book has found only one word, that is, a keyword or table or field name (the SELECT keyword), and has completed the word for you. To continue the example, clear the contents of the current cell and then Enter:

    SELECT * FROM al

    Now press the Tab key. You will see a popup menu with any possible elements beginning with al that you could use, as shown in Figure 1.9:

    Figure 1.9: The choice of available elements using autoprompt

    Click the required item in the list, and it will be added to the SQL code.

    Limiting the number of records displayed

    Databricks tables can contain millions—or even billions—of records. Each time you run a query, all the records you see are sent from the server to the querying application. It follows that displaying all the rows in a huge table can place an unnecessary strain on the server that stores the data and on the network that connects the server to your workstation. So, in the real world, it is a good idea to display only a small number of records when examining a table. This is particularly true when looking at a table for the first time.

    Fortunately, SQL has a way of limiting the number of rows returned by a query. This is nothing more than a simple extension of the command that you used in the previous section. Here you can see how to tweak a piece of SQL code to display only a few records in the query output.

    Delete any SQL that might be in the query window (or open a new query window), and enter the following code snippet:

    SELECT  *

    FROM    allsales

    LIMIT    5; 

    Run the SQL code by pressing Ctrl + Enter.

    How does it work

    Executing this command will display all the columns in the table—but only the first five records. This will look like the output shown in Figure 1.3 (limited to the first five rows).

    You may well be wondering which records you are seeing when you ask SQL to display only five rows (or indeed any other number or records). The answer is that you might not even see the same set of records when you run this command at different times. Databricks does not necessarily return the first (or last) five records that were added to a table. Neither does it display the last five to be updated, viewed, or printed.

    This apparently minor question reveals an important fact about Databricks data. Records are stored in any order (or, at least, as they were loaded). However, there is no immediate way of telling, in a table, whether the data is stored in any sequence-nor does it matter to SQL. Remember this when querying tables, and never trust the data to be returned in any specific order unless you have asked for this to happen. How to do this is explained later in this chapter.

    The following are a few points that are worth mentioning at this juncture:

    There are several ways to run the SQL code in a query window. Pressing Ctrl + Enter is only one method. Another is to select RunRun Cell from the Workbook menu. Yet another alternative is to click the small triangle at the top right of the workbook cell and click on Run Cell.

    You can enter the keywords and the table and column names in either uppercase or lowercase or even a mixture of the two. In this book, however, we will always enter keywords in uppercase to help them stand out in the code snippet. Hopefully, this will make them easier to learn.

    You do not have to write queries in a purely linear fashion, starting with the SELECT clause and continuing to the end of the query. You can start with any part of the query and build it up as you want.

    Displaying data from a specific field

    A Databricks table can contain hundreds of columns. Most of the time, you will want to display data from only a few of the available columns in a table. The next piece of SQL shows how you choose a single column to display instead of all the available columns in a table. Specifically, it shows how to list only the customer names as follows:

    SELECT CustomerName

    FROM  allsales;

    Executing this piece of code (using any of the techniques that were pointed out at the end of the previous section) will show you something similar to the output shown in Figure 1.10:

    Figure 1.10: Choosing a column from a table

    How does it work

    By replacing the star (or asterisk if you prefer) in your SQL with a specific column name, you have told Databricks that it is this column—and this column only—that you want to display. This example also makes the point that SQL is an extremely modular and extensible language, and you can easily extend or modify a simple command to make it more specific.

    Of course, this approach relies on your knowing the exact name of a column and also typing it exactly as it appears in the table. Fortunately, Databricks has ways of displaying column names and using them in your SQL, as you will discover in the next section.

    Even simple SQL commands have their subtleties. When listing data, you need to remember the following:

    You can write short SQL commands like this one on a single line if you prefer. Alternatively, you could write the following instead of placing each clause on a separate line:

    SELECT CustomerName FROM allsales;

    This is entirely a question of personal choice. As we are presuming that you are new to SQL, we will keep the core SQL statements on separate lines in this book to accentuate the underlying logic of the language. In your queries, you can write the SQL any way you want, as long as it works. All that Databricks wants is that the grammar of the command is technically accurate and that keywords are separated by spaces or returns.

    Finding the columns in a table

    You may well be wondering how on Earth you can be expected to remember all the columns in each table so that you can type them into your SQL queries. Fortunately, Databricks Workbooks can help you here by displaying all the columns in a table in a single click.

    Displaying the columns in a table is really easy. All you have to do (in an empty cell in a workbook) is type the following command:

    DESCRIBE allsales;

    You should see something like the output shown in Figure 1.11 (you may see fewer rows for the moment):

    Figure 1.11: Displaying the columns in a table

    When you display the list of columns in a table, you can also see some more technical information concerning the type of data that the column contains. We will explain what datatypes are in later chapters.

    SQL writing style

    You can format the SQL that you write using multiple styles. You may prefer to write short SQL statements on a single line. Alternatively, you may find it clearer if you place the core keywords such as SELECT, FROM, and ORDER BY on separate lines. The choice is yours.

    All you have to remember is that keywords must be separated by a space, a tab character, or a line break. Otherwise, the presentation of your SQL is entirely up to you.

    Displaying data from a specific set of fields

    SQL does not limit you to displaying all the fields—or only one field—from a table. You can choose not only the fields from a table that you want to display but also the order in which they will appear in the output from the query. The following piece of code shows you how to select two fields (CountryName and Sales Region) from another table in the database—the Country table:

    SELECT CountryName, CustomerName

    FROM  allsales;

    Executing this piece of code will show you something similar to the output in Figure 1.12:

    Figure 1.12: Displaying multiple fields from the Country table

    How does it work

    Here again, you have extended the base SQL that you used at the start of the chapter. Specifically, you have developed the SELECT statement to include the field names containing the data that you want to view. All you had to do was to separate each field name with a comma and place the field names in the order that you want to see their data in the output from left to right.

    There is only one major trick to remember when listing a specific set of fields:

    Remember not to add a comma after the final field in a list of fields in the SELECT clause.

    Columns or fields

    Whereas spreadsheets prefer the term columns, databases tend to use the word fields. So, despite the fact that we have used the term column up until now, in this chapter, we will switch to using the term field from now on. Both words, however, describe the same thing and can generally be used interchangeably.

    Modifying the field name in the output using aliases

    Many datasets have cryptic—or frankly incomprehensible—field names. While, as an analyst or data guru, you might get used to this, it is not always a good idea to present information to users in a way that makes the data harder to understand than is necessary. So, SQL allows you to output the data under a different field header to enhance readability. In this example, you will display the country field under another name. Start by taking a look at the following code snippet and then at the output it returns in Figure 1.13:

    SELECT CountryName, CountryISO3 AS IsoCode FROM allsales;

    Figure 1.13: Changing a field name using an alias

    How does it work

    In the query output, the original field name is replaced by the name you have chosen—IsoCode in this example. This technique is called aliasing; you are giving the field another name in the query. Applying an alias has no effect at all on the underlying data and does not change the underlying field name. What it does do is apply a different (and ideally more comprehensible) name in the query output.

    Aliases have their own particular set of rules that must be adhered to if they are to work correctly. Essentially you need to remember the following points:

    You may have noticed that all the table and field names that have been used so far contain neither spaces nor special (that is, non-alphanumeric) characters. This is because SQL requires you to specify the names of what it calls objects—that is, fields and tables, among other things—in a specific way. However, we do not want to make things appear over-complicated here, so as a starting point, let us just say you are better avoiding all non-alphanumeric characters when creating an alias for a field.

    If you want to add a space or special characters to an alias (suppose in the example used in this section, you want to see ISO Code as the field heading), then you must place the alias inside backquotes. That is, you can write the following to add a space to the alias IsoCode that you want to use instead of the real field name:

    SELECT CountryName, CountryISO3 AS `Iso Code` FROM allsales;

    In practice, many data people advise that you avoid spaces and nonstandard characters if you can, as this can get painful when writing complex queries. To encourage this, we have stuck to aliases without spaces or non-alphanumeric characters in this book.

    An alias cannot be more than 122 characters long.

    Aliases can contain underscores.

    You can also add an alias without the AS keyword using the following code:

    SELECT CountryName, CountryISO3 `Iso Code` FROM allsales;

    However, we prefer not to use this technique and so will not be taking this approach when aliasing columns:

    Backquotes are also known as backticks.

    Backquotes are not the same as single quotes. There are cases in SQL when they may be used interchangeably-but aliases with spaces, or special characters must use backquotes, not single or double quotes.

    You cannot use double quotes or single quotes in aliases.

    Removing duplicates from query output

    As befits a company that sells its products across the globe, the CEO of Prestige Cars needs to know where its presence is felt the most. It follows that she wants you to produce a list of the countries where the company’s customers can be found. A quick look at the all-sales table reveals a Country field that you can use to list countries.

    Here is the SQL that will produce the list you require:

    SELECT DISTINCT  CountryName

    FROM            allsales;

    This query returns the results shown in Figure 1.14:

    Figure 1.14: Returning distinct elements from the data

    How does it work

    This query applies the DISTINCT keyword to remove any duplicates from the list of countries.

    The new aspect of this example is the DISTINCT keyword. This keyword is incredibly useful because it removes all duplicates from the query output before you even see the query results.

    Undo and redo

    A Databricks workbook is like many other desktop applications and Web interfaces in that you can undo changes and redo them if you want to.

    Ctrl + Z will undo one or more modifications.

    Ctrl + Y will redo modifications that you have undone.

    Sorting data

    Now that you can select the fields that contain the data you want to see in the sequence that you want to see them, you probably also want to sort the data. For example, you might want to sort car sales by increasing the sale price. To do this, just run the following snippet and take a look at the numbers in the SalePrice field in Figure 1.15. They are now sorted from lowest to highest.

    SELECT    MakeName, ModelName, SalePrice

    FROM      allsales

    ORDER BY  SalePrice;

    Figure 1.15: Sorting data in ascending order

    How does it work

    To sort the data returned by a query, just add the ORDER BY keyword (it is considered to be a single keyword even if it is really two words) after the FROM clause of the SQL command. Then you add the field that you are sorting the data on. This creates an ORDER BY clause.

    Equally important is that writing SQL is all about making simple extensions to existing code. So, you do not have to produce instant reams of code that work the first time. You can start with a small snippet of code, test it, and then extend it until it does exactly what you want it to do.

    The following are several key points to remember here:

    The ORDER BY keyword can also be used on text (in which case it sorts in alphabetical order), on numbers (where it sorts from lowest to highest), or on dates (in which case it places the dates in from earliest to latest).

    If you want, you can add the ASC keyword after the sort field name to force an ORDER BY statement to sort the data in ascending order. However, Databricks sorts of data in ascending order out of the box. So, the result would be the same even if you wrote the following:

    SELECT * FROM allsales ORDER BY SalePrice ASC;

    If you test this, you will see the same result that you saw when you added the ASC keyword to the ORDER BY clause in the SQL snippet at the start of this section. Techies refer to this as the default sort order.

    If you find that reiterating field names in the ORDER BY clause is somewhat laborious, then you can always apply a shortcut. Instead of using a field name, you can use a number to represent it. So, you could write the SQL query at the start of this section as follows:

    SELECT    MakeName, ModelName, SalePrice

    FROM      allsales

    ORDER BY  3;

    The number that you use in the ORDER BY clause stands for the position of the field in the SELECT clause of the SQL statement. So in this example, 1 represents IsoCode (aliased as IsoCode), 2 mean CountryName (the second field in the SELECT clause), and so on.

    Note: We will use keywords to describe core Databricks terms even when the keyword consists of several individual words.

    Sorting data in reverse order

    As you saw in the previous example, data can be sorted from lowest to highest really quickly and easily. You can also sort data from highest to lowest (or Z to A). Changing the sort order is as simple as replacing the ASC (short for ascending) keyword that you just met with the DESC (short for descending) keyword. You can see this in the output shown in Figure 1.16, which is the result of the following code snippet that sorts the country names in reverse order of SalePrice:

    SELECT    MakeName, ModelName, SalePrice

    FROM      allsales

    ORDER BY  SalePrice DESC;

    Figure 1.16: Sorting data in alphabetical order

    How does it work

    Switching the sort order is as easy as adding the ASC keyword to the end of the ORDER BY clause or replacing the DESC keyword with the ASC keyword at the end of the ORDER BY clause of a SQL command. Using DESC forces SQL to sort the results from highest to lowest (if they are numbers), Z to A (if they are text), or latest to earliest (if they are dates).

    Applying multiple sort criteria

    Larger data sets can require that you sort data according to multiple criteria. Suppose, for instance, that you want to list all the cars sold first by country, then by make (per country), and finally by model if there are several makes sold for a specific country.

    This is easy to do in SQL because it is a simple extension of the techniques that you have seen in the previous two sections. Take a look at the following code snippet:

    SELECT    CountryName, MakeName, ModelName

    FROM      allsales

    ORDER BY  CountryName, MakeName, ModelName;

    Executing this query will return output similar to that shown in Figure 1.17:

    Figure 1.17: Sorting sales using multiple fields

    How does it work

    Sometimes, you will be faced with a table where some fields contain the same data elements repeated several times. A telephone directory is like this. You may have many pages of people named Smith, although nearly all may have different first names. Even if the last name and the first name are the same, they may have different middle initials. In these cases, you need to sort the data on successive fields so that (to continue the telephone directory analogy) you sort first by last name, then by first name, and finally by middle initial.

    Entering several fields after the ORDER BY keyword tells SQL to sort the data progressively on the fields you entered. In this example, it means the following:

    First, by the country.

    Then, by the make (if there is more than one record with the same country).

    Finally, by the model (if there is more than one record with the same country and make).

    All you have to do is enter the field names separated by a comma in the ORDER BY clause.

    Applying a multiple-sort order has its own specific set of core requirements. These include the following:

    While there may be technical limits to the number of fields you can sort on, in practice, you do not need to worry and can extend the field list that you use in the ORDER BY statement to include as many fields as you want.

    In this example, you used the same fields in the SELECT statement that you used in the ORDER BY statement. This is not compulsory in SQL because there is no obligation in a basic SQL query to display the same fields that you use for ordering the data. However, when you are testing your SQL skills (or ensuring that the data looks like you think it should), it can be a good idea to use the same groups of fields in both clauses. This way, you can see whether the output is what you expect.

    Sorting query results requires a lot of computing horsepower when you are dealing with large tables running on small clusters. We recommend that you sort data only if it is really necessary.

    As was the case for SELECT clauses, you should not add a comma after the final field name in the ORDER BY clause.

    Running SQL queries

    So far in this chapter, you have always run queries in the time-honored way by pressing Ctrl + Enter or selecting Run=>Cell from the book menu. Databricks does, however, offer you several ways to run SQL queries. These methods are explained in Table 1.1:

    Table 1.1: Code execution scope

    Displaying the available tables

    All the data in Databricks is stored in files but accessed in SQL using tables. A Databricks database can consist of dozens—or even hundreds—of tables of data that have been carefully designed and created by database professionals. The first thing that you will have to do when faced with any database that is new to you is to take a look at the tables it contains. To do this, enter the following Spark SQL in a blank cell:

    SHOW TABLES;

    You should see something like the list of tables shown in Figure 1.18:

    Figure 1.18: Displaying the tables contained in a database

    As this sample database is small, it contains only a handful of tables. As you progress through this book, you will learn what data each table contains.

    Finding all the views in a database

    One of the aims of this book is to introduce you progressively to the

    Enjoying the preview?
    Page 1 of 1