SQL: A Comprehensive Guide to Database Concepts for Aspiring Data Engineers

Gaurav Kumar
8 min readApr 5, 2024

--

This is part of the Data Engineering Roadmap.

1. Foundational Knowledge — SQL.

Introduction:

In the vast landscape of data engineering, a fundamental understanding of databases is indispensable. SQL (Structured Query Language) stands as the cornerstone, bridging the gap between raw data and actionable insights. In this article, we embark on a journey to unravel the intricacies of SQL, exploring its nuances, applications, and significance in the realm of data engineering.

The SQL Saga:

1. The Language of Databases:

SQL, often pronounced as “sequel,” is a domain-specific language used for managing and manipulating relational databases. It provides a standardized way to interact with databases, enabling users to query, insert, update, and delete data. The simplicity and versatility of SQL make it a powerful tool for data engineers and analysts alike.

2. Relational Databases:

The foundation of SQL lies in relational databases, where data is organized into tables with predefined relationships. Each table consists of rows and columns, and SQL queries enable the retrieval of specific information from these tables. Understanding the anatomy of relational databases is pivotal for anyone entering the data engineering arena.

3. Basic SQL Commands:

SELECT * FROM Employees;

The SELECT statement is the bread and butter of SQL. It retrieves data from one or more tables based on specified criteria. The asterisk (*) represents all columns, providing a holistic view of the dataset.

WHERE Clause: Filtering the Noise

Adding a WHERE clause allows for precise data retrieval by specifying conditions. For instance:

SELECT * FROM Employees WHERE DepartmentName = 'Data Engineering';

This query fetches all employees working in the Data Engineering department.

4. Data Modification Commands:

Beyond querying, SQL empowers users to modify data in databases.

INSERT INTO Employees (employeeid, name, departmentid, salary, position, departmentname) Values(105, 'Gaurav Kumar', 4, 20000000, 1, 'Quant Researcher');

The INSERT INTO statement adds new records to a table. Here, a new employee named Gaurav Kumar is introduced into the system.

UPDATE Employees
SET Salary = 90000 WHERE EmployeeID = 101;

The UPDATE statement modifies existing records. In this example, John Doe’s salary is updated to $90,000.

DELETE FROM Employees WHERE EmployeeID = 101;

The DELETE statement removes records based on specified conditions. In this case, John Doe is no longer part of the employee roster.

5. Joins: Unleashing the Power of Relationships

In the complex landscape of data, information often resides in multiple tables. SQL’s JOIN operations facilitate the merging of data from different tables, offering a comprehensive view.

SELECT Employees.EmployeeID, Employees.Name, Departments.DepartmentName
FROM Employees
INNER JOIN Departments ON Employees.DepartmentID = Departments.DepartmentID;

This query retrieves employee names along with their respective department names by joining the Employees and Departments tables.

6. Aggregate Functions: Beyond Individual Records

SQL isn’t just about fetching individual records; it’s also about deriving meaningful insights from data at scale. Aggregate functions play a crucial role in summarizing data and extracting valuable metrics.

  • SUM(): This function calculates the sum of values in a column. For example:
SELECT SUM(Salary) AS TotalSalary FROM Employees;

Here, TotalSalary represents the sum of salaries across all employees.

  • AVG(): AVG() computes the average value of a numeric column. For instance:
SELECT AVG(Salary) AS AverageSalary FROM Employees;

The AverageSalary gives the average salary across all employees.

  • COUNT(): COUNT() tallies the number of rows in a result set or the number of non-null values in a column. Example:
SELECT COUNT(EmployeeID) AS TotalEmployees FROM Employees;

This yields the total number of employees in the database.

  • MAX() and MIN(): These functions return the maximum and minimum values, respectively, from a column. For instance:
SELECT MIN(Salary) AS MinSalary FROM Employees;
SELECT MAX(Salary) AS MaxSalary FROM Employees;

MinSalary and MaxSalary provide the lowest and highest salaries among employees.

  • GROUP BY: When combined with aggregate functions, GROUP BY allows data to be grouped based on specific criteria. For example:
SELECT DepartmentID, AVG(Salary) AS AverageSalary
FROM Employees
GROUP BY DepartmentID;

This query calculates the average salary for each department.

7. Subqueries: Unraveling Nested Data Mysteries

Subqueries, also known as nested queries, enable the execution of queries within queries. They play a vital role in complex data analysis and manipulation.

  • Scalar Subqueries: These return a single value and can be used anywhere an expression is allowed. For example:
SELECT Name, Salary,
(SELECT AVG(Salary) FROM Employees) AS AverageSalary
FROM Employees;

Here, AverageSalary represents the average salary across all employees.

Common Table Expression (CTE):

Common Table Expressions (CTEs) are a powerful feature in SQL that allow for the creation of temporary result sets, which can be referenced within a query. They provide a way to break down complex queries into more manageable, modular parts, enhancing readability, maintainability, and performance. This article will delve into what CTEs are, how they work, and provide examples of their usage.

What is a Common Table Expression (CTE)?

A Common Table Expression, as the name suggests, is a temporary named result set that can be referenced within the scope of a SELECT, INSERT, UPDATE, or DELETE statement. CTEs are defined using the WITH keyword followed by a name and a query that produces the result set.

Syntax of a CTE:

WITH cte_name (column1, column2, ...) AS (
-- CTE query
SELECT column1, column2, ...
FROM table_name
WHERE conditions
)
-- Main query using the CTE
SELECT *
FROM cte_name;

Benefits of Using CTEs:

  1. Improved Readability: CTEs allow you to break down complex queries into smaller, more understandable parts. Each CTE can represent a logical step in the overall query, making it easier to comprehend and maintain.
  2. Code Reusability: Once defined, a CTE can be referenced multiple times within the same query, avoiding the need to repeat complex subqueries.
  3. Enhanced Performance: In some cases, using CTEs can improve performance by allowing the database engine to optimize the execution plan.

Example Usage:

Let’s consider a scenario where we have a database table named employees containing employee information such as employee_id, employee_name, Department_id, and Salary. We want to find the average salary for each department.

WITH department_avg_salary AS (
SELECT DepartmentID, AVG(Salary) AS AverageSalary
FROM Employees
GROUP BY DepartmentID
)
SELECT d.DepartmentID, d.DepartmentName, s.AverageSalary
FROM Departments d
JOIN department_avg_salary s ON d.DepartmentID = s.DepartmentID;

In this example, we first calculate the average salary for each department using a CTE named department_avg_salary. Then, we join this CTE with the Departments table to retrieve the department name along with its average salary.

Examples of RDBMS (Relational Database Management Systems) Software:

  1. MySQL: MySQL is one of the most widely used open-source relational database management systems. It is known for its reliability, ease of use, and strong community support. MySQL is commonly used in web applications and is compatible with various operating systems.
  2. PostgreSQL: PostgreSQL is another open-source RDBMS known for its advanced features, extensibility, and standards compliance. It offers support for complex queries, indexing, and data integrity constraints. PostgreSQL is favored for its reliability and scalability, making it suitable for enterprise-level applications.
  3. Oracle Database: Oracle Database is a commercial RDBMS developed by Oracle Corporation. It is known for its robustness, scalability, and comprehensive feature set. Oracle Database is widely used in large enterprises for mission-critical applications, data warehousing, and business intelligence.
  4. Microsoft SQL Server: Microsoft SQL Server is a commercial RDBMS developed by Microsoft. It is popular for its integration with other Microsoft products, ease of use, and extensive set of features including business intelligence, data analysis, and reporting services. SQL Server is commonly used in Windows-based environments.
  5. SQLite: SQLite is a lightweight, embedded RDBMS that requires minimal configuration and administration. It is self-contained, serverless, and supports ACID transactions. SQLite is often used in applications that require a local database with low resource requirements, such as mobile apps and desktop software.
  6. IBM Db2: IBM Db2 is a family of data management products, including both relational and non-relational database solutions. It offers features such as data encryption, high availability, and integration with other IBM products. Db2 is commonly used in enterprise environments for transaction processing and analytics.
  7. MariaDB: MariaDB is an open-source RDBMS forked from MySQL, designed to be compatible with MySQL while offering additional features and performance improvements. It is known for its speed, scalability, and strong community support. MariaDB is often used as a drop-in replacement for MySQL in various applications.
  8. Amazon Aurora: Amazon Aurora is a cloud-native relational database service provided by Amazon Web Services (AWS). It is compatible with MySQL and PostgreSQL, offering high performance, scalability, and availability. Aurora is designed for cloud environments and is favored by organizations leveraging AWS infrastructure.

These are just a few examples of the many RDBMS software options available, each with its own strengths, features, and suitability for different use cases. When choosing an RDBMS, it’s essential to consider factors such as scalability, reliability, performance, licensing costs, and compatibility with existing systems and applications.

So let’s see how much you’ve retained from the blog

Conclusion:

SQL serves as the backbone of modern data management systems, empowering organizations to extract insights, make informed decisions, and drive business growth. By mastering SQL and its diverse functionalities, aspiring data engineers can navigate the complex landscape of data with confidence and precision. As technology continues to evolve, SQL remains a timeless skill, indispensable for anyone embarking on a career in data engineering. Happy Learning!!!

--

--