Welcome to the third and the last part of our guide to starting Big Data projects on AWS.
In the last post, we took a closer look at Athena, which is a perfect tool for ad-hoc data analysis. However, if you need to continually monitor & analyze your data, Athena can get pricey.
Another available solution is EMR. The trouble with it? It’s quite a sophisticated tool. Any other options?
What is Redshift?
Amazon Redshift is a fast, fully managed, massively parallel, large-scale data warehouse. It allows you to run complex queries against petabytes of structured data using SQL-based clients and business intelligence (BI) tools using standard ODBC and JDBC connections.
Here are some Redshift basics:
- A Redshift cluster consists of two types of nodes. The leader node manages clients’ connections, receives queries, and coordinates parallel execution with compute nodes, while the compute nodes store data and perform queries and computations.
- There is always a leader node in the cluster. It is automatically provisioned, and you’re not charged for it unless you use a single cluster node. In such a case, a single node performs both, the leader and the compute node functions.
- The cluster occupies a single Availability Zone (AZ), which results from the demand for high-speed connectivity between nodes. If an AZ fails, you have to wait until it’s backed up, and your cluster data will be preserved. You can also restore any existing snapshot to a new AZ in the same region.
- Redshift uses a columnar data format, which means that it stores data tables as columns rather than rows. This approach significantly speeds up queries as data is aggregated, compressed and queried in several columns instead of thousands of rows. On the downside, columnar databases are inappropriate for Online Transaction Processing (OLTP). Each column is stored as a single file, so inserting data with the INSERT statement implies writing to all columns.
- Columnar databases such as Redshift are not fit for storing small amounts of data or large binary objects.
Using Redshift is simple. First, you need to launch a cluster or create a specific VPC and subnet group if so required. A security group for a cluster should also be established at this point.
To start the cluster, go to the management console and choose Database, Redshift, and then click [Launch Cluster]. To tune up default parameters for the database, create a new parameter group for the cluster, and then click the [Launch cluster] button.
Now, you have to specify a cluster identifier, database name (optional), database port, username and password.
Once done, you’ll need to choose the node type, cluster type and the number of compute nodes.
For the node type, you can choose between Dense Compute and Dense Storage nodes. The first type is optimized for computing and the second one – for storage. You pay a single price for the computing power and the storage you use.
You may use previous generation instances if you wish, but the current generation offers better performance for the same price.
In the next screen, configure additional parameters and click [Continue] to review your configuration.
Your cluster will launch soon.
Loading cluster data – Amazon S3
Once the cluster is up and running, you can start loading data into it.
The data source can be EMR, DynamoDB, remote hosts and Amazon S3. In this post, we will focus on the last one. You can also use it to load data to Redshift from other AWS services, which cannot do it directly, such as Kinesis.
To load a table from S3, upload data to your S3 bucket, and then use the COPY command. You can specify files to load by using object prefix or the manifest file.
- The syntax to load data by using object prefix is as follows:
copy <table_name> from ‘s3://<bucket_name>/<object_prefix>’ authorization;
- If you want to specify files to be loaded with the manifest file, just add the manifest keyword at the very end:
copy <table_name> from ‘s3://<bucket_name>/<object_prefix>’ authorization manifest;
Taking care of authorization
For authorization, you can use roles or IAM access credentials. Roles are more secure, as you don’t have to store credentials on the host. The following examples show syntax with both, role and IAM credentials:
The manifest file is a simple JSON-formatted text file that explicitly lists files to be loaded.
Speeding up the loading process
To accelerate the loading process, you can split your data into parts for parallel load.
To enable Redshift to identify all parts of a single file, you need to use the same prefix. For example, if you want to split the file ‘example.csv’ into four parts, you should name them as follows:
Place all these files into a single S3 folder, so that you can specify the folder name and load all the files with a single COPY command.
Creating a table
It’s worth mentioning that Amazon doesn’t provide an interface to communicate with Redshift. You’ll need to use an external SQL client compatible with PostgreSQL JDBC or ODBC drivers.
SQLWorkbenchJ is an example of such a tool. It’s free and recommended by Amazon and can be downloaded from http://www.sql-workbench.net/. The JDBC driver for Redshift can be found here: http://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html
Remember our last article about Athena? Good. We will use the same data set, with the country population. Let’s create a table for that data.
CREATE TABLE population
country CHAR(30) NOT NULL,
year1980 NUMERIC(12, 5) NOT NULL,
year1981 NUMERIC(12, 5) NOT NULL,
year1982 NUMERIC(12, 5) NOT NULL,
year1983 NUMERIC(12, 5) NOT NULL,
year1984 NUMERIC(12, 5) NOT NULL,
year1985 NUMERIC(12, 5) NOT NULL,
year1986 NUMERIC(12, 5) NOT NULL,
year1987 NUMERIC(12, 5) NOT NULL,
year1988 NUMERIC(12, 5) NOT NULL,
year1989 NUMERIC(12, 5) NOT NULL,
year1990 NUMERIC(12, 5) NOT NULL,
year1991 NUMERIC(12, 5) NOT NULL,
year1992 NUMERIC(12, 5) NOT NULL,
year1993 NUMERIC(12, 5) NOT NULL,
year1994 NUMERIC(12, 5) NOT NULL,
year1995 NUMERIC(12, 5) NOT NULL,
year1996 NUMERIC(12, 5) NOT NULL,
year1997 NUMERIC(12, 5) NOT NULL,
year1998 NUMERIC(12, 5) NOT NULL,
year1999 NUMERIC(12, 5) NOT NULL,
year2000 NUMERIC(12, 5) NOT NULL,
year2001 NUMERIC(12, 5) NOT NULL,
year2002 NUMERIC(12, 5) NOT NULL,
year2003 NUMERIC(12, 5) NOT NULL,
year2004 NUMERIC(12, 5) NOT NULL,
year2005 NUMERIC(12, 5) NOT NULL,
year2006 NUMERIC(12, 5) NOT NULL,
year2007 NUMERIC(12, 5) NOT NULL,
year2008 NUMERIC(12, 5) NOT NULL,
year2009 NUMERIC(12, 5) NOT NULL,
year2010 NUMERIC(12, 5) NOT NULL);
Copying data to Redshift
It’s about time to COPY our data from S3 to Redshift. We will use the previously described command with an addition of a delimiter, in our case a comma. The original table had to be transformed a little before loading (i.e. we replaced NAs with 0s), just for the example’s purposes.
And here is our table:
It is a relatively small table, so using the SELECT * statement shouldn’t create an issue here. In real life, however, avoid selecting * from large tables.
Here we are. We’ve prepared data, uploaded it to S3 and loaded it into Redshift. The first step is accomplished, the project has started. Now it’s your turn to use SQL statements for data analysis.
Starting something new…
And that’s it for now. You’ve reached the end of our short article series. We hope it provided you with the basics of Big Data in AWS and now you’re ready to start your very first Big Data on AWS project.
There are many services in AWS to help you with Big Data. Depending on your needs and experience, you can use more sophisticated and demanding services such as Amazon EMR, perform ad-hoc analysis with Athena, or leverage the amazing data warehouse like Redshift.
It’s a perfect place to start your journey, so choose your services and start playing with Big Data. Good luck!