Tuesday, August 16, 2022
HomeBig DataMigrate a big information warehouse from Greenplum to Amazon Redshift utilizing AWS...

Migrate a big information warehouse from Greenplum to Amazon Redshift utilizing AWS SCT – Half 2


On this second put up of a multi-part collection, we share finest practices for selecting the optimum Amazon Redshift cluster, information structure, changing saved procedures, suitable features and queries extensively used for SQL conversions, and proposals for optimizing the size of knowledge varieties for desk columns. You may take a look at the first put up of this collection for steerage on planning, operating, and validation of a large-scale information warehouse migration from Greenplum to Amazon Redshift utilizing AWS Schema Conversion Instrument (AWS SCT).

Select your optimum Amazon Redshift cluster

Amazon Redshift has two kinds of clusters: provisioned and serverless. For provisioned clusters, it’s worthwhile to arrange the identical with required compute assets. Amazon Redshift Serverless can run high-performance analytics within the cloud at any scale. For extra data, consult with Introducing Amazon Redshift Serverless – Run Analytics At Any Scale With out Having to Handle Knowledge Warehouse Infrastructure.

An Amazon Redshift cluster consists of nodes. Every cluster has a pacesetter node and a number of compute nodes. The chief node receives queries from shopper purposes, parses the queries, and develops question run plans. The chief node then coordinates the parallel run of those plans with the compute nodes and aggregates the intermediate outcomes from these nodes. It then returns the outcomes to the shopper purposes.

When figuring out your sort of cluster, contemplate the next:

  • Estimate the dimensions of the enter information compressed, vCPU, and efficiency. As of this writing, we advocate the Amazon Redshift RA3 occasion with managed storage, which scales compute and storage independently for quick question efficiency.
  • Amazon Redshift supplies an automatic “Assist me select” cluster based mostly on the dimensions of your information.
  • A predominant benefit of a cloud Amazon Redshift information warehouse is that you simply’re not caught with {hardware} and commodities like outdated guard information warehouses. For sooner innovation, you’ve got the choice to attempt completely different cluster choices and select the optimized one by way of efficiency and price.
  • On the time of growth or pilot, you may often begin with a smaller variety of nodes. As you progress to manufacturing, you may alter the variety of nodes based mostly in your utilization sample. When right-sizing your clusters, we advocate selecting the reserved occasion sort to chop down the fee even additional. The general public-facing utility Easy Replay might help you identify efficiency towards completely different cluster varieties and sizes by replaying the shopper workload. For provisioned clusters, when you’re planning to make use of the advisable RA3 occasion, you may evaluate completely different node varieties to find out the appropriate occasion sort.
  • Primarily based in your workload sample, Amazon Redshift helps resize, pause and cease, and concurrency scaling of the cluster. Amazon Redshift workload administration (WLM) permits efficient and versatile administration of reminiscence and question concurrency.

Create information extraction duties with AWS SCT

With AWS SCT extraction brokers, you may migrate your supply tables in parallel. These extraction brokers authenticate utilizing a legitimate person on the info supply, permitting you to regulate the assets obtainable for that person throughout the extraction. AWS SCT brokers course of the info regionally and add it to Amazon Easy Storage Service (Amazon S3) by way of the community (by way of AWS Direct Join). We advocate having a constant community bandwidth between your Greenplum machine the place the AWS SCT agent is put in and your AWS Area.

If in case you have tables round 20 million rows or 1 TB in dimension, you need to use the digital partitioning function on AWS SCT to extract information from these tables. This creates a number of sub-tasks and parallelizes the info extraction course of for this desk. Subsequently, we advocate creating two teams of duties for every schema that you simply migrate: one for small tables and one for big tables utilizing digital partitions.

For extra data, consult with Creating, operating, and monitoring an AWS SCT information extraction job.

Knowledge structure

To simplify and modernize your information structure, contemplate the next:

  • Set up accountability and authority to implement enterprise information requirements and insurance policies.
  • Formalize the info and analytics working mannequin between enterprise and enterprise items and features.
  • Simplify the info know-how ecosystem by way of rationalization and modernization of knowledge property and instruments or know-how.
  • Develop organizational constructs that facilitate extra strong integration of the enterprise and supply groups, and construct data-oriented merchandise and options to handle the enterprise issues and alternatives all through the lifecycle.
  • Again up the info periodically in order that if one thing is incorrect, you’ve got the flexibility to replay.
  • Throughout planning, design, execution, and all through implementation and upkeep, guarantee information high quality administration is added to realize the specified consequence.
  • Easy is the important thing to a straightforward, quick, intuitive, and low-cost resolution. Easy scales a lot better than advanced. Easy makes it doable to suppose huge (Invent and Simplify is one other Amazon management precept). Simplify the legacy course of by migrating solely the required information utilized in tables and schemas. For instance, when you’re performing truncate and cargo for incremental information, determine a watermark and solely course of incremental information.
  • You will have use instances that requiring record-level inserts, updates, and deletes for privateness rules and simplified pipelines; simplified file administration and near-real-time information entry; or simplified change information seize (CDC) information pipeline growth. We advocate utilizing purposeful instruments based mostly in your use case. AWS gives the choices to make use of Apache HUDI with Amazon EMR and AWS Glue.

Migrate saved procedures

On this part, we share finest practices for saved process migration from Greenplum to Amazon Redshift. Knowledge processing pipelines with advanced enterprise logic usually use saved procedures to carry out the info transformation. We advise utilizing huge information processing like AWS Glue or Amazon EMR to modernize your extract, rework, and cargo (ETL) jobs. For extra data, take a look at High 8 Greatest Practices for Excessive-Efficiency ETL Processing Utilizing Amazon Redshift. For time-sensitive migration to cloud-native information warehouses like Amazon Redshift, redesigning and creating all the pipeline in a cloud-native ETL instrument is likely to be time-consuming. Subsequently, migrating the saved procedures from Greenplum to Amazon Redshift saved procedures may be the appropriate selection.

For a profitable migration, make sure that to comply with Amazon Redshift saved process finest practices:

  • Specify the schema title whereas making a saved process. This helps facilitate schema-level safety and you’ll implement grants or revoke entry management.
  • To stop naming conflicts, we advocate naming procedures utilizing the prefix sp_. Amazon Redshift reserves the sp_ prefix solely for saved procedures. By prefixing your process names with sp_, you make sure that your process title received’t battle with any current or future Amazon Redshift process names.
  • Qualify your database objects with the schema title within the saved process.
  • Observe the minimal required entry rule and revoke undesirable entry. For comparable implementation, make sure that the saved process run permission will not be open to ALL.
  • The SECURITY attribute controls a process’s privileges to entry database objects. Once you create a saved process, you may set the SECURITY attribute to both DEFINER or INVOKER. In case you specify SECURITY INVOKER, the process makes use of the privileges of the person invoking the process. In case you specify SECURITY DEFINER, the process makes use of the privileges of the proprietor of the process. INVOKER is the default. For extra data, consult with Safety and privileges for saved procedures.
  • Managing transactions on the subject of saved procedures are essential. For extra data, consult with Managing transactions.
  • TRUNCATE points a commit implicitly inside a saved process. It interferes with the transaction block by committing the present transaction and creating a brand new one. Train warning whereas utilizing TRUNCATE to make sure it by no means breaks the atomicity of the transaction. This additionally applies for COMMIT and ROLLBACK.
  • Adhere to cursor constraints and perceive efficiency issues whereas utilizing cursor. You need to use set-based SQL logic and short-term tables whereas processing massive datasets.
  • Keep away from hardcoding in saved procedures. Use dynamic SQL to assemble SQL queries dynamically at runtime. Guarantee acceptable logging and error dealing with of the dynamic SQL.
  • For exception dealing with, you may write RAISE statements as a part of the saved process code. For instance, you may elevate an exception with a customized message or insert a document right into a logging desk. For unhandled exceptions like WHEN OTHERS, use built-in features like SQLERRM or SQLSTATE to go it on to the calling utility or program. As of this writing, Amazon Redshift limits calling a saved process from the exception block.

Sequences

You need to use IDENTITY columns, system timestamps, or epoch time as an choice to make sure uniqueness. The IDENTITY column or a timestamp-based resolution might need sparse values, so when you want a steady quantity sequence, it’s worthwhile to use devoted quantity tables. You may as well use of the RANK() or ROW_NUMBER() window operate over all the set. Alternatively, get the high-water mark from the prevailing ID column from the desk and increment the values whereas inserting information.

Character datatype size

Greenplum char and varchar information sort size is specified by way of character size, together with multi-byte ones. Amazon Redshift character varieties are outlined by way of bytes. For desk columns utilizing multi-byte character units in Greenplum, the transformed desk column in Amazon Redshift ought to allocate enough storage to the precise byte dimension of the supply information.

A simple workaround is to set the Amazon Redshift character column size to 4 occasions bigger than the corresponding Greenplum column size.

A finest follow is to make use of the smallest doable column dimension. Amazon Redshift doesn’t allocate space for storing based on the size of the attribute; it allocates storage based on the true size of the saved string. Nonetheless, at runtime, whereas processing queries, Amazon Redshift allocates reminiscence based on the size of the attribute. Subsequently, not setting a default dimension of 4 occasions better helps from a efficiency perspective.

An environment friendly resolution is to investigate manufacturing datasets and decide the utmost byte dimension size of the Greenplum character columns. Add a 20% buffer to assist future incremental progress on the desk.

To reach on the precise byte dimension size of an current column, run the Greenplum information construction character utility from the AWS Samples GitHub repo.

Numeric precision and scale

The Amazon Redshift numeric information sort has a restrict to retailer as much as most precision of 38, whereas in a Greenplum database, you may outline a numeric column with none outlined size.

Analyze your manufacturing datasets and decide numeric overflow candidates utilizing the Greenplum information construction numeric utility from the AWS Samples GitHub repo. For numeric information, you’ve got choices to sort out this based mostly in your use case. For numbers with a decimal half, you’ve got the choice to spherical the info based mostly on the info sort with none information loss in the entire quantity half. For future reference, you may a hold copy of the column in VARCHAR or retailer in an S3 information lake. In case you see an especially small share of an outlier of overflow information, clear up the supply information for high quality information migration.

SQL queries and features

Whereas changing SQL scripts or saved procedures to Amazon Redshift, when you encounter unsupported features, database objects, or code blocks for which you might need to rewrite the question, create user-defined features (UDFs), or redesign. You may create a customized scalar UDF utilizing both a SQL SELECT clause or a Python program. The brand new operate is saved within the database and is on the market for any person with adequate privileges to run. You run a customized scalar UDF in a lot the identical manner as you run current Amazon Redshift features to match any performance of legacy databases. The next are some examples of alternate question statements and methods to realize particular aggregations that is likely to be required throughout a code rewrite.

AGE

The Greenplum operate AGE () returns an interval subtracting from the present date. You might accomplish the identical utilizing a subset of MONTHS_BETWEEN(), ADD_MONTH(), DATEDIFF(), and TRUNC() features based mostly in your use case.

The next instance Amazon Redshift question calculates the hole between the date 2001-04-10 and 1957-06-13 by way of 12 months, month, and days. You may apply this to any date column in a desk.

choose
	trunc(trunc(months_between('2001-04-10'::date, '1957-06-13'::date))/ 12) years,
	mod(trunc(months_between('2001-04-10'::date, '1957-06-13'::date))::int4, 12) months,
	'2001-04-10'::date -add_months('1957-06-13'::date,
	trunc(months_between('2001-04-10'::date, '1957-06-13'::date))::int4) days;

COUNT

If in case you have a use case to get distinct aggregation within the Depend() window operate, you can accomplish the identical utilizing a mix of the Dense_Rank () and Max() window features.

The next instance Amazon Redshift question calculates the distinct merchandise depend for a given date of sale:

choose
	sale_date,
	merchandise,
	value,
	max(densernk) over (partition by sale_date order by merchandise rows between unbounded previous and unbounded following) as distinct_itemcount
from
	(
	choose
		*,dense_rank() over (partition by sale_date order by merchandise) as densernk
	from
		testaggr)
order by
	sale_date,
	merchandise,
value;

ORDER BY

Amazon Redshift combination window features with an ORDER BY clause require a compulsory body.

The next instance Amazon Redshift question creates a cumulative sum of value by sale date and orders the outcomes by merchandise inside the partition:

choose
	*,
    sum(value) over (partition by sale_date
order by
	merchandise rows between unbounded previous and unbounded following) as total_cost_by_date
from
	testaggr
order by
	sale_date,
	merchandise,
	value;

STRING_AGG

In Greenplum, STRING_AGG() is an combination operate, which is used to concatenate a listing of strings. In Amazon Redshift, use the LISTAGG() operate.

The next instance Amazon Redshift question returns a semicolon-separated checklist of e-mail addresses for every division:

choose
	dept,
	listagg(email_address,';') 
inside group (order by dept) as email_list
from
	employee_contact
group by
	dept
order by
	dept;

ARRAY_AGG

In Greenplum, ARRAY_AGG() is an combination operate that takes a set of values as enter and returns an array. In Amazon Redshift, use a mix of the LISTAGG() and SPLIT_TO_ARRAY() features. The SPLIT_TO_ARRAY() operate returns a SUPER datatype.

The next instance Amazon Redshift question returns an array of e-mail addresses for every division:

choose
	dept,
	SPLIT_TO_ARRAY(email_list,
	';') email_array
from
	(
	choose
		dept,
		listagg(email_address,
		';') 
inside group (
		order by dept) as email_list
	from
		employee_contact
	group by
		dept
	order by
		dept);

To retrieve array parts from a SUPER expression, you need to use the SUBARRAY() operate:

 choose
	SUBARRAY( email_array,
	0,
	1 ) first_element,
	SUBARRAY( email_array,
	1,
	1) second_element,
	SUBARRAY( email_array,
	0 ) all_element
from
	testarray
the place
	dept="HR";

UNNEST

In Greenplum, you need to use the UNNEST operate to separate an array and convert the array parts right into a set of rows. In Amazon Redshift, you need to use PartiQL syntax to iterate over SUPER arrays. For extra data, consult with Querying semistructured information.

create temp desk unnesttest as
choose
	json_parse('{"scalar_array": [0,1,2,3,4,5.5,6,7.9,8,9]}') as information;

choose
	aspect
from
	unnesttest as un,
	un.information.scalar_array as aspect at index;

WHERE

You may’t use a window operate within the WHERE clause of a question in Amazon Redshift. As an alternative, assemble the question utilizing the WITH clause after which refer the calculated column within the WHERE clause.

The next instance Amazon Redshift question returns the sale date, merchandise, and price from a desk for the gross sales dates the place the full sale is greater than 100:

with aggrcost as (
choose
	sale_date ,
	merchandise,
	value,
	sum(value) over (partition by sale_date) as total_sale
from
	testaggr )
choose
	*
from
	aggrcost
the place
	total_sale > 100;

Confer with the next desk for added Greenplum date/time features together with the Amazon Redshift equal to speed up you code migration.

.DescriptionGreenplumAmazon Redshift
1The now() operate return the beginning time of the present transactionnow ()sysdate
2clock_timestamp() returns the beginning timestamp of the present assertion inside a transaction blockclock_timestamp ()to_date(getdate(),'yyyy-mm-dd') + substring(timeofday(),12,15)::timetz
3transaction_timestamp () returns the beginning timestamp of the present transactiontransaction_timestamp ()to_date(getdate(),'yyyy-mm-dd') + substring(timeofday(),12,15)::timetz
4Interval – This operate provides x years and y months to the date_time_column and returns a timestamp sortdate_time_column + interval ‘ x years y months’add_months(date_time_column, x*12 + y)
5Get whole variety of seconds between two-time stamp fieldsdate_part('day', end_ts - start_ts) * 24 * 60 * 60+ date_part('hours', end_ts - start_ts) * 60 * 60+ date_part('minutes', end_ts - start_ts) * 60+ date_part('seconds', end_ts - start_ts)datediff('seconds', start_ts, end_ts)
6Get whole variety of minutes between two-time stamp fieldsdate_part('day', end_ts - start_ts) * 24 * 60 + date_part('hours', end_ts - start_ts) * 60 + date_part('minutes', end_ts - start_ts)datediff('minutes', start_ts, end_ts)
7Extract date half literal from distinction of two-time stamp fieldsdate_part('hour', end_ts - start_ts)extract(hour from (date_time_column_2 - date_time_column_1))
8Operate to return the ISO day of the weekdate_part('isodow', date_time_column)TO_CHAR(date_time_column, 'ID')
9Operate to return ISO 12 months from date time subjectextract (isoyear from date_time_column)TO_CHAR(date_time_column, ‘IYYY’)
10Convert epoch seconds to equal datetimeto_timestamp(epoch seconds)TIMESTAMP 'epoch' + Number_of_seconds * interval '1 second'

Amazon Redshift utility for troubleshooting or operating diagnostics for the cluster

The Amazon Redshift Utilities GitHub repo comprises a set of utilities to speed up troubleshooting or evaluation on Amazon Redshift. Such utilities encompass queries, views, and scripts. They aren’t deployed by default onto Amazon Redshift clusters. The perfect follow is to deploy the wanted views into the admin schema.

Conclusion

On this put up, we coated prescriptive steerage round information varieties, features, and saved procedures to speed up the migration course of from Greenplum to Amazon Redshift. Though this put up describes modernizing and transferring to a cloud warehouse, you need to be augmenting this transformation course of in the direction of a full-fledged fashionable information structure. The AWS Cloud allows you to be extra data-driven by supporting a number of use instances. For a contemporary information structure, it’s best to use purposeful information shops like Amazon S3, Amazon Redshift, Amazon Timestream, and others based mostly in your use case.


In regards to the Authors

Suresh Patnam is a Principal Options Architect at AWS. He’s keen about serving to companies of all sizes reworking into fast-moving digital organizations specializing in huge information, information lakes, and AI/ML. Suresh holds a MBA diploma from Duke College- Fuqua College of Enterprise and MS in CIS from Missouri State College. In his spare time, Suresh enjoys taking part in tennis and spending time together with his household.

Arunabha Datta is a Sr. Knowledge Architect at AWS Skilled Providers. He collaborates with prospects and companions to architect and implement fashionable information structure utilizing AWS Analytics providers. In his spare time, Arunabha enjoys images and spending time together with his household.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular