2021/02/25 - AWS Glue DataBrew - 4 updated api methods
Changes This SDK release adds two new dataset features: 1) support for specifying the file format for a dataset, and 2) support for specifying whether the first row of a CSV or Excel file contains a header.
{'Format': 'CSV | JSON | PARQUET | EXCEL', 'FormatOptions': {'Csv': {'HeaderRow': 'boolean'}, 'Excel': {'HeaderRow': 'boolean'}}}
Creates a new DataBrew dataset.
See also: AWS API Documentation
Request Syntax
client.create_dataset( Name='string', Format='CSV'|'JSON'|'PARQUET'|'EXCEL', FormatOptions={ 'Json': { 'MultiLine': True|False }, 'Excel': { 'SheetNames': [ 'string', ], 'SheetIndexes': [ 123, ], 'HeaderRow': True|False }, 'Csv': { 'Delimiter': 'string', 'HeaderRow': True|False } }, Input={ 'S3InputDefinition': { 'Bucket': 'string', 'Key': 'string' }, 'DataCatalogInputDefinition': { 'CatalogId': 'string', 'DatabaseName': 'string', 'TableName': 'string', 'TempDirectory': { 'Bucket': 'string', 'Key': 'string' } } }, Tags={ 'string': 'string' } )
string
[REQUIRED]
The name of the dataset to be created. Valid characters are alphanumeric (A-Z, a-z, 0-9), hyphen (-), period (.), and space.
string
Specifies the file format of a dataset created from an S3 file or folder.
dict
Options that define the structure of either Csv, Excel, or JSON input.
Json (dict) --
Options that define how JSON input is to be interpreted by DataBrew.
MultiLine (boolean) --
A value that specifies whether JSON input contains embedded new line characters.
Excel (dict) --
Options that define how Excel input is to be interpreted by DataBrew.
SheetNames (list) --
Specifies one or more named sheets in the Excel file, which will be included in the dataset.
(string) --
SheetIndexes (list) --
Specifies one or more sheet numbers in the Excel file, which will be included in the dataset.
(integer) --
HeaderRow (boolean) --
A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
Csv (dict) --
Options that define how Csv input is to be interpreted by DataBrew.
Delimiter (string) --
A single character that specifies the delimiter being used in the Csv file.
HeaderRow (boolean) --
A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
dict
[REQUIRED]
Information on how DataBrew can find data, in either the AWS Glue Data Catalog or Amazon S3.
S3InputDefinition (dict) --
The Amazon S3 location where the data is stored.
Bucket (string) -- [REQUIRED]
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
DataCatalogInputDefinition (dict) --
The AWS Glue Data Catalog parameters for the data.
CatalogId (string) --
The unique identifier of the AWS account that holds the Data Catalog that stores the data.
DatabaseName (string) -- [REQUIRED]
The name of a database in the Data Catalog.
TableName (string) -- [REQUIRED]
The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
TempDirectory (dict) --
An Amazon location that AWS Glue Data Catalog can use as a temporary directory.
Bucket (string) -- [REQUIRED]
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
dict
Metadata tags to apply to this dataset.
(string) --
(string) --
dict
Response Syntax
{ 'Name': 'string' }
Response Structure
(dict) --
Name (string) --
The name of the dataset that you created.
{'Format': 'CSV | JSON | PARQUET | EXCEL', 'FormatOptions': {'Csv': {'HeaderRow': 'boolean'}, 'Excel': {'HeaderRow': 'boolean'}}}
Returns the definition of a specific DataBrew dataset.
See also: AWS API Documentation
Request Syntax
client.describe_dataset( Name='string' )
string
[REQUIRED]
The name of the dataset to be described.
dict
Response Syntax
{ 'CreatedBy': 'string', 'CreateDate': datetime(2015, 1, 1), 'Name': 'string', 'Format': 'CSV'|'JSON'|'PARQUET'|'EXCEL', 'FormatOptions': { 'Json': { 'MultiLine': True|False }, 'Excel': { 'SheetNames': [ 'string', ], 'SheetIndexes': [ 123, ], 'HeaderRow': True|False }, 'Csv': { 'Delimiter': 'string', 'HeaderRow': True|False } }, 'Input': { 'S3InputDefinition': { 'Bucket': 'string', 'Key': 'string' }, 'DataCatalogInputDefinition': { 'CatalogId': 'string', 'DatabaseName': 'string', 'TableName': 'string', 'TempDirectory': { 'Bucket': 'string', 'Key': 'string' } } }, 'LastModifiedDate': datetime(2015, 1, 1), 'LastModifiedBy': 'string', 'Source': 'S3'|'DATA-CATALOG', 'Tags': { 'string': 'string' }, 'ResourceArn': 'string' }
Response Structure
(dict) --
CreatedBy (string) --
The identifier (user name) of the user who created the dataset.
CreateDate (datetime) --
The date and time that the dataset was created.
Name (string) --
The name of the dataset.
Format (string) --
Specifies the file format of a dataset created from an S3 file or folder.
FormatOptions (dict) --
Options that define the structure of either Csv, Excel, or JSON input.
Json (dict) --
Options that define how JSON input is to be interpreted by DataBrew.
MultiLine (boolean) --
A value that specifies whether JSON input contains embedded new line characters.
Excel (dict) --
Options that define how Excel input is to be interpreted by DataBrew.
SheetNames (list) --
Specifies one or more named sheets in the Excel file, which will be included in the dataset.
(string) --
SheetIndexes (list) --
Specifies one or more sheet numbers in the Excel file, which will be included in the dataset.
(integer) --
HeaderRow (boolean) --
A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
Csv (dict) --
Options that define how Csv input is to be interpreted by DataBrew.
Delimiter (string) --
A single character that specifies the delimiter being used in the Csv file.
HeaderRow (boolean) --
A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
Input (dict) --
Information on how DataBrew can find data, in either the AWS Glue Data Catalog or Amazon S3.
S3InputDefinition (dict) --
The Amazon S3 location where the data is stored.
Bucket (string) --
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
DataCatalogInputDefinition (dict) --
The AWS Glue Data Catalog parameters for the data.
CatalogId (string) --
The unique identifier of the AWS account that holds the Data Catalog that stores the data.
DatabaseName (string) --
The name of a database in the Data Catalog.
TableName (string) --
The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
TempDirectory (dict) --
An Amazon location that AWS Glue Data Catalog can use as a temporary directory.
Bucket (string) --
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
LastModifiedDate (datetime) --
The date and time that the dataset was last modified.
LastModifiedBy (string) --
The identifier (user name) of the user who last modified the dataset.
Source (string) --
The location of the data for this dataset, Amazon S3 or the AWS Glue Data Catalog.
Tags (dict) --
Metadata tags associated with this dataset.
(string) --
(string) --
ResourceArn (string) --
The Amazon Resource Name (ARN) of the dataset.
{'Datasets': {'Format': 'CSV | JSON | PARQUET | EXCEL', 'FormatOptions': {'Csv': {'HeaderRow': 'boolean'}, 'Excel': {'HeaderRow': 'boolean'}}}}
Lists all of the DataBrew datasets.
See also: AWS API Documentation
Request Syntax
client.list_datasets( MaxResults=123, NextToken='string' )
integer
The maximum number of results to return in this request.
string
The token returned by a previous call to retrieve the next set of results.
dict
Response Syntax
{ 'Datasets': [ { 'AccountId': 'string', 'CreatedBy': 'string', 'CreateDate': datetime(2015, 1, 1), 'Name': 'string', 'Format': 'CSV'|'JSON'|'PARQUET'|'EXCEL', 'FormatOptions': { 'Json': { 'MultiLine': True|False }, 'Excel': { 'SheetNames': [ 'string', ], 'SheetIndexes': [ 123, ], 'HeaderRow': True|False }, 'Csv': { 'Delimiter': 'string', 'HeaderRow': True|False } }, 'Input': { 'S3InputDefinition': { 'Bucket': 'string', 'Key': 'string' }, 'DataCatalogInputDefinition': { 'CatalogId': 'string', 'DatabaseName': 'string', 'TableName': 'string', 'TempDirectory': { 'Bucket': 'string', 'Key': 'string' } } }, 'LastModifiedDate': datetime(2015, 1, 1), 'LastModifiedBy': 'string', 'Source': 'S3'|'DATA-CATALOG', 'Tags': { 'string': 'string' }, 'ResourceArn': 'string' }, ], 'NextToken': 'string' }
Response Structure
(dict) --
Datasets (list) --
A list of datasets that are defined.
(dict) --
Represents a dataset that can be processed by DataBrew.
AccountId (string) --
The ID of the AWS account that owns the dataset.
CreatedBy (string) --
The Amazon Resource Name (ARN) of the user who created the dataset.
CreateDate (datetime) --
The date and time that the dataset was created.
Name (string) --
The unique name of the dataset.
Format (string) --
Specifies the file format of a dataset created from an S3 file or folder.
FormatOptions (dict) --
Options that define how DataBrew interprets the data in the dataset.
Json (dict) --
Options that define how JSON input is to be interpreted by DataBrew.
MultiLine (boolean) --
A value that specifies whether JSON input contains embedded new line characters.
Excel (dict) --
Options that define how Excel input is to be interpreted by DataBrew.
SheetNames (list) --
Specifies one or more named sheets in the Excel file, which will be included in the dataset.
(string) --
SheetIndexes (list) --
Specifies one or more sheet numbers in the Excel file, which will be included in the dataset.
(integer) --
HeaderRow (boolean) --
A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
Csv (dict) --
Options that define how Csv input is to be interpreted by DataBrew.
Delimiter (string) --
A single character that specifies the delimiter being used in the Csv file.
HeaderRow (boolean) --
A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
Input (dict) --
Information on how DataBrew can find the dataset, in either the AWS Glue Data Catalog or Amazon S3.
S3InputDefinition (dict) --
The Amazon S3 location where the data is stored.
Bucket (string) --
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
DataCatalogInputDefinition (dict) --
The AWS Glue Data Catalog parameters for the data.
CatalogId (string) --
The unique identifier of the AWS account that holds the Data Catalog that stores the data.
DatabaseName (string) --
The name of a database in the Data Catalog.
TableName (string) --
The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
TempDirectory (dict) --
An Amazon location that AWS Glue Data Catalog can use as a temporary directory.
Bucket (string) --
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
LastModifiedDate (datetime) --
The last modification date and time of the dataset.
LastModifiedBy (string) --
The Amazon Resource Name (ARN) of the user who last modified the dataset.
Source (string) --
The location of the data for the dataset, either Amazon S3 or the AWS Glue Data Catalog.
Tags (dict) --
Metadata tags that have been applied to the dataset.
(string) --
(string) --
ResourceArn (string) --
The unique Amazon Resource Name (ARN) for the dataset.
NextToken (string) --
A token that you can use in a subsequent call to retrieve the next set of results.
{'Format': 'CSV | JSON | PARQUET | EXCEL', 'FormatOptions': {'Csv': {'HeaderRow': 'boolean'}, 'Excel': {'HeaderRow': 'boolean'}}}
Modifies the definition of an existing DataBrew dataset.
See also: AWS API Documentation
Request Syntax
client.update_dataset( Name='string', Format='CSV'|'JSON'|'PARQUET'|'EXCEL', FormatOptions={ 'Json': { 'MultiLine': True|False }, 'Excel': { 'SheetNames': [ 'string', ], 'SheetIndexes': [ 123, ], 'HeaderRow': True|False }, 'Csv': { 'Delimiter': 'string', 'HeaderRow': True|False } }, Input={ 'S3InputDefinition': { 'Bucket': 'string', 'Key': 'string' }, 'DataCatalogInputDefinition': { 'CatalogId': 'string', 'DatabaseName': 'string', 'TableName': 'string', 'TempDirectory': { 'Bucket': 'string', 'Key': 'string' } } } )
string
[REQUIRED]
The name of the dataset to be updated.
string
Specifies the file format of a dataset created from an S3 file or folder.
dict
Options that define the structure of either Csv, Excel, or JSON input.
Json (dict) --
Options that define how JSON input is to be interpreted by DataBrew.
MultiLine (boolean) --
A value that specifies whether JSON input contains embedded new line characters.
Excel (dict) --
Options that define how Excel input is to be interpreted by DataBrew.
SheetNames (list) --
Specifies one or more named sheets in the Excel file, which will be included in the dataset.
(string) --
SheetIndexes (list) --
Specifies one or more sheet numbers in the Excel file, which will be included in the dataset.
(integer) --
HeaderRow (boolean) --
A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
Csv (dict) --
Options that define how Csv input is to be interpreted by DataBrew.
Delimiter (string) --
A single character that specifies the delimiter being used in the Csv file.
HeaderRow (boolean) --
A variable that specifies whether the first row in the file will be parsed as the header. If false, column names will be auto-generated.
dict
[REQUIRED]
Information on how DataBrew can find data, in either the AWS Glue Data Catalog or Amazon S3.
S3InputDefinition (dict) --
The Amazon S3 location where the data is stored.
Bucket (string) -- [REQUIRED]
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
DataCatalogInputDefinition (dict) --
The AWS Glue Data Catalog parameters for the data.
CatalogId (string) --
The unique identifier of the AWS account that holds the Data Catalog that stores the data.
DatabaseName (string) -- [REQUIRED]
The name of a database in the Data Catalog.
TableName (string) -- [REQUIRED]
The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.
TempDirectory (dict) --
An Amazon location that AWS Glue Data Catalog can use as a temporary directory.
Bucket (string) -- [REQUIRED]
The S3 bucket name.
Key (string) --
The unique name of the object in the bucket.
dict
Response Syntax
{ 'Name': 'string' }
Response Structure
(dict) --
Name (string) --
The name of the dataset that you updated.