# Custom Table Data Import Function
# I. Overview
In some cases, the data you need to use may not be represented in the form of user
or event
, such as some mapping tables, or some external data. If you need to use this part of data, you need to import custom data into the TA system through data_transfer
command, which is associated with event table and user info table.
The following two import data sources are currently supported:
mysql
: remote mysql databasetxtfile
: local file
# II. Instructions for Use
# 2.1 Command description
The command for data import is as follows:
ta-tool data_transfer -conf <config files> [--date xxx]
# 2.2 Command parameter description
# 2.2.1 -conf
The incoming parameters are the configuration file path of the imported table. Each table is a configuration file. It supports the simultaneous import of multiple tables. It supports wild-card methods, such as/data/config/
or ./config/.json
# 2.2.1 --date
Optional parameter --date: as an optional prameter, the parameter represents the data date, and the time macro will be replaced based on this reference time. The parameter may not pass, it is deemed to take the current date by default, in the format of YYYY-MM-DD. For the specific use of the time macro, please refer to time macro usage
# 2.3 Configuration file description
# 2.3.1 The sample configuration file for a single table is as follows:
{
"parallel_num": 2,
"source": {
"type": "txtfile",
"parameter": {
"path": ["/data/home/ta/importer_test/data/*"],
"encoding": "UTF-8",
"column": ["*"],
"fieldDelimiter": "\t"
}
},
"target": {
"appid": "test-appid",
"table": "test_table",
"table_desc": "import test table",
"partition_value": "@[{yyyyMMdd}-{1day}]",
"column": [
{
"name": "col1",
"type": "timestamp",
"comment": "timestamp"
},
{
"name": "col2",
"type": "varchar"
}
]
}
}
# 2.3.2 Description of outer layer parameters
- parallel_num
- Description: import the number of concurrent threads and control the rate of import
- Type:
int
- Required: Yes
- Default value: none
- source
- Description: configuration of specific parameters for importing data sources
- Type:
jsonObject
- Required: Yes
- Default value: none
- target
- Description: configuration of specific parameter configuration for exporting target tables
- Type:
jsonObject
- Required: Yes
- Default value: none
# 2.3.3 Specification of source parameters
- type
- Description: import the type of data sources, current import tools: three import data sources of
txtfile
,mysql
andftp
. More data sources will be supported later - Type:
string
- Required: Yes
- Default value: none
- Description: import the type of data sources, current import tools: three import data sources of
- parameter
- Description: specific configuration of different data sources, as shown in III. Import Data Source Configuration
- Type:
jsonObject
- Required: Yes
- Default value: none
# 2.3.4 Specification of target parameters
- appid
- Description: import project appid corresponding to the table, which can be found in the background of the TA system
- Type:
string
- Required: Yes
- Default value: none
- table
- Description: import table name into the TA system. Note: The table name cannot be duplicated globally. It is recommended to add distinguishable prefixes or suffixes for different projects
- Type:
string
- Required: Yes
- Default value: none
- table_desc
- Description: import the comment of the table. It is recommended to configure this parameter when importing to facilitate the subsequent query of the table and clarify the meaning of the table
- Type:
string
- Required: No
- Default value: none
- partition_value
- Description: imported partition value. The custom table imported by the TA system will bring the partition field
$pt
by default, so the imported partition value must be specified when importing. Generally, it can be set as the imported data date, for example:20180701
, and also supports time macro replacement, for example:@[{yyyyMMdd}-{1day}]
. Section 2.1 will introduce the specific usage - Type:
string
- Required: Yes
- Default value: none
- Description: imported partition value. The custom table imported by the TA system will bring the partition field
- column
- Description: define the table field definition imported into the TA system, including 3 property values of
name
,type
andcomment
, wherename
andtype
required fields. The sample is as follows:
- Description: define the table field definition imported into the TA system, including 3 property values of
[
{
"name": "col1",
"type": "timestamp",
"comment": "timestamp"
},
{
"name": "col2",
"type": "varchar"
}
]
When the source is mysql and the whole table is imported (that is, the column
field is ["*"]
), the column
parameter cannot be passed in the target
, and the import tool will follow the table structure in mysql. For the rest, this field must be passed
- Type:
jsonArray
- Required: No
- Default value: mysql source table schema definition
# 2.4 Time macro usage
You can replace time parameters with time macros inside the configuration file. ta-tool will use the imported start time as a benchmark to calculate the offset of time based on the parameters of the time macro, and replace the time macro in the configuration file. The available time macro formats are@[{yyyyMMdd}]
, @[{yyyyMMdd}-{nday}]
, @[{yyyyMMdd}+{nday}]
, @[{yyyMMdd} + {nday} ], etc.
yyyyMMdd
can be replaced with any date format that can be parsed by JavadateFormat
, for example:yyyy-MM-dd HH:mm:ss.SSS
andyyyyMMddHH000000
- N can be any integer, representing the offset value of time
- Day represents the offset unit of time, which can be selected as follows:
day
,hour
,minute
,week
andmonth
- Example: Suppose the current time is
2018-07-01 15:13:23.234
@[{yyyyMMdd}]
is replaced with20180701
@[{yyyy-MM-dd}-{1day}]
is replaced with2018-06-31
@[{yyyyMMddHH}+{2hour}]
is replaced with2018070117
@[{yyyyMMddHHmm00}-{10minute}]
is replaced with20180701150300
# III. Configuration of Import Data Sources
This section will introduce the parameter configuration of different data sources. Currently, it supports 3 import data sources of txtfile
, mysql
and ftp
. According to different data sources, you need to adjust the parameters of the source.
# 3.1 mysql data source
The data source is connected to the remote mysql database through the JDBC connector, and generates a query SELECT SQL statement based on the information configured by the user, and then sends it to the remote mysql database, and imports the result returned by the SQL execution into the table of the TA system.
# 3.1.1 Sample configuration
- A sample configuration of importing the entire mysql table into the TA system:
{
"parallel_num": 2,
"source": {
"type": "mysql",
"parameter": {
"username": "test",
"password": "test",
"column": ["*"],
"connection": [
{
"table": ["test_table"],
"jdbcUrl": ["jdbc:mysql://mysql-ip:3306/testDb"]
}
]
}
},
"target": {
"appid": "test-appid",
"table": "test_table_abc",
"table_desc": "mysql test table",
"partition_value": "@[{yyyy-MM-dd}-{1day}]"
}
}
- Custom SQL import to the sample configuration of the TA system:
{
"parallel_num": 1,
"source": {
"type": "mysql",
"parameter": {
"username": "test",
"password": "test",
"connection": [
{
"querySql": [
"select db_id,log_time from test_table where log_time>='@[{yyyy-MM-dd 00:00:00}-{1day}]' and log_time<'@[{yyyy-MM-dd 00:00:00}]'"
],
"jdbcUrl": ["jdbc:mysql://mysql-ip:3306/testDb"]
}
]
}
},
"target": {
"appid": "test-appid",
"table": "test_table_abc",
"table_desc": "mysql test table",
"partition_value": "@[{yyyy-MM-dd}-{1day}]",
"column": [
{
"name": "db_id",
"type": "bigint",
"comment": "db serial number"
},
{
"name": "log_time",
"type": "timestamp",
"comment": "time stamp"
}
]
}
}
# 3.1.2 parameter description
- jdbcUrl
- Description: describe the JDBC connection information to the peer database, using the JSON array. Note that jdbcUrl must be included in the connection configuration unit. In general, the JSON array can be filled with a JDBC connection.
- Type:
jsonArray
- Required: Yes
- Default value: none
- username
- Description: user name of the data source
- Type:
string
- Required: Yes
- Default value: none
- password
- Description: Password for the user name specified by the data source
- Type:
string
- Required: Yes
- Default value: none
- table
- Description: selected tables that need to be synchronized. Use JSON's array description, so it supports simultaneous extraction of multiple tables. When configuring multiple tables, the user must ensure that multiple tables are of the same schema structure. MySQL Reader does not check whether the tables are the same logical table. Note that the table must be included in the connection configuration unit.
- Type:
jsonArray
- Required: Yes
- Default value: none
- column
- Description: set of column names in the configured table that need to be synchronized, using the JSON array to describe the field information. You can use
to indicate that all column configurations are used by default, such as
[“”]
. - Type:
jsonArray
- Required: Yes
- Default value: none
- Description: set of column names in the configured table that need to be synchronized, using the JSON array to describe the field information. You can use
- where
- Description: Filter, splice SQL according to the specified column, table and where conditions, and extract data according to this SQL. In actual business scenarios, the data of the previous day is often selected for synchronization, and the where condition can be specified as
log_time>='@[{yyyy-MM-dd 00:00:00}-{1day}]' and log_time<'@[{yyyy-MM-dd 00:00:00}]'
'. Note: The where condition cannot be specified aslimit 10
.limit
is not a legal where clause of SQL. The where condition can effectively synchronize business increments. If the where statement is not filled in, the import tool is regarded as synchronizing full data. - Type:
string
- Required: No
- Default value: none
- Description: Filter, splice SQL according to the specified column, table and where conditions, and extract data according to this SQL. In actual business scenarios, the data of the previous day is often selected for synchronization, and the where condition can be specified as
- querySql
- Description: In some business scenarios, where this configuration item is not sufficient to describe the filtered conditions, users can customize the filtered SQL through this configuration parameter. When the user configures this item, the import tool will ignore the configuration parameters of
table
andcolumn
, and directly use the content of this configuration item to filter the data. For example, the data needs to be synchronized after multi-table join:select a,b from table_a join table_b on table_a.id = table_b.id
. When the user configures querySql, the import tool directly ignores table, column and where condition configuration, querySql priority is greater than the table, column and where options. - Type:
string
- Required: No
- Default value: none
- Description: In some business scenarios, where this configuration item is not sufficient to describe the filtered conditions, users can customize the filtered SQL through this configuration parameter. When the user configures this item, the import tool will ignore the configuration parameters of
# 3.2 txtfile data source
The txtfile data source reads files on the local server and imports them into the TA's system table. The current usage restrictions and characteristics of txtfile are as follows:
- Support and only support reading TXT files, and require the schema in TXT to be a two-dimensional table
- Support CSV-like format files with custom delimiters
- Support multiple types of data reading (represented with string), column clipping and column constants
- Support recursive reading and file name filtering
- Support text compression, the existing compression format is zip, gzip and bzip2
# 3.2.1 Sample configuration
{
"parallel_num": 5,
"source": {
"type": "txtfile",
"parameter": {
"path": ["/home/ftp/data/testData/*"],
"column": [
{
"index": 0,
"type": "long"
},
{
"index": 1,
"type": "string"
}
],
"encoding": "UTF-8",
"fieldDelimiter": "\t"
}
},
"target": {
"appid": "test-appid",
"table": "test_table_abc",
"table_desc": "mysql test table",
"partition_value": "@[{yyyy-MM-dd}-{1day}]",
"column": [
{
"name": "db_id",
"type": "bigint",
"comment": "db serial number"
},
{
"name": "log_time",
"type": "timestamp",
"comment": "time stamp"
}
]
}
}
# 3.2.2 parameter description
- path
- Description: path information of the local file system. Note that multiple paths can be filled in here. When specifying wildcard, the import tool tries to traverse multiple file information. For example, specify
/data/
, which means reading all files in the/data
directory. Currently only thesymbol is supported as a file wildcard. It is particularly important to note that the import tool will treat all Text Files synchronized under a job as the same data table. The user must ensure that all Files fit the same set of schema information. The file must be read in a CSV-like format.
- Type:
string
- Required: Yes
- Default value: none
- Description: path information of the local file system. Note that multiple paths can be filled in here. When specifying wildcard, the import tool tries to traverse multiple file information. For example, specify
- column
- Description: Read the list of fields.
type
specifies the type of source data,index
specifies the current column from the text (starting with 0).value
specifies the current type as a constant, does not read data from the source file, but automatically generates the corresponding column according tovalue
.
- Description: Read the list of fields.
By default, users can read all data according to the string
type, configured as follows:
"column": ["*"]
The user can specify the Column field information, configured as follows:
({
"type": "long",
"index": 0
},
{
"type": "string",
"value": "2018-07-01 00:00:00"
})
For user-specified Column information, type
must be filled in and index/value
must be selected.
The value range of type is:long
, double
, string
aboolean
- Type:
jsonArray
- Required: Yes
- Default value: none
- fieldDelimiter
- Description: field delimiter for reading
- Type:
string
- Required: Yes
- Default value:
,
- compress
- Description: text compression type. By default, non-filling means no compression. Supported compression types are
zip
,gzip
andbzip2
. - Type:
string
- Required: No
- Default value: no compression
- Description: text compression type. By default, non-filling means no compression. Supported compression types are
- encoding
- Description: read the encoding configuration of the file.
- Type:
string
- Required: No
- Default value:
utf-8
- skipHeader
- Description: The CSV-like format file may have the header as the title, which needs to be skipped. The default is not skipped.
- Type:
boolean
- Required: No
- Default value:
false
- nullFormat
- Description: Standard string cannot be used to define
null
(null pointer) in text files.ta-tool
providesnullFormat
to define which strings can be represented asnull
. For example, if the user configures:nullFormat:"\N"
, then if the source data is"\N"
, ta-tool is treated as anull
field. - Type:
string
- Required: No
- Default value:
\N
- Description: Standard string cannot be used to define