docs.intersystems.com
Home / Using Java with the InterSystems JDBC Driver / The Simple Data Transfer Utility

Using Java with the InterSystems JDBC Driver
The Simple Data Transfer Utility
Previous section           Next section
InterSystems: The power behind what matters   
Search:  


The Simple Data Transfer Utility is a single-class Java command-line utility used for massive data transfer from a source to a target datasource.
The target datasource is always a Java Database Connectivity (JDBC) datasource. The utility is agnostic about underlying database, the main requirement being JDBC compliance. However it is optimized to work with IRIS as a target.
The source datasource can be either a JDBC datasource or a comma-separated values (CSV) file.
While the utility works with both standard and sharded namespaces in IRIS, when the target table is sharded then the utility is significantly more efficient as it utilizes parallelization.
The assumption is that while this kind of loading is taking place, there are no other database processes running. Usually even journalling is disabled and possibly some concurrency controls. While the loader might function successfully even if actions such as INSERT or DELETE are done in parallel, this usage is in conflict with its primary use case.
This chapter contains the following sections:
CSV Driver Options
There are two options to transfer from CSV files into IRIS, or other JDBC datasource:
Using the Simple Data Transfer Utility
The utility is a Java program contained in the Java class com.intersys.datatransfer.SimpleMover. To run it, Java 8 must be installed. Additionally, the following InterSystems JAR files are required:
If the source or target is another JDBC data source then you need to add the JAR with the corresponding driver to the classpath. The following JDBC data sources have been tested:
The class com.intersystems.datatransfer.SimpleMover requires at least one argument: a property file containing connection parameters for source and target data sources and other options. Parameters provided in command line arguments override those provided in a property file. This feature allows testing different options without modifying the properties file. In theory, all parameters could be specified on the command line but that is not recommended. 
The following is an example of a command line to execute the utility:
  java -cp 
      ../../iris/latest/built/common/release/java/1.8/isc-utils/isc-utils-3.0.0.jar:
      ../../iris/latest/built/common/release/java/1.8/isc-jdbc/isc-jdbc-3.0.0.jar 
  com.intersys.datatransfer.SimpleMover p=astro2s.properties
Here the astro2s.properties is a properties file that defines all transfer options. Note that for windows users, classpath entries are separated by a semicolon (;) rather than a colon (:)
When you need additional JDBC drivers they should be added to the classpath. This example shows usage when a PostgreSQL driver is needed:
  java -cp 
      ../../iris/latest/built/common/release/java/1.8/isc-utils/isc-utils-3.0.0.jar:
      ../../iris/latest/built/common/release/java/1.8/isc-jdbc/isc-jdbc-3.0.0.jar:
      /Volumes/ppdisk/mpro/opt/postgres/lib/postgresql-42.1.4.jar 
  com.intersys.datatransfer.SimpleMover p=astro2s.properties
This example uses a CSV JDBC driver:
  java -cp 
      ../../iris/latest/built/common/release/java/1.8/isc-utils/isc-utils-3.0.0.jar:
      ../../iris/latest/built/common/release/java/1.8/isc-jdbc/isc-jdbc-3.0.0.jar:
      /usr/local/javalib/csvjdbc/csvjdbc-1.0-34.jar 
  com.intersys.datatransfer.SimpleMover p=astro2s.properties
Data Transfer Options
Options can be specified either as command line arguments or in a property file. Options on the command line override those in the property file.
Property File
The following options apply to the transfer generally. A number of additional options that apply to the source and target datasources are described in a later section.
Command Line Options
One important command line option is properties (also p or props), which supplies the name of properties file. If you omit this option, you must pass all required options on the command line.
Specify command line options either in the form key=value or key value. A special form, -p:<propertyName>=<propertyValue>, can be used to specify or override any property from the property file.
Datasources
Users must specify two datasources: source and target. Both use similar properties prefixed by either "source." or "target."
Target Properties
Property Name
Description
A valid URL to a datasource to load into.
Only for IRIS, instead of URL user can specify host, port and namespace
Only for IRIS, instead of URL user can specify host, port and namespace
Only for IRIS, instead of URL user can specify host, port and namespace
Username for the target datasource.
Password for the target datasource.
Only for the target datasource. Creates a table identical to the source table in the target database. Possible values: sharded, not sharded, and do not create. If this property is not specified, the table is not created and must exist in the target data source.
A property that is passed directly to the JDBC driver. For example, for target.driver.x.y.z=25 a property with a name x.y.z and a value of 25 is passed to the target driver.
When true, a background thread runs a select count(*) query on the target table to monitor the rate of insertion. When false, the rate of insertion is estimated by the amount of data sent to the server. Default: true.
Name of the table to load data into.
Source Properties
Property Name
Description
A valid URL for the source datasource.
Only for IRIS, instead of URL user can specify host, port and namespace
Only for IRIS, instead of URL user can specify host, port and namespace
Only for IRIS, instead of URL user can specify host, port and namespace
Username for the source datasource.
Password for the source datasource.
List of columns from the source table to ignore and not write to the target table.
When using splitOn, use this property to limit the records to load by specifying a maximum value of that column to load from the source.
When using splitOn, use this property to limit the records to load by specifying a minimum value of that column to load from the source.
Use csv to use the built-in CSV parser or jdbc for any other JDBC source. Default: jdbc
Name of the table to load data from, when loading from JDBC source. Using either source.table or source.query is required when source.type="jdbc".
When loading from a table, use this property to select a column to use for splitting the data into chunks (jobs).
Source Properties when Loading from a JDBC Source
Optionally specifies the number of records to load. This number is used for calculating the percentage completed and estimates for time remaining. Avoids running the count(*) query. Useful when the count(*) query can take a lot of time.
Possible value: an integer.
A property that is passed directly to the JDBC driver. For example for source.driver.x.y.z=25 a property with a name x.y.z and a value of 25 is passed to the source driver.
Specifies an SQL query to load data. Useful if you require more advanced processing than loading a whole table with source.table. Optionally use a range condition with two ?s to split the data into batches.
Specifies a loop spec in ObjectScript format to provide the first query parameter value for splitting the source data into batches. For example, a value of "1:10:100" issues the query for the first batch with a value of 1 passed in as this first query parameter, incremented by 10 for each batch (21, 31, ...) until it reaches 100.
Possible value: loop spec in ObjectScript format.
The second and further arguments can be specified either in the same way as the first argument or by referencing a prior argument. For example, $1 + 999 means, that on each iteration the second argument is the value of the first incremented by 999.
Allows you to optionally specify an alternate query for deriving the number of records to load, which is used for the progress indicator. When not specified, this defaults to SELECT COUNT(*) FROM (source.query).
Possible value: SQL query for count.
Source Properties when Loading from a CSV Source
Property Name
Description
Restrict reading the lines to a preallocated pool. If the input file is huge, then because reading is usually faster than writing, the program might read the entire file into memory causing an OutOfMemory error. To avoid this situation, use this property to specify the pool size. Best practice is to provide as large a pool as your memory allows. In any case, the size of this pool cannot be less than the size of batch multiplied by the number of threads. If the specified size is smaller then it is automatically increased. When nothing is specified (by default), this pool size is considered unbounded.
Separator to use for reading column values. Defaults to “.”.
A list of column names to load in the order they appear in the source file, provided as a comma separated list of strings. This list must match the column names in the target table (case sensitive). You can omit some non-required columns in the target table by leaving them blank.
The special value of "#" means that the first line in the CSV file is the header line, i.e. the list of column names. If any other value is used, the first line in the CSV file is expected to contain valid data.
Possible value: a comma separated list of strings.
The same as source.header. Supported for compatibility with relique CSV JDBC driver.
Number of parallel jobs to use for reading the CSV file. It can be different from the number of insert jobs.
Possible value: an integer.
XEP Datasource
A target IRIS database can be used in XEP Mode. This mode is only valid for non-sharded databases. It can provide significant performance boost in some cases. To specify that a target should work in XEP mode, use the following property:
Property Name
Description
Mode to use for inserting into the target table. When "xep" is provided, InterSystems' XEP functionality is used for loading data, which can provide significant performance improvements over regular JDBC. This is not available when loading into sharded tables. This is a required property.
Possible value: xep
Property File Examples
PostgreSQL Example
The following property file is used to transfer a whole table from PostgreSQL into IRIS, creating a new sharded table in the IRIS namespace:
PostgreSQL Example
jobs=10
threads=10
splitOn=ID
exclude=ID
min = 1
max = *1.1
refresh=5s
log.file=logs/fromPostgres.log

source.url = jdbc:postgresql://localhost:5432/postgres
source.username=postgres
source.password=sys
source.table=acidminer.ATreeNode

target.host=localhost
target.port=56777
target.namespace=SHMASTER
target.username=_SYSTEM
target.password=SYS
target.table=acidminer.ATreeNode

target.create = sharded
InterSystems IRIS Example
The following property file is used to transfer a result set obtained by custom query from one IRIS database into another IRIS database, creating a new non-sharded table in the latter.
InterSystems IRIS Example
exclude=ID
refresh=5s
log.file=logs/test103.log
source.host=localhost
source.port=56775
source.namespace=ACIDMINER
source.username=_SYSTEM
source.password=SYS
source.table=com_intersys_acidminer_model.ATreeNode
source.query = 
    select * from $table 
    where filteredOut = 0 
    and clade->NodeType = 'duplication' 
    and ID between ? and ?
source.query.$1 = 1000000:1000000:10000000
source.query.$2 = $1 + 999999
source.query.count = 
    select count(*) from $table 
    where filteredOut = 0 
    and clade->NodeType = 'duplication' 
    and ID between 1000000 and 10999999
target.host=localhost
target.port=56777
target.namespace=TEST
target.username=_SYSTEM
target.password=SYS
target.table=test_acidminer.ATreeNode2
target.create = not sharded
CSV Using CSV JDBC Driver Example
CSV Using CSV JDBC Driver Example
exclude=dummy
refresh=5s
log.file=astro.log
source.type=csv
source.url=jdbc:relique:csv:/Volumes/ppdisk/data/astro/files
source.username=
source.password=
source.tables=.*
source.driver.headerline=
    id1,ra,dec,errra,errdec,pmra,pmdec,errpmra,errpmdec,radvel,errradvel,htm,healpixring,healpixnest,
    epoch,axe_a,axe_b,theta,shape,magu,errmagu,magb,errmagb,magv,errmagv,magr,errmagr,magi,errmagi,
    magj,errmagj,magh,errmagh,magk,errmagk,magSg,errmagSg,magSr,errmagSr,magSi,errmagSi,vartype,
    period,logteff,errlogteff,logg,errlogg,logmet,errlogmet,alphamet,erralphamet,spectrumid,dummy
source.driver.suppressHeaders=true
source.driver.trimValues=true
source.count=16000000
target.host=localhost
target.port=56777
target.namespace=SHMASTER
target.username=_SYSTEM
target.password=SYS
target.table=ASTRO
# target.create = not sharded
CSV Using Built-in CSV Reader Example
CSV Using Built-in CSV Reader Example
refresh=5s
log.file=logs/astro2s.log
jobs = 20
source.type=csv
# source.url=file:/Volumes/ppdisk/data/astro/files/f_example_1_0.txt.1.csv
source.url=file:/Volumes/ppdisk/data/astro/f_example_1_0.txt.csv
source.driver.headerline=
    id1,ra,dec,errra,errdec,pmra,pmdec,errpmra,errpmdec,radvel,errradvel,htm,healpixring,healpixnest,
    epoch,axe_a,axe_b,theta,shape,magu,errmagu,magb,errmagb,magv,errmagv,magr,errmagr,magi,errmagi,
    magj,errmagj,magh,errmagh,magk,errmagk,magSg,errmagSg,magSr,errmagSr,magSi,errmagSi,vartype,
    period,logteff,errlogteff,logg,errlogg,logmet,errlogmet,alphamet,erralphamet,spectrumid,dummy
source.jobs = 5
# source.csv.pool = 0
source.csv.pool = 200000
target.host=localhost
target.port=56777
target.namespace=SHMASTER
target.username=_SYSTEM
target.password=SYS
target.table=ASTRO

-- %USERSIG{MishaBouzinier - 2018-01-04}%
Using XEP Mode for target connection
Using XEP Mode for target connection
exclude=dummy
refresh=5s
log.file=logs/astro2x.log
jobs = 4
source.type=csv
# source.url=file:/Volumes/ppdisk/data/astro/files/f_example_1_0.txt.1.csv
source.url=file:/Volumes/ppdisk/data/astro/f_example_1_0.txt.csv
source.driver.headerline=
    id,ra,dec,errra,errdec,pmra,pmdec,errpmra,errpmdec,radvel,errradvel,htm,healpixring,healpixnest,
    epoch,axe_a,axe_b,theta,shape,magu,errmagu,magb,errmagb,magv,errmagv,magr,errmagr,magi,errmagi,
    magj,errmagj,magh,errmagh,magk,errmagk,magSg,errmagSg,magSr,errmagSr,magSi,errmagSi,vartype,
    period,logteff,errlogteff,logg,errlogg,logmet,errlogmet,alphamet,erralphamet,spectrumid,dummy
source.jobs = 1
# source.csv.pool = 0
source.csv.pool = 200000
target.host=localhost
target.port=56777
target.mode=xep
target.namespace=ASTRO
target.username=_SYSTEM
target.password=SYS
target.table=ASTRO2


Previous section           Next section
Send us comments on this page
View this book as PDF   |  Download all PDFs
Copyright © 1997-2019 InterSystems Corporation, Cambridge, MA
Content Date/Time: 2019-07-19 06:02:41