Query big mssql Data Source [Batch]

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Query big mssql Data Source [Batch]

miki haiat
HI ,
I want to query some sql table that contains  ~80m rows.

There is  a few ways to do that  and i wonder what is the best way to do that .

  1. Using JDBCINPUTFORMAT  -> convert to dataset and output it without doing any logic in the dataset, passing the full query in the JDBCINPUTFORMAT set query parameters.
  2.  Using JDBCINPUTFORMAT    select all the data from table then desirelaze it ->convert to dataset and preforming logic.

Or something else that is much efficient ?

Thanks,

Miki

Reply | Threaded
Open this post in threaded view
|

Re: Query big mssql Data Source [Batch]

Flavio Pompermaier
You can pass a ParametersProvider to the jdbc input format in order to parallelize the fetch.
Of course you don't have to kill the mysql server  with too many request in parallel so you'll probably put a limit to the parallelism of the input format.


On Tue, 4 Dec 2018, 17:31 miki haiat <[hidden email] wrote:
HI ,
I want to query some sql table that contains  ~80m rows.

There is  a few ways to do that  and i wonder what is the best way to do that .

  1. Using JDBCINPUTFORMAT  -> convert to dataset and output it without doing any logic in the dataset, passing the full query in the JDBCINPUTFORMAT set query parameters.
  2.  Using JDBCINPUTFORMAT    select all the data from table then desirelaze it ->convert to dataset and preforming logic.

Or something else that is much efficient ?

Thanks,

Miki

Reply | Threaded
Open this post in threaded view
|

Re: Query big mssql Data Source [Batch]

miki haiat
Im using  jdts driver to query mssql .
I used the  ParametersProvider  as you suggested but  for some reason the job wont     run  parallel .

flink_in.JPG

Also the sink , a simple print out wont parallel 

flink_out.JPG




On Tue, Dec 4, 2018 at 10:05 PM Flavio Pompermaier <[hidden email]> wrote:
You can pass a ParametersProvider to the jdbc input format in order to parallelize the fetch.
Of course you don't have to kill the mysql server  with too many request in parallel so you'll probably put a limit to the parallelism of the input format.


On Tue, 4 Dec 2018, 17:31 miki haiat <[hidden email] wrote:
HI ,
I want to query some sql table that contains  ~80m rows.

There is  a few ways to do that  and i wonder what is the best way to do that .

  1. Using JDBCINPUTFORMAT  -> convert to dataset and output it without doing any logic in the dataset, passing the full query in the JDBCINPUTFORMAT set query parameters.
  2.  Using JDBCINPUTFORMAT    select all the data from table then desirelaze it ->convert to dataset and preforming logic.

Or something else that is much efficient ?

Thanks,

Miki

Reply | Threaded
Open this post in threaded view
|

Re: Query big mssql Data Source [Batch]

Flavio Pompermaier
Reply | Threaded
Open this post in threaded view
|

Re: Query big mssql Data Source [Batch]

miki haiat
hi Flavio ,

This is the query that im trying to coordinate 
.setQuery("SELECT a, b, c, \n" +
"FROM dbx.dbo.x as tls\n"+
"WHERE tls.a BETWEEN ? and ?"
And this is the way im trying to parameterized

ParameterValuesProvider pramProvider = new NumericBetweenParametersProvider(10000, 3,300);
  
I also tried this way 
 Serializable[][] queryParameters = new String[1][2];
queryParameters[0] = new String[]{"3","300"};

On Wed, Dec 5, 2018 at 6:44 PM Flavio Pompermaier <[hidden email]> wrote:
Reply | Threaded
Open this post in threaded view
|

Re: Query big mssql Data Source [Batch]

Flavio Pompermaier
the constructor of NumericBetweenParametersProvider takes 3 params: long fetchSize, long minVal, long maxVal.
If you want parallelism you should use a  1 < fetchSize  < maxVal.
In your case, if you do new NumericBetweenParametersProvider(50, 3, 300) you will produce 6 parallel tasks: 
  1. SELECT .... BETWEEN 3 and 50
  2. SELECT .... BETWEEN 51 and 100
  3. SELECT .... BETWEEN 101 and 150
  4. SELECT .... BETWEEN 151 and 200
  5. SELECT .... BETWEEN 201 and 250
  6. SELECT .... BETWEEN 251 and 300

On Thu, Dec 6, 2018 at 10:32 AM miki haiat <[hidden email]> wrote:
hi Flavio ,

This is the query that im trying to coordinate 
.setQuery("SELECT a, b, c, \n" +
"FROM dbx.dbo.x as tls\n"+
"WHERE tls.a BETWEEN ? and ?"
And this is the way im trying to parameterized

ParameterValuesProvider pramProvider = new NumericBetweenParametersProvider(10000, 3,300);
  
I also tried this way 
 Serializable[][] queryParameters = new String[1][2];
queryParameters[0] = new String[]{"3","300"};

On Wed, Dec 5, 2018 at 6:44 PM Flavio Pompermaier <[hidden email]> wrote:

Reply | Threaded
Open this post in threaded view
|

Re: Query big mssql Data Source [Batch]

miki haiat
Hi Flavio ,
That working fine for and im able to pull ~17m rows in 20 seconds.

Im a bit confuse regarding the state backhand ,  
I could find a way to configure it so im guessing the data is in the memory ...

thanks,
Miki
 


On Thu, Dec 6, 2018 at 12:06 PM Flavio Pompermaier <[hidden email]> wrote:
the constructor of NumericBetweenParametersProvider takes 3 params: long fetchSize, long minVal, long maxVal.
If you want parallelism you should use a  1 < fetchSize  < maxVal.
In your case, if you do new NumericBetweenParametersProvider(50, 3, 300) you will produce 6 parallel tasks: 
  1. SELECT .... BETWEEN 3 and 50
  2. SELECT .... BETWEEN 51 and 100
  3. SELECT .... BETWEEN 101 and 150
  4. SELECT .... BETWEEN 151 and 200
  5. SELECT .... BETWEEN 201 and 250
  6. SELECT .... BETWEEN 251 and 300

On Thu, Dec 6, 2018 at 10:32 AM miki haiat <[hidden email]> wrote:
hi Flavio ,

This is the query that im trying to coordinate 
.setQuery("SELECT a, b, c, \n" +
"FROM dbx.dbo.x as tls\n"+
"WHERE tls.a BETWEEN ? and ?"
And this is the way im trying to parameterized

ParameterValuesProvider pramProvider = new NumericBetweenParametersProvider(10000, 3,300);
  
I also tried this way 
 Serializable[][] queryParameters = new String[1][2];
queryParameters[0] = new String[]{"3","300"};

On Wed, Dec 5, 2018 at 6:44 PM Flavio Pompermaier <[hidden email]> wrote:

Reply | Threaded
Open this post in threaded view
|

Re: Query big mssql Data Source [Batch]

Flavio Pompermaier
That inputformat is a batch one, so there's no state backend. You need to output the fetched data somewhere AFAIK
On Thu, Dec 6, 2018 at 3:49 PM miki haiat <[hidden email]> wrote:
Hi Flavio ,
That working fine for and im able to pull ~17m rows in 20 seconds.

Im a bit confuse regarding the state backhand ,  
I could find a way to configure it so im guessing the data is in the memory ...

thanks,
Miki
 


On Thu, Dec 6, 2018 at 12:06 PM Flavio Pompermaier <[hidden email]> wrote:
the constructor of NumericBetweenParametersProvider takes 3 params: long fetchSize, long minVal, long maxVal.
If you want parallelism you should use a  1 < fetchSize  < maxVal.
In your case, if you do new NumericBetweenParametersProvider(50, 3, 300) you will produce 6 parallel tasks: 
  1. SELECT .... BETWEEN 3 and 50
  2. SELECT .... BETWEEN 51 and 100
  3. SELECT .... BETWEEN 101 and 150
  4. SELECT .... BETWEEN 151 and 200
  5. SELECT .... BETWEEN 201 and 250
  6. SELECT .... BETWEEN 251 and 300

On Thu, Dec 6, 2018 at 10:32 AM miki haiat <[hidden email]> wrote:
hi Flavio ,

This is the query that im trying to coordinate 
.setQuery("SELECT a, b, c, \n" +
"FROM dbx.dbo.x as tls\n"+
"WHERE tls.a BETWEEN ? and ?"
And this is the way im trying to parameterized

ParameterValuesProvider pramProvider = new NumericBetweenParametersProvider(10000, 3,300);
  
I also tried this way 
 Serializable[][] queryParameters = new String[1][2];
queryParameters[0] = new String[]{"3","300"};

On Wed, Dec 5, 2018 at 6:44 PM Flavio Pompermaier <[hidden email]> wrote: