part files written to HDFS with .pending extension

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

part files written to HDFS with .pending extension

Krishnanand Khambadkone
Hi,  I have written a small program that uses a Twitter input stream and a HDFS output sink.   When the files are written to HDFS each part file in the directory has a .pending extension.  I am able to cat the file and see the tweet text.  Is this normal for the part files to have .pending extension.

-rw-r--r--   3 user  supergroup      46399 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-95.pending

-rw-r--r--   3 user supergroup      54861 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-96.pending

-rw-r--r--   3 user supergroup      41878 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-97.pending

-rw-r--r--   3  user supergroup      42813 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-98.pending

-rw-r--r--   3  user supergroup      42887 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-99.pending


Reply | Threaded
Open this post in threaded view
|

Re: part files written to HDFS with .pending extension

Krishnanand Khambadkone
BTW, I am using a BucketingSink and a DateTimeBucketer.  Do I need to set any other property to move the files from .pending state.

BucketingSink<String> sink = new BucketingSink<String>("hdfs://localhost:8020/flinktwitter/");
sink.setBucketer(new DateTimeBucketer<String>("yyyy-MM-dd--HHmm"));

On Friday, September 1, 2017, 5:03:46 PM PDT, Krishnanand Khambadkone <[hidden email]> wrote:


Boxbe This message is eligible for Automatic Cleanup! ([hidden email]) Add cleanup rule | More info
Hi,  I have written a small program that uses a Twitter input stream and a HDFS output sink.   When the files are written to HDFS each part file in the directory has a .pending extension.  I am able to cat the file and see the tweet text.  Is this normal for the part files to have .pending extension.

-rw-r--r--   3 user  supergroup      46399 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-95.pending

-rw-r--r--   3 user supergroup      54861 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-96.pending

-rw-r--r--   3 user supergroup      41878 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-97.pending

-rw-r--r--   3  user supergroup      42813 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-98.pending

-rw-r--r--   3  user supergroup      42887 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-99.pending


Reply | Threaded
Open this post in threaded view
|

Re: part files written to HDFS with .pending extension

Urs Schoenenberger
Hi,

you need to enable checkpointing for your job. Flink uses ".pending"
extensions to mark parts that have been completely written, but are not
included in a checkpoint yet.

Once you enable checkpointing, the .pending extensions will be removed
whenever a checkpoint completes.

Regards,
Urs

On 02.09.2017 02:46, Krishnanand Khambadkone wrote:

>  BTW, I am using a BucketingSink and a DateTimeBucketer.  Do I need to set any other property to move the files from .pending state.
> BucketingSink<String> sink = new BucketingSink<String>("hdfs://localhost:8020/flinktwitter/");sink.setBucketer(new DateTimeBucketer<String>("yyyy-MM-dd--HHmm"));
>     On Friday, September 1, 2017, 5:03:46 PM PDT, Krishnanand Khambadkone <[hidden email]> wrote:  
>  
>  This message is eligible for Automatic Cleanup! ([hidden email]) Add cleanup rule | More info
>  Hi,  I have written a small program that uses a Twitter input stream and a HDFS output sink.   When the files are written to HDFS each part file in the directory has a .pending extension.  I am able to cat the file and see the tweet text.  Is this normal for the part files to have .pending extension.
>
> -rw-r--r--   3 user  supergroup      46399 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-95.pending
>
> -rw-r--r--   3 user supergroup      54861 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-96.pending
>
> -rw-r--r--   3 user supergroup      41878 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-97.pending
>
> -rw-r--r--   3  user supergroup      42813 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-98.pending
>
> -rw-r--r--   3  user supergroup      42887 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-99.pending
>
>
>
> BTW, I am using a BucketingSink and a DateTimeBucketer.  Do I need to
> set any other property to move the files from .pending state.
>
> BucketingSink<String> sink = new
> BucketingSink<String>("hdfs://localhost:8020/flinktwitter/");
> sink.setBucketer(new DateTimeBucketer<String>("yyyy-MM-dd--HHmm"));
>
> On Friday, September 1, 2017, 5:03:46 PM PDT, Krishnanand Khambadkone
> <[hidden email]> wrote:
>
>
> Boxbe <https://www.boxbe.com/overview> This message is eligible for
> Automatic Cleanup! ([hidden email]) Add cleanup rule
> <https://www.boxbe.com/popup?url=https%3A%2F%2Fwww.boxbe.com%2Fcleanup%3Fkey%3DEtlbVGf2IoFyqVd%252BYTQgoYh7IBe%252BIpOJYK7qDVCFAc0%253D%26token%3Dvrvb4I8bZMqQO%252BIQo4LNdIPzxul4NPZ3oJxE1mxcxH%252Bl4O3xClWrPt9haYNIyocLTiCZU9Hz03W2YAj7r%252BrvypJRDvZuV2DQKZIO0jWxjDDidXcdSYtJf6vQSofw8eMWiaV6575VpAnd8HTL3AsZgQ%253D%253D&tc_serial=32491392088&tc_rand=158279498&utm_source=stf&utm_medium=email&utm_campaign=ANNO_CLEANUP_ADD&utm_content=001>
> | More info
> <http://blog.boxbe.com/general/boxbe-automatic-cleanup?tc_serial=32491392088&tc_rand=158279498&utm_source=stf&utm_medium=email&utm_campaign=ANNO_CLEANUP_ADD&utm_content=001>
>
> Hi,  I have written a small program that uses a Twitter input stream and
> a HDFS output sink.   When the files are written to HDFS each part file
> in the directory has a .pending extension.  I am able to cat the file
> and see the tweet text.  Is this normal for the part files to have
> .pending extension.
>
> -rw-r--r--   3 user  supergroup      46399 2017-09-01 16:35
> /flinktwitter/2017-09-01--1635/_part-0-95.pending
>
> -rw-r--r--   3 user supergroup      54861 2017-09-01 16:35
> /flinktwitter/2017-09-01--1635/_part-0-96.pending
>
> -rw-r--r--   3 user supergroup      41878 2017-09-01 16:35
> /flinktwitter/2017-09-01--1635/_part-0-97.pending
>
> -rw-r--r--   3  user supergroup      42813 2017-09-01 16:35
> /flinktwitter/2017-09-01--1635/_part-0-98.pending
>
> -rw-r--r--   3  user supergroup      42887 2017-09-01 16:35
> /flinktwitter/2017-09-01--1635/_part-0-99.pending
>
>

--
Urs Schönenberger - [hidden email]

TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082
Reply | Threaded
Open this post in threaded view
|

Re: Re: part files written to HDFS with .pending extension

Krishnanand Khambadkone
In reply to this post by Krishnanand Khambadkone
Yes,  I enabled checkpointing and now the files do not have .pending extension.

Thank you Urs.

On Saturday, September 2, 2017, 3:10:28 AM PDT, Urs Schoenenberger <[hidden email]> wrote:


Boxbe Urs Schoenenberger ([hidden email]) is not on your Guest List | Approve sender | Approve domain
Hi,

you need to enable checkpointing for your job. Flink uses ".pending"
extensions to mark parts that have been completely written, but are not
included in a checkpoint yet.

Once you enable checkpointing, the .pending extensions will be removed
whenever a checkpoint completes.

Regards,
Urs

On 02.09.2017 02:46, Krishnanand Khambadkone wrote:

>  BTW, I am using a BucketingSink and a DateTimeBucketer.  Do I need to set any other property to move the files from .pending state.
> BucketingSink<String> sink = new BucketingSink<String>("hdfs://localhost:8020/flinktwitter/");sink.setBucketer(new DateTimeBucketer<String>("yyyy-MM-dd--HHmm"));
>    On Friday, September 1, 2017, 5:03:46 PM PDT, Krishnanand Khambadkone <[hidden email]> wrote: 

>  This message is eligible for Automatic Cleanup! ([hidden email]) Add cleanup rule | More info
>  Hi,  I have written a small program that uses a Twitter input stream and a HDFS output sink.  When the files are written to HDFS each part file in the directory has a .pending extension.  I am able to cat the file and see the tweet text.  Is this normal for the part files to have .pending extension.
>
> -rw-r--r--  3 user  supergroup      46399 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-95.pending
>
> -rw-r--r--  3 user supergroup      54861 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-96.pending
>
> -rw-r--r--  3 user supergroup      41878 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-97.pending
>
> -rw-r--r--  3  user supergroup      42813 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-98.pending
>
> -rw-r--r--  3  user supergroup      42887 2017-09-01 16:35 /flinktwitter/2017-09-01--1635/_part-0-99.pending
>
>
>
> BTW, I am using a BucketingSink and a DateTimeBucketer.  Do I need to
> set any other property to move the files from .pending state.
>
> BucketingSink<String> sink = new
> BucketingSink<String>("hdfs://localhost:8020/flinktwitter/");
> sink.setBucketer(new DateTimeBucketer<String>("yyyy-MM-dd--HHmm"));
>
> On Friday, September 1, 2017, 5:03:46 PM PDT, Krishnanand Khambadkone
> <[hidden email]> wrote:
>
>
> Boxbe <https://www.boxbe.com/overview> This message is eligible for
> Automatic Cleanup! ([hidden email]) Add cleanup rule
> <https://www.boxbe.com/popup?url=https%3A%2F%2Fwww.boxbe.com%2Fcleanup%3Fkey%3DEtlbVGf2IoFyqVd%252BYTQgoYh7IBe%252BIpOJYK7qDVCFAc0%253D%26token%3Dvrvb4I8bZMqQO%252BIQo4LNdIPzxul4NPZ3oJxE1mxcxH%252Bl4O3xClWrPt9haYNIyocLTiCZU9Hz03W2YAj7r%252BrvypJRDvZuV2DQKZIO0jWxjDDidXcdSYtJf6vQSofw8eMWiaV6575VpAnd8HTL3AsZgQ%253D%253D&tc_serial=32491392088&tc_rand=158279498&utm_source=stf&utm_medium=email&utm_campaign=ANNO_CLEANUP_ADD&utm_content=001>
> | More info
> <http://blog.boxbe.com/general/boxbe-automatic-cleanup?tc_serial=32491392088&tc_rand=158279498&utm_source=stf&utm_medium=email&utm_campaign=ANNO_CLEANUP_ADD&utm_content=001>

>
> Hi,  I have written a small program that uses a Twitter input stream and
> a HDFS output sink.  When the files are written to HDFS each part file
> in the directory has a .pending extension.  I am able to cat the file
> and see the tweet text.  Is this normal for the part files to have
> .pending extension.
>
> -rw-r--r--  3 user  supergroup      46399 2017-09-01 16:35
> /flinktwitter/2017-09-01--1635/_part-0-95.pending
>
> -rw-r--r--  3 user supergroup      54861 2017-09-01 16:35
> /flinktwitter/2017-09-01--1635/_part-0-96.pending
>
> -rw-r--r--  3 user supergroup      41878 2017-09-01 16:35
> /flinktwitter/2017-09-01--1635/_part-0-97.pending
>
> -rw-r--r--  3  user supergroup      42813 2017-09-01 16:35
> /flinktwitter/2017-09-01--1635/_part-0-98.pending
>
> -rw-r--r--  3  user supergroup      42887 2017-09-01 16:35
> /flinktwitter/2017-09-01--1635/_part-0-99.pending

>
>

--
Urs Schönenberger - [hidden email]

TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082