Introducing the pathological package for manipulating paths, files and directories

Home > R, Uncategorized > Introducing the pathological package for manipulating paths, files and directories

Introducing the pathological package for manipulating paths, files and directories

28th April, 2014 richierocks Leave a comment Go to comments

I was recently hunting for a function that will strip the extension from a file – changing foo.png to foo, and so forth. I was knitting a report, and wanted to replace the file extension of the input with the extension of the the output file. (knitr handles this automatically in most cases but I had some custom logic in there that meant I had to work things manually.)

Finding file extensions is such a common task that I figured that someone must have written a function to solve the problem already. A quick search using findFn("file extension") from the sos package revealed a few thousand hits. There’s a lot of noise in there, but I found a few promising candidates.

There’s removeExt in the limma package (you can find it on Bioconductor), strip_extension in Kmisc, remove_file_extension which has identical copies in both spatial.tools and gdalUtils, and extension in the raster.

To save you the time and effort, I’ve tried them all, and unfortunately they all suck.

At a bare minimum, a file extension stripper needs to be vectorized, deal with different file extensions within that vector, deal with multiple levels of extension (for things like “tar.gz” files), and with filenames with dots in the name other than the extension, and with missing values, and with directories. OK, that’s quite a few things but I’m picky.

Since all the existing options failed, I’ve made my own function. In fact, I went overboard and created a package of path manipulation utilities, the pathological package. It isn’t on CRAN yet, but you can install it via:

library(devtools)
install_github("richierocks/pathological")

It’s been a while since I’ve used MATLAB, but I have fond recollections of its fileparts function that splits a path up into the directory, filename and extension.

The pathological equivalent is to decompose a path, which returns a ~~character matrix~~ data.frame with three columns.

library(pathological)
x <- c(
  "somedir/foo.tgz",         # single extension
  "another dir\\bar.tar.gz", # double extension
  "baz",                     # no extension
  "quux. quuux.tbz2",        # single ext, dots in filename
  R.home(),                  # a dir
  "~",                       # another dir
  "~/quuuux.tar.xz",         # a file in a dir
  "",                        # empty 
  ".",                       # current dir
  "..",                      # parent dir
  NA_character_              # missing
)
(decomposed <- decompose_path(x))
##                          dirname                      filename      extension
## somedir/foo.tgz         "d:/workspace/somedir"       "foo"         "tgz"    
## another dir\\bar.tar.gz "d:/workspace/another dir"   "bar"         "tar.gz" 
## baz                     "d:/workspace"               "baz"         ""       
## quux. quuux.tbz2        "d:/workspace"               "quux. quuux" "tbz2"   
## C:/PROGRA~1/R/R-31~1.0  "C:/Program Files/R/R-3.1.0" ""            ""       
## ~                       "C:/Users/richie/Documents"  ""            ""       
## ~/quuuux.tar.xz         "C:/Users/richie/Documents"  "quuuux"      "tar.xz" 
## ""                           ""            ""       
## .                       "d:/workspace"               ""            ""       
## ..                      "d:/"                        ""            ""       
## <NA>                    NA                           NA            NA       
## attr(,"class")
## [1] "decomposed_path" "matrix"

There are some shortcut functions to get at different parts of the filename:

get_extension(x)
##         somedir/foo.tgz another dir\\bar.tar.gz                     baz 
##                   "tgz"                "tar.gz"                      "" 
##        quux. quuux.tbz2  C:/PROGRA~1/R/R-31~1.0                       ~ 
##                  "tbz2"                      ""                      "" 
##         ~/quuuux.tar.xz                                               . 
##                "tar.xz"                      ""                      "" 
##                      ..                    <NA> 
##                      ""                      NA 
                     
strip_extension(x)
##  [1] "d:/workspace/somedir/foo"         "d:/workspace/another dir/bar"    
##  [3] "d:/workspace/baz"                 "d:/workspace/quux. quuux"        
##  [5] "C:/Program Files/R/R-3.1.0"       "C:/Users/richie/Documents"       
##  [7] "C:/Users/richie/Documents/quuuux" "/"                               
##  [9] "d:/workspace"                     "d:/"                             
## [11] NA 

strip_extension(x, include_dir = FALSE)
##         somedir/foo.tgz another dir\\bar.tar.gz                     baz 
##                   "foo"                   "bar"                   "baz" 
##        quux. quuux.tbz2  C:/PROGRA~1/R/R-31~1.0                       ~ 
##           "quux. quuux"                      ""                      "" 
##         ~/quuuux.tar.xz                                               . 
##                "quuuux"                      ""                      "" 
##                      ..                    <NA> 
##                      ""                      NA

You can also get your original file location (in a standardised form) using

recompose_path(decomposed)
##  [1] "d:/workspace/somedir/foo.tgz"           
##  [2] "d:/workspace/another dir/bar.tar.gz"    
##  [3] "d:/workspace/baz"                       
##  [4] "d:/workspace/quux. quuux.tbz2"          
##  [5] "C:/Program Files/R/R-3.1.0"             
##  [6] "C:/Users/richie/Documents"              
##  [7] "C:/Users/richie/Documents/quuuux.tar.xz"
##  [8] "/"                                      
##  [9] "d:/workspace"                           
## [10] "d:/"                                    
## [11] NA

The package also contains a few other path utilities. The standardisation I mentioned comes from standardise_path (standardize_path also available for Americans), and there’s a dir_copy function for copying directories.

It’s brand new, so after I’ve complained about other people’s code, I’m sure karma will ensure that you’ll find a bug or two, but I hope you find it useful.

Tags: directories, files, packages, pathological, paths, r

Comments (8) Trackbacks (0) Leave a comment Trackback

Bill

29th April, 2014 at 11:34 am

Reply

Thanks, your function looks useful for a lot of times when I need to strip the extensions. One potential feature request (having looked at the code but not used it quite yet): Could it have an option to either select all extensions (like your .tar.gz example) or just the last extension (to just grab the .gz)? More generally, could a “max extensions” option be given so that it splits by dots and then chooses up to max extensions positions as the extension?

And a potential corner case for a lot of Linux/Unix users could be a file that begins with a “.” which is hidden in *nix.
- richierocks
  
  29th April, 2014 at 12:07 pm
  
  Reply
  
  Kmisc::strip_extension has a lvl argument which lets you specify how many dots-worth of extension to strip. (Be warned that this function fails weirdly if you pass it a vector.) I thought about replicating the feature, but decompose_path is already surprisingly complicated and I didn’t want to add to this for such a rare use case. (Can you think of a real-world example where you’d really want to just grab the gz?)
  
  As for filenames starting with a dot, that’s taken care of. Try, for example, decompose_path(c(".x.tgz", ".x.tar.gz")).
  - Bill
    
    29th April, 2014 at 12:33 pm
    
    Reply
    
    Often, I create files that are named data-type.date.csv (more concretely: “”, and so in that case, I’d like to just snip the .csv and keep the date with the filename.
    - richierocks
      
      29th April, 2014 at 13:20 pm
      
      Reply
      
      Since you are using hyphens in your date, decompose_path is already smart enough to realise that that part isn’t a file extension, so it correctly identifies the file extension as being csv. :)
      
      You could fool it with a name like “”, but I suggest the workaround is just to not have perverse file naming like that.
Anonymous

30th April, 2014 at 8:30 am

Reply

I think Karma already got you in your Karma sentence.
30th April, 2014 at 13:29 pm

Reply

Why not make the result a data frame so you can extract pieces with $?
richierocks

30th April, 2014 at 17:11 pm

Reply

I chose a matrix because it was the simpler object but yes, you’re right, a data.frame would be easier to work with. Fixed in github.
Barry

12th June, 2014 at 12:28 pm

Reply

When’s this going on CRAN so we can stick it in our dependses and that multitude of file extension strippers you found can deprecate themselves?

No trackbacks yet.

4D Pie Charts

Introducing the pathological package for manipulating paths, files and directories

Leave a Reply Cancel reply

Richie Cotton

Categories

Archives

Blogroll

Licensing

Follow Blog via Email

Follow “4D Pie Charts”

4D Pie Charts

Introducing the pathological package for manipulating paths, files and directories

Share this:

Like this:

Related

Leave a Reply Cancel reply

Richie Cotton

Categories

Archives

Blogroll

Licensing

Follow Blog via Email

Follow “4D Pie Charts”