Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
R
 
 
 
 
man
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

dataPreparation

Travis-CI Build Status codecov CRAN_Status_Badge

Data preparation accounts for about 80% of the work during a data science project. Let's take that number down. dataPreparation will allow you to do most of the painful data preparation for a data science project with a minimum amount of code.

This package is

  • fast (use data.table and exponential search)
  • RAM efficient (perform operations by reference and column-wise to avoid copying data)
  • stable (most exceptions are handled)
  • verbose (log a lot)

Main preparation steps

Before using any machine learning (ML) algorithm, one need to prepare its data. Preparing a data set for a data science project can be long and tricky. The main steps are the followings:

  • Read: load the data set (this package don't treat this point: for csv we recommend data.table::fread)
  • Correct: most of the times, there are some mistake after reading, wrong format... one have to correct them
  • Transform: creating new features from date, categorical, character... in order to have information usable for a ML algorithm (aka: numeric or categorical)
  • Filter: get rid of useless information in order to speed up computation
  • Pre model transformation: Specific manipulation for the chosen model (handling NA, discretization, one hot encoding, scaling...)
  • Shape: put your data set in a nice shape usable by a ML algorithm

Here are the functions available in this package to tackle those issues:

Correct Transform Filter Pre model manipulation Shape
unFactor generateDateDiffs fastFilterVariables fastHandleNa shapeSet
findAndTransformDates generateFactorFromDate whichAreConstant fastDiscretization sameShape
findAndTransformNumerics aggregateByKey whichAreInDouble fastScale setAsNumericMatrix
setColAsCharacter generateFromFactor whichAreBijection one_hot_encoder
setColAsNumeric generateFromCharacter remove_sd_outlier
setColAsDate fastRound remove_rare_categorical
setColAsFactor target_encode remove_percentile_outlier

All of those functions are integrated in the full pipeline function prepareSet.

For more details on how it work go check our tutorial.

Getting started: 30 seconds to dataPreparation

Installation

Install the package from CRAN:

install.packages("dataPreparation")

To have the latest features, install the package from github:

library(devtools)
install_github("ELToulemonde/dataPreparation")

Test it

Load a toy data set

library(dataPreparation)
data(messy_adult)
head(messy_adult)

Perform full pipeline function

clean_adult <- prepareSet(messy_adult)
head(clean_adult)

That's it. For all functions, you can check out documentation and/or tutorial vignette.

How to Contribute

dataPreparation has been developed and used by many active community members. Your help is very valuable to make it better for everyone.

  • Check out call for contributions to see what can be improved, or open an issue if you want something.
  • Contribute to add new usesfull features.
  • Contribute to the tests to make it more reliable.
  • Contribute to the documents to make it clearer for everyone.
  • Contribute to the examples to share your experience with other users.
  • Open issue if you met problems during development.

For more details, please refer to CONTRIBUTING.

You can’t perform that action at this time.