ELToulemonde / dataPreparation

dataPreparation

Data preparation accounts for about 80% of the work during a data science project. Let's take that number down. dataPreparation will allow you to do most of the painful data preparation for a data science project with a minimum amount of code.

This package is

fast (use data.table and exponential search)
RAM efficient (perform operations by reference and column-wise to avoid copying data)
stable (most exceptions are handled)
verbose (log a lot)

Main preparation steps

Before using any machine learning (ML) algorithm, one need to prepare its data. Preparing a data set for a data science project can be long and tricky. The main steps are the followings:

Read: load the data set (this package don't treat this point: for csv we recommend data.table::fread)
Correct: most of the times, there are some mistake after reading, wrong format... one have to correct them
Transform: creating new features from date, categorical, character... in order to have information usable for a ML algorithm (aka: numeric or categorical)
Filter: get rid of useless information in order to speed up computation
Pre model transformation: Specific manipulation for the chosen model (handling NA, discretization, one hot encoding, scaling...)
Shape: put your data set in a nice shape usable by a ML algorithm

Here are the functions available in this package to tackle those issues:

Correct	Transform	Filter	Pre model manipulation	Shape
unFactor	generateDateDiffs	fastFilterVariables	fastHandleNa	shapeSet
findAndTransformDates	generateFactorFromDate	whichAreConstant	fastDiscretization	sameShape
findAndTransformNumerics	aggregateByKey	whichAreInDouble	fastScale	setAsNumericMatrix
setColAsCharacter	generateFromFactor	whichAreBijection		one_hot_encoder
setColAsNumeric	generateFromCharacter	remove_sd_outlier
setColAsDate	fastRound	remove_rare_categorical
setColAsFactor	target_encode	remove_percentile_outlier

All of those functions are integrated in the full pipeline function prepareSet.

For more details on how it work go check our tutorial.

Getting started: 30 seconds to dataPreparation

Installation

Install the package from CRAN:

install.packages("dataPreparation")

To have the latest features, install the package from github:

library(devtools)
install_github("ELToulemonde/dataPreparation")

Test it

Load a toy data set

library(dataPreparation)
data(messy_adult)
head(messy_adult)

Perform full pipeline function

clean_adult <- prepareSet(messy_adult)
head(clean_adult)

That's it. For all functions, you can check out documentation and/or tutorial vignette.

How to Contribute

dataPreparation has been developed and used by many active community members. Your help is very valuable to make it better for everyone.

Check out call for contributions to see what can be improved, or open an issue if you want something.
Contribute to add new usesfull features.
Contribute to the tests to make it more reliable.
Contribute to the documents to make it clearer for everyone.
Contribute to the examples to share your experience with other users.
Open issue if you met problems during development.

For more details, please refer to CONTRIBUTING.

ELToulemonde / dataPreparation

README.md

dataPreparation

Main preparation steps

Getting started: 30 seconds to dataPreparation

Installation

Test it

How to Contribute

About

Releases 11

Packages

Contributors 3

Languages

ELToulemonde / dataPreparation

Join GitHub today

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio

Latest commit

Git stats

Files

README.md

dataPreparation

Main preparation steps

Getting started: 30 seconds to dataPreparation

Installation

Test it

How to Contribute

About

Topics

Resources

License

Releases 11

Packages 0

Contributors 3

Languages

Essential cookies

Always active

Analytics cookies

Packages