harshildarji / DataScienceLab

Data Science Lab - SS - 2019

5 stars 2 forks

Watch

25 commits

Failed to load latest commit information.

README.md

Data Science Lab - SS - 2019

Dataset: https://archive.org/download/archiveteam-twitter-stream-2018-04/archiveteam-twitter-stream-2018-04.tar

filter.py
This script will go through all the JSON files in dataset folder, and will only store the tweet if it matches following criterias:
- extended_tweet is NOT null
- lang is en (English)
- Tweet contains word(s) defined in keyWords list
It will not store all the details of a particular tweets, but only the features we require for our purpose:
- Twitter User Desciption
- Tweet
All this information will be stored in csv format (saved as all_data.csv).
label.py
Since we need to manually annotate all the selected tweets, this script will provide a simple command line interface to help with that.
This will present the user with a tweet (from all_data.csv, line by line), user will input 1 or 0 where:
- 1: Tweet is migration relevant
- 0: Tweet is NOT migration relevant
Once the user will hit enter, label will be stored in train_label.csv.
annotation.ipynb
This notebook trains and performs evaluation on the labelled data.
Pipeline (for now):
- Import data, and remove rows with null values in any columns
- Balance the dataset using SMOTE
- Prepare TF-IDF and Doc2Vec feature extraction techniques
- Provide appropriate data and labels to both the techniques, train classifiers using retrieved feature vectors
- Perform classification on a seperate validation set
- Print and Plot results!

About

Data Science Lab - SS - 2019

python data-science tweets annotation uni-passau data-science-lab

Releases

No releases published

Packages

No packages published

Languages

You can’t perform that action at this time.