Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Data Science Lab - SS - 2019

Total alerts Language grade: Python

Dataset: https://archive.org/download/archiveteam-twitter-stream-2018-04/archiveteam-twitter-stream-2018-04.tar

  1. filter.py
    This script will go through all the JSON files in dataset folder, and will only store the tweet if it matches following criterias:
    - extended_tweet is NOT null
    - lang is en (English)
    - Tweet contains word(s) defined in keyWords list
    It will not store all the details of a particular tweets, but only the features we require for our purpose:
    - Twitter User Desciption
    - Tweet
    All this information will be stored in csv format (saved as all_data.csv).

  2. label.py
    Since we need to manually annotate all the selected tweets, this script will provide a simple command line interface to help with that.
    This will present the user with a tweet (from all_data.csv, line by line), user will input 1 or 0 where:
    - 1: Tweet is migration relevant
    - 0: Tweet is NOT migration relevant
    Once the user will hit enter, label will be stored in train_label.csv.

  3. annotation.ipynb
    This notebook trains and performs evaluation on the labelled data.
    Pipeline (for now):
    - Import data, and remove rows with null values in any columns
    - Balance the dataset using SMOTE
    - Prepare TF-IDF and Doc2Vec feature extraction techniques
    - Provide appropriate data and labels to both the techniques, train classifiers using retrieved feature vectors
    - Perform classification on a seperate validation set
    - Print and Plot results!

Releases

No releases published

Packages

No packages published
You can’t perform that action at this time.