Topic Evolution in Life Sciences Research

September 10, 2017

In this blog, I will talk about my thought process during my capstone project for the Data Science Immersive program at General Assembly (GA). In my GitHub repo, I have provided the write-up, the presentation slides, the actual code, some data visualization, and the references. You are welcome to check it out if you want more detail.

Save Your Work as You Go

August 26, 2017

When you run code that may take a while, for example, webscraping, or extracting text from many xml files, you may want to write the output to the disk as you go for various reasons. One is to release memory, since you are not storing everything in memory; the other one is to prevent losing all the work in case there is a power outage, internet outage, or an error is thrown on your code.

Some Plotting Functions You May Find Handy

August 19, 2017

In this blog I’ll share a few plotting functions I regularly use.

Schedule Repetitive Tasks Using Crontab

August 12, 2017

As your data science tasks become more and more sophisticated, you may find yourself having to run scripts repeatedly at fixed intervals. For example, some projects require webscraping, or, as I described in a previous blog post, checking GitHub repos for updates. Cron is perfect for such purposes. I’ll describe below how to use crontab in UNIX systems (Linux or Mac) to schedule for a Python script to run periodically at fixed times.

Programmatically Check for Updates in GitHub Repos

August 05, 2017

During the Data Science Immersive (DSI) program in General Assembly (Washington, DC), we usually have a GitHub Enterprise repo for each lecture, which we are asked to fork and clone to our own computers. Periodically, the instructors would post updates after we clone a repo, such as solution code for labs, or updated code-alongs after lectures. Checking for these updates can be tedious, especially when we don’t know exactly when a certain repo would be updated (and it doesn’t help my OCD). So I wrote a script to automate that. I’ve tested and debugged the script pretty extensively during the DSI program. Unfortunately, by the time I have worked out all the kinks, the repos aren’t updated that much anymore. But I hope this will help future cohorts. (Script is shown at the end.)

A Step-by-Step Guide to Set Up Python Scripts for System-Wide Use

July 29, 2017

As you develop more and more Python code for data science tasks, you may find yourself using the same code (functions) over and over again (e.g., a customized eda function). Instead of copying and pasting code snippets to your current Jupyter notebook, there are slicker ways to call Python functions from anywhere on your computer without having to specify the path, which I’ll describe in this blog.

Exploratory Data Analysis (IV)

July 22, 2017

One basic yet often effective approach in exploratory data analysis I have not mentioned in my EDA series (I, II, III) is visually inspecting the raw data:

  • check the column names (and understand what they mean)
  • check the values — Are they numbers, characters, etc.? Do they make sense?

Exploratory Data Analysis (III)

July 15, 2017

In the first blog of this series, I’ve talked about how to use the describe method in pandas to get a sense of the distributions of the data. While describe gives a pretty comprehensive summary on numeric data, the same thing cannot be said about categorical data. For this purpose, the value_counts method for the Series class is quite handy. I’ve written a wrapper function for it for easier control:

Exploratory Data Analysis (II)

July 08, 2017

In my last blog post, I have discussed the importance of exploratory data analysis (EDA) and the use of the describe method in pandas. Today I’ll discuss another aspect of EDA: checking for missing and duplicated data.

Exploratory Data Analysis (I)

July 01, 2017

Exploratory data analysis (EDA) is a crucial part of data science, during which we look at the size of the dataset and the variables within, the distributions of the variables, and the relationships between these variables. We also want to identify missing data and outliers. The EDA step informs us on where a project should go next: