In this blog, I will talk about my thought process during my capstone project for the Data Science Immersive program at General Assembly (GA). In my GitHub repo, I have provided the write-up, the presentation slides, the actual code, some data visualization, and the references. You are welcome to check it out if you want more detail.
When you run code that may take a while, for example, webscraping, or extracting text from many xml files, you may want to write the output to the disk as you go for various reasons. One is to release memory, since you are not storing everything in memory; the other one is to prevent losing all the work in case there is a power outage, internet outage, or an error is thrown on your code.
As your data science tasks become more and more sophisticated, you may find yourself having to run scripts repeatedly at fixed intervals.
For example, some projects require webscraping, or, as I described in a previous blog post, checking GitHub repos for updates.
Cron is perfect for such purposes.
I’ll describe below how to use
crontab in UNIX systems (Linux or Mac) to schedule for a Python script to run periodically at fixed times.
During the Data Science Immersive (DSI) program in General Assembly (Washington, DC), we usually have a GitHub Enterprise repo for each lecture, which we are asked to fork and clone to our own computers. Periodically, the instructors would post updates after we clone a repo, such as solution code for labs, or updated code-alongs after lectures. Checking for these updates can be tedious, especially when we don’t know exactly when a certain repo would be updated (and it doesn’t help my OCD). So I wrote a script to automate that. I’ve tested and debugged the script pretty extensively during the DSI program. Unfortunately, by the time I have worked out all the kinks, the repos aren’t updated that much anymore. But I hope this will help future cohorts. (Script is shown at the end.)
As you develop more and more Python code for data science tasks, you may find yourself using the same code (functions) over and over again (e.g., a customized eda function). Instead of copying and pasting code snippets to your current Jupyter notebook, there are slicker ways to call Python functions from anywhere on your computer without having to specify the path, which I’ll describe in this blog.
In the first blog of this series, I’ve talked about how to use the
describe method in pandas to get a sense of the distributions of the data.
describe gives a pretty comprehensive summary on numeric data, the same thing cannot be said about categorical data.
For this purpose, the
value_counts method for the Series class is quite handy.
I’ve written a wrapper function for it for easier control:
Exploratory data analysis (EDA) is a crucial part of data science, during which we look at the size of the dataset and the variables within, the distributions of the variables, and the relationships between these variables. We also want to identify missing data and outliers. The EDA step informs us on where a project should go next: