A while ago, I wrote about some free resources you can use to learn data science on your own. This was mainly geared towards folks who wanted to apply to The Data Incubator’s free Data Science Fellowship as a useful “getting started” guide but it’s a useful place to start regardless of where you want to apply to be a data scientist. I’ll break up my answer into two parts:
- Free resources broken down by topic: While you are coming at this with an expertise in machine-learning, there are a number of other useful aspects of data science to learn. The response is more general interest.
- Free Data sources with which you can gain hands-on experience. One of the linchpins of our data science fellowship is the building of a capstone project that you use to showcase your newfound data science knowledge.
#1: New Topics to Learn [original post]
Here are five important skills to develop and some resources on how to help you develop them. While we don’t expect our applicants to possess all of these skills, most applicants already have a strong background in many of them.
- Scraping: There’s a lot of data out there so you’ll need to learn how to get access to it. Whether it’s JSON, HTML or some homebrew format, you should be able to handle them with ease. Modern scripting languages like Python are ideal for this. In Python, look at packages like urllib2, requests, simplejson, re, and beautiful soup to make handling web requests and data formats easier. More advanced topics include error handling (retrying) and parallelization (multiprocessing).
- SQL: Once you have a large amount of structured data, you will want to store and process it. SQL is the original query language and its syntax is so prevalent that there are SQL query interfaces for everything from sqldf for R data frames to Hive for Mapreduce.Normally, you would have to go through a painful install process to play with SQL. Fortunately, there’s a nice online interactive tutorialavailable where you can submit your queries and learn interactively. Additionally, Mode Analytics has agreat tutorial geared towards data scientists, although it is not interactive. When you’re ready to use SQL locally, SQLiteoffers a simple-to-install version of SQL.
- Data frames: SQL is great for handling large amounts of data but unfortunately it lacks machine learning and visualization. So the workflow is often to use SQL or mapreduce to get data to a manageable size and then process it using a libraries like R’s data frames or Python’s pandas. For Pandas, Wes McKinney, who created pandas, has a great video tutorial on youtube. Watch it here and follow along by checking out the github code.
- Machine-Learning: A lot of data science can be done with select, join, and groupby (or equivalently, map and reduce) but sometimes you need to do some non-trivial machine-learning. Before you jump into fancier algorithms, try out simpler algorithms like Naive Bayesandregularized linear regression. In Python, these are implemented in scikit learn. In R, they are implemented in the glm and gbmlibraries. You should make sure you understand the basics really well before trying out fancier algorithms.
- Visualization: Data science is about communicating your findings, and data visualization is an incredibly valuable part of that. Python offers Matlab-like plotting via matplotlib, which is functional, even if its ascetically lacking. R offers ggplot, which is prettier. Of course, if you’re really serious about dynamic visualizations, tryd3.
These are some of the foundational skills that will be invaluable to your career as a data scientist. While they only cover a subset of what we talk about at The Data Incubator (there’s a lot more to cover in stats, machine-learning, and mapreduce), this is a great start. For a more detailed list of topics, you might want to checkout this great infographic:
#2: Cool Data Sources: [original post]
At The Data Incubator, we run a free six week data science fellowship to help our Fellows land industry jobs. Our hiring partners love considering Fellows who don’t mind getting their hands dirty with data. That’s why our Fellows work on cool capstone projects that showcase those skills. One of the biggest obstacles to successful projects has been getting access to interesting data. Here are a few cool public data sources you can use for your next project:
- Publically Traded Market Data: Quandl is an amazing source of finance data. Google Finance and Yahoo Finance are additional good sources of data. Corporate filings with the SEC are available on Edgar.
- Housing Price Data: You can use the Trulia API or the Zillow API. In the UK, you can find price paid in house sales and historical mean house price by region (use this tool to translate between postcode and lat/long).
- Lending data: You can find student loan defaults by university and the complete collection of peer-to-peer loans fromLending Club and Prosper, the two largest platforms in the space.
- Home mortgage data: There is data made available by the Home Mortgage Disclosure Act and there’s a lot of data from theFederal Housing Finance Agency available here.
- Review Content: You can get reviews of restaurants and physical venues from Foursquare and Yelp (see geodata). Amazon has a large repository of Product Reviews. Beer reviews from Beer Advocate can be found here. Rotten Tomatoes Movie Reviews are available from Kaggle.
- Web Content: Looking for web content? Wikipedia provides dumps of their articles. Common Crawl has a large corpus of the internet available. ArXiv maintains all their data available via Bulk Download from AWS S3. Want to know which URLs are malicious? There’s a dataset for that. Music data is available from the Million Songs Database. You can analyze the Q&A patterns on sites like Stack Exchange (including Stack Overflow).
- Media Data: There’s open annotated articles form the New York Times, Reuters Dataset, and GDELT project (a consolidation of many different news sources). Google Books has published NGramsfor books going back to past 1800.
- Communications Data: There’s access to public messages of the Apache Software Foundation and communicationsamongst former execs at Enron.
- Municipal Data: Crime Data is available for City of Chicago and Washington DC. Restaurant Inspection Data is available forChicagoand New York City.
- Transportation Data: NYC Taxi Trips in 2013 are available courtesy of the Freedom of Information Act. There’s bikesharing data from NYC, Washington DC, and SF. There’s also Flight Delay Data from the FAA.
- Census Data: Japanese Census Data. US Census data from 2010, 2000, 1990. From census data, the government has also derived time use data. EU Census Data. Check out popular male / female baby names going back to the 19th Centuryfrom the Social Security Administration.
- World Bank: They have a lot of data available on their website.
- Election Data: Political contribution data for the last few US elections can be downloaded from the FEC here and here. Polling data is available from Real Clear Politics.
- Food, Drugs, and Devices Data: The USDA provides location-based information about the food environment in their Food Atlas. The FDA also provides a number of high value public datasets.
Data With a Cause:
- Environmental Data: Data on household energy usage is available as well as NASA Climate Data.
- Medical and Biological Data: You can get anything from anonymous medical records, to remote sensor reading for individuals, to data on the Genomes of 1000 individuals.
- Geo Data: Try looking at these Yelp Datasets for venues near major universities and one for major cities in the Southwest. The Foursquare API is another good source. Open Street Map has open data on venues as well.
- Twitter Data: You can get access to Twitter Data used for sentiment analysis, network Twitter Data, and social Twitter data, on top of their API.
- Games Data: Datasets for games, including a large dataset of Poker hands, dataset of online Domion Games, and datasets of Chess Games are available. Gaming Unplugged Since 2000 also has a large database of games, prices, artists, etc.
- Web Usage Data: Web usage data is a common dataset that companies look at to understand engagement. Available datasets include anonymous usage data for MSNBC, Amazon purchase history (also anonymized), and Wikipedia traffic.
Metasources: these are great sources for other web pages.
- Stanford Network Data: http://snap.stanford.edu/index.html
- Every year, the ACM holds a competition for machine learning called the KDD Cup. Their data is available online.
- UCI maintains archives of data for machine learning.
- US Census Data.
- Amazon is hosting Public Datasets on s3.
- Kaggle hosts machine-learning challenges and many of their datasets are publicly available.
- The cities of Chicago, New York, Washington DC, and SFmaintain public data warehouses.
- Yahoo maintains a lot of data on its web properties which can be obtained by writing them.
- BigML is a blog that maintains a list of public datasets for the machine learning community.
- GroupLens Research has collected and made available rating data sets from the MovieLens website.
- Finally, if there’s a website with data you are interested in, crawl for it!