What Are Good Ways To Get Started With Data Science For A Complete Novice?
A while ago, I wrote about some free resources you can use to learn data science on your own. This was mainly geared towards folks who wanted to apply toas a useful “getting started” guide but it’s a useful place to start regardless of where you want to apply to be a data scientist. I’ll break up my answer into two parts:
- Free resources broken down by topic: While you are coming at this with an expertise in machine-learning, there are a number of other useful aspects of data science to learn. The response is more general interest.
- Free Data sources with which you can gain hands-on experience. One of the linchpins of is the building of a capstone project that you use to showcase your newfound data science knowledge.
#1: New Topics to Learn 
Here are five important skills to develop and some resources on how to help you develop them. While we don’t expect our applicants to possess all of these skills, most applicants already have a strong background in many of them.
- Scraping: There’s a lot of data out there so you’ll need to learn how to get access to it. Whether it’s JSON, HTML or some homebrew format, you should be able to handle them with ease. Modern scripting languages like Python are ideal for this. In Python, look at packages like , , , , and to make handling web requests and data formats easier. More advanced topics include error handling ( ) and parallelization ( ).
- SQL: Once you have a large amount of structured data, you will want to store and process it. SQL is the original query language and its syntax is so prevalent that there are SQL query interfaces for everything from for R data frames to for Mapreduce.Normally, you would have to go through a painful install process to play with SQL. Fortunately, there’s a nice available where you can submit your queries and learn interactively. Additionally, Mode Analytics has a geared towards data scientists, although it is not interactive. When you’re ready to use SQL locally, offers a simple-to-install version of SQL.
- Data frames: SQL is great for handling large amounts of data but unfortunately it lacks machine learning and visualization. So the workflow is often to use SQL or mapreduce to get data to a manageable size and then process it using a libraries like R’s or Python’s . For Pandas, Wes McKinney, who created pandas, has a great video tutorial on youtube. Watch it and follow along by checking out the .
- Machine-Learning: A lot of data science can be done with select, join, and groupby (or equivalently, map and reduce) but sometimes you need to do some non-trivial machine-learning. Before you jump into fancier algorithms, try out simpler algorithms like and . In Python, these are implemented in . In R, they are implemented in the and libraries. You should make sure you understand the basics really well before trying out fancier algorithms.
- Visualization: Data science is about communicating your findings, and data visualization is an incredibly valuable part of that. Python offers Matlab-like plotting via , which is functional, even if its ascetically lacking. R offers , which is prettier. Of course, if you’re really serious about dynamic visualizations, try .
These are some of the foundational skills that will be invaluable to your career as a data scientist. While they only cover a subset of what we talk about at(there’s a lot more to cover in stats, machine-learning, and mapreduce), this is a great start. For a more detailed list of topics, you might want to checkout this great infographic:
#2: Cool Data Sources: 
At, we run a free six week data science fellowship to help our Fellows land industry jobs. Our hiring partners love considering Fellows who don’t mind getting their hands dirty with data. That’s why our Fellows work on cool capstone projects that showcase those skills. One of the biggest obstacles to successful projects has been getting access to interesting data. Here are a few cool public data sources you can use for your next project:
- Publically Traded Market Data: is an amazing source of finance data. and are additional good sources of data. Corporate filings with the SEC are available on .
- Housing Price Data: You can use the or the . In the UK, you can find and historical (use to translate between postcode and lat/long).
- Lending data: You can find and the complete collection of peer-to-peer loans from and , the two largest platforms in the space.
- Home mortgage data: There is data made available by the and there’s a lot of data from the .
- Review Content: You can get reviews of restaurants and physical venues from Foursquare and Yelp (see geodata). Amazon has a large repository of . Beer reviews from Beer Advocate can be found . Rotten Tomatoes are available from Kaggle.
- Web Content: Looking for web content? Wikipedia provides . Common Crawl has a . ArXiv maintains all their data available via . Want to know which URLs are malicious? There’s a for that. Music data is available from the . You can analyze the Q&A patterns on sites like .
- Media Data: There’s open annotated articles form the , , and (a consolidation of many different news sources). Google Books has for books going back to past 1800.
- Communications Data: There’s access to public messages of the and communications .
- Municipal Data: Crime Data is available for and . Restaurant Inspection Data is available for and .
- Transportation Data: . There’s bikesharing data from , , and . There’s also .
- Census Data: . US Census data from , , . From census data, the government has also derived . . Check out from the Social Security Administration.
- World Bank: They have a lot of data available .
- Election Data: Political contribution data for the last few US elections can be downloaded from the FEC and . Polling data is available from .
- Food, Drugs, and Devices Data: The USDA provides location-based information about the food environment in their . The FDA also provides a number of high value .
Data With a Cause:
- Environmental Data: Data on as well as .
- Medical and Biological Data: You can get anything from , to remote sensor reading , to data on the Genomes of .
- Geo Data: Try looking at these Yelp Datasets for and one for major cities in the . The is another good source. Open Street Map has open as well.
- Twitter Data: You can get access to used for sentiment analysis, , and , on top of their .
- Games Data: Datasets for games, including a large dataset of , dataset of , and datasets of are available. also has a large of games, prices, artists, etc.
- Web Usage Data: Web usage data is a common dataset that companies look at to understand engagement. Available datasets include , (also anonymized), and .
Metasources: these are great sources for other web pages.
- Stanford Network Data:
- Every year, the ACM holds a competition for machine learning called the KDD Cup. Their data is .
- UCI maintains .
- Amazon is hosting .
- Kaggle hosts machine-learning challenges and many of their datasets are .
- The cities of , , , and maintain public data warehouses.
- Yahoo maintains a lot of data on its web properties .
- is a blog that maintains a list of public datasets for the machine learning community.
- has collected and made available rating data sets from the MovieLens website.
- Finally, if there’s a website with data you are interested in, crawl for it!