数据处理在家工作_在家工作时如何提高数据科学技能

数据处理在家工作

This article will serve as a guide to improving your Data Science skills while working from home. You can use it to build real-life projects, beef up your portfolio, and prepare yourself for what's next.

本文将作为在家中工作时提高数据科学技能的指南。您可以使用它来构建现实生活中的项目，增强您的投资组合，并为接下来的工作做好准备。

The coronavirus outbreak is taking over headlines. Due to the spread of COVID-19, remote work is suddenly an overnight requirement for many. You might be working from home as you are reading this article.

冠状病毒的爆发已成为头条新闻。由于COVID-19的普及，远程工作突然成为许多人的一夜之间的需求。在阅读本文时，您可能在家中工作。

With millions working from home for many weeks now, we should seize this opportunity to improve our skills in the domain we are focusing on.

如今，数以百万计的人在家工作了数周，我们应该抓住这个机会来提高我们专注的领域的技能。

Here is my strategy to learn Data Science while working from home with few personal real life projects.

这是我在家学习很少的个人现实生活项目时学习数据科学的策略。

"So what should we do?"

“那我们该怎么办？”

"Where should we start learning?"

“我们应该从哪里开始学习？”

Grab your coffee as I explain the process of how you can learn data science sitting at home. This blog is for everyone, from beginners to professionals.

在我解释如何在家学习数据科学的过程时，请喝杯咖啡。这个博客适合从初学者到专业人士的每个人。

数据处理在家工作_在家工作时如何提高数据科学技能 — Photo by Nick Morrison on Unsplash

先决条件 (Prerequisites)

To start this journey, you will need to cover the prerequisites. No matter which specific field you are in, you will need to learn the following prerequisites for data science.

要开始此旅程，您需要满足先决条件。无论您处于哪个特定领域，都将需要学习以下数据科学先决条件。

逻辑/算法： (Logic/Algorithms:)

It’s important to know why we need a particular prerequisite before learning it. Algorithms are basically a set of instructions given to a computer to make it do a specific task.

重要的是要知道为什么我们在学习之前需要特定的先决条件。算法基本上是一组给计算机的指令，以使其执行特定任务。

Machine learning is built from various complex algorithms. So you need to understand how algorithms and logic work on a basic level before jumping into complex algorithms needed for machine learning.

机器学习是根据各种复杂算法构建的。因此，在跳入机器学习所需的复杂算法之前，您需要了解算法和逻辑在基本层面上是如何工作的。

If you are able to write the logic for any given puzzle with the proper steps, it will be easy for you to understand how these algorithms work and you can write one for yourself.

如果您能够按照正确的步骤编写任何给定难题的逻辑，那么您将很容易理解这些算法的工作原理，并且可以自己编写一个。

Resources: Some awesome free resources to learn data structures and algorithms in depth.

资源：一些很棒的免费资源，可以深入学习数据结构和算法。

统计： (Statistics:)

Statistics is a collection of tools that you can use to get answers to important questions about data.

统计是工具的集合，您可以使用这些工具来获取有关数据的重要问题的答案。

Machine learning and statistics are two tightly related fields of study. So much so that statisticians refer to machine learning as “applied statistics” or “statistical learning”.

机器学习和统计是两个紧密相关的研究领域。如此之多，以至于统计学家将机器学习称为“应用统计学”或“统计学习”。

The following topics should be covered by aspiring data scientists before they start machine learning.

有抱负的数据科学家在开始机器学习之前，应涵盖以下主题。

Measures of Central Tendency — mean, median, mode, etc
中心趋势的度量-均值，中位数，众数等
Measures of Variability — variance, standard deviation, z-score, etc
变量的度量-方差，标准差，z得分等
Probability — probability density function, conditional probability, etc
概率—概率密度函数，条件概率等
Accuracy — true positive, false positive, sensitivity, etc
准确性-真阳性，假阳性，敏感性等
Hypothesis Testing and Statistical Significance — p-value, null hypothesis, etc
假设检验和统计意义-p值，原假设等

Resources: Learn college level statistics in this free 8 hour course.

资源：在这个免费的8小时课程中，学习大学水平的统计信息。

商业： (Business:)

This depends on which domain you want to focus on. It basically involves understanding the particular domain and getting domain expertise before you get into a data science project. This is important as it helps in defining our problem accurately.

这取决于您要关注的域。基本上，它涉及到了解特定领域并在进入数据科学项目之前获得领域专业知识。这很重要，因为它有助于准确地定义我们的问题。

Resources: Data science for business

资源：商业数据科学

复习基础知识 (Brush up your basics)

This sounds pretty easy but we tend to forget some important basic concepts. It gets difficult to learn more complex concepts and the latest technologies in a specific domain without having a solid foundation in the basics.

这听起来很容易，但是我们往往会忘记一些重要的基本概念。如果没有扎实的基础知识，就很难学习更复杂的概念和特定领域的最新技术。

Here are few concepts you can start revising:

您可以开始修改以下几个概念：

Python程式设计语言 (Python programming language )

Python is widely used in data science. Check out this collection of great Python tutorials and these helpful code samples to get started.

Python在数据科学中被广泛使用。查看这个很棒的Python教程和这些有用的代码示例的集合，以开始使用。

You can also check out this Python3 Cheatsheet that will help you learn new syntax that was released in python3. It'll also help you brush up on basic syntax.

您还可以查看此Python3备忘单，它可以帮助您学习python3中发布的新语法。它还将帮助您重温基本语法。

And if you want a great free course, check out this Python for Everybody course from Dr. Chuck.

而且，如果您想获得很棒的免费课程，请查阅Chuck博士的Python for Everyone课程。

通用数据科学技能 (General data science skills)

Want to take a great course on data science concepts? Here's a bunch of data science courses that you can take online, ranked according to thousands of data points.

是否想学习一门很好的数据科学概念课程？这是一堆您可以在线学习的数据科学课程，并根据成千上万个数据点进行排名。

Resources: Data science for beginners - free 6 hour course, What languages should you learn for data science?

资源：面向初学者的数据科学免费6小时课程，您应该为数据科学学习哪些语言？

数据采集 (Data Collection)

Now it is time for us to explore all the ways you can collect your data. You never know where your data might be hiding. Following are a few ways you can collect your data.

现在是时候让我们探索收集数据的所有方式了。您永远都不知道数据可能隐藏在哪里。以下是几种收集数据的方法。

网页抓取 (Web scraping )

Web scraping helps you gather structured data from the web, select some of that data, and keep what you selected for whatever use you require.

Web抓取可帮助您从Web上收集结构化数据，选择其中一些数据，并保留为任何所需用途选择的内容。

You can start learning BeautifulSoup4 which helps you scrape websites and make your own datasets.

您可以开始学习BeautifulSoup4 ，它可以帮助您抓取网站并创建自己的数据集。

Advance Tip: You can automate browsers and get data from interactive web pages such as Firebase using Selenium. It is useful for automating web applications and automating boring web based administration

高级提示：您可以使用Selenium自动化浏览器并从诸如Firebase的交互式网页中获取数据。对于自动执行Web应用程序和自动进行无聊的基于Web的管理非常有用

Resources: Web Scraping 101 in Python

资源： Python中的Web Scraping 101

云服务器 (Cloud servers)

If your data is stored on cloud servers such as S3, you might need to get familiar with how to get data from there. The following link will help you understand how to implement them using Amazon S3.

如果您的数据存储在S3等云服务器上，则可能需要熟悉如何从那里获取数据。以下链接将帮助您了解如何使用Amazon S3实施它们。

Resources : Getting started with Amazon S3, How to deploy your site or app to AWS S3 with CloudFront

资源： Amazon S3入门，如何使用CloudFront将站点或应用程序部署到AWS S3

蜜蜂 (APIs)

There are millions of websites that provide data through APIs such as Facebook, Twitter, etc. So it is important to learn how they are used and have a good idea on how they are implemented.

有数以百万计的网站通过Facebook，Twitter等API提供数据。因此，重要的是要了解如何使用它们以及如何实现它们。

Resources : What is an API? In English, please, How to build a JSON API with Python, and Getting started with Python API.

资源：什么是API？请使用英语，如何使用Python构建JSON API以及Python API入门。

数据预处理 (Data Preprocessing)

This topic includes everything from data cleaning to feature engineering. It takes a lot of time and effort. So we need to dedicate a lot of time to actually learn it.

本主题包括从数据清理到功能工程的所有内容。这需要很多时间和精力。因此，我们需要花费大量时间来实际学习它。

Data cleaning involves different techniques based on the problem and data type. The data needs to be cleaned from irrelevant data, syntax erros, data inconsistencies and missing data. The following guide will get you started with data cleaning.

数据清理涉及基于问题和数据类型的不同技术。需要清除不相关的数据，语法错误，数据不一致和丢失的数据。以下指南将帮助您开始进行数据清理。

Resources : Ultimate guide to data cleaning

资源：数据清理的最终指南

Data Preprocessing is an important step in which the data gets transformed, or encoded, so that the machine can easily parse it. It requires time as well as effort to preprocess different types of data which include numerical, textual and image data.

数据预处理是重要的步骤，在该步骤中，数据将被转换或编码，以便计算机可以轻松地对其进行解析。预处理包括数字，文本和图像数据在内的不同类型的数据需要时间和精力。

Resources : Data Preprocessing: Concepts, All you need to know about text preprocessing for NLP and Machine Learning, Preprocessing for deep learning.

资源：数据预处理：概念，您需要了解有关NLP和机器学习的文本预处理，深度学习的预处理的所有知识。

机器学习 (Machine Learning)

Finally we reach our favourite part of data science: Machine Learning.

最终，我们到达了数据科学中我们最喜欢的部分：机器学习。

My suggestion here would be to first brush up your basic algorithms.

我的建议是首先复习您的基本算法。

Classification — Logistic Regression, RandomForest, SVM, Naive Bayes, Decision Trees

分类 — Logistic回归，RandomForest，SVM，朴素贝叶斯，决策树

Resources : Types of classification algorithms in Machine Learning, Classification Algorithms in Machine Learning

资源：机器学习中的分类算法类型，机器学习中的分类算法

Regression — Linear Regression, RandomForest, Polynomial Regression

回归 -线性回归，RandomForest，多项式回归

Resources : Introduction to Linear Regression , Use Linear Regression models to predict quadratic, root, and polynomial functions, 7 Regression Techniques you should know, Selecting the best Machine Learning algorithm for your regression problem,

资源：线性回归简介，使用线性回归模型预测二次函数，根函数和多项式函数，您应该了解的7种回归技术，为回归问题选择最佳的机器学习算法，

Clustering — K-Means Clustering, DBSCAN, Agglomerative Hierarchical Clustering

聚类 -K-Means聚类，DBSCAN，聚集层次聚类

Resources : Clustering algorithms

资源：聚类算法

Gradient Boosting — XGBoost, Catboost, AdaBoost

梯度提升 -XGBoost，Catboost，AdaBoost

Resources : Gradient boosting from scratch, Understanding Gradient Boosting Machines

资源：从头开始进行梯度增强，了解梯度增强机

I urge you all to understand the math behind these algorithms so you have a clear idea of how it actually works. You can refer to this blog where I have implemented XGBoost from scratch — Implementing XGBoost from scratch

我敦促大家了解这些算法背后的数学原理，以便您对它的实际工作方式有一个清晰的了解。您可以参考此博客，其中我是从头开始实现XGBoost的 -从头开始实现XGBoost

Now you can move on to Neural Networks and start your Deep Learning journey.

现在，您可以进入神经网络并开始您的深度学习之旅。

Resources: Deep Learning for Developers, Introduction to Deep Learning with Tensorflow, How to develop neural networks with Tensorflow, Learn how deep neural networks work

资源：开发人员深度学习，Tensorflow深度学习简介，如何使用Tensorflow 开发神经网络，了解深度神经网络如何工作

You can then further dive deep into how LSTM, Siamese Networks, CapsNet and BERT works.

然后，您可以进一步深入研究LSTM，Siamese Networks，CapsNet和BERT的工作方式。

黑客马拉松： (Hackathons:)

Now we need to implement these algorithms on a competitive level. You can start looking for online Data Science Hackathons. Here is the list of websites where I try to compete with other data scientists.

现在我们需要在竞争水平上实现这些算法。您可以开始寻找在线数据科学黑客马拉松。这是我尝试与其他数据科学家竞争的网站列表。

Analytics Vidhya — https://datahack.analyticsvidhya.com/contest/all/

Analytics Vidhya - https: //datahack.analyticsvidhya.com/contest/all/

Kaggle — https://www.kaggle.com/competitions

Hackerearth — https://www.hackerearth.com/challenges/

Hackerearth - https: //www.hackerearth.com/challenges/

MachineHack — https://www.machinehack.com/

MachineHack - https: //www.machinehack.com/

TechGig — https://www.techgig.com/challenge

Dare2compete — https://dare2compete.com/e/competitions/latest

Crowdanalytix — https://www.crowdanalytix.com/community

To have a look at a winning solution, here is a link to my winning solution to one online Hackathon on Analytics Vidhya — https://github.com/Sid11/AnalyticsVidhya_DataSupremacy

要查看一个获奖的解决方案，这是我获奖的解决方案的链接，它指向Analytics(分析)Vidhya上的一个在线Hackathon — https://github.com/Sid11/AnalyticsVidhya_DataSupremacy

项目： (Projects:)

We see people working on dummy data and still don’t get the taste of how actual data looks like. In my opinion, working on real life data gives you a very clear idea how data in real life looks like. The amount of time and effort required in cleaning real life data takes about 70% of your project’s time.

我们看到人们在处理虚拟数据，但仍然不了解实际数据的外观。我认为，处理现实生活中的数据可以使您非常清楚地了解现实生活中的数据的外观。清理现实数据所需的时间和精力约占项目时间的70％。

Here are the best free open data sources anyone can use

这是任何人都可以使用的最佳免费开放数据源
Open Government Data — https://data.gov.in/

开放*数据-https://data.gov.in/
Data about real contributed by thousands of users and organizations across the world — https://data.world/datasets/real

世界各地成千上万的用户和组织提供的有关real的数据— https://data.world/datasets/real
19 public datasets for Data Science Project — https://www.springboard.com/blog/free-public-data-sets-data-science-project/

数据科学项目的19个公共数据集-https: //www.springboard.com/blog/free-public-data-sets-data-science-project/

商业情报 (Business Intelligence)

After you get the results from your project, it is now time to make business decisions from those results. Business Intelligence is a suite of software and services that helps transform data into actionable intelligence and knowledge.

从项目中获得结果后，现在该根据这些结果制定业务决策了。商业智能是一套软件和服务，可帮助将数据转换为可操作的智能和知识。

This can be done by creating a dashboard from the output of our model. Tableau is a powerful and the fastest growing data visualization tool used in the Business Intelligence Industry. It helps in simplifying raw data into the very easily understandable format. Data analysis is very fast with Tableau and the visualizations created are in the form of dashboards and worksheets.

这可以通过从模型的输出创建仪表板来完成。 Tableau是商业智能行业中使用的功能强大且增长最快的数据可视化工具。它有助于将原始数据简化为非常容易理解的格式。使用Tableau可以非常快速地进行数据分析，并且创建的可视化文件以仪表板和工作表的形式出现。

Resources : Getting started with Tableau, Tableau for Data Science course

资源： Tableau入门， Tableau数据科学课程

It is now time for you start your work from home to improve your skillset. Also if you started this journey and need my advice or details about any subpart which I have mentioned above, feel free to comment or mail me at jsiddhesh96[at]gmail[dot]com.

现在是时候从家里开始工作以提高技能了。另外，如果您开始了这一旅程，并且需要我的建议或我上面提到的任何子部分的详细信息，请随时在jsiddhesh96 [at] gmail [dot] com上发表评论或发邮件给我。

翻译自: https://www.freecodecamp.org/news/improve-your-data-science-skills-while-working-from-home/