# Data Science

### **LIFE CYCLE**

[**Microsoft on Team DS Lifecycle**](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/overview) **- "**&#x54;he Team Data Science Process (TDSP) is an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently. TDSP helps improve team collaboration and learning by suggesting how team roles work best together. TDSP includes best practices and structures from Microsoft and other industry leaders to help toward successful implementation of data science initiatives. The goal is to help companies fully realize the benefits of their analytics program.

This article provides an overview of TDSP and its main components. We provide a generic description of the process here that can be implemented with different kinds of tools. A more detailed description of the project tasks and roles involved in the lifecycle of the process is provided in additional linked topics. Guidance on how to implement the TDSP using a specific set of Microsoft tools and infrastructure that we use to implement the TDSP in our teams is also provided."

![by The DS lifecycle, Microsoft Documentation](https://lh5.googleusercontent.com/6uVYD4xbDkj2HG_rfP7fWQUn5eERj0nl_m-kKPpuyYX4q6R0g95WAduUFmIrSWVOd0P6dptgZG-1gkqWX-PvX4Png_ocJwI8VVxnj5WaZHCyetwvCLMwaKnp6g5b4goekVy9RuWV)

![by The DS lifecycle, Microsoft Documentation](/files/-Ml8ZLcCPvUpLvf_eHJn)

[**Google’s famous MLops**](https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning#mlops_level_0_manual_process)

![ML systems is more than ML code. Google.](https://lh3.googleusercontent.com/OHYbZ0jFBY6YtJvLHC0Rz10L341va62S9yOD8bALHAWHvnBRJ3TsxjZC0eEkUhGyjvLlkDITenjVqFJ-PZTl3Ab_Kt2qYbaTzRdUFzLxY-_O7zcV9IZ3jYS1I7URKKU6KiZCsmsk)

![ML systems is more than ML code. Google.](https://lh6.googleusercontent.com/ZEEeOvDgg_B7N6mP6XO19_o5Q4SpOAec4reiSg3R6TLJChRS19Nry9IfjwerveX8lhMNr5UwCZV9o-RrX-QzASyrZkiutTWUagH-r9LC5t_oVOpSHzn3D0fd1kubjwg0RjE9ZxYk)

[**Fast ai project checklist**](https://www.fast.ai/2020/01/07/data-questionnaire/?fbclid=IwAR2M_kdqKGSQ9uOFfdTncA6415K31V_flN203T1vHwNJOYg83XY2a9c-Jgg)

**"When I used to do consulting, I’d always seek to understand an organization’s context for developing data projects, based on these considerations:**

* **Strategy: What is the organization trying to do (objective) and what can it change to do it better (levers)?**
* **Data: Is the organization capturing necessary data and making it available?**
* **Analytics: What kinds of insights would be useful to the organization?**
* **Implementation: What organizational capabilities does it have?**
* **Maintenance: What systems are in place to track changes in the operational environment?**
* **Constraints: What constraints need to be considered in each of the above areas?"**

### **WORKFLOWS**

1. [**kaggle**](https://towardsdatascience.com/my-secret-sauce-to-be-in-top-2-of-a-kaggle-competition-57cff0677d3c?fbclid=IwAR3Iei5OmwswIMbbqcz2dNr5rLsWS-iuuaAuOjmhCELTTEBTPmSM85mTw7U)

### **PLATFORMS**

1. [**Uber, google, netflix, airbnb, etc**](https://databaseline.tech/a-tour-of-end-to-end-ml-platforms/)

### **STACK**

1. [**Medium on canonical stack**](https://towardsdatascience.com/rise-of-the-canonical-stack-in-machine-learning-724e7d2faa75)

### **Being a DS / Researcher**

1. [**A day in a life**](https://towardsdatascience.com/12-things-i-learned-during-my-first-year-as-a-machine-learning-engineer-2991573a9195)
2. [**Advice for a ds**](https://medium.com/the-data-experience/building-a-data-pipeline-from-scratch-32b712cfb1db)**, business kpi are not research kpi, etc**
3. [**Review of deep learning papers and co authorship**](https://neurovenge.antonomase.fr/)
4. **Full stack DS** [**Uri Weiss**](https://linkedin.com/in/uriweiss)

   ![](https://lh6.googleusercontent.com/TUBCkjRcavVYjzKkg8aqqsU8Z8Eeogznm9uRIO5mS_2Hl7lr0MbZGZYy9UFsN0eJ1eAi0by6_R0CHEqK2IY_HIVpItxneKpgEsuREH8FFfC5nLKaqQ7Q_aTFhPJ1bQEP936Ysn0c)

   **by** [**Uri Weiss**](https://linkedin.com/in/uriweiss)**. wrong credits?** [**please contact me**](mailto:ori@oricohen.com)**.**
5. [**ML practices for a DS**](https://se-ml.github.io/)

### **Team Building / Group Cohesion**&#x20;

1. [**DS vs DA vs MLE**](https://medium.com/@meightpc_14421/data-scientist-vs-data-analysis-vs-ml-engineer-which-job-is-most-suited-for-you-def7b12b3256) **- the most intensive diagram post ever. This is the motherload of figure references.**

**References:**

[**1**](https://medium.com/@rdavila01/a-team-development-roadmap-ce5247127037)**,** [**2**](https://medium.com/swlh/team-development-stages-51df5606c0a2)**,** [**3**](https://medium.com/unexpected-leadership/forming-storming-norming-and-performing-5d06d021a969)**,** [**4**](https://medium.com/@RiterApp/8-models-of-team-effectiveness-3a3b84efb3ae)**,** [**5**](https://medium.com/@warren2lynch/traditional-to-scrum-team-forming-storming-norming-and-performing-3fd5fd1f5ea9)**,** [**6**](https://medium.com/@pallawi.ds/new-employee-best-practices-to-perform-with-the-team-tuckmans-stages-of-group-development-c656ca295bee)**,** [**7**](https://medium.com/agilegreat/tuckman-model-for-building-great-teams-7b3203d7a9e3)**,** [**8**](https://medium.com/simply-agile/agile-leader-pattern-2-for-building-awesome-teams-stabilize-teams-32785b70868c)**,** [**9**](https://medium.com/hackernoon/team-building-mental-models-1f431ae29361)**, 10**&#x20;

[**Why data science needs generalists not specialists** ](https://hbr.org/2019/03/why-data-science-teams-need-generalists-not-specialists)

1. **(good advice)** [**Building a DS function (team)**](https://medium.com/ww-tech-blog/from-0-to-60-models-in-two-years-building-out-an-impactful-data-science-function-9ef86abb9605)

### Culture

1. [Netflix](https://jobs.netflix.com/culture) culture
2. [Reed hastings on netflix' keeper test](https://hrtechx.com/2020/11/20/netflixs-keeper-test-is-the-secret-to-a-successful-workforce/) - "netflixs-keeper-test-is-the-secret-to-a-successful-workforce"
   1. [response 1](https://www.highlights.lornerubis.com/page/83/)
3.

### **Agile for data-science-research**

1. [**How to manage a data science research team using agile methodology, not scrum and not kanban**](https://towardsdatascience.com/data-science-agile-cycles-my-method-for-managing-data-science-projects-in-the-hi-tech-industry-b289e8a72818)
2. [**Workflow for data science research projects**](https://towardsdatascience.com/data-science-project-flow-for-startups-282a93d4508d)
3. [**Tips for data science research management**](https://towardsdatascience.com/my-best-tips-for-agile-data-science-research-b40365cc979d)
4. [**IMO a really bad implementation of agile for data-science-projects**](https://www.locallyoptimistic.com/post/agile-analytics-p1/)

### **SOTA AND CURRENT TRENDS SUMMARIES**

1. [**ICLR 2019**](https://huyenchip.com/2019/05/12/top-8-trends-from-iclr-2019.html?fbclid=IwAR28Ez8Hs-XMSxcQb2NHfLQvZ5m4C8b4NIZPue00u6MZzrlI90Oqx8TExuU)
2. [**Medium**](https://medium.com/huggingface/the-best-and-most-current-of-modern-natural-language-processing-5055f409a1d1?fbclid=IwAR22vuGFXHil1Nz4vJr4uhueiKPRMz2T-BSwPXl8kg5iQZ54ppHe5ffecqI)
3. [**State of ai, a yearly report**](https://www.stateof.ai/)

### **Building Data/DS teams**

1. [**(great) the data team a short story by erik bern**](https://erikbern.com/2021/07/07/the-data-team-a-short-story.html)
2. [Guilds / Gangs / Squads](https://aviranm.medium.com/the-evolution-of-a-guild-a6c7d1927610) by Aviran Mordo

[ Squads, Tribes, Guilds, dont be like Spotify](https://uxdesign.cc/squads-tribes-guild-to-be-like-spotify-or-not-13ecf690fd36)

1. [Discover the Spotify Model](https://www.atlassian.com/agile/agile-at-scale/spotify)

### **YOUTUBE COURSES**

* [**DEEPNET.TV YOUTUBE (excellent)**](https://www.youtube.com/channel/UC9OeZkIwhzfv-_Cb7fCikLQ)
* [**Mitchel ML Lectures (too long)**](http://www.cs.cmu.edu/~ninamf/courses/601sp15/lectures.shtml)
* [**Quoc Les (google) wrote DNN tutorials and 3H video (not intuitive)**](http://cs.stanford.edu/~quocle/)
* [**KDnuggets: numpy, panda, scikit, tutorials.**](http://www.kdnuggets.com/2015/11/seven-steps-machine-learning-python.html)
* [**Deep learning online book (too wordy)**](http://neuralnetworksanddeeplearning.com/)
* [**Genetic Algorithms - grid search hyper params better than brute force.. obviously**](https://medium.com/@harvitronix/lets-evolve-a-neural-network-with-a-genetic-algorithm-code-included-8809bece164)
*
* [**CNN tutorial**](http://mccormickml.com/2015/01/10/understanding-the-deeplearntoolbox-cnn-example/)
* [**Introduction to programming in scikit**](http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-intro.ipynb)
* [**SVM in scikit python**](https://github.com/jakevdp/sklearn_pycon2015/blob/master/notebooks/03.1-Classification-SVMs.ipynb)
* [**Sklearn scipy PCA tutorial**](https://github.com/jakevdp/sklearn_pycon2015/blob/master/notebooks/04.1-Dimensionality-PCA.ipynb)
* [**RNN** ](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
* [**Matrix Multiplication**](http://www.mathwarehouse.com/algebra/matrix/multiply-matrix.php) **- linear algebra**

### **Deep learning Course** &#x20;

1. [**Kadenze - deep learning tensor flow**](https://www.kadenze.com/courses/creative-applications-of-deep-learning-with-tensorflow-iv/sessions/introduction-to-tensorflow) **- Histograms for (Image distribution - mean distribution) / std dev, are looking quite good.**
2. [**deep learning with keras**](https://github.com/fchollet/deep-learning-with-python-notebooks)

### **Machine Learning Courses**

1. [**Recommended: Udacity includes ML and DL** ](https://classroom.udacity.com/courses/ud188/lessons/b4ca7aaa-b346-43b1-ae7d-20d27b2eab65/concepts/4b7026be-06e3-49de-a362-ce109172659e)
2. [**Week1: Introduction Lesson 4: Supervised, unsupervised.**](https://www.coursera.org/learn/machine-learning/lecture/1VkCb/supervised-learning)
3. [**Lesson 6: model regression, cost function**](https://www.coursera.org/learn/machine-learning/lecture/db3jS/model-representation)
4. [**Lesson 71: optimization objective, large margin classification**](https://www.coursera.org/learn/machine-learning/lecture/sHfVT/optimization-objective)
5. [**PCA at coursera #1**](https://www.coursera.org/learn/machine-learning/lecture/GBFTt/principal-component-analysis-problem-formulation)
6. [**PCA at coursera**](https://www.coursera.org/learn/machine-learning/lecture/ZYIPa/principal-component-analysis-algorithm) **#2**
7. [**PCA #3**](https://www.coursera.org/learn/machine-learning/lecture/S1bq1/choosing-the-number-of-principal-components)
8. [**SVM at coursera #1 - simplified**](https://www.coursera.org/learn/predictive-analytics/lecture/2Qh1o/support-vector-machine-example)

### NLP Courses

1. [spacy](https://spacy.io/usage/spacy-1) 101
2. [gensim](https://www.machinelearningplus.com/nlp/gensim-tutorial/), [2](https://radimrehurek.com/gensim/auto_examples/), gensim notebooks
3. [nltk](https://realpython.com/nltk-nlp-python/), [2](https://www.tutorialspoint.com/natural_language_toolkit/index.htm)
4. [yandex](#life-cycle)
5. lena [voita](https://lena-voita.github.io/nlp_course.html)

[<br>](<&#xA;https://spacy.io/usage/spacy-101&#xA;https://www.machinelearningplus.com/nlp/gensim-tutorial/&#xA;https://radimrehurek.com/gensim/auto_examples/&#xA;https://realpython.com/nltk-nlp-python/&#xA;https://www.tutorialspoint.com/natural_language_toolkit/index.htm&#xA;https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks&#xA;https://github.com/yandexdataschool/nlp_course&#xA;https://lena-voita.github.io/nlp_course.html&#xA;https://github.com/fchollet/deep-learning-with-python-notebooks>)

### **Predictive Analytics Course**

[**Syllabus**](https://www.coursera.org/learn/predictive-analytics)

[**Week 2: Lesson 29: supervised learning** ](https://www.coursera.org/learn/predictive-analytics/lecture/qzrx8/statistics-vs-machine-learning)

[**Lesson 36: From rules to trees**](https://www.coursera.org/learn/predictive-analytics/lecture/qTN05/from-rules-to-trees)

[**Lesson 43: overfitting, then validation, then accuracy**](https://www.coursera.org/learn/predictive-analytics/lecture/cnLwv/overfitting)

[**Lesson 46: bootstrap, bagging, boosting, random forests.**](https://www.coursera.org/learn/predictive-analytics/lecture/ZUJqG/bootstrap-revisited)

[**Lesson 52: NN**](https://www.coursera.org/learn/predictive-analytics/lecture/6uyga/nearest-neighbor)

[**Lesson 55: Gradient Descent**](https://www.coursera.org/learn/predictive-analytics/lecture/68oAE/optimization-by-gradient-descent)

[**Lesson 59: Logistic regression, SVM, Regularization, Lasso, Ridge regression**](https://www.coursera.org/learn/predictive-analytics/lecture/FecmG/intuition-for-logistic-regression)

[**Lesson 64: gradient descent, stochastic, parallel, batch.**](https://www.coursera.org/learn/predictive-analytics/lecture/eCynR/stochastic-and-batched-gradient-descent)<br>

[**Unsupervised: Lesson X K-means, DBscan**](https://www.coursera.org/learn/predictive-analytics/lecture/WWiiy/introduction-to-unsupervised-learning)

### **BOOKS & NOTEBOOKS**

1. [**Machine learning design patterns**](https://www.oreilly.com/library/view/machine-learning-design/9781098115777/)**,** [**git**](https://github.com/GoogleCloudPlatform/ml-design-patterns) **notebooks!,** [**medium**](https://lakshmanok.medium.com/machine-learning-design-patterns-58e6ecb013d7)
   1. [**DP1 - transform**](https://medium.com/swlh/ml-design-pattern-1-transform-9e82ccbc3209) **Moving an ML model to production is much easier if you keep inputs, features, and transforms separate**
   2. [**DP2 - checkpoints**](https://towardsdatascience.com/ml-design-pattern-2-checkpoints-e6ca25a4c5fe) **Saving the intermediate weights of your model during training provides resilience, generalization, and tunability**
   3. [**DP3 - virtual epochs**](https://medium.com/google-cloud/ml-design-pattern-3-virtual-epochs-f842296de730) **Base machine learning model training and evaluation on total number of examples, not on epochs or steps**
   4. [**DP4 - keyed predictions**](https://towardsdatascience.com/ml-design-pattern-4-keyed-predictions-a8de67d9c0f4) **Export your model so that it passes through client keys**
   5. [**DP5 - repeatable sampling**](https://towardsdatascience.com/ml-design-pattern-5-repeatable-sampling-c0ccb2889f39) **use the hash of a well distributed column to split your data into training, validation, and testing**
2. [**Gensim notebooks**](https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks) **- from w2v, doc2vec to nmf, lda, pca, sklearn api, cosine, topic modeling, tsne, etc.**
3. [**Deep learning with python**](https://www.manning.com/books/deep-learning-with-python) **- francois chollet, deep learning & vision** [**git notebooks!**](https://github.com/fchollet/deep-learning-with-python-notebooks)**,** [**official notebooks**](https://github.com/PacktPublishing/Deep-Learning-with-Keras)**.**
4. **Yandex school,** [**nlp notebooks**](https://github.com/yandexdataschool/nlp_course)
5. [**Machine learning engineering book**](http://www.mlebook.com/wiki/doku.php) **(i.e., data science)**
6. [**Interpretable Machine Learning book**](https://christophm.github.io/interpretable-ml-book/)
7.

### **COST**

1. [**GPT2/3**](https://medium.com/modern-nlp/estimating-gpt3-api-cost-50282f869ab8)

### **Patents**

1. [**Method Patent Exceptionalism**](https://ilr.law.uiowa.edu/print/volume-102-issue-3/method-patent-exceptionalism)

### **General Advice**

**(really good)** [**Practical advice for analysis of large, complex data sets**](https://www.unofficialgoogledatascience.com/2016/10/practical-advice-for-analysis-of-large.html) **- distributions, outliers, examples, slices, metric significance, consistency over time, validation, description, evaluation, robustness in measurement, reproducibility, etc.**


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://www.mlcompendium.com/foundation-knowledge/data-science.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
