Data Science

LIFE CYCLE

Microsoft on Team DS Lifecyclearrow-up-right - "The Team Data Science Process (TDSP) is an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently. TDSP helps improve team collaboration and learning by suggesting how team roles work best together. TDSP includes best practices and structures from Microsoft and other industry leaders to help toward successful implementation of data science initiatives. The goal is to help companies fully realize the benefits of their analytics program.

This article provides an overview of TDSP and its main components. We provide a generic description of the process here that can be implemented with different kinds of tools. A more detailed description of the project tasks and roles involved in the lifecycle of the process is provided in additional linked topics. Guidance on how to implement the TDSP using a specific set of Microsoft tools and infrastructure that we use to implement the TDSP in our teams is also provided."

Google’s famous MLopsarrow-up-right

ML systems is more than ML code. Googlearrow-up-right.
ML systems is more than ML code. Googlearrow-up-right.

Fast ai project checklistarrow-up-right

"When I used to do consulting, I’d always seek to understand an organization’s context for developing data projects, based on these considerations:

  • Strategy: What is the organization trying to do (objective) and what can it change to do it better (levers)?

  • Data: Is the organization capturing necessary data and making it available?

  • Analytics: What kinds of insights would be useful to the organization?

  • Implementation: What organizational capabilities does it have?

  • Maintenance: What systems are in place to track changes in the operational environment?

  • Constraints: What constraints need to be considered in each of the above areas?"

WORKFLOWS

PLATFORMS

STACK

Being a DS / Researcher

  1. Advice for a dsarrow-up-right, business kpi are not research kpi, etc

Team Building / Group Cohesion

  1. DS vs DA vs MLEarrow-up-right - the most intensive diagram post ever. This is the motherload of figure references.

References:

1arrow-up-right, 2arrow-up-right, 3arrow-up-right, 4arrow-up-right, 5arrow-up-right, 6arrow-up-right, 7arrow-up-right, 8arrow-up-right, 9arrow-up-right, 10

Why data science needs generalists not specialists arrow-up-right

Culture

  1. Reed hastings on netflix' keeper testarrow-up-right - "netflixs-keeper-test-is-the-secret-to-a-successful-workforce"

Agile for data-science-research

Building Data/DS teams

Squads, Tribes, Guilds, dont be like Spotifyarrow-up-right

YOUTUBE COURSES

Deep learning Course

  1. Kadenze - deep learning tensor flowarrow-up-right - Histograms for (Image distribution - mean distribution) / std dev, are looking quite good.

Machine Learning Courses

NLP Courses

arrow-up-right

Predictive Analytics Course

Syllabusarrow-up-right

Week 2: Lesson 29: supervised learning arrow-up-right

Lesson 36: From rules to treesarrow-up-right

Lesson 43: overfitting, then validation, then accuracyarrow-up-right

Lesson 46: bootstrap, bagging, boosting, random forests.arrow-up-right

Lesson 52: NNarrow-up-right

Lesson 55: Gradient Descentarrow-up-right

Lesson 59: Logistic regression, SVM, Regularization, Lasso, Ridge regressionarrow-up-right

Lesson 64: gradient descent, stochastic, parallel, batch.arrow-up-right

Unsupervised: Lesson X K-means, DBscanarrow-up-right

BOOKS & NOTEBOOKS

  1. Machine learning design patternsarrow-up-right, gitarrow-up-right notebooks!, mediumarrow-up-right

    1. DP1 - transformarrow-up-right Moving an ML model to production is much easier if you keep inputs, features, and transforms separate

    2. DP2 - checkpointsarrow-up-right Saving the intermediate weights of your model during training provides resilience, generalization, and tunability

    3. DP3 - virtual epochsarrow-up-right Base machine learning model training and evaluation on total number of examples, not on epochs or steps

    4. DP4 - keyed predictionsarrow-up-right Export your model so that it passes through client keys

    5. DP5 - repeatable samplingarrow-up-right use the hash of a well distributed column to split your data into training, validation, and testing

  2. Gensim notebooksarrow-up-right - from w2v, doc2vec to nmf, lda, pca, sklearn api, cosine, topic modeling, tsne, etc.

COST

Patents

General Advice

(really good) Practical advice for analysis of large, complex data setsarrow-up-right - distributions, outliers, examples, slices, metric significance, consistency over time, validation, description, evaluation, robustness in measurement, reproducibility, etc.

Last updated

Was this helpful?