Data Science Tools
Last updated
Was this helpful?
Last updated
Was this helpful?
good for removal
Coroutines
Async io
Clean code:
pyenv virtualenv
Jupyter notebooks as a module
Enter your project directory
$ python -m venv projectname
$ source projectname/bin/activate
(venv) $ pip install ipykernel
(venv) $ ipython kernel install --user --name=projectname
Run jupyter notebook * (not entirely sure how this works out when you have multiple notebook processes, can we just reuse the same server?)
Connect to the new server at port 8889
As far as i can tell, reshape effectively flattens the tree and divide it again to a new tree, but the total amount of inputs needs to stay the same. 2*4*6 = 4*2*3*2 for example
code:
import numpy
rng = numpy.random.RandomState(234)
a = rng.randn(2,3,10)
print(a.shape)
print(a)
b = numpy.reshape(a, (3,5,-1))
print(b.shape)
print (b)
def mask_with_values(df): mask = df['A'].values == 'foo' return df[mask]
Using python (map)
Using numpy
using a function (not as pretty)
df['t'] = [x for x in range(10)]
df['t-1'] = df['t'].shift(1)
df['t-1'] = df['t'].shift(-1)
The system is built around quickly visualizing target values and comparing datasets. Its goal is to help quick analysis of target characteristics, training vs testing data, and other such data characterization tasks."
SCI-KIT LEARN
complementary to the above
- a make sense tutorial and instructions on how to use all.
by alfredo motta
by Christine Egan
**
Important
( - put a one liner before the code and query the variables inside a function.
, on
( - a shape of (2,4,6) is like a tree of 2->4 and each one has more leaves 4->6.
*** A tutorial for
How to add extensions to jupyter:
to finding the minima
finding it in a 1d numpy array
- explaining why vectors work faster. between list, map, vectorize. Vectorize wins. The idea is to use vectorize and a function that does something that may involve if conditions on a vector, and do it as fast as possible.
about using pandas, loading, loading from zip, seeing the table’s features, accessing rows & columns, boolean operations, calculating on a whole row\column with a simple function and on two columns even, dealing with time\date parsing.
- pivot melt stack unstack
(benchmarked):
- by name, by index, by python methods.
-
in pandas,
based on a (boolean or not) column and calculation:
Given a DataFrame, the () function can be used to create copies of columns that are pushed forward (rows of NaN values added to the front) or pulled back (rows of NaN values added to the end).
- A Practical Introduction - Yotam Perkal - PyCon Israel 2018
In this talk, I will present the problem and give a practical overview (accompanied by Jupyter Notebook code examples) of three libraries that aim to address it: Voluptuous - Which uses Schema definitions in order to validate data [] Engarde - A lightweight way to explicitly state your assumptions about the data and check that they're actually true [] * TDDA - Test Driven Data Analysis [ ]. By the end of this talk, you will understand the Importance of data validation and get a sense of how to integrate data validation principles as part of the ML pipeline.
, use apply.
- "Sweetviz is an open-source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with just two lines of code. Output is a fully self-contained HTML application.
(good)
Pipeline t,
- Multi gpu, multi node-gpu alternative for SKLEARN algorithms
about using svm\knn\naive\log regression in sklearn in python, i.e., “fitting a model onto the data”
. , , .
Also Insanely fast, .
, using pipelines. thank you sk-lego.
Images by
on all fast.ai courses, 14 posts
- is an open-source, machine learning library in Python that helps you from data preparation to model deployment. It is easy to use and you can do almost every data science project task with just one line of code.
,
**
to initialize NVML: Driver/library version mismatch
, **[2](),**
(great)
by