============================================== Running continually a single instance in Linux ============================================== #!/bin/bash /bin/pidof -x /home/pi/scripts/pi_duino_2.py > /dev/null if [[ $? -ne 0 ]]; then echo Restarting data capture export PYTHONUNBUFFERED=1 /home/pi/scripts/pi_duino_2.py fi ========================================================= freeing up memory on Python using the garbage collector. ========================================================= import gc del variable gc.collect() ================================================== Check if a GPU is available within Python Kernel. ================================================== import GPUtil print(GPUtil.getAvailable()) import torch use_cuda = torch.cuda.is_available() print(use_cuda) ======== Anaconda ======== conda info --envs Query the created environments created in Anaconda. source activate environment1 Activate an environment runing a certain Python version in Linux. activate environment1 Activate an environment runing a certain Python version in Windows. source deactivate Deactivate the currently active environment in Linux. deactivate Deactivate the currently active environment in Windows. conda create -n name python=3.5 anaconda Create a new Anaconda environment runing Python 3.5. conda create -n name python=2.7 anaconda Create a new Anaconda environment runing Python 2.7. conda remove --name environment1 --all Eliminates an existing environment. ====== Spyder ====== clear Clears the IPython console. #%% Cell splitter. Keyboard tricks: [CTRL + 1] Toggle comment and uncomment. [F5] Runs the current script as a Python script. Does not handle magic commands. [CTRL + F5] Line-by-line debugging of the current script as a Python script. Does not handle magic commands. [F9] Feeds the current script line-by-line to the IPython terminal. Handles magic commands. [CTRL + Enter] Runs the current cell. [Shift + Enter] Runs the current cell and advances to the next cell. command[TAB] Autocomplete. [Shift + Tab] Shows the arguments of a function if pressed within the parentheses. =================================== Jupyter Notebook + Voila/IpyWidgets =================================== Keyboard tricks: [Shift + Enter] Run current cell. [TAB] Increase indent. [Shift + TAB] Decrease indent. [00] Restart kernel [Shift+o] Toggle output scrolling. command[TAB] Autocomplete. [CTRL + /] Toggle comment and uncomment. IPython Magic commands: ? Add after command for displaying help. ?? Add after command for displaying source code. ! Sends the command after the exclamation point to the Operating System console to be run. %who List all variables of the namespace. %cd Change working directory. %ls List files and directories in working directory. %run Execute Python script or another jupyter notebook %load Insert the code of another script. %store Pass varibles in between notebooks. %matplotlib inline Matplotlib plots are going to be placed within the notebook page. %matplotlib notebook Matplotlib plots are going to be placed within the notebook page as a dynamic widget. %precision 2 Set floating point precision for printing. %prun Shows how much time the program spent in each function. %%timeit -n Run the current Jupyter Notebook cell n times (default: n=1000) and present the average of the execution time. Markdown cells directives: ##Title Create a Heading in a markdown (i.e., regular text) cell. [Enter + Enter] Restarts numbering in numbered lists created in Markdown cells. (Hyperlink) Creates a hyperlink in Markdown cells. %%latex Apply LaTeX parsing to the current markdown (i.e., regular text) cell. ![title] Insert an image in a markdown cell. Widgets: from ipywidgets Imports the ipywidgets library for widgets in Jupyter Notebook. interact() Wrapper for changing arguments using widgets when calling functions. Example: %pylab inline #make sure to use "inline" instead of "notebook" when using widgets import cv2 import sys from ipywidgets import interact rcParams['figure.figsize'] = (13.0, 11.0) def update(d,sigmaColor,sigmaSpace): baboon_bilateral = cv2.bilateralFilter(baboon,d,sigmaColor,sigmaSpace) subplot(1,2,1) imshow(baboon) title('Original Image') subplot(1,2,2) imshow(baboon_bilateral) title('Filtered Image') interact(update,d=(-1,180,1),sigmaColor=(0,200,1),sigmaSpace=(0.0,200,1)) ==================== Python / IntelPython ==================== if __name__ == '__main__': Conventional instruction at the lowest part of a Python script for good programming practice. Example: if __name__ == '__main__': print(particularfunction("parameter1","parameter2")) #! /usr/bin/env python This sentence at the top of a Python script allows Linux to run it as '$ script.py' instead of using '$ python script.py'. import Loads a library. from Loads a sublibrary/class/function. Must be used with the 'import' directive. (i.e., from import ) as Applies a pseudonim to a library/sublibrary/class/function. Must be used with the 'import' directive. (i.e., import as , or from import as ) [TAB] The indentation symbol is important in Python for parsing the content of a special directives (functions, conditions, loops) # Hash is the comment symbol. Anything after it is parsed as a comment. () Parentheses clarify the order of mathematical and logical operations and allow to turn a long one-line statement into a multi-line statement. They are also necessary for the definition of tuple variables. [] Brackets perform the indexing and slicing operations for strings/lists/tuples/dictionaries/arrays/Series/Dataframes. Non-negative integers index from beggining to end, while negative integers index from end to beggining. They are also necessary for the definition of list variables. [[]] Useful for specifying a matrix in Numpy. {} Curly brackets contain the elements of a set or a dictionary. It represents an empty dictionary if assigned to a variable. del Delete an object or part of an object (e.g., a slice from a list) from memory. help() Displays help for a function. dir() Displays advanced help for a function. x = 5 Creates an integer variable. x = 5.0 Creates a float variable. x = 3+4j Creates a complex variable. x = True Creates a boolean variable. None Special data type variable defined in Python for absent values. Equivalent to NULL in other languages. + Adds numerical values. - Substracts numerical values. * Multiplies numerical values. / Divides numerical values. For the result to be a float either the numerator or denominator needs to be a float. % Modulo operation over numerical values. ** Power operation over numerical values. abs() Computes the absolute value. max() Computes the maximum value. min() Computes the minimum value. pow() Power operation over numerical values. < Less than. If used with sets, tests if a set is a subset of another. > Higher than. If used with sets, tests if a set is a superset of another. <= Less or equal. If used with sets, tests if a set is a subset of another. >= Higher or equal. If used with sets, tests if a set is a superset of another. == Equal. != Not equal. and Logical AND comparison. Zero is interpreted as false; any other integer as true. or Logical OR comparison. Zero is interpreted as false; any other integer as true. not Logical opposite. is Python object comparison. Example: 'type(variable) is int' is not Python object comparison. Example: 'type(variable) is not int' | If used with integers, bitwise OR comparison. ^ If used with integers, bitwise XOR comparison. & If used with integers, bitwise AND comparison. << If used with integers, bit left shift. >> If used with integers, bit right shift. ~ If used with integers, bits inverted. in Tests for list, tuple, set, dictonary, or string membership. When used with strings it searches for subtrings. not in Opposite of 'in' keyword. isdisjoint Tests if two sets that are compared have no common elements. x = (1,'a',2,'b') Creates a tuple (i.e., a non modifiable array of different types of values). xlist = [1,'a',2,'b'] Creates a list (i.e., a modifiable array of different types of values). string = 'asd' Creates a string variable. Either single, or double-quotes, or triple double-quotes can be used. Triple double-quotes allow multi-line string. dict = {'a':2,'b':5} Creates a dictionary variable. setv = {1,2,3,4,5} Creates a set variable. setv = set([1,2,3,4]) Creates a set variable. x[index] Returns the value in the string/list/tuple/dictionary 'x' for the corresponding 'index' position. Python starts indexing using '0'! x[a:b:c] Returns a slice (subset) of values in the string/list/tuple 'x'. Python returns a slice with the values in positions a to (b-1). Do not specify a or b to indicate from begining or to end, respectively. Specify c for the step size. If negative values are used for c, the subset takes reverse order. x[-5] and x[-5:] Returns the value and slice of LAST set of positions in the string/list/tuple 'x', respectively. + If used with lists, tuples and strings, it concatenates them. * If used with lists, tuples and strings, it repeats them a number of times. - If used with sets, it computes the set difference. | If used with sets, it computes the set union. & If used with sets, it computes the set intersection. ^ If used with sets, it computes the set complement. max() If used with lists, any string value contained in the list will be considered as higher than any numerical value. int() Converts a variable to the nearest integer. float() Convert variable to float. long() Converts a variable to a long integer. str() Converts numeric quantities to strings to be printed or stored. complex() Creates a complex number. list() Convert variable to list. set() Convert list into a set (preserves unique values only). print() Displays strings in the screen regardless of if Python is used in interactive mode. Use the format() method to display multiple variables. Example: print('Number = {} & {}'.format(var1,var2)). input() Takes a numerical input from the user. raw_input() Takes an alphanumeric input from the user. type() Displays the type of a Python variable/object. id() Return the Python identifier of an object (i.e., its pointer to memory positions) len() Returns the number of elements of a string/list/tuple/Series/DataFrame. string.index() Finds the position of the first occurrence of a character in a string. string.count() Counts the occurrences of a character in a string. string.find() Finds the position of the first occurence of a substring in a string. string.lower() Converts a string to lowercase. string.upper() Converts a string to uppercase. string.startswith() Tests if a string starts with a certain substring. string.endswith() Tests if a string ends with a certain substring. string.strip() Removes characters from the beginning and end of a string. string.replace() Replaces characters in a string. string.split() Splits a string into a list of substrings based on a separator character. string.join() Places the string in between the characters of another string. string.format() Function used for convenient string formatting. (%s for strings, %d for integers, %f for floats) xlist.pop() Removes an element from a list based on its index. xlist.remove() Removes an element from a list based on its value. xlist.insert() Inserts a new value to a list. xlist.extend() Merges lists. xlist.append() Adds values (or embedded lists) to a list. xlist.index() Shows the index of the first occurrence of a value in the list. xlist.count() Count the occurrences of a value in the list. xlist.sort() Sorts the values of a list and replaces the original list. Use argument 'reverse = True' for reverse order. range() Create a range of values as a list. append() Add new values at the end of a list. sort() Sorts the values of a list. sorted() Sorts the values of a list without replacing the original list. zip() Return a list of tuples computed from a one-to-one maping between two lists or numpy arrays. setv.add() Adds an element to a set. setv.remove() Removes an element from the set based on its value. setv.pop() Removes a randomly chosen element of a set. setv.clear() Removes all elements of a set. setv.intersection() Computes the intersection of sets. setv.difference() Computes the difference of sets. setv.union() Computes the union of two sets. dict.keys() Returns dictonary keys. dict.values() Returns values associated to dictionary keys. dict.items() Returns keys and values of dictionaries as a tuple. dict.get() Retuns the value that corresponds to a key in a dictionary. Returns None if the key was not found. if condition : Create a conditional statement. The elif and else directives are optional. The indentation is indispensable. ---- ---- elif condition : ---- ---- else : ---- ---- while condition : Creates a loop while the condition is met. The break directive is optional. The indentation is indispensable. ---- ---- --- break --- for variable_name in list/tuple/numpyarray: Creates a loop assigning one value of the list/tuple or one row of the numpy array for variable_name for each iteration. ---- The indentation is indispensable. ---- [operation_over_variable for variable1 in set_variable1 for variable2 in set_variable2 if cond] One-line creation of smart lists (a.k.a, list comprehensions). The expression must be between brackets. The operation_over_variable can be simply return a variable. The conditional directive and nested for directives are optional. def function_name(x,y,z=None): Create a function. The return directive is optional. The indentation is indispensable. --- All of the optional input parameters (the ones with default values) need to be --- specified after the other parameters. After it is defined the function can be used return as 'function_name()', or can be assigned to a variable 'var_func = function_name' and invoked as 'var_func()'. lambda var1,var2 : operation_over_var1_and_var2(x,y) Creates a one-line anonymous (i.e., without name) function. The (x,y) directive is used when invoking the function and providing input parameters. lambda var1: True if operation_over_var1(x) else False. Creates a one-line anonymous (i.e., without name) function that integrates a condition statement. class ClassName: Create a class (i.e., an object template) with its embedded attributes (i.e., variables) and methods (i.e., functions). class_global_var1 The use of the self parameter for the methods is mandatory. class_global_var2 The use of the constructor __init__ is optional in Python. def method_name(self,x,y): --- --- def method_name(self,v,z): --- --- try: Allows proper handling of exceptions (errors) in a program. --- --- except NameError: Checks for a 'NameError' error raised when executing. --- There can be as many excepts as necessary. --- except: Checks for any error raised other than the ones implicitly specified. --- --- else: Executes this block if the code in try doesn't generate an exception. --- --- finally: Executes this block regardless if there if an eception ocurred or not. --- --- raise TypeError("") Raises an error in a script that can crash a program or can be handled by the try-except instruction. map() Applies a function over one or many iterable variables using a one-to-one approach. This function can also be used to apply special Python, Numpy and Scipy functions (element-wise!) over the contents of each element of a Series. It is is orders of magnitude faster than using for loops! Consider using pandas.DataFrame.applymap for smaller codes if applied over DataFrames but it is slightly slower. If applying special functions over the set of the DataFrame or Series, we recommend using the pandas.DataFrame.apply() method, the pandas.DataFrame.pivot_table() method, or converting the DataFrame and Series to Numpy arrays and using numpy.where() or numpy.select(). with Recommended method for manipulating (opening and writing) to file objects. file = open() Creates a file object. file.seek() Goes to a place within the file. file.readlines() Reads each line in the file. file.close() Tears down the file object. ======================= os (Built-in Library) ======================= import os Import built-in Operating System and file operations library. os.system() Runs the OS terminal (e.g. BASH) from Python to execute commands. Example: os.system (f'sudo apt-get update') os.path() Sets a path for the running script. os.path.exists() Validates the existence of a path in the operating system. os.path.isfile() Validates the existence of a file in the operating system. ============================= subprocess (Built-in Library) ============================= import subprocess Import built-in Subprocess library. call() Runs a shell command. Displays result on console in Linux, not in Windows. Does not support BASH pipe. Returns the exit code of the executable. Example: 'subprocess.call(["tshark", "-r", i])' check_output() Runs a shell command. Returns the output in a string (Python 2.7) or a bytes object(Python 3.5) if there is a zero exit code. Use 'stderr=subprocess.STDOUT' to include errors in the output. PIPE Object that allows BASH pipe between subprocesses. Popen() Creates a subprocess object to run a shell command. Supports BASH pipe when many Popen() objects exits. Use arguments 'stdin' and/or 'stdout=subprocess.PIPE' to use BASH pipe. Popen().returncode Execution code. Popen().stdin Contains the stdin of the subprocess. Popen().stdout Contains the stdout of the subprocess. Popen().stdout.close() Allows the subprocess to receive a SIGPIPE if the piped subprocess exists. Popen().communicate() Returns the stdout and stderror at the end of the piped BASH command. Example: p1 = subprocess.Popen(["tshark", "-r", i, sys.argv[1]], stdout=subprocess.PIPE) p2 = subprocess.Popen(["awk", "{print $7}"], stdin=p1.stdout, stdout=subprocess.PIPE) p3 = subprocess.Popen(["sort", "-n"], stdin=p2.stdout, stdout=subprocess.PIPE) p4 = subprocess.Popen(["uniq"], stdin=p3.stdout, stdout=subprocess.PIPE) p1.stdout.close() # Allow p1 to receive a SIGPIPE if p2 exits. p2.stdout.close() # Allow p2 to receive a SIGPIPE if p3 exits. p3.stdout.close() # Allow p3 to receive a SIGPIPE if p4 exits. output,err = p4.communicate() print(output) Example: cmd = ['awk', 'length($0) > 5'] p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, stdin=subprocess.PIPE) out, err = p.communicate('foo\nfoofoo\n') ======= pexpect ======= import pexpect Import interactive shell (e.g., BASH) execution library. Some methods are not available in Windows. ======================= sys (Built-in Library) ======================= import sys Import built-in System specific parameters and functions. argv List object that contains the arguments passed by the user when running a Python script. ====================================== re (Built-in Library) ====================================== import re Import built-in Regular Expressions library. search() ======== requests ======== import requests Import HTML/Web API calls library. ======== uvicorn ======== import uvicorn Imports library for routing requests to different servers. ======= FastAPI ======= import fastapi Imports RESTful API WebServer Library. ===== Flask ===== Import Flask WebFramework Library. ====== Django ====== Import DJango WebFramework Library. ======================= csv (Built-in Library) ======================= import csv Import built-in CSV manipulation library. DictReader() Read values of CSV files and store them as a Dictionary. Uses column names as keys and records as values. ==== json ==== import json Imports json manipulation library. load() Reads json file contents and imports them as objects. Example: with open('data.json', 'r') as f: readfile = json.load(f) ========== sqlalchemy ========== import sqlalchemy as sa Import SQL Database API library. Example 1 ingesting a CSV with SQLite: #(Modified from https://plotly.com/python/v3/big-data-analytics-with-pandas-and-sqlite/) # Making database connections from Python with Pandas and sqlalchemy: from sqlalchemy import create_engine # database connection import datetime as dt disk_engine = create_engine('sqlite:///311_8M.db') # Initializes database with filename 311_8M.db in current directory start = dt.datetime.now() chunksize = 20000 j = 0 index_start = 1 for df in pd.read_csv('311_100M.csv', chunksize=chunksize, iterator=True, encoding='utf-8'): df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) # Remove spaces from columns df['CreatedDate'] = pd.to_datetime(df['CreatedDate']) # Convert to datetimes df['ClosedDate'] = pd.to_datetime(df['ClosedDate']) df.index += index_start # Remove the un-interesting columns columns = ['Agency', 'CreatedDate', 'ClosedDate', 'ComplaintType', 'Descriptor', 'CreatedDate', 'ClosedDate', 'TimeToCompletion', 'City'] for c in df.columns: if c not in columns: df = df.drop(c, axis=1) j+=1 print '{} seconds: completed {} rows'.format((dt.datetime.now() - start).seconds, j*chunksize) df.to_sql('data', disk_engine, if_exists='append') index_start = df.index[-1] + 1 Example 2 reading a table with Microsoft SQL: # Making database connections from Python with Pandas and sqlalchemy: # (The +';Trusted_Connection=yes;' is only necessary when using # Windosws (domain-based) authentication for the database). # DB settings server='WINREPSRV102' database = 'dbWar2' username= '8888' password = '*****' params = urllib.parse.quote_plus('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+password+';Trusted_Connection=yes;') engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s"%params) try: engine.connect() print(f'Succesfully connected to {database}') except: print('Failed to connect to database. Check your db settings.') engine.dispose() kpi_df = pd.read_sql_table("SalesItemLTEConfig", schema="PKM", con=engine) display(kpi_df) engine.dispose() =========================== time (Built-in Library) =========================== import time as tm Import built-in Time library. time() Returns the current timestamp in seconds since the Epoch (January 1st, 1970) sleep() Wastes a certain amount of time in seconds. 'sleep(0.8)' proved useful in a loop running on a RPi to prevent 100% CPU usage. =========================== datetime (Built-in Library) =========================== import datetime as dt Import built-in Date and Time library. dtx = datetime.fromtimestamp() Creates an object that converts a timestamp to datetime. dtx.year Year attribute of datetime object. dtx.month Month attribute of datetime object. dtx.day Day attribute of datetime object. dtx.hour Hour attribute of datetime object. dtx.minute Minute attribute of datetime object. dtx.second Second attribute of datetime object. dtd = timedelta() Creates a time delta object. A duration expressing the difference between two dates. dtt = date.today() Creates a date time object containing the current local date. ====== logger ====== import logger Imports the logging API library. ======= IPython ======= display() Display Pandas Dataframe as an HTTP table. Requires: 'from IPython.display import display' ===== Numpy ===== import numpy as np Import Numpy Library. np.nan Not-a-Number (NaN). Special data type (float by convention) used by Numpy to denote an absent numerical value. It is considered as different from the Python None data type. np.isnan() Tests for the presence of a NaN. Necessary because the answer to '==np.nan' is always false, inclusive when testing 'np.nan==np.nan'. a Variable used in this quick reference document to denote a Numpy array. a.dtype Type of data of the array. a.astype() Change type of a numpy array. a.shape Shape (dimension size) of an array. a.ndim Returns the number of dimensions of an array. a.reshape() Reshape array. Returns same data with a new shape. a.resize() Resize array. a.ravel() Flattens a numpy array. a.T Transpose of an array. a.copy() Copies an array. Recommended to use when we want to assign a slice of an array to another variable and modify the new variable. a[] Returns a slice (subset) of values of the numpy array. Python starts indexing with 0! Assigning a slice of an array to another variable assigns the positions in memory!!!! Use the copy() method if this is not intended. Colon notation (i.e., [start:stop] or [start:stop:stepsize]) is used in unidimensional slicing. Comma notation (i.e., [row(s),column(s)] or [row(s)][column(s)]) is used in bidimensional slicing. Double bracket notation (i.e., [[1, 5, 6]]) is used for selecting subset of rows. Double bracket notation always creates a copy of the data. a[a > 30] Returns a slice (subset) of values of the numpy array that satisfy the condition. Boolean indexing always creates a copy of the data. a.where() Applies different operations to array contents depending on conditions. a.select() Equivalent to np.where() but allows more extensive conditions and is much faster than pd.apply(). a.sum() Sum all elements of array. a.cumsum() Cumulative sum of the elements of an array. a.max() Find the maximum value of an array. a.argmax() Find the position of the maximum value of an array. a.argmin() Find the position of the minimum value of an array. a.min() Find the minimum value of an array. a.mean() Compute the sample mean of array. a.std() Compute sample standard deviation of array. Computed over N by default. a.sort() Performs an in-place sorting of an array. a.argsort() Returns the indexes that would sort the values of an array. a.any() Test whether any array elements evaluate to True (logical OR for array). a.all() Test whether all array elements evaluate to True (logical AND for array). array() Create a numpy array from a Python list. matrix() Converts a table of strings to 2D array. arange() Create a range of values as numpy array with ordinal equidistant separation. linspace() Create a range of values as a numpy array with linear equidistant separation. ones() Returns an array of 1's. zeros() Returns an array of 0's. empty() Returns an n-dimensional array of uninitialized (garbage) values. full() Creates an array and fills it with a certain value of a certain datatype (e.g., with False values). eye() Returns identity matrices. diag() Constructs a diagonal matrix. repeat() Returns an array in which each element of a list is repeated a number of times consecutively. vstack() Stack numpy arrays vertically. hstack() Stack numpy arrays horizontally. histogram() Computes a histogram over the values of a numpy array. len() Number of rows of an array. enumerate() Returns the row and row index of a numpy array. where() Used for conditional slicing of an array. sort() Returns a sorted copy of an array. unique() Returns sorted unique values of an array. in1d() Tests for membership of values within an array. intersect1d() union1d() setdiff1d() setxor1d() flatnonzero() Returns an array containing the indexes of non-zero elements in an array or True values in a boolean array/mask. nonzero() Returns a tuple containing the indexes of non-zero elements in an array or True values in a boolean array/mask. count_nonzero() Counts the number of non-zero elements in an array or True values in a boolean array/mask. logical_and() Logical AND. any() Test whether any array elements evaluate to True (logical OR for array). all() Test whether all array elements evaluate to True (logical AND for array). vectorize() Turn a scalar function into one which accepts and returns vectors. + One-to-one numpy array addition. - One-to-one numpy array substraction. * One-to-one numpy array multiplication. / One-to-one numpy array division. ** Element-by-element numpy array power. & If used with boolean masks, logical AND comparison. | If used with boolean masks, logical OR comparison. ~ If used with boolean masks, logical NOT. dot() Compute inner product of two arrays. convolve() Linear convolution of two sequences. random.randint() Creates a numpy array of random integers. random.binomial() Generates a set of values with univariate binomial distribution. random.normal() Generates a set of values with univariate normal distribution (zero-mean and unit standard deviation). random.chisquared() Generates a set of values with univariate chi-squared distribution (only depends on degrees of freedom). random.RandomState() Sets or return the current state of the random number generator. ==== Cupy ==== import cupy Import CUDA Numeric Python Library for taking advantage of GPUs. ===== Scipy ===== import scipy as sp spars Variable used in this quick reference document to denote a Numpy sparse array. spars.toarray() Converts a Scipy sparse matrix to a regular numpy array. spars.todense() Converts a Scipy sparse matrix to a regular numpy matrix. fftpack.fft() Computes the Fast Fourier Transform (DFT) of an array. fftpack.ifft() Computes the Inverse Fast Fourier Transform (IDFT) of an array. stats.kurtosis() Calculates the sample Kurtosis of a distribution (i.e., the shape of the tails of a distribution). A positive value shows that the tails are more peaky than a normal distribution and a negative value shows that the tails are more flat than a normal distribution. Requires: 'from pandas.tools.plotting import scatter_matrix' stats.skew() Computes the sample skewness of a distribution (i.e., how much the distribution is 'concentrated on the left'). stats.ttest_ind() Computes a statistical T-test on the data (i.e., computes similarity between the means of the data). The p-value returned is useful when compared with the alpha value considered for the hypothesis testing. The alpha is a tool for the verification of a null hypothesis, and should be carefully selected according to the expenses of the resulting hypothesis test. ======= Seaborn ======= import seaborn as sns Import Seaborn Library. It is a simplification of Matplotlib for Data Analysis Visualization. ============== Plotly Express ============== import plotlyexpress as px Imports Plotly Express Library. scattermapbox Excelent scatter plot over geographic map. density_heatmap Excellent 2D Histogram. ========== Matplotlib ========== import matplotlib.pyplot as plt Import Matplotlib Library. matshow() Plot Heatmap (colored matrix or table). plot() Plot data or plot a function against a range. bar() Plots a bar graph. hist() Plots a histogram. hlines() Plots horizontal lines. vlines() Plots vertical lines. imshow() Displays images. Use 'cmap' for choosing the colormap and 'interpolation' for choosing the interpolation method. xlabel() Label the x-axis. ylabel() Label the y-axis. title() Title the graph. legend([],loc=' ') Create legend. Set 'loc='best'' for auto placement xticks([],[]) Set tick values for x-axis. First array for numeric values, second for alphanumeric labels. yticks([],[]) Set tick values for y-axis. First array for values, second for labels. colorbar() Inserts a colorbar in the plot. figure() Create a new figure window. subplot(n,x,y) Create multiple plots; n- number of plots, x- number horizontally displayed, y- number vertically displayed. savefig('foo.png') Save plot. ax=gca() Select current axis. ax.spines[].set_color() Change axis color, none to remove. ax.spines[].set_position() Change axis position. Can change coordinate space. style.use() Uses a certain style for the plots. FuncAnimation(fig, anifunct, interval=5000) Runs the anifunct function every interval and plots it into the fig figure. Useful for live plots. Requires 'import matplotlib.animation as animation'. An example of the anifunct can be a function that reads a text file and plots its contents. The anifunct requires the interval parameter although it is not used directly by it. Example: def anifunct(i): data = open('example.csv','r').read() lines = data.split('\n') xseries = [] yseries = [] for line in lines: if len(line) > 1: x, y = line.split(',') xseries.append(x) yseries.append(y) fig.clear() # Clears the existent plot. fig.plot(xseries,yseries) # Draws the updated plot. ====== Pandas ====== Pandas general commands: import pandas as pd Import Pandas Library. set_option("display.max_rows",16) Set Pandas preview to 16 rows at maximum. If 'None' is passed instead of a number, all rows are displayed. reset_option('display.max_rows') Reset Pandas row preview to default values. set_option("display.max_columns",16) Set Pandas preview to 16 columns at maximum. If 'None' is passed instead of a number, all columns are displayed. reset_option('display.max_columns') Reset Pandas column preview to default values. read_csv() Reads a CSV file into a DataFrame. read_excel() Reads an Excel file into a DataFrame. Example: read_excel('filename', #skiprows=, #skip_footer=, # Skip last rows. #na_values= # Replace string values by NaNs. ) read_html() Scrapes data from an HTML web page. Series commands: Series() Creates a series. If it contains mixed type of values it will default to an object data type. s Variable used in this quick reference document to denote a Series. s.astype() Interprets a Series as storing a type of value. For nominal data (categorical unordered) we use the parameters: 'category', categories=['Low', 'Medium', 'High'], ordered=False Nominal data can also be represented by binary vectors (called 'feature extraction' in Machine Learning verbage). For ordinal data (categorical ordered) we use the parameters: 'category', categories=['Low', 'Medium', 'High'], ordered=True Ordinal data also can be represented by integer values. The ordering allows us to use logical operators over the categories. s.index Returns the index values and data type of a Series. s[] Series slicing. The use of the loc and iloc methods is preferable in order to avoid confusion. s.loc['index'] Select subset of data by label-based indexing (i.e., index value). It adds new entries if the values are non-existent. s.iloc[1:2] Select subset of data by integer (i.e., numeric) indexing. s.count() Count the amount of non-NaN records. s.unique() Summarizes the unique values of a Series. s.std() Computes the standard deviation of a Series. Computed over N-1 by default. s.str.contains() Searches in strings for patterns. s.str.split() Splits string in multiple strings based on a certain character. Pass 'expand=True' to split into a DataFrame. Pass parameter 'n' to specify number of splits. s.str.get() Gets the different substrings of a splitted string. s.str.rsplit() Reverse split. Same function but starts from the end of the word. s.str.join() Concatenates a list/array of strings contained within each entry of a Series. Useful for Bag of Words. s.plot() Line plot of the Series. s.plot.bar() Bar plot of the Series. cut() Assigns numeric data of a Series based on the range of values into equally spaced bins. This is a means for discretization: reducing the possible numerical values into more 'categorical' values. s.Timestamp.dt.round() Rounds timestamp values of a Series to the closest point in time specified. DataFrame commands: df Variable used in this quick reference document to denote a DataFrame. df.values Return the data formatted as a numpy array. df.dtypes Data types of the columns of the DataFrame. df.columns Names of the columns of the DataFrame. df.shape Size of the DataFrame. df.nbytes Size in bytes of the DataFrame. df.info() Summary of the DataFrame information. df.describe() Summary of statistical data for numerical columns of the DataFrame. df.index Returns DataFrame index. df.set_index() Uses one or a Python list of column names as the DataFrame index and multiindex, respectively. df.reset_index() Takes the current DataFrame index and creates a column for it, while a new index is generated using a sequence of numbers. df.append() Appends data to the DataFrame. df.sample() Returns a random sample of the Dataframe. df.style.background_gradient(cmap='Blues') Creates a color heatmap over the DataFrame. df.T Transpose of a DataFrame. df.copy() Creates a copy of a DataFrame. df.where() Creates a copy of a DataFrame slice created using conditional slicing (i.e. uses a boolean mask as input). After using it, df.dropna() can be used to values that do not satisfy the boolean mask, if any. df.isin() Returns a boolean DataFrame showing whether each element in the 'df' DataFrame is contained in values. df.rename() Rename columns of the DataFrame using a dictionary that maps the old column names to the new column names. Pass 'inplace=True' parameter to change the DataFrame directly. df["Column"] Select column by column name. Creates a series using a column of the DataFrame. Affecting the values of the Series affects the original data frame. df[['Column1,'Column2']] Creates a DataFrame as a subset of an existing DataFrame. Can be used to specify the column order of an existing DataFrame. df['Date/Time':'Date/Time'] Select rows by date/time value of the index. Creates a DataFrame. (df["Column"] >= ) Creates a boolean mask evaluating the condition. Boolean masks must be within parentheses to guarantee the order of evaluation! df.isnull() Creates a boolean mask evaluating the presence of np.NaNs in entries of the DataFrame. df.notnull() Creates a boolean mask evaluating the absense of np.NaNs in entries of the DataFrame. df.loc[[15],'Column'] Select subset of data by label-based indexing (i.e., index value and column names). It adds new entries if the values are non-existent. Use Python tuples with the index levels for selecting data from a multiindexed DataFrame. As a general principle be careful when using too many back-to-back brackets because it's tricky to predict if a visualization or a copy is returned!!! df.iloc[1:2,2:3] Select subset of data by integer (i.e., numeric) indexing. As a general principle be careful when using too many back-to-back brackets because it's tricky to predict if a visualization or a copy is returned!!! df.xs() Useful for filtering data from multiindexed DataFrame (for visualization only). df["Column"].astype() Changes the datatype of a DataFrame column. del df["Column"] Delete given column. Performs the deletion on the original DataFrame. df.drop() Delete given row or column based on label (i.e., index or column name). Pass axis=1 for columns. Returns a copy of the affected DataFrame by default. df.dropna() Drops rows where data is missing. df.fillna() Fill entries that contain missing values. df.interpolate() Fill entries that contain missing values with interpolated values. df.resample() Aggregates the values of the DataFrame within a certain time frequency period, applying some operation. df.as_freq() Change the time frequency period of the index of the DataFrame, some filling method might be necessary. df.mean() Calculate the mean of the values of each column in the DataFrame. df.std() Calculates the population standard deviation of the values of each column in the DataFrame. df.corr() Estimates the correlation matrix from the DataFrame. Ignores non-numerical columns. df.sort_values() Sorts the values of a DataFrame. group,frames = df.groupby() Creates a GroupBy object from a single Dataframe: A collection of Dataframes that result from the split of the original DataFrame, where the split depends on the values of a certain column of the original DataFrame. This is useful for implementing the 'Split' stage of the 'Split->Apply->Combine' or 'Split->Apply' workflows. Iterative methods can take care of the 'Apply' stage or the 'agg' method can be used. If we use a function as an argument, instead of a column name, the data is split based on the results of the function. The function uses as input data the index of the original Dataframe. df.groupby().agg() Performs an aggregation operation over the data of a GroupBy object. It takes as input a dictionary of DataFrame column names and operations. In this case the operations are performed on the specified columns. If groupby is used with the 'level' argument then the input of agg is a dictionary of output column names (aggregation type) and operations. In this case all the operations are performed over all the columns. If the operations are non-standard, then the apply method is preferred. The supported operations: aggfuncs = [ 'count', 'sum', 'sem', 'skew', 'mean', 'min', 'max', 'std', 'quantile', 'nunique', 'mad', 'size', pd.Series.mode, 'var', 'unique'] df.apply() Apply a function over a certain axis of the dataframe. Very used for lambda functions. If performing operations, Numpy functions are recommended. df.pivot() Create pivot table that aggregates only non-numerical values. df.pivot_table() Create pivot table using data from the DataFrame. If aggfunc is specified (def. mean), the options 'margins', 'dropna' and 'fill_value' do not work. Make sure that NaNs in categorical variables have been replaced before using this command, otherwise they will be ignored. --- Example: print("Alert: Make sure that NaNs in categorical variables have been replaced, otherwise they will be ignored in the Pivot Table!!!\n") df.pivot_table( index=[], # Use specified original Dataframe columns as pivot rows. columns=[], # Use specified original Dataframe columns as pivot columns. values=[], # Use specified original Dataframe columns values. aggfunc=[np.mean,np.max], # Option 1: Aggregation functions applied to all values. {'Column':np.size}, # Option 2a: Aggregation functions applied to specific values. {'Column':lambda x: ''.join(str(v) for v in x)}, # Option 2b: Custom aggregation functions applied to specific # values. In this case, concatenation of string values. margins=True, # Grand Totals and Subtotals. #margins_name="Total", # Does not work if aggfunc is specified. #dropna=False, # Does not work if aggfunc is specified. #Useful for identifying missing information, but can make make visualization harder. #fill_value='UNKNOWN', # If dropna is disabled, the NaNs of the Values (only!) can be replaced with a certain value. ) print(str(PT.shape[0]) + " records found, including GrandTotal if applicable.") # Visualization Filters: Do not recalculate the Pivot Table values. Only affects the display of such values. # Filter based on the values of a column: #col_cond_1 = #col_cond_2 = # Filter based on the values of the index (configured from left to right, but applied from right to left): #ind_cond_1 = #ind_cond_2 = (PT #[ #col_cond_1 #& #col_cond_2 #] #.xs(ind_cond_2[0], level=ind_cond_2[1], axis=0, drop_level=False) #.xs(ind_cond_1[0], level=ind_cond_1[1], axis=0, drop_level=False) #.reset_index().sort_values() #.sort_index(level=) ) --- End of Example. df.hist() Plot histogram using DataFrame data. df.plot() Handler for many types of plots. df.plot.line() Line plot of the DataFrame. Use option 'subplots=True' to plot each the data of each Series in a different chart. Use option 'kind="kde"' to plot a probability distribution (pdf) estimation of the Series that form the DataFrame. df.plot.area() Area plot of the DataFrame. df.plot.bar() Vertical bar plot of the DataFrame. df.plot.barh() Horizontal bar plot of the DataFrame. df.plot.pie() Pie plot of the DataFrame. df.plot.density() Density estimate plot of the DataFrame. df.plot.kde() Density estimate plot of the DataFrame. df.plot.box() Box plot of the DataFrame. df.boxplot() Statistical information plot. One box per Series (i.e., column). Red line: median. Blue box: 25%-75% percetile of data. Dashed line: 1.5 times the blue box. Crosses: Outliers. df.plot.hexbin() 2D hexbin plot. Gridsize and color can be adjusted. df.plot.scatter() 2D scatter plot. Buble Colors and Size can be adjusted. The amount of data points should not be excesively big. Otherwise subsampling can be used. Example: #Subsampling. idx = np.random.permutation(len(rankpi_df))[0:len(rankpi_df)//1000] rankpi_df_subsampled = rankpi_df[['[TDD]DL PRB Utilization Rate', '[LTE]Average PHY DL Throughput(Mbps)', '[TDD]Mean RRC-Connected User Number', 'Resource_Oversuscription']].loc[idx] rankpi_df_subsampled.plot.scatter( x='[TDD]DL PRB Utilization Rate', y='[LTE]Average PHY DL Throughput(Mbps)') qcut() Divide Data into bins using quantiles (equal size partitions of data). Example: qcut(df['support'],4,retbins=True,duplicates='drop',labels=None) concat(list_of_dataframes) Concatenates DataFrames in a list of Dataframes if all such Dataframes have the same columns. Example: """Check for a directory""" targetdir = "/home/siyer/Documents/Test_data/" suffix = '*.csv' if os.path.exists(targetdir): print("directory found") os.chdir(targetdir) filelist = [i for i in glob.glob(format(suffix))] print (filelist) else: print("cant find directory") appdata_list = [] # Create an empty list. pd.concat takes a list of dataframes as an argument for i in filelist: appdata = pd.read_csv(i, index_col=0) # add a dictionary of datatypes expected appdata['filename'] = os.path.basename(i) appdata_list.append(appdata) big_appdata = pd.concat(appdata_list, ignore_index=True) # ignore index true and false gives the same result. merge() Combines two dataframes using the indexes or columns values. Pass 'how=outer' for set union, 'how=inner' for set intersection, 'how=left' for complementing the first dataframe with information from the second, and 'how=right' for complementing the second dataframe with information from the first. Pass 'left_index=True' and/or 'right_index=True' to specify the use of indexes for making the relations. Pass 'left_on="Column"' or 'right_on="Column"' to specify the use of columns or columns for making the relations (instead of the dataframe indices). When both dataframes being merged contain columns with the same name, Pandas will append _x and _y to the columns coming from the first and second dataframe, respectively. scatter_matrix(df) Visual Matrix of 2D scatter plots for comparing multiple variables. Requires: 'from pandas.tools.plotting import scatter_matrix' df.to_csv() Write Dataframe contents into a CSV file. Timestamp/Period commands: Timestamp() Converts a string that denotes a point in time to a Timestamp object. If the resulting values are used as index values for a Series or DataFrame, then they constitute a DatetimeIndex object. to_datetime() Convert non-standard strings that denote points in time to a Timestamp object. Pass dayfirst=True to parse the European date instead of the American. Example: df['date'] = pd.to_datetime(df['date'], format='%d%b%Y') date_range() Computes a timeline of equally spaced Timestamps. Period() Converts a string that denotes a time period (e.g., a month or day) to a Period object. If the resulting values are used as index values for a Series or DataFrame, then they constitute a PeriodIndex object. Timedelta() Compute a distance in time. + If used with Timestamps or Timedeltas, it adds them it adds them properly. - If used with Timestamps of Timedeltas, it substracts them properly. ================ Pandas Profiling ================ import pandas_profiling Imports Pandas Profiling Functions. Allows quick data exploration using Pandas. ========================= Tensorflow Extended (TFX) ========================= https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/data_validation/tfdv_basic.ipynb#scrollTo=mPt5BHTwy_0F Data Exploration, Schema Generation and Data Validation: import tensorflow_data_validation as tfdv Imports the Tensorflow Extended Data Validation Utility. Runs over Apache Beam. stats = generate_statistics_from_dataframe() Computes dataset statistics. visualize_statistics() Visualizes statistics generated. Can be used to compare datasets visually. Example: visualize_statistics( lhs_statistics=eval_stats, rhs_statistics=train_stats, lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET' ) schema = tfdv.infer_schema(statistics=train_stats) Infer schema from the computed statistics. display_schema(schema) Display the inferred schema validate_statistics(statistics=eval_stats, schema=schema, environment) Detect Anomalies in dataset using a reference schema. display_anomalies(anomalies) Display the detected anomalies. Additional examples: # You can relax the minimum fraction of values that must come from the domain of a particular feature get_feature(schema, 'feature_column_name').distribution_constraints.min_domain_mass = get_feature(schema, 'readmitted').not_in_environment.append('SERVING') # You can add a new value to the domain of a particular feature get_domain(schema, 'feature_column_name').value.append('string') # Restrict the range of the `age` feature set_domain(schema, 'age', schema_pb2.IntDomain(name='age', min=17, max=90)) # Define particular options for generate_statistics_from_dataframe() options = tfdv.StatsOptions(schema=schema, infer_type_from_schema=True, feature_allowlist=approved_cols) # Calculate skew for the diabetesMed feature diabetes_med = tfdv.get_feature(schema, 'diabetesMed') diabetes_med.skew_comparator.infinity_norm.threshold = 0.03 # domain knowledge helps to determine this threshold # Calculate drift for the payer_code feature payer_code = tfdv.get_feature(schema, 'payer_code') payer_code.drift_comparator.infinity_norm.threshold = 0.03 # domain knowledge helps to determine this threshold # Calculate anomalies skew_drift_anomalies = tfdv.validate_statistics(train_stats, schema, previous_statistics=eval_stats, serving_statistics=serving_stats) # Calculate drift for the payer_code feature tfx.components.StatisticsGen( examples=example_gen.outputs['examples'] ) # Declare the InteractiveContext and use a local sqlite file as the metadata store. # You can ignore the warning about the missing metadata config file context = InteractiveContext(pipeline_root=_pipeline_root) example_gen = tfx.components.CsvExampleGen(input_base=_data_root) # Run the component context.run(statistics_gen) # Show the output statistics context.show(statistics_gen.outputs['statistics']) ==== Dask ==== import dask Import Pandas/Numpy Distributed Computing (Parallelization) Library. Performs Dataframe operations faster than Pandas due to multi-core capabilities. Performs array operations faster than Numpy due to multi-computer capabilities. dataframe.read_csv() Fast reading of multiple csv files into a dask dataframe. dataframe.from_pandas() Turn a Pandas Dataframe into a Dask Dataframe for speeding up ============ scikit-learn ============ Numerical variables commands: ---------------------------- preprocessing.StandardScaler() Scales the numerical values to a zero mean and unit variance random variable. Categorical variables commands: ------------------------------ Both DictVectorizer and OneHotEncoder turn categorical features into binary features. Due that scikit-learn estimators expect continuous input, it is advantageous to represent or characterize categorical features as binary features, because in that way they won't be interpreted as being ordered (i.e., operations like <, >, <=, and >= don't have an effect). In the case of DictVectorizer the input is a dictionary of feature-value pairs, while in the OneHotEncoder case, the input is a set of integers that represents the different categories of a feature. The LabelEncoder function can be used to turn the values of a feature into integer values. lev = LabelEncoder() Used for transforming categorical text features into integer values. vecDV = DictVectorizer() Class instantiation (i.e., object). Used for transforming categorical text features formated as a dictionary of feature-value pairs into binary features (binary vectors). Requires: 'from sklearn.feature_extraction import DictVectorizer' vecDV.fit_transform() Creates the mapping and transforms the data and stores it as numpy.ndarray (for reasonable sized data) or scipy.sparse (for big size data). vecDV.fit_transform().toarray() Similar to previous but forces the storage of the data as a numpy.ndarray in memory. vecDV.transform() Transforms the data and stores it as numpy.ndarray (for reasonable sized data) or scipy.sparse (for big size data). vecDV.get_feature_names() Shows the mapping of feature-value pairs to the binary features (i.e., features). vecOHE = OneHotEncoder() Class instantiation (i.e., object). Used for transforming categorical text features formated as a set of integers into binary features (binary vectors). Text processing commands: ------------------------- ldobject = load_files() Loads files stored in a folder structure where the name of the folders represents the labels of the files. Requires 'from sklearn.datasets import load_files' bagoword = CountVectorizer() Applies bag-of-words strategy to an iterable of strings (counts the words used in a document or message exchange creating a vocabulary from a text corpus). Requires 'from sklearn.feature_extraction.text import CountVectorizer'. Use the parameter 'vocabulary' to define a dictionary of tokens to be used. Use the 'min_df' parameter to set the minimum number of text samples in which a token needs to appear to be included in the vocabulary. Use the 'max_df' parameter to exclude tokens from the vocabulary that are too frequent in the text samples to be informative. Use the 'stop_words' parameter to exclude tokens from a predefined set. Use the 'token_pattern='(?u)\\b\\w+\\b'' parameter to parse 1-character words. Use the 'ngram_range' parameter for allowing the analysis of groups of words instead of word by word. This way, some of the local ordering information is preserved. Use the 'analyzer='char_wb'' parameter for tokenizing by characters within word boundaries. This way, mispellings of words have a smaller effect on the classifier. bagoword.fit() Trains the CountVectorizer object. bagoword.vocabulary_ Displays the vocabulary of a trained CountVectorizer object. bagowords.get_feature_names() Displays the names of the columns of the trained object. bagoword.transform() Returns a Scipy sparse matrix. Uses the trained CountVectorizer object to map strings of data to word counts. tfidf = TfidfTransformer() Applies TF-IDF strategy over an iterable of word counts (weighs the word counts by the inverse of their frequency in a text corpus and/or normalizes them). Use the 'norm' parameter to specify the type of normalization. Use the 'use_idf' parameter to specify if the Inverse Document Frequency should be used in the calculation. tfidf = TfidfVectorizer() Applies both the bag-of-words and TF-IDF strategies to an iterable of strings (counts the words used in a document, weighs it by the inverse of their frequency in a text corpus, and normalizes it). tfidf.idf_ Displays the Inverse Document Frequency values stored by the object. ENGLISH_STOP_WORDS List of common words in the English language. Requires 'from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS' Multilabel commands: labl = MultiLabelBinarizer() Class instantiation (i.e., object). Used for transforming a set of labels into binary vectors. Requires: 'from sklearn.preprocessing import MultiLabelBinarizer' labl.fit() Stores the set of labels present in the data. labl.fit_transform() Transforms the data and stores it as numpy.ndarray (for reasonable sized data) or scipy.sparse (for big size data). labl.classes_ Shows the mapping of labels to the binary features. Automatic Feature Selection commands: ------------------------------------ sel = SelectPercentile() Keeps a percentage of the most important features of the dataset based on supervised ANOVA. Supply 'score_func' f_classif or f_regression and 'percentile'. Requires: 'from sklearn.feature_selection import SelectPercentile' sel.fit() Applies the ANOVA model over the data. sel.transform() Returns the dataset with the most important features. sel.get_support() Returns the boolean mask with the decisions of inclusion. select = SelectFromModel Keeps a number of the most important features of the dataset based on unsupervised ANOVA. Supply 'estimator' and 'threshold'. Requires: 'from sklearn.feature_selection import SelectFromModel' select.fit() Applies the ANOVA model over the data. select.transform() Returns the dataset with the most important features. select.get_support() Returns the boolean mask with the decisions of inclusion. Linear Regression commands: --------------------------- lrobj = LinearRegression() Class instantiation. Used for training a least squares regression model. Requires: 'from sklearn.linear_model import LinearRegression' Specify 'fit_intercept' if the intercept weight needs to be computed. lrobj.fit() Trains the linear regression model. lrobj.coefs_ Returns the weights (parameters) of the linear model. Feedforward Neural Networks (Multi-Layer Perceptron) commands: -------------------------------------------------------------- mlpobj = neural_network.MLPClassifier() Class instantiation (i.e., object). Used for training a multi-class feed-forward neural net. Requires: 'from sklearn.neural_network import MLPClassifier' Specify 'solver' to indicate the optimization algorithm for backpropagation. Specify 'activation' to indicate the activation function of the neurons. Specify 'hidden_layer_sizes' to indicate how many hidden layers and neurons per hidden layer. Specify 'random_state' for reproducibility mlpobj.fit() Trains a neural network based on data and labels. mlpobj.score() Returns the mean accuracy on the given test data and labels. mlpobj.coefs_ Returns the weights of the neural net. Classification Trees commands: ------------------------------ treeobj = tree.DecisionTreeClassifier() Class instantiation (i.e., object). Used for training a multi-class classification tree. Requires: 'from sklearn import tree' If the input matrix X is very sparse, it is recommended to convert to sparse csc_matrix before calling fit and sparse csr_matrix before calling predict. Training time can be orders of magnitude faster for a sparse matrix input compared to a dense matrix when features have zero values in most of the samples. cltree = treeobj.fit() Trains a classification tree based on data and labels. cltree.predict() Runs a classification tree with test data. cltree.predict_proba() Runs a classification tree with test data, having as an output the probability of each class. (i.e., the fraction of training samples of the same class in a leaf) cltree.score() Returns the mean accuracy on the given test data and labels. tree.export_graphviz() Creates a (non-human readable) classification tree visualization. We need a tool like pydotplus to create an image file. Requires: 'from sklearn.tree import export_graphviz' pydot.graph_from_dot_data() Creates an image file based on the output of graphviz. Random Forests commands: ------------------------ forestobj = RandomForestClassifier() Class instantiation (i.e., object). Used for training a multi-class classification random forest. Requires: 'from sklearn.ensemble import RandomForestClassifier' If the input matrix X is very sparse, it is recommended to convert to sparse csc_matrix before calling fit and sparse csr_matrix before calling predict. Training time can be orders of magnitude faster for a sparse matrix input compared to a dense matrix when features have zero values in most of the samples. Specify 'n_estimators' to control the number of decision trees trained. Specify 'n_samples' to control the number of samples used for each decision tree in the random forest. Specify 'max_features' to control the number of features considered for each decision tree split. Specify 'max_depth' to control the number of levels of the decision trees in the random forest. Specify 'n_jobs' to use CPU multicore capabilities. clforest = forestobj.fit() Trains a random forest classifier on data and labels. clforest.predict() Runs a random forest classifier with test data. clforest.score() Returns the mean accuracy on the given test data and labels. clforest.estimators_ Displays the information of the decision trees that form the random forest. clforest.feature_importances_ Displays feature importance measures. Gradient Boosting Machines: --------------------------- gbmobj = GradientBoostingClassifier() Class instantiation (i.e., object). Used for training a multi-class Gradient Boosting Machine for classification. Requires: 'from sklearn.ensemble import GradientBoostingClassifier' Specify 'n_estimators' to control the number of decision trees trained. Specify 'max_depth' to control the number of levels of the decision trees in the random forest. Specify 'learning_rate' to specify the GBM learnig rate (a.k.a. shrinkage). clgbm = forestobj.fit() Trains a GBM classifier on data and labels. clgbm.predict() Runs a GBM classifier with test data. clgbm.score() Returns the mean accuracy on the given test data and labels. clgbm.estimators_ Displays the information of the decision trees that form the GBM. clgbm.feature_importances_ Displays feature importance measures. Cross-Validation commands: -------------------------- train_test_split() Splits the original dataset in a train and testing dataset. Requires: 'from sklearn.cross_validation import train_test_split' or 'from sklearn.model_selection import train_test_split' cross_val_score() Cross-Validates a classifier. Requires (x,) shape instead of (x,1) for the label array. Requires: 'from sklearn.cross_validation' import cross_val_score'. Generalization performance commands: confusion_matrix() Computes a confusion matrix for the results of a binary or multiclass classifier. Multilabel is not currently supported. Requires: 'from sklearn.metrics import confusion_matrix, classification_report' f1_score() Computes the f1_score for the results of binary, multiclass and multilabel classifiers. This metric accounts for imbalanced sets, specially when using the parameter 'average='macro''. Requires: 'from sklearn.metrics import confusion_matrix, classification_report' classification_report() Computes the precision, recall, and f1_score per class of binary, multiclass and multilabel classifiers. Requires: 'from sklearn.metrics import confusion_matrix, classification_report' from sklearn.cross_validation import train_test_split Persistent (Serialized) Data (binary/pickled files) commands: ------------------------------------------------ joblib.dump() Exports a Python object into a Python pickled file. Requires: 'from sklearn.externals import joblib' joblib.load() Loads a Python pickled file. Requires: 'from sklearn.externals import joblib' ======= mlxtend ======= import mlxtend Import the Machine Learning Extended library, that complements scikit-learn. Association rules: ------------------ preprocessing.TransactionEncoder() One Hot Encoding assuming a list of lists as input. frequent_patters.apriori() Class instantiation (i.e., object). Used for training an association rules model using the apriori algorithm. Example: apriori(df, min_support = 0.2, use_colnames = True, verbose = 1) frequent_patters.association_rules() Function with multiple metrics to better interpret trained models. Example: association_rules(df, metric = "confidence", min_threshold =0.6) =========== statsmodels =========== import statsmodels as sm Import the Statistical Models and Time Series Analysis library. sm.RecursiveLS() Recursive Least Squares ===== river ===== Online (Incremental) Machine Learning library. ============ Yellowbrick ============ Provides insight to humans by creating visualizations of ML algorithm's performance and feature analysis. ==== ELI5 ==== Feature analysis library. ==== TPOT ==== Automated Machine Learning library. Automatically tries multiple ML models on the data. ===== Keras ===== import keras Imports Keras Deep Learning Framework. Keras is an API for implementing Neural Networks in Tensorflow. Requires tensorflow. datasets.mnist Handwritten MNIST dataset. nn = models.Sequential() Generates a Neural Net object model. This model assumes a linear stack of layers. Pass a list of layers as arguments. Use 'layers.InputLayer()' to specify input layer parameters. Use 'layers.Flatten()' to flatten a layer. Use 'layers.Dense()' to specify a dense (perceptron) layer and its parameters. Use 'layers.Conv2D()' to specify a 2D convolutional layer and its parameters. Use 'layers.MaxPool2D()' to specify a 2D pooling layer and its parameters. nn.summary() Provides a summary of the proposed model object. nn.compile() Configures the learning process. Requires an optimizer, a cost function, and a performance metric. nn.fit() Trains the model. If the model has been pretrained, it continues on from that point. nn.fit_generator() Trains the model. nn.predict_classes() Runs a classifier model with test data. nn.predict() Runs a classifier test data, having as an output the probability of each class. nn.evaluate() Returns the mean accuracy on the given test data and labels. nn.history Contains the evolution of the model training. callbacks.ReduceLROnPlateau() callbacks.EarlyStopping() mbn = MobileNet() Creates an object for using the pre-trained MobileNet. Requires 'from keras.applications.mobilenet import MobileNet' Use 'weights' for using pre-trained weights using well-known databases (e.g., weights='imagenet') If using for Transfer learning this model can be used as a layer in a Keras Sequential model. mbn.layers[0].traninable Allows setting if the layers of MobileNet will be trainable or not. layers.Input() layers.Dense() Layer of fully connected neurons. layers.Flatten() layers.Dropout() Applies dropout (regularization) to a neural net layer. layers.BatchNormalization() optimizers.Adam() optimizers.SGD() optimizers.RMSprop() utils.to_categorical() Applies One-Hot-Encoding (encoding to binary mutually orthogonal vectors) to the labels. ====== pyzbar ====== import pyzbar Import QR-codes and barcodes reading library. ============ scikit-image ============ import skimage Imports scikit-image library for image processing. io.imread() Reads an image file and returns a numpy array. io.imread_collection() Returns a list or collection of images. img_as_float() Converts image data format to float representation for pixels. img_as_ubyte() Converts image to unsigned byte representation for pixels. color.rgb2gray() Converts an RGB image to a grayscale image. Image Filtering: --------------- filters.threshold_otsu() Computes the otsu treshold for separating objects from background in grayscale images. After computed, the threshold is compared with the grayscale image (e.g., grayscale_img > thresh). Image Segmentation: ------------------ segmentation.clear_border() Clears borders of binary image after applying the thresholding. measure.label() Segments the image in regions or labels. Returns a matrix that indicates the labels to which every pixel belongs. reg = measure.regionprops() Returns every region of the segmented image as a 'region' object. reg.bbox() Creates a bounding box around a region object. Useful for cropping regions from a segmented image using slicing. ===== spacy ===== import spacy Import Natural Language Processing Library. ==== NLTK ==== import nltk Import Natural Language Processing Library. ========= Langchain ========= Import Large Language Model development framework. Allows to import, integrate, and customize LLMs with agents/toolkits. === AWS === import boto3 Import Amazon Web Service client Library. ============ Google Cloud ============ Speech: ------- from google.cloud import speech Import Google's Speech-to-Text API Library. === bob === import bob Imports speaker recognition module Library. ===== coach ===== import coach Imports Intel reinforcement learning library. ====== Cython ====== ===== Numba ===== ======== networkx ======== import networkx Import NetworkX library for graph-based data manipulations. ====== igraph ====== import igraph Import IGraph library for graph-based data manipulations. ========= yahoo_fin ========= from yahoo_fin import stock_info Import Yahoo Finance stocks library. stock_info.get_data('aapl' , start_date = '01/01/1999') ======= padasip ======= import padasip Import Digital Signal Processing (DSP) adaptive systems and adaptive filters. ===== Voila ===== import voila Imports the Voila library for serving interactive Jupyter Notebooks with widgets via HTML. ====== Gradio ====== Imports the Hugging Face Gradio API for deploying ML pipelines and models. ========= Streamlit ========= import streamlit as st Imports the Streamlit API in order to easily create low-code Website with interactive widgets. pyplot() Plots a Matplotlib figure in Streamlit. Plots are handled as rendered images (i.e., non-interactive) Example 1 using subplot: fig, (ax1, ax2) = plt.subplots(1,2,figsize=(2, 2)) ax1.hist(rankpi_df.loc[rankpi_df_DLPRB_mask,['Hour']], bins=24) ax2.hist(rankpi_df.loc[rankpi_df_DLPRB_mask,['Hour']], bins=24) st.pyplot(fig) Example 2 using subplot of a multiindex image. fig_apnprotocoluse, ax = plt.subplots() apn_service_proto_df.head(10).drop('All',axis=1).drop('All',axis=0).plot.bar(ax=ax, stacked=True) plt.xlabel("Application") st.pyplot(fig_apnprotocoluse) altair_chart() Plots an interactive Altair chart in streamlit. Example of a histogram in Altair: import altair as alt c = alt.Chart(rankpi_df.loc[rankpi_df_DLPRB_mask,['Hour']]).mark_bar().encode( alt.X("Hour", bin=False), y='count()' ) r = alt.Chart(rankpi_df.loc[rankpi_df_DLPRB_mask,['Hour']]).mark_rule(color='red').encode( x='mean(Hour):Q', size=alt.value(5) ) st.altair_chart((c + r), use_container_width=True) ==== Dash ==== Imports optimized webframework for Web Dashboards. Understand this!!! appdata_dict = [dict(r.iteritems()) for _, r in appdata.iterrows()]