Introduction to Data Science in Python學(xué)習(xí)筆記

本文主要是作者在學(xué)習(xí)coursera的Introduction to Data Science in Python課程的學(xué)習(xí)筆記,僅供參考。


1. 50 Years of Data Science

? ? (1) Data Exploration and Preparation?

????(2) Data Representation and Transformation

????(3) Computing with Data

? ? (4) Data Modeling

????(5) Data Visualization and Presentation

? ? (6) Science about Data Science


2. Functions

def add_numbers(x, ?y, ?z = None, flag = False):

? ? if (flag):

? ? ? ? print('Flag is true!')

? ? if (z == None):

? ? ? ? return x + y

? ? else:

? ? ? ? return x + y + z

print(add_numbers(1, 2, flag=true))


Assign function add_numbers to a variable a:

a = add_numbers

a = (1, 2, flag=true)


3. 查看數(shù)據(jù)類型

type('This is a string')

-> str

type(None)

-> NoneType


4. Tuple 元組

Tuples are an immutable data structure (cannot be altered).

元組是一個不變的數(shù)據(jù)結(jié)構(gòu)(無法更改)。

x = (1, 'a', 2, 'b')

type(x)

->tuple


5. List 列表

Lists are a mutable data structure.

列表是可變的數(shù)據(jù)結(jié)構(gòu)。

x = [1, 'a', 2, 'b']

type(x)

->list


6. Append 附加

Use append to append an object to a list.

使用附加將對象附加到列表。

x.append(3.3)

print(x)

->[1, 'a', 2, 'b', 3.3]


7. Loop through each item in the list

for item in x:

? ? print(item)

->1

? ? a

? ? 2

? ? b

? ? 3.3


8. Using the indexing operator to loop through each item in the list

i = 0

while( i != len(x) ):

? ? ? ? print(x[I])

? ? ? ? i = i +1

->1

????a

????2

????b

????3.3


9. List 基本操作

(1)Use + to concatenate連接 lists

[1, 2] + [3, 4]

-> [1, 2, 3, 4]

(2)Use * to repeat lists

[1]*3

->[1, 1, 1]

(3) Use the in operator to check if something is inside a list

1 in [1, 2, 3]

->True


10. String 基本操作

(1)Use bracket notation to slice a string.

? ? ? ? ??使用方括號符號來分割字符串。

x = 'This is a string'

print(x[0])

->T

print(x[0:1])

->T

print(x[0:2])

->Th

print(x[-1]) ?# the last element

->g

print(x[-4:-2]) ?# start from the 4th element from the end and stop before the 2nd element from the end

->ri

x[:3] ?#?This is a slice from the beginning of the string and stopping before the 3rd element.

->Thi

x[3:] #?this is a slice starting from the 4th element of the string and going all the way to the end.

-> s is a string

(2) New example on list

firstname = 'Christopher'

lastname = 'Brooks'

print(firstname + ' ' + lastname)

->Christopher?Brooks

print(firstname*3)

->ChristopherChristopherChristopher

print('Chris' in firstname)

->True

(3) Split returns a list of all the words in a string, or a list split on a specific character.

firstname = 'Christopher Arthur Hansen Brooks'.split(' ')[0]?

lastname = 'Christopher Arthur Hansen Brooks'.split(' ')[-1]?

print(firstname)

->Christopher

print(lastname)

->Brooks

(4) Make sure you convert objects to strings before concatenating串聯(lián).

'Chris' + 2

->Error

'Chris' + str(2)

->Chris2


11. Dictionary 字典?

(1)Dictionaries associate keys with values

x = {'Christopher Brooks': 'brooksch@umich.edu', 'Bill Gates': 'billg@microsoft.com'}

x['Christopher Brooks']

->brooksch@umich.edu

x['Kevyn Collins-Thompson'] = None

x['Kevyn Collins-Thompson']

->沒有輸出

(2)Iterate over all of the keys:

? ? ? ? ? 遍歷所有的鍵:

for name in x:

? ? print(x[name])

->brooksch@umich.edu

? ? billg@microsoft.com

? ? None

(3) Iterate over all of the values:

for email in x.values():

? ? print(email)

->brooksch@umich.edu

? ? billg@microsoft.com

? ? None

(4) Iterate over all of the items in the list:

for name, email in x.items():

? ? print(name)

????print(email)

->Christopher Brooks

? ? brooksch@umich.edu

? ? Bill Gates

? ? billg@microsoft.com

? ? Kevyn Collins-Thompson

????None

(5)?unpack a sequence into different variables:

? ? ? ? ? 將序列解壓為不同的變量:

x = ('Christopher', 'Brooks', 'brooksch@umich.edu')

fname, lname, email = x

fname

->Christopher

lname

->Brooks

(6) Make sure the number of values you are unpacking matches the number of variables being assigned.

x = ('Christopher', 'Brooks', 'brooksch@umich.edu', 'Ann Anbor')

fname, lname, email = x

->error


12. More on Strings

(1) Simple Samples

print('Chris' + 2)

->error

print('Chris' + str(2))

->Chris2

(2) Python has a built in method for convenient string formatting.

sales_record = {'price': 3.24, 'num_items': 4, 'person': 'Chris' }

sales_statement = '{} bought {} item(s) at a price of {} each for a total of {}'

print(sales_statement.format(sales_record['person'], sales_record['num_items'], sales_record['price'], sales_record['num_items']*sales_record['price']))

->Chris bought 4 item(s) at a price of 3.24 each for a total of 12.96


13. Reading and Writing CSV files

(1)導(dǎo)入csv

import csv

%precision 2

with open('mpg.csv') as csvfile:

? ? mpg = list(csv.DictReader(csvfile)) # 將csvfile轉(zhuǎn)化為元素為字典的list

mpg[:3]

->

[OrderedDict([('', '1'),

? ? ? ? ? ? ? ('manufacturer', 'audi'),

? ? ? ? ? ? ? ('model', 'a4'),

? ? ? ? ? ? ? ('displ', '1.8'),

? ? ? ? ? ? ? ('year', '1999'),

? ? ? ? ? ? ? ('cyl', '4'),

? ? ? ? ? ? ? ('trans', 'auto(l5)'),

? ? ? ? ? ? ? ('drv', 'f'),

? ? ? ? ? ? ? ('cty', '18'),

? ? ? ? ? ? ? ('hwy', '29'),

? ? ? ? ? ? ? ('fl', 'p'),

? ? ? ? ? ? ? ('class', 'compact')]),

OrderedDict([('', '2'),

? ? ? ? ? ? ? ('manufacturer', 'audi'),

? ? ? ? ? ? ? ('model', 'a4'),

? ? ? ? ? ? ? ('displ', '1.8'),

? ? ? ? ? ? ? ('year', '1999'),

? ? ? ? ? ? ? ('cyl', '4'),

? ? ? ? ? ? ? ('trans', 'manual(m5)'),

? ? ? ? ? ? ? ('drv', 'f'),

? ? ? ? ? ? ? ('cty', '21'),

? ? ? ? ? ? ? ('hwy', '29'),

? ? ? ? ? ? ? ('fl', 'p'),

? ? ? ? ? ? ? ('class', 'compact')]),

OrderedDict([('', '3'),

? ? ? ? ? ? ? ('manufacturer', 'audi'),

? ? ? ? ? ? ? ('model', 'a4'),

? ? ? ? ? ? ? ('displ', '2'),

? ? ? ? ? ? ? ('year', '2008'),

? ? ? ? ? ? ? ('cyl', '4'),

? ? ? ? ? ? ? ('trans', 'manual(m6)'),

? ? ? ? ? ? ? ('drv', 'f'),

? ? ? ? ? ? ? ('cty', '20'),

? ? ? ? ? ? ? ('hwy', '31'),

? ? ? ? ? ? ? ('fl', 'p'),

? ? ? ? ? ? ? ('class', 'compact')])]

(2)查看list長度

len(mpg)

->234

(3)keys gives us the column names of our csv

mpg[0].keys()

->odict_keys(['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class'])

(4)Find the average cty fuel economy across all car. All values in the dictionaries are strings, so we need to convert to float.

sum(float(d['hwy']) for d in mpg) / len(mpg)

->23.44

(5)Use set to return the unique values for the number of cylinders the cars in our dataset have.

使用set返回數(shù)據(jù)集中汽車具有的汽缸數(shù)的唯一值。

cylinders = set(d['cyl'] for d in mpg)

cylinders

->'4', '5', '6', '8'

(6) We are grouping the cars by number of cylinder, and find the average cty mpg for each group.

CtyMpgByCyl = []

for c in cylinders:

? ? summpg = 0

? ? cyltypecount = 0

? ? for d in mpg:

? ? ? ? ? ? if d['cyl'] == c:

? ? ? ? ? ? ? ? summpg += float(d['cty'])

? ? ? ? ? ? ? ? cyltypecount += 1

? ? CtyMpgByCyl.append((c, summpg / cyltypecount))

CtyMpgByCyl.sort(key = lambda x: x[0])

CtyMpgByCyl

->[('4', 21.01), ('5', 20.50), ('6', 16.22), ('8', 12.57)]

(7) Use set to return the unique values for the class types in our dataset

vehicleclass = set(d['class'] for d in mpg)

vehicleclass

->{'2seater', 'compact', 'midsize', 'minivan', 'pickup', 'subcompact', 'suv'}

(8) How to find the average hwy mpg for each class of vehicle in our dataset.

HwyMpgByClass = []

for t in vehicleclass:

? ? summpg = 0

? ? vclasscount = 0

? ? for d in mpg:

? ? ? ? ? ? if d['class'] == t:

? ? ? ? ? ? ? ? ? ? summpg += float(d['hwy'])

? ? ? ? ? ? ? ? ? ? vclasscount += 1

? ? HwyMpgByClass.append((t, summpg / vclasscount))

HwyMpgByClass.sort(key = lambda x: x[1])

HwyMpgByClass

->

[('pickup', 16.88),

('suv', 18.13),

('minivan', 22.36),

('2seater', 24.80),

('midsize', 27.29),

('subcompact', 28.14),

('compact', 28.30)]


14. Dates and Times

(1) 安裝Datetime和Times的包

import datetime as dt

import time as tm

(2) Time returns the current time in seconds since the Epoch

tm.time()

->1583932727.90

(3) Convert the timestamp to datetime

dtnow = dt.datetime.fromtimestamp(tm.time())

dtnow

->

datetime.datetime(2020, 3, 11, 13, 18, 56, 990293)

(4) Handy datetime attributes: get year, month, day, etc. from a datetime

dtnow.year, dtnow.month, dtnow.day, dtnow.hour, dtnow.minute, dtnow.second

->(2020, 3, 11, 13, 18, 56)

(5) Timedelta is a duration expressing the difference between two dates.

delta = dt.timedelta(days = 100)

delta

->datetime.timedelta(100)

(6) date.today returns the current local date

today = dt.date.today()

today

->datetime.date(2020, 3, 11)

(7) the date 100 days ago

today - delta

->datetime.date(2019, 12, 2)

(8) compare dates

today > today - delta

-> True


15. Objects and map()

(1) an example of a class in python:

class Person:

? ? department = 'School of Information'

? ? def set_name(self, new_name)

? ? ? ? ? ? self.name = new_name

? ? def set_location(self, new_location)

? ? ? ? ? ? self.location = new_location


person = Person()

person.set_name('Christopher Brooks')

person.set_location('Ann Arbor, MI, USA')

print('{} live in {} and work in the department {}'.format(person.name, person.location, person.department))

(2) mapping the min function between two lists

store1 = [10.00, 11.00, 12.34, 2.34]

store2 = [9.00, 11.10, 12.34, 2.01]

cheapest = map(min, store1, store2)

cheapest

-><map at 0x7f74034a8860>

(3) iterate through the map object to see the values

for item in cheapest:

? ? print(item)

->

9.0

11.0

12.34

2.01


16. Lambda and List Comprehensions

(1) an example of lambda that takes in three parameters and adds the first two

my_function = lambda a, b, c: a+b

my_function(1, 2, 3)

->3

(2) iterate from 0 to 999 and return the even numbers.

my_list = []

for number in range(0, 1000):

? ? ? ? if number % 2 == 0:

? ? ? ? ? ? ? ? my_list.append(number)

my_list

->[0, 2, 4,...]

(3) Now the same thing but with list comprehension

my_list = [number for number in range(0, 1000) if number % 2 == 0]

my_list

->[0, 2, 4,...]


17. Numpy

(1) import package

import numpy as np


18.creating array數(shù)組(tuple元組,list列表)

(1) create a list and convert it to a numpy array

mylist = [1, 2, 3]

x = np.array(mylist)

x

->array([1, 2, 3])

(2) just pass in a list directly

y = np.array([4, 5, 6])

y

->array([4, 5, 6])

(3) pass in a list of lists to create a multidimensional array

m = np.array([[[7, 8, 9,],[10, 11, 12]])

m

->

array([[ 7, 8, 9],

? ? ? [10, 11, 12]])

(4) use the shape method to find the dimensions of array

m.shape?

->(2,3)

(5) arange returns evenly spaced values within a given interval

n = np.arange(0, 30, 2)

n

->array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28])

(6) reshape returns an array with the same data with a new shape

n = n.reshape(3, 5)

n

->

array([[ 0, 2, 4, 6, 8],

? ? ? [10, 12, 14, 16, 18],

? ? ? [20, 22, 24, 26, 28]])

(7) linspace returns evenly spaced numbers over a specified interval

o = np.linspace(0, 4, 9)

o

->array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ])

(8) resize changes the shape and size of array in-space

o.resize(3, 3)

o

->

array([[ 0. , 0.5, 1. ],

? ? ? [ 1.5,? 2. ,? 2.5],

? ? ? [ 3. ,? 3.5,? 4. ]])

(9) ones returns a new array of given shape and type, filled with ones

np.ones((3, 2))

->

array([[ 1., 1.],

? ? ? [ 1.,? 1.],

? ? ? [ 1.,? 1.]])

(10) zeros returns a new array of given shape and type, filled with zeros

np.zeros((2,3))

->

array([[ 0., 0., 0.],

? ? ? [ 0.,? 0.,? 0.]])

(11) eye returns a 2D array with ones on the diagonal and zeros

np.eye(3)

->

array([[ 1., 0., 0.],

? ? ? [ 0.,? 1.,? 0.],

? ? ? [ 0.,? 0.,? 1.]])

(12) diag extracts a diagonal or constructs a diagonal array

np.diag(y)

->

array([[4, 0, 0],

? ? ? [0, 5, 0],

? ? ? [0, 0, 6]])

(13)creating an array using repeating list

np.array([1, 2, 3]*3)

->array([1, 2, 3, 1, 2, 3, 1, 2, 3])

(14) repeat elements of an array using repeat

np.repeat([1, 2, 3], 3)

->array([1, 1, 1, 2, 2, 2, 3, 3, 3])

(15) combine arrays

p = np.ones([2, 3], int)

p

->

array([[1, 1, 1],

? ? ? [1, 1, 1]])

(16) use vstack to stack arrays in sequence vertically (row wise).

np.vstack([p, 2*p])

->

array([[1, 1, 1],

? ? ? [1, 1, 1],

? ? ? [2, 2, 2],

? ? ? [2, 2, 2]])

(17) use hstack to stack arrays in sequence horizontally (column wise).

np.hstack([p, 2*p])

->

array([[1, 1, 1, 2, 2, 2],

? ? ? [1, 1, 1, 2, 2, 2]])


19. Operations

(1) element wise + - * /

print(x+y)

print(x-y)

->

[5 7 9]

[-3 -3 -3]

print(x*y)

print(x/y)

->

[ 4 10 18]

[ 0.25? 0.4? 0.5 ]

print(x**2)

->[1 4 9]

(2) Dot Product

x.dot(y) # x1y1+x2y2+x3y3

->32

(3)

?z = np.array([y, y**2])

print(z)

print(len(z)) #number of rows of array

->

[[ 4 5 6]

[16 25 36]]

2

(4) transpose array

z

->

[[ 4 5 6]

[16 25 36]]

z.T

->

array([[ 4, 16],

? ? ? [ 5, 25],

? ? ? [ 6, 36]])

(5) use .dtype to see the data type of the elements in the array

z.dtype

->dtype('int64')

(6) use .astype to cast to a specific type?

z = z.astype('f')

z.dtype

->dtype('float32')

(7) math functions?

a = np.array([-4, -2, 1, 3, 5])

a.sum()

->3

a.max()

->5

a.min()

->-4

a.mean()

->0.59999999998

a.std()

->3.2619012860600183

a.argmax()

->4

a.argmin()

->0

(8) indexing / slicing

s = np.arange(13)**2

s

->array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144])

(9)use bracket notation to get the value at a specific index

s[0], s[4], s[-1]

->(0, 16, 144)

(10) use : to indicate a range.array[start:stop]

s[1:5]

->array([ 1, 4, 9, 16])

(11) use negatives to count from the back

s[-4:]

->array([ 81, 100, 121, 144])

(12) A second : can be used to indicate step-size.array[start : stop : stepsize]

Here we are starting 5th element from the end, and counting backwards by 2 until the beginning of the array is reached.

s[-5::-2]

->array([64, 36, 16, 4, 0])

(13) look at the multidimensional array

r = np.arange(36)

r.resize((6,6))

r

->

array([[ 0, 1, 2, 3, 4, 5],

? ? ? [ 6,? 7,? 8,? 9, 10, 11],

? ? ? [12, 13, 14, 15, 16, 17],

? ? ? [18, 19, 20, 21, 22, 23],

? ? ? [24, 25, 26, 27, 28, 29],

? ? ? [30, 31, 32, 33, 34, 35]])

(14) use bracket notation to slice

r[2, 2]

->14

(15) use : to select a range of rows or columns

r[3, 3:6]

->array([21, 22, 23])

(16) select all the rows up to row2 , and all the columns up to the last column.

r[:2, :-1]

->

array([[ 0, 1, 2, 3, 4],

? ? ? [ 6,? 7,? 8,? 9, 10]])

(17) a slice of last row, only every other element

r[-1, ::2]

->array([30, 32, 34])

(18) perform conditional indexing.

r[r > 30]

->array([31, 32, 33, 34, 35])

(19) assigning all values in the array that are greater than 30 to the value of 30

r[r > 30] = 30

r

->

array([[ 0, 1, 2, 3, 4, 5],

? ? ? [ 6,? 7,? 8,? 9, 10, 11],

? ? ? [12, 13, 14, 15, 16, 17],

? ? ? [18, 19, 20, 21, 22, 23],

? ? ? [24, 25, 26, 27, 28, 29],

? ? ? [30, 30, 30, 30, 30, 30]])

(20) copy and modify arrays

r2 = r[:3, :3]

r2

->

array([[ 0, 1, 2],

? ? ? [ 6,? 7,? 8],

? ? ? [12, 13, 14]])

(21)set this slice's values to zero([:] selects the entire array)

r2[:] = 0

r2

->

array([[0, 0, 0],

? ? ? [0, 0, 0],

? ? ? [0, 0, 0]])

(22) r has also be changed

r

->

array([[ 0, 0, 0, 3, 4, 5],

? ? ? [ 0,? 0,? 0,? 9, 10, 11],

? ? ? [ 0,? 0,? 0, 15, 16, 17],

? ? ? [18, 19, 20, 21, 22, 23],

? ? ? [24, 25, 26, 27, 28, 29],

? ? ? [30, 30, 30, 30, 30, 30]])

(23) to avoid this, use .copy()

r_copy = r.copy()

r_copy

->

array([[ 0, 0, 0, 3, 4, 5],

? ? ? [ 0,? 0,? 0,? 9, 10, 11],

? ? ? [ 0,? 0,? 0, 15, 16, 17],

? ? ? [18, 19, 20, 21, 22, 23],

? ? ? [24, 25, 26, 27, 28, 29],

? ? ? [30, 30, 30, 30, 30, 30]])

(24) now when r_copy is modified, r will not be changed

r_copy[:] =10

print(r_copy, '\n')

print(r)

->

[[10 10 10 10 10 10]

[10 10 10 10 10 10]

[10 10 10 10 10 10]

[10 10 10 10 10 10]

[10 10 10 10 10 10]

[10 10 10 10 10 10]]


[[ 0? 0? 0? 3? 4? 5]

[ 0? 0? 0? 9 10 11]

[ 0? 0? 0 15 16 17]

[18 19 20 21 22 23]

[24 25 26 27 28 29]

[30 30 30 30 30 30]]

(25) create a new 4*3 array of random numbers 0-9

test = np.random.randint(0, 10, (4,3))

test

->

array([[1, 8, 2],

? ? ? [6, 1, 5],

? ? ? [7, 8, 0],

? ? ? [7, 6, 2]])

(26) iterate by row

for row in test:

? ? print(row)

->

[1 8 2]?

[6 1 5]

[7 8 0]

[7 6 2]

(27) iterate by index

for i in range(len(test)):

? ? ? ? print(test[I])

->

[1 8 2]

[6 1 5]

[7 8 0]

[7 6 2]

(28) iterate by row and index

for i, row in enumerate(test):

? ? ? ? print('row', i, 'is', row)

->

row 0 is [1 8 2]

row 1 is [6 1 5]

row 2 is [7 8 0]

row 3 is [7 6 2]

(29) use zip to iterate over multiple iterables

test2 = test**2

test2

->

array([[ 1, 64, 4],

? ? ? [36,? 1, 25],

? ? ? [49, 64,? 0],

? ? ? [49, 36,? 4]])


for i, j in zip(test, test2):

? ? ? ? print(i, '+', j, '=', i+j)

->

[1 8 2] + [ 1 64 4] = [ 2 72 6]

[6 1 5] + [36? 1 25] = [42? 2 30]

[7 8 0] + [49 64? 0] = [56 72? 0]

[7 6 2] + [49 36? 4] = [56 42? 6]

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容