老师又湿又大片,蜜桃天堂SM,成人精品丝袜在线一区

課程github地址

Week1 內(nèi)容
Content
1. Data collection
  - Raw files (.csv,.xlsx)
  - Databases (mySQL)
  - APIs
2. Data formats
  - Flat files (.csv,.txt)
  - XML
  - JSON
3. Making data tidy
4. Distributing data
5. Scripting for data cleaning

1.原始數(shù)據(jù)與處理后數(shù)據(jù)

原始數(shù)據(jù)和處理后數(shù)據(jù)的區(qū)別

2.整潔數(shù)據(jù)的組成

!!重要對于數(shù)據(jù)來說應該有這四部分組成：

原始數(shù)據(jù)
整潔數(shù)據(jù)
編碼本（描述每一個變量以及其值）
從原始數(shù)據(jù)到處理完成的詳細步驟（主要是腳本）

原始數(shù)據(jù)

識別原始數(shù)據(jù)

整潔數(shù)據(jù)

!!重要整潔數(shù)據(jù)的四個特征：

每一個變量在單獨的一列
每個不同的變量觀測值應該在不同的行中
每“一類”的變量應當用一個單獨的表格記錄
多個表格時應當有一個鍵值變量將表格鏈接起來

整潔數(shù)據(jù)的技巧

每一個文件的頂部僅包含一行變量名稱
變量名稱盡可能易于理解
每一張數(shù)據(jù)表單應當存在單獨的文件中

編碼本

關于編碼本的細節(jié)

How to code variables
When you put variables into a spreadsheet there are several main categories you will run into depending on their data type:

Continuous

Ordinal

Categorical

Missing

Censored

Continuous variables are anything measured on a quantitative scale that could be any fractional number. An example would be something like weight measured in kg. Ordinal data are data that have a fixed, small (< 100) number of levels but are ordered. This could be for example survey responses where the choices are: poor, fair, good. Categorical data are data where there are multiple categories, but they aren't ordered. One example would be sex: male or female. This coding is attractive because it is self-documenting. Missing data are data that are unobserved and you don't know the mechanism. You should code missing values as NA. Censored data are data where you know the missingness mechanism on some level. Common examples are a measurement being below a detection limit or a patient being lost to follow-up. They should also be coded as NA when you don't have the data. But you should also add a new column to your tidy data called, "VariableNameCensored" which should have values of TRUE if censored and FALSE if not. In the code book you should explain why those values are missing. It is absolutely critical to report to the analyst if there is a reason you know about that some of the data are missing. You should also not impute/make up/ throw away missing observations.
In general, try to avoid coding categorical or ordinal variables as numbers. When you enter the value for sex in the tidy data, it should be "male" or "female". The ordinal values in the data set should be "poor", "fair", and "good" not 1, 2 ,3. This will avoid potential mixups about which direction effects go and will help identify coding errors.
Always encode every piece of information about your observations using text. For example, if you are storing data in Excel and use a form of colored text or cell background formatting to indicate information about an observation ("red variable entries were observed in experiment 1.") then this information will not be exported (and will be lost!) when the data is exported as raw text. Every piece of data should be encoded as actual text that can be exported.

Codebook 例子

codebook example 1

R通過包自動生成codebook

闡述原始數(shù)據(jù)到整潔數(shù)據(jù)的詳細過程

詳細步驟

3. 下載數(shù)據(jù)

if(!file.exists("dirname")){ #判斷是否存在目錄
  dir.create("dirname") #創(chuàng)建目錄
}

下載數(shù)據(jù)
download.file("url")

download.file(url, destfile, method, quiet = FALSE, mode = "w",
cacheOK = TRUE,
extra = getOption("download.file.extra"),
headers = NULL, ...)

method有curl\wget

4. 導入本地文件（簡略）

read.csv()

read.table(file, header = FALSE, sep = "", quote = ""'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = FALSE,
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

read.csv(file, header = TRUE, sep = ",", quote = """,
dec = ".", fill = TRUE, comment.char = "", ...)

read.csv2(file, header = TRUE, sep = ";", quote = """,
dec = ",", fill = TRUE, comment.char = "", ...)

5. 讀取excle\xml\json

excle

read.xlsx()

xml

使用包“XML”

xmldoc <- xmlTreeParse(url, useInternal = T) #下載并解析xml文件
xmlRoot(xmldoc)# 獲取根

xml參考 #內(nèi)涵伯克利的好多優(yōu)秀PPT
xpath編程

Xpath

使用例子

image.png

html類似xml

image.png

json

image.png

6.data.table

直接看包解說

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Getting and cleaning data——Week1

Getting and cleaning data——Week1

1.原始數(shù)據(jù)與處理后數(shù)據(jù)

2.整潔數(shù)據(jù)的組成

原始數(shù)據(jù)

整潔數(shù)據(jù)

編碼本

闡述原始數(shù)據(jù)到整潔數(shù)據(jù)的詳細過程

3. 下載數(shù)據(jù)

4. 導入本地文件（簡略）

5. 讀取excle\xml\json

excle

xml

json

6.data.table

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Getting and cleaning data——Week1

1.原始數(shù)據(jù)與處理后數(shù)據(jù)

2.整潔數(shù)據(jù)的組成

原始數(shù)據(jù)

整潔數(shù)據(jù)

編碼本

闡述原始數(shù)據(jù)到整潔數(shù)據(jù)的詳細過程

3. 下載數(shù)據(jù)

4. 導入本地文件（簡略）

5. 讀取excle\xml\json

excle

xml

json

6.data.table

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av