720+ new NLP models, 300+ supported languages, translation, summarization, question answering and more with T5 and Marian models! - John Snow Labs NLU 1.1.0

NLU 1.1.0 Release Notes

We are incredibly excited to release NLU 1.1.0! This release integrates the 720+ new models from the latest Spark-NLP 2.7.0 + releases You can now achieve state-of-the-art results with Sequence2Sequence transformers on problems like text summarization, question answering, translation between 192+ languages, and extract Named Entity in various Right to Left written languages like Arabic, Persian, Urdu, and languages that require segmentation like Koreas, Japanese, Chinese, and many more in 1 line of code! These new features are possible because of the integration of the Google's T5 models and Microsoft's Marian models transformers.
NLU 1.1.0 has over 720+ new pretrained models and pipelines while extending the support of multi-lingual models to 192+ languages such as Chinese, Japanese, Korean, Arabic, Persian, Urdu, and Hebrew.
In addition to this, NLU 1.1.0 comes with 9 new notebooks showcasing training classifiers for various review and sentiment datasets and 7 notebooks for the new features and models.

NLU 1.1.0 New Features

720+ new models you can find an overview of all NLU models here and further documentation in the models hub
NEW: Introducing MarianTransformer annotator for machine translation based on MarianNMT models. Marian is an efficient, free Neural Machine Translation framework mainly being developed by the Microsoft Translator team (646+ pretrained models & pipelines in 192+ languages)
NEW: Introducing T5Transformer annotator for Text-To-Text Transfer Transformer (Google T5) models to achieve state-of-the-art results on multiple NLP tasks such as Translation, Summarization, Question Answering, Sentence Similarity, and so on
NEW: Introducing brand new and refactored language detection and identification models. The new LanguageDetectorDL is faster, more accurate, and supports up to 375 languages
NEW: Introducing WordSegmenter model for word segmentation of languages without any rule-based tokenization such as Chinese, Japanese, or Korean
NEW: Introducing DocumentNormalizer component for cleaning content from HTML or XML documents, applying either data cleansing using an arbitrary number of custom regular expressions either data extraction following the different parameters

Translation

Translation example You can translate between more than 192 Languages pairs with the Marian Models You need to specify the language your data is in as start_language and the language you want to translate to as target_language. The language references must be ISO language codes
nlu.load('.translate.')
Translate English to French : ``` nlu.load('en.translate_to.fr').predict("Hello from John Snow Labs")

Output: Bonjour des laboratoires de neige de John!

**Translate English to Inukitut :** nlu.load('en.translate_to.lu').predict("Hello from John Snow Labs")

Output: kalunganyembo ka mashika makamankate **Translate English to Hungarian :** nlu.load('en.translate_to.hu').predict("Hello from John Snow Labs") Output: Helló John hó laborjából. **Translate English to German :** nlu.load('en.translate_to.de').predict("Hello from John Snow Labs!") Output: Hallo aus John Schnee Labors ```

python translate_pipe = nlu.load('en.translate_to.de') df = translate_pipe.predict('Billy likes to go to the mall every sunday') df

sentence	translation
Billy likes to go to the mall every sunday	Billy geht gerne jeden Sonntag ins Einkaufszentrum

T5

Example of every T5 task

Overview of every task available with T5

The T5 model is trained on various datasets for 17 different tasks which fall into 8 categories.

Text summarization
Question answering
Translation
Sentiment analysis
Natural Language inference
Coreference resolution
Sentence Completion
Word sense disambiguation

Every T5 Task with explanation:

Task Name	Explanation
1.CoLA	Classify if a sentence is gramaticaly correct
2.RTE	Classify whether if a statement can be deducted from a sentence
3.MNLI	Classify for a hypothesis and premise whether they contradict or contradict each other or neither of both (3 class).
4.MRPC	Classify whether a pair of sentences is a re-phrasing of each other (semantically equivalent)
5.QNLI	Classify whether the answer to a question can be deducted from an answer candidate.
6.QQP	Classify whether a pair of questions is a re-phrasing of each other (semantically equivalent)
7.SST2	Classify the sentiment of a sentence as positive or negative
8.STSB	Classify the sentiment of a sentence on a scale from 1 to 5 (21 Sentiment classes)
9.CB	Classify for a premise and a hypothesis whether they contradict each other or not (binary).
10.COPA	Classify for a question, premise, and 2 choices which choice the correct choice is (binary).
11.MultiRc	Classify for a question, a paragraph of text, and an answer candidate, if the answer is correct (binary),
12.WiC	Classify for a pair of sentences and a disambigous word if the word has the same meaning in both sentences.
13.WSC/DPR	Predict for an ambiguous pronoun in a sentence what it is referring to.
14.Summarization	Summarize text into a shorter representation.
15.SQuAD	Answer a question for a given context.
16.WMT1.	Translate English to German
17.WMT2.	Translate English to French
18.WMT3.	Translate English to Romanian

Every T5 Task example notebook to see how to use every T5 Task.
T5 Open and Closed Book question answering notebook

Open book and Closed book question answering with Google's T5

T5 Open and Closed Book question answering tutorial
With the latest NLU release and Google's T5 you can answer general knowledge based questions given no context and in addition answer questions on text databases. These questions can be asked in natural human language and answerd in just 1 line with NLU!.

What is a open book question?

You can imagine an open book question similar to an examen where you are allowed to bring in text documents or cheat sheets that help you answer questions in an examen. Kinda like bringing a history book to an history examen.
In T5's terms, this means the model is given a question and an additional piece of textual information or so called context.
This enables the T5 model to answer questions on textual datasets like medical records,newsarticles , wiki-databases , stories and movie scripts , product descriptions, 'legal documents' and many more.
You can answer open book question in 1 line of code, leveraging the latest NLU release and Google's T5. All it takes is :
```python nlu.load('answer_question').predict(""" Where did Jebe die? context: Ghenkis Khan recalled Subtai back to Mongolia soon afterwards, and Jebe died on the road back to Samarkand""")

Output: Samarkand ```

Example for answering medical questions based on medical context ``` python question =''' What does increased oxygen concentrations in the patient’s lungs displace? context: Hyperbaric (high-pressure) medicine uses special oxygen chambers to increase the partial pressure of O 2 around the patient and, when needed, the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness (the ’bends’) are sometimes treated using these devices. Increased O 2 concentration in the lungs helps to displace carbon monoxide from the heme group of hemoglobin. Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing its partial pressure helps kill them. Decompression sickness occurs in divers who decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible is part of the treatment. '''

Predict on text data with T5

nlu.load('answer_question').predict(question)

Output: carbon monoxide ```

Take a look at this example on a recent news article snippet : ```python question1 = 'Who is Jack ma?' question2 = 'Who is founder of Alibaba Group?' question3 = 'When did Jack Ma re-appear?' question4 = 'How did Alibaba stocks react?' question5 = 'Whom did Jack Ma meet?' question6 = 'Whom did Jack Ma hide from?'

from https://www.bbc.com/news/business-55728338

news_article_snippet = """ context: Alibaba Group founder Jack Ma has made his first appearance since Chinese regulators cracked down on his business empire. His absence had fuelled speculation over his whereabouts amid increasing official scrutiny of his businesses. The billionaire met 100 rural teachers in China via a video meeting on Wednesday, according to local government media. Alibaba shares surged 5% on Hong Kong's stock exchange on the news. """

join question with context, works with Pandas DF aswell!

questions = [ question1+ news_article_snippet, question2+ news_article_snippet, question3+ news_article_snippet, question4+ news_article_snippet, question5+ news_article_snippet, question6+ news_article_snippet,] nlu.load('answer_question').predict(questions) ``` This will output a Pandas Dataframe similar to this :

Answer	Question
Alibaba Group founder	Who is Jack ma?
Jack Ma	Who is founder of Alibaba Group?
Wednesday	When did Jack Ma re-appear?
surged 5%	How did Alibaba stocks react?
100 rural teachers	Whom did Jack Ma meet?
Chinese regulators	Whom did Jack Ma hide from?

What is a closed book question?

A closed book question is the exact opposite of a open book question. In an examen scenario, you are only allowed to use what you have memorized in your brain and nothing else. In T5's terms this means that T5 can only use it's stored weights to answer a question and is given no aditional context. T5 was pre-trained on the C4 dataset which contains petabytes of web crawling data collected over the last 8 years, including Wikipedia in every language.
This gives T5 the broad knowledge of the internet stored in it's weights to answer various closed book questions
You can answer closed book question in 1 line of code, leveraging the latest NLU release and Google's T5. You need to pass one string to NLU, which starts which a question and is followed by a context: tag and then the actual context contents. All it takes is :
```python nlu.load('en.t5').predict('Who is president of Nigeria?')

Muhammadu Buhari ```

```python nlu.load('en.t5').predict('What is the most spoken language in India?')

Hindi ```

```python nlu.load('en.t5').predict('What is the capital of Germany?')

Berlin ```

Text Summarization with T5

Summarization example
Summarizes a paragraph into a shorter version with the same semantic meaning, based on this paper
```python

Set the task on T5

pipe = nlu.load('summarize')

define Data, add additional tags between sentences

data = [ ''' The belgian duo took to the dance floor on monday night with some friends . manchester united face newcastle in the premier league on wednesday . red devils will be looking for just their second league away win in seven . louis van gaal’s side currently sit two points clear of liverpool in fourth . ''', ''' Calculus, originally called infinitesimal calculus or "the calculus of infinitesimals", is the mathematical study of continuous change, in the same way that geometry is the study of shape and algebra is the study of generalizations of arithmetic operations. It has two major branches, differential calculus and integral calculus; the former concerns instantaneous rates of change, and the slopes of curves, while integral calculus concerns accumulation of quantities, and areas under or between curves. These two branches are related to each other by the fundamental theorem of calculus, and they make use of the fundamental notions of convergence of infinite sequences and infinite series to a well-defined limit.[1] Infinitesimal calculus was developed independently in the late 17th century by Isaac Newton and Gottfried Wilhelm Leibniz.[2][3] Today, calculus has widespread uses in science, engineering, and economics.[4] In mathematics education, calculus denotes courses of elementary mathematical analysis, which are mainly devoted to the study of functions and limits. The word calculus (plural calculi) is a Latin word, meaning originally "small pebble" (this meaning is kept in medicine – see Calculus (medicine)). Because such pebbles were used for calculation, the meaning of the word has evolved and today usually means a method of computation. It is therefore used for naming specific methods of calculation and related theories, such as propositional calculus, Ricci calculus, calculus of variations, lambda calculus, and process calculus.''' ]

Predict on text data with T5

pipe.predict(data) ```

Predicted summary	Text
manchester united face newcastle in the premier league on wednesday . louis van gaal's side currently sit two points clear of liverpool in fourth . the belgian duo took to the dance floor on monday night with some friends .	the belgian duo took to the dance floor on monday night with some friends . manchester united face newcastle in the premier league on wednesday . red devils will be looking for just their second league away win in seven . louis van gaal’s side currently sit two points clear of liverpool in fourth .

Binary Sentence similarity/ Paraphrasing

Binary sentence similarity example Classify whether one sentence is a re-phrasing or similar to another sentence This is a sub-task of GLUE and based on MRPC - Binary Paraphrasing/ sentence similarity classification
``` t5 = nlu.load('en.t5.base')

Set the task on T5

t5['t5'].setTask('mrpc ')

define Data, add additional tags between sentences

data = [ ''' sentence1: We acted because we saw the existing evidence in a new light , through the prism of our experience on 11 September , " Rumsfeld said . sentence2: Rather , the US acted because the administration saw " existing evidence in a new light , through the prism of our experience on September 11 " ''' , ''' sentence1: I like to eat peanutbutter for breakfast sentence2: I like to play football. ''' ]

Predict on text data with T5

t5.predict(data) ``` | Sentence1 | Sentence2 | prediction| |------------|------------|----------| |We acted because we saw the existing evidence in a new light , through the prism of our experience on 11 September , " Rumsfeld said .| Rather , the US acted because the administration saw " existing evidence in a new light , through the prism of our experience on September 11 " . | equivalent | | I like to eat peanutbutter for breakfast| I like to play football | not_equivalent |

How to configure T5 task for MRPC and pre-process text

.setTask('mrpc sentence1:) and prefix second sentence with sentence2:

Example pre-processed input for T5 MRPC - Binary Paraphrasing/ sentence similarity

mrpc sentence1: We acted because we saw the existing evidence in a new light , through the prism of our experience on 11 September , " Rumsfeld said . sentence2: Rather , the US acted because the administration saw " existing evidence in a new light , through the prism of our experience on September 11",

Regressive Sentence similarity/ Paraphrasing

Measures how similar two sentences are on a scale from 0 to 5 with 21 classes representing a regressive label. This is a sub-task of GLUE and based onSTSB - Regressive semantic sentence similarity .
```python t5 = nlu.load('en.t5.base')

Set the task on T5

t5['t5'].setTask('stsb ')

define Data, add additional tags between sentences

data = [ ''' sentence1: What attributes would have made you highly desirable in ancient Rome? sentence2: How I GET OPPERTINUTY TO JOIN IT COMPANY AS A FRESHER?' ''' , ''' sentence1: What was it like in Ancient rome? sentence2: What was Ancient rome like? ''', ''' sentence1: What was live like as a King in Ancient Rome?? sentence2: What was Ancient rome like? '''
]

Predict on text data with T5

t5.predict(data)
```

sentence1	sentence2	prediction
What attributes would have made you highly desirable in ancient Rome?	How I GET OPPERTINUTY TO JOIN IT COMPANY AS A FRESHER?	0
What was it like in Ancient rome?	What was Ancient rome like?	5.0
What was live like as a King in Ancient Rome??	What is it like to live in Rome?	3.2

How to configure T5 task for stsb and pre-process text

.setTask('stsb sentence1:) and prefix second sentence with sentence2:

Example pre-processed input for T5 STSB - Regressive semantic sentence similarity

stsb sentence1: What attributes would have made you highly desirable in ancient Rome? sentence2: How I GET OPPERTINUTY TO JOIN IT COMPANY AS A FRESHER?',

Grammar Checking

Grammar checking with T5 example Judges if a sentence is grammatically acceptable. Based on CoLA - Binary Grammatical Sentence acceptability classification
```python pipe = nlu.load('grammar_correctness')

Set the task on T5

pipe['t5'].setTask('cola sentence: ')

define Data

data = ['Anna and Mike is going skiing and they is liked is','Anna and Mike like to dance']

Predict on text data with T5

pipe.predict(data) ``` |sentence | prediction| |------------|------------| | Anna and Mike is going skiing and they is liked is | unacceptable | | Anna and Mike like to dance | acceptable |

Document Normalization

Document Normalizer example The DocumentNormalizer extracts content from HTML or XML documents, applying either data cleansing using an arbitrary number of custom regular expressions either data extraction following the different parameters
python pipe = nlu.load('norm_document') data = ' Example This is an example of a simple HTML page with one paragraph.
' df = pipe.predict(data,output_level='document') df |text|normalized_text| |------|-------------| | Example This is an example of a simple HTML page with one paragraph.
|Example This is an example of a simple HTML page with one paragraph.|

Word Segmenter

Word Segmenter Example The WordSegmenter segments languages without any rule-based tokenization such as Chinese, Japanese, or Korean ```python pipe = nlu.load('ja.segment_words')

japanese for 'Donald Trump and Angela Merkel dont share many opinions'

ja_data = ['ドナルド・トランプとアンゲラ・メルケルは多くの意見を共有していません'] df = pipe.predict(ja_data, output_level='token') df
```

token
ドナルド
・
トランプ
と
アンゲラ
・
メルケル
は
多く
の
意見
を
共有
し
て
い
ませ
ん

Named Entity Extraction (NER) in Various Languages

NLU now support NER for over 60 languages, including Korean, Japanese, Chinese and many more! ```python

Extract named chinese entities

pipe = nlu.load('zh.ner')

Chinese for 'Donald Trump and Angela Merkel dont share many opinions'

zh_data = ['唐纳德特朗普和安吉拉·默克尔没有太多意见'] df = pipe.predict(zh_data, output_level='document') df

Output : [唐纳德, 安吉拉]

Now translate [唐纳德, 安吉拉] back to english with NLU!

translate_pipe = nlu.load('zh.translate_to.en') en_entities = translate_pipe.predict(['唐纳德', '安吉拉'])

Output : ``` |Translation| Chinese| |------|------| |Donald | 唐纳德 | |Angela | 安吉拉|

New NLU Notebooks

NLU 1.1.0 New Notebooks for new features

NLU 1.1.0 New Classifier Training Tutorials

Binary Classifier training Jupyter tutorials

Multi Class text Classifier training Jupyter tutorials

NLU 1.1.0 New Medium Tutorials

Installation

```bash

PyPi

!pip install nlu pyspark==2.4.7

Conda

Install NLU from Anaconda/Conda

conda install -c johnsnowlabs nlu ```

Additional NLU ressources

720+ new NLP models, 300+ supported languages, translation, summarization, question answering and more with T5 and Marian models! - John Snow Labs NLU 1.1.0

NLU 1.1.0 Release Notes

We are incredibly excited to release NLU 1.1.0! This release integrates the 720+ new models from the latest Spark-NLP 2.7.0 + releases You can now achieve state-of-the-art results with Sequence2Sequence transformers on problems like text summarization, question answering, translation between 192+ languages, and extract Named Entity in various Right to Left written languages like Arabic, Persian, Urdu, and languages that require segmentation like Koreas, Japanese, Chinese, and many more in 1 line of code! These new features are possible because of the integration of the Google's T5 models and Microsoft's Marian models transformers.
NLU 1.1.0 has over 720+ new pretrained models and pipelines while extending the support of multi-lingual models to 192+ languages such as Chinese, Japanese, Korean, Arabic, Persian, Urdu, and Hebrew.
In addition to this, NLU 1.1.0 comes with 9 new notebooks showcasing training classifiers for various review and sentiment datasets and 7 notebooks for the new features and models.

NLU 1.1.0 New Features

720+ new models you can find an overview of all NLU models here and further documentation in the models hub
NEW: Introducing MarianTransformer annotator for machine translation based on MarianNMT models. Marian is an efficient, free Neural Machine Translation framework mainly being developed by the Microsoft Translator team (646+ pretrained models & pipelines in 192+ languages)
NEW: Introducing T5Transformer annotator for Text-To-Text Transfer Transformer (Google T5) models to achieve state-of-the-art results on multiple NLP tasks such as Translation, Summarization, Question Answering, Sentence Similarity, and so on
NEW: Introducing brand new and refactored language detection and identification models. The new LanguageDetectorDL is faster, more accurate, and supports up to 375 languages
NEW: Introducing WordSegmenter model for word segmentation of languages without any rule-based tokenization such as Chinese, Japanese, or Korean
NEW: Introducing DocumentNormalizer component for cleaning content from HTML or XML documents, applying either data cleansing using an arbitrary number of custom regular expressions either data extraction following the different parameters

Translation

Translation example You can translate between more than 192 Languages pairs with the Marian Models You need to specify the language your data is in as start_language and the language you want to translate to as target_language. The language references must be ISO language codes
nlu.load('.translate.')
Translate English to French : ``` nlu.load('en.translate_to.fr').predict("Hello from John Snow Labs")

Output: Bonjour des laboratoires de neige de John!

**Translate English to Inukitut :** nlu.load('en.translate_to.lu').predict("Hello from John Snow Labs")

Output: kalunganyembo ka mashika makamankate **Translate English to Hungarian :** nlu.load('en.translate_to.hu').predict("Hello from John Snow Labs") Output: Helló John hó laborjából. **Translate English to German :** nlu.load('en.translate_to.de').predict("Hello from John Snow Labs!") Output: Hallo aus John Schnee Labors ```

python translate_pipe = nlu.load('en.translate_to.de') df = translate_pipe.predict('Billy likes to go to the mall every sunday') df

sentence	translation
Billy likes to go to the mall every sunday	Billy geht gerne jeden Sonntag ins Einkaufszentrum

T5

Example of every T5 task

Overview of every task available with T5

The T5 model is trained on various datasets for 17 different tasks which fall into 8 categories.

Text summarization
Question answering
Translation
Sentiment analysis
Natural Language inference
Coreference resolution
Sentence Completion
Word sense disambiguation

Every T5 Task with explanation:

Task Name	Explanation
1.CoLA	Classify if a sentence is gramaticaly correct
2.RTE	Classify whether if a statement can be deducted from a sentence
3.MNLI	Classify for a hypothesis and premise whether they contradict or contradict each other or neither of both (3 class).
4.MRPC	Classify whether a pair of sentences is a re-phrasing of each other (semantically equivalent)
5.QNLI	Classify whether the answer to a question can be deducted from an answer candidate.
6.QQP	Classify whether a pair of questions is a re-phrasing of each other (semantically equivalent)
7.SST2	Classify the sentiment of a sentence as positive or negative
8.STSB	Classify the sentiment of a sentence on a scale from 1 to 5 (21 Sentiment classes)
9.CB	Classify for a premise and a hypothesis whether they contradict each other or not (binary).
10.COPA	Classify for a question, premise, and 2 choices which choice the correct choice is (binary).
11.MultiRc	Classify for a question, a paragraph of text, and an answer candidate, if the answer is correct (binary),
12.WiC	Classify for a pair of sentences and a disambigous word if the word has the same meaning in both sentences.
13.WSC/DPR	Predict for an ambiguous pronoun in a sentence what it is referring to.
14.Summarization	Summarize text into a shorter representation.
15.SQuAD	Answer a question for a given context.
16.WMT1.	Translate English to German
17.WMT2.	Translate English to French
18.WMT3.	Translate English to Romanian

Every T5 Task example notebook to see how to use every T5 Task.
T5 Open and Closed Book question answering notebook

Open book and Closed book question answering with Google's T5

T5 Open and Closed Book question answering tutorial
With the latest NLU release and Google's T5 you can answer general knowledge based questions given no context and in addition answer questions on text databases. These questions can be asked in natural human language and answerd in just 1 line with NLU!.

What is a open book question?

You can imagine an open book question similar to an examen where you are allowed to bring in text documents or cheat sheets that help you answer questions in an examen. Kinda like bringing a history book to an history examen.
In T5's terms, this means the model is given a question and an additional piece of textual information or so called context.
This enables the T5 model to answer questions on textual datasets like medical records,newsarticles , wiki-databases , stories and movie scripts , product descriptions, 'legal documents' and many more.
You can answer open book question in 1 line of code, leveraging the latest NLU release and Google's T5. All it takes is :
```python nlu.load('answer_question').predict(""" Where did Jebe die? context: Ghenkis Khan recalled Subtai back to Mongolia soon afterwards, and Jebe died on the road back to Samarkand""")

Output: Samarkand ```

Example for answering medical questions based on medical context ``` python question =''' What does increased oxygen concentrations in the patient’s lungs displace? context: Hyperbaric (high-pressure) medicine uses special oxygen chambers to increase the partial pressure of O 2 around the patient and, when needed, the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness (the ’bends’) are sometimes treated using these devices. Increased O 2 concentration in the lungs helps to displace carbon monoxide from the heme group of hemoglobin. Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing its partial pressure helps kill them. Decompression sickness occurs in divers who decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible is part of the treatment. '''

Predict on text data with T5

nlu.load('answer_question').predict(question)

Output: carbon monoxide ```

Take a look at this example on a recent news article snippet : ```python question1 = 'Who is Jack ma?' question2 = 'Who is founder of Alibaba Group?' question3 = 'When did Jack Ma re-appear?' question4 = 'How did Alibaba stocks react?' question5 = 'Whom did Jack Ma meet?' question6 = 'Who did Jack Ma hide from?'

from https://www.bbc.com/news/business-55728338

news_article_snippet = """ context: Alibaba Group founder Jack Ma has made his first appearance since Chinese regulators cracked down on his business empire. His absence had fuelled speculation over his whereabouts amid increasing official scrutiny of his businesses. The billionaire met 100 rural teachers in China via a video meeting on Wednesday, according to local government media. Alibaba shares surged 5% on Hong Kong's stock exchange on the news. """

join question with context, works with Pandas DF aswell!

questions = [ question1+ news_article_snippet, question2+ news_article_snippet, question3+ news_article_snippet, question4+ news_article_snippet, question5+ news_article_snippet, question6+ news_article_snippet,] nlu.load('answer_question').predict(questions) ``` This will output a Pandas Dataframe similar to this :

Answer	Question
Alibaba Group founder	Who is Jack ma?
Jack Ma	Who is founder of Alibaba Group?
Wednesday	When did Jack Ma re-appear?
surged 5%	How did Alibaba stocks react?
100 rural teachers	Whom did Jack Ma meet?
Chinese regulators	Who did Jack Ma hide from?

What is a closed book question?

A closed book question is the exact opposite of a open book question. In an examen scenario, you are only allowed to use what you have memorized in your brain and nothing else. In T5's terms this means that T5 can only use it's stored weights to answer a question and is given no aditional context. T5 was pre-trained on the C4 dataset which contains petabytes of web crawling data collected over the last 8 years, including Wikipedia in every language.
This gives T5 the broad knowledge of the internet stored in it's weights to answer various closed book questions
You can answer closed book question in 1 line of code, leveraging the latest NLU release and Google's T5. You need to pass one string to NLU, which starts which a question and is followed by a context: tag and then the actual context contents. All it takes is :
```python nlu.load('en.t5').predict('Who is president of Nigeria?')

Muhammadu Buhari ```

```python nlu.load('en.t5').predict('What is the most spoken language in India?')

Hindi ```

```python nlu.load('en.t5').predict('What is the capital of Germany?')

Berlin ```

Text Summarization with T5

Summarization example
Summarizes a paragraph into a shorter version with the same semantic meaning, based on this paper
```python

Set the task on T5

pipe = nlu.load('summarize')

define Data, add additional tags between sentences

data = [ ''' The belgian duo took to the dance floor on monday night with some friends . manchester united face newcastle in the premier league on wednesday . red devils will be looking for just their second league away win in seven . louis van gaal’s side currently sit two points clear of liverpool in fourth . ''', ''' Calculus, originally called infinitesimal calculus or "the calculus of infinitesimals", is the mathematical study of continuous change, in the same way that geometry is the study of shape and algebra is the study of generalizations of arithmetic operations. It has two major branches, differential calculus and integral calculus; the former concerns instantaneous rates of change, and the slopes of curves, while integral calculus concerns accumulation of quantities, and areas under or between curves. These two branches are related to each other by the fundamental theorem of calculus, and they make use of the fundamental notions of convergence of infinite sequences and infinite series to a well-defined limit.[1] Infinitesimal calculus was developed independently in the late 17th century by Isaac Newton and Gottfried Wilhelm Leibniz.[2][3] Today, calculus has widespread uses in science, engineering, and economics.[4] In mathematics education, calculus denotes courses of elementary mathematical analysis, which are mainly devoted to the study of functions and limits. The word calculus (plural calculi) is a Latin word, meaning originally "small pebble" (this meaning is kept in medicine – see Calculus (medicine)). Because such pebbles were used for calculation, the meaning of the word has evolved and today usually means a method of computation. It is therefore used for naming specific methods of calculation and related theories, such as propositional calculus, Ricci calculus, calculus of variations, lambda calculus, and process calculus.''' ]

Predict on text data with T5

pipe.predict(data) ```

Predicted summary	Text
manchester united face newcastle in the premier league on wednesday . louis van gaal's side currently sit two points clear of liverpool in fourth . the belgian duo took to the dance floor on monday night with some friends .	the belgian duo took to the dance floor on monday night with some friends . manchester united face newcastle in the premier league on wednesday . red devils will be looking for just their second league away win in seven . louis van gaal’s side currently sit two points clear of liverpool in fourth .

Binary Sentence similarity/ Paraphrasing

Binary sentence similarity example Classify whether one sentence is a re-phrasing or similar to another sentence This is a sub-task of GLUE and based on MRPC - Binary Paraphrasing/ sentence similarity classification
``` t5 = nlu.load('en.t5.base')

Set the task on T5

t5['t5'].setTask('mrpc ')

define Data, add additional tags between sentences

data = [ ''' sentence1: We acted because we saw the existing evidence in a new light , through the prism of our experience on 11 September , " Rumsfeld said . sentence2: Rather , the US acted because the administration saw " existing evidence in a new light , through the prism of our experience on September 11 " ''' , ''' sentence1: I like to eat peanutbutter for breakfast sentence2: I like to play football. ''' ]

Predict on text data with T5

t5.predict(data) ``` | Sentence1 | Sentence2 | prediction| |------------|------------|----------| |We acted because we saw the existing evidence in a new light , through the prism of our experience on 11 September , " Rumsfeld said .| Rather , the US acted because the administration saw " existing evidence in a new light , through the prism of our experience on September 11 " . | equivalent | | I like to eat peanutbutter for breakfast| I like to play football | not_equivalent |

How to configure T5 task for MRPC and pre-process text

.setTask('mrpc sentence1:) and prefix second sentence with sentence2:

Example pre-processed input for T5 MRPC - Binary Paraphrasing/ sentence similarity

mrpc sentence1: We acted because we saw the existing evidence in a new light , through the prism of our experience on 11 September , " Rumsfeld said . sentence2: Rather , the US acted because the administration saw " existing evidence in a new light , through the prism of our experience on September 11",

Regressive Sentence similarity/ Paraphrasing

Measures how similar two sentences are on a scale from 0 to 5 with 21 classes representing a regressive label. This is a sub-task of GLUE and based onSTSB - Regressive semantic sentence similarity .
```python t5 = nlu.load('en.t5.base')

Set the task on T5

t5['t5'].setTask('stsb ')

define Data, add additional tags between sentences

data = [

 ''' sentence1: What attributes would have made you highly desirable in ancient Rome? sentence2: How I GET OPPERTINUTY TO JOIN IT COMPANY AS A FRESHER?' ''' , ''' sentence1: What was it like in Ancient rome? sentence2: What was Ancient rome like? ''', ''' sentence1: What was live like as a King in Ancient Rome?? sentence2: What was Ancient rome like? ''' ]

Predict on text data with T5

t5.predict(data)
```

sentence1	sentence2	prediction
What attributes would have made you highly desirable in ancient Rome?	How I GET OPPERTINUTY TO JOIN IT COMPANY AS A FRESHER?	0
What was it like in Ancient rome?	What was Ancient rome like?	5.0
What was live like as a King in Ancient Rome??	What is it like to live in Rome?	3.2

How to configure T5 task for stsb and pre-process text

.setTask('stsb sentence1:) and prefix second sentence with sentence2:

Example pre-processed input for T5 STSB - Regressive semantic sentence similarity

stsb sentence1: What attributes would have made you highly desirable in ancient Rome? sentence2: How I GET OPPERTINUTY TO JOIN IT COMPANY AS A FRESHER?',

Grammar Checking

Grammar checking with T5 example Judges if a sentence is grammatically acceptable. Based on CoLA - Binary Grammatical Sentence acceptability classification
```python pipe = nlu.load('grammar_correctness')

Set the task on T5

pipe['t5'].setTask('cola sentence: ')

define Data

data = ['Anna and Mike is going skiing and they is liked is','Anna and Mike like to dance']

Predict on text data with T5

pipe.predict(data) ``` |sentence | prediction| |------------|------------| | Anna and Mike is going skiing and they is liked is | unacceptable | | Anna and Mike like to dance | acceptable |

Document Normalization

Document Normalizer example The DocumentNormalizer extracts content from HTML or XML documents, applying either data cleansing using an arbitrary number of custom regular expressions either data extraction following the different parameters
python pipe = nlu.load('norm_document') data = ' Example This is an example of a simple HTML page with one paragraph.
' df = pipe.predict(data,output_level='document') df |text|normalized_text| |------|-------------| | Example This is an example of a simple HTML page with one paragraph.
|Example This is an example of a simple HTML page with one paragraph.|

Word Segmenter

Word Segmenter Example The WordSegmenter segments languages without any rule-based tokenization such as Chinese, Japanese, or Korean ```python pipe = nlu.load('ja.segment_words')

japanese for 'Donald Trump and Angela Merkel dont share many opinions'

ja_data = ['ドナルド・トランプとアンゲラ・メルケルは多くの意見を共有していません'] df = pipe.predict(ja_data, output_level='token') df
```

token
ドナルド
・
トランプ
と
アンゲラ
・
メルケル
は
多く
の
意見
を
共有
し
て
い
ませ
ん

Named Entity Extraction (NER) in Various Languages

NLU now support NER for over 60 languages, including Korean, Japanese, Chinese and many more! ```python

Extract named chinese entities

pipe = nlu.load('zh.ner')

Chinese for 'Donald Trump and Angela Merkel dont share many opinions'

zh_data = ['唐纳德特朗普和安吉拉·默克尔没有太多意见'] df = pipe.predict(zh_data, output_level='document') df

Output : [唐纳德, 安吉拉]

Now translate [唐纳德, 安吉拉] back to english with NLU!

translate_pipe = nlu.load('zh.translate_to.en') en_entities = translate_pipe.predict(['唐纳德', '安吉拉'])

Output : ``` |Translation| Chinese| |------|------| |Donald | 唐纳德 | |Angela | 安吉拉|

New NLU Notebooks

NLU 1.1.0 New Notebooks for new features

NLU 1.1.0 New Classifier Training Tutorials

Binary Classifier training Jupyter tutorials

Multi Class text Classifier training Jupyter tutorials

NLU 1.1.0 New Medium Tutorials

Installation

```bash

PyPi

!pip install nlu pyspark==2.4.7

Conda

Install NLU from Anaconda/Conda

conda install -c johnsnowlabs nlu ```

Additional NLU ressources

I'm going to cry my heart out, so you've been warned.
Long story, short.

I was the bookish nerd until last year of engineering
Started writing code to build things in the last year
Luckily got a job in one of the "mass recruitment" companies
Got offered their Gold tier so I could skip training and join workforce on Day 7 and get 2.5x the salary offered to freshers because of communication & partially code
"Development" work is adding small features to old code, changing organization practices (moving to Jira, Github integration, creating CI/CD concept, Wiki-fying documents, etc).
Realize I suck at Data Structures and Algorithm.
Tired of monotony and stagnation, tried learning DSA but even the basic question that requires some "logic" looks like a failure.
I am restarting my algo and DS journey. But is there anything else I can do to get a job? Should I look at changing fields? If yes, how. I love management, CX, etc. I do love coding too, but I am not the competitive coder and that makes me believe that I am the impostor Among Us.
Communication Skill ✅ Presentation Skill ✅ Coding 🆗 DSA 😧

Long story, long.
Chapter 0: The beginning.
I was great at computers from when I was young. I was no Chintu developing applications and having investors wrestle to reach me, but I did some basic static HTML pages, could figure my way out in fixing computer and internet issues, etc. This got me the prestigious stature of "geek" and "gizmo" in a household where being able to surf the internet was akin to cavemen discovering fire. Then on, I decided I wanted to study computer science despite being from a school (rather, board) that did not even have Computers as a subject in 8th, 9th and 10th.
Come 11th, I wanted to take up Computer Science and take it up, I did. The first chapter (and I kid you not) was about introduction to computer science, where we had to rut what a peripheral device is and what a non-peripheral device is. The class was basically a teacher highlighting contents of the textbook that would fetch us marks. I nope'd out. Being the cream student, I had an option to switch to Electronics because the demand for electronics was so low that they had only 1 section with 63 people and were really looking to make that an even number. After a few trial classes, I realized electronics is fun too and I ditched my long term love to study electronics.
I guess somewhere, there was always a zeal to want to learn computers but I did not know the sources. I knew I could "learn C/C++" but what would I do with that was something I never knew. I consulted a couple of teachers in my college regarding this but what they'd suggest is for me to learn it to get good marks, improve my score and get into a good engineering college. 🙄 Their assumption was that the real CS happens in engineering college and 12th does not matter.
I stuck to electronics, scored fairly decent marks and in engineering, I opted for CS.
Chapter 1: University. (Can skip)
For some reason, I thought of University as a magic box where a dumb person goes in and comes out as a "coder, rider, provider". But alas, life is no TikTok and I realized it on day 1 when in my Science and Humanities department, I was the only one who did not know a line of code. There were kids who'd be inducted into Mechanical, Electronics, Civil departments and 99% of them knew how to write some code and I was caught off guard, by this. I thought I could wing it and yet again focused only on scoring marks. This is what I was thought whole my life and it did not help that the Director drew a correlation that every kid who scores a 8+ GPA lands up a job that pays them at least 12lpa.
Focusing on marks, studying what "looked" important and with a goal to maintain a 8+ GPA, I strictly adhered to rules that would help me achieve my goals. 3 years of this, you could ask me to write an API and I would first see if this was "in syllabus" or not. In the 3rd year, we had something called Practice School. This was a term that we borrowed from some IIT / NIT, but proudly wore it on our chests like it was some discovery we secretively made. Practice School is a fancy term for unpaid internship where you could either work for a company, or work under a professor to do some research work for some minor credit. For some weird reason, I was asked about what activity I did apart from academics in every interview that I attended. Did they really expect me to do anything apart from study and score marks? 😲 \s. So, obviously, I got a great internship in a very good company called the company of friends, located in the Boys hostel.
With an industry internship that went flying while jamming to "Hum Toh Udd Gaye, by Ritviz", I was left with no choice but to get a research internship. Thanks to my face, communication and luck I could convince one professor that it was me who discovered gravity and that I have some secretive potentially mind-blowing scientific research going on that would shock Stephen Hawking. I was a Research Assistant to a professor with initials MDD (which co-incidentally also stands for Major Depressive Disorder).
Chapter 2: Research Assistant. (Can skip)
Being a research assistant, my job was to build apps to capture data, propagate these apps to a set of users and generate datasets. Not bragging, but I could learn the technology he wanted and build apps very quickly. It was not production quality as this was the first "project" I was working on, but it was there. It could house about 100-150 users who actively used the application to log data. Spending more time towards this, I neglected studies a bit. My grades were still the same thanks to the easing up of the portions and subjects. I absolutely loved what I was doing and the fact that I could see a weekly impact when I release a new version of the app was something that gave me immense thrill. The professor, too, was extremely impressed by my efforts and gave me a couple of interns to "manage" in order to churn more apps. This was fun, we experimented with multiple frameworks, presented our "research" work to a couple of potential "investors" and this experience improved my communication, presentation, documentation, coding and every other skill I could think off. In one of the monthly "appraisal" scheduled by the professor, I asked him how "industry-ready" I was and he gave me compliments like I was the one of the many forms of Lord Vishnu. I was pretty satisfied and I could nail interviews (is what I thought).
Chapter 3: Placement Season (Please dont skip)
As 3rd year came to an end, placement season began. If placement season was a mood, it would be the mood associated with "winter is coming". First company that opened up doors to an interview was Uber. With a pay package that equates to my family's 2-year pay, they came in with a bang. The first round was an online round. As I read the questions, i could physically feel my hair jump and fall off and the one's remaining grey themselves in order to fool my body that we're old now and death was only a matter of time right now. I could solve 1 question, but most people could solve 2. I discussed this online and found out that Uber is notoriously asking difficult questions and that makes sense because they're paying a huge salary.
I was not aiming for such a huge salary, so I was fine. After this came Intuit, Microsoft, GS, AWS, HP, Cisco, Myntra, Sabre, Shell, Infosys and I could not clear even one of the first coding rounds. Sometimes I got 2 questions right, sometimes I got 1. But all the times I never got the interview.
I was genuinely depressed and realized that it is time to up my DSA game. This game isn't new to me. I was "preparing for placements" by referring to sources like HackerRank (which was the go-to choice of more than 90% of the recruiters). I reduced the time spent on this because I was convinced that my practical experiences will be valued. I restarted my practice and one fine day a small company came to campus. They asked the most simplest coding questions that just tested if you can translate the logic to code. I could and I got in, after 38 rejections and 1 interview. Pretty much a TWICH-isq company.
Chapter 4: The work (Please dont, thx)
The company that I got into typically trains all the employees for 6-8 months before giving them work. But thanks to my practical experience, I was one of the 10% people who was offered a role to join immediately out of college for a 2.5x increase in pay. I thought my luck is changing and apna time aayega. I joined the company. Next month, I will have completed 3 years in this company. The most "development" work I have done is add 3-4 minor "adapters" to the existing product and expand support. Apart from that, I aided migration to Jira, Github, setup CI / CD pipelines, got the Wiki culture, etc. It's a very old fashioned place but what I got going for me is that my team is not rigid in their mindset. In the 3 years that I am here, my salary has increased by a grand total of 7% (not annually, overall).
The ACTUAL question.
I am restarting my algo and DS journey. But is there anything else I can do to get a job? Should I look at changing fields? If yes, how. I love management, CX, etc. I do love coding too, but I am not the competitive coder and that makes me believe that I am the impostor Among Us. I am not looking to go abroad as I am the sole breadwinner in my family and I can barely sustain with my present salary.
Thanks for reading if you did.
Sorry for making this Quora-isque, lordships of Reddit.
Thanks,
Regards,
Bye.

Good daytime, dear stats-friends! <3 As a complete amateur in statistics I'm looking for advice on selecting the appropriate analysis for my dataset and purposes.
- I have some demographic variables (Gender (3 options), Urban or rural resident (2 options), Relationship Status (5 options), number of certificates in their profession, etc.)
- A questionnaire with 50 items: 38 is binary Yes/No; the rest asks respondents to select one of the 9 roles to describe themselves in different social situations. (*the roles always remain the same)
I have punched in the results of all 100 people who participated in SPSS and assigned a value to all of my responses.
What do I need to identify from this data? First and foremost:

How people responded yes/no to different questions between "male/female/other-please-specify" (How many men said 'yes' to owning a pet? How many women said 'yes' to being self-employed?)
How many yes responses in in total, in all 38 questions did singles give?
Whether there is a relationship between any of the 9 roles and the number of certificates (do people with higher number of certificates choose the role 'professional' more often across all situations?)
Wether the 9 roles relate somehow between each other? (If people use role "friend" in situation Nr.1 do they also likely choose role "father' in situation Nr.5?)

Answer to any of these would help me to progress greatly because I have been stuck with youtube tutorials and quora since yesterday, and I'm not even sure if I'm looking in the right direction.. ʕ⊙ᴥ⊙ʔ
Thank you very much much in advance! <3

Hi. I have a machine learning course in college. However, there is a project component I really need to get working on from right now even though the course has just started. It will be a Kaggle contest but also with elements where they we have to justify our experimentation process. A key rule is that there is a blanket ban on all Deep Learning methods (though embeddings are allowed).

My problem statement is going to be a reduced-dataset version of the Quora Insincere Questions problem on Kaggle.

I have a fair bit of experience with Deep Learning based methods, thanks to the deeplearning.ai specialization, but almost know nothing of the classical methods like SVM, Kernels etc and I also know nothing about preprocessing, exploratory data analysis and other methods.

Where should I learn these things from? Waiting for it to be covered in the regular course is not an option. At all.

Thanks,
pAkOdA

Hello!
Recently a close friend asked for my help on buying a new laptop, since she knows I'm into computer hardware. Naturally, my first question was her main use cases. She told me that the productivity application she uses the most is R for statistics. I'm not fully aware of the type of processing this program does, and while I read some stuff, like it loads the datasets in RAM and therefore benefits from larger amounts of it, or that it mainly uses a single thread unless you use it with 'snow' because it doesn't do paralelism very well by default, I'm still not completely confident on what it would benefit the most from.
The laptop she used is good (main specs listed below), but she has been feeling the need for an upgrade. I've found out that there is a benchmark to measure your CPU's effectiveness for R (although more specifically, I read it measures your CPU's number crunching ability, which would apply for R), called benchmarkme, but after looking for it for quite a while, I couldn't find a database of benchmarked CPUs to make up the hirarchy. Therefore, I can only guess what the program prioretizes (as in single thread performance mainly).
I was wondering if any of you had any decent benchmark page or resource so I can compare the different SKUs available today in consumer laptops by their R statistics ability.
Her old laptop relevant specs are:
CPU: I7 4702MQ (4 core 8 thread mobile CPU @ 2.2 GHz base clock)
RAM: 8 GB DDR4 (I assume a very slow speed, like most laptops, although she didn't mention it).
No discrete GPU.
No SSD.
1 TB HDD @ 5400 RPM.
If you can't provide any benchmarks for me to look at, just verbal knowledge of what R benefits from is very very welcome. Single thread performance? Multi thread performance? Up to X amount of cores? Is more than 16GB of RAM actually necessary? Does it get help by the GPU in any way (like CUDA acceleration)?
EDIT: Thank all of you for the input!

Introduction

Hello, my name is Alexander and I'm a recent graduate of The Data Incubator program.
I know before I joined the program datascience was an invaluable tool in helping me make my final decision to attend (a big shout out to all members who responded to my private messages) and this in-depth (read: way too long) review is my small attempt to give back. I also believe that my perspective is somewhat unique in that most of the existing great reviews out there are from a fresh grad/post-doc perspective while mine is one from someone who is already an experienced professional looking to transition into data science.

About Me

I'm formerly trained in computer science and have been a Senior Software Engineer for over 15 years. I started my career with a deep love of operating systems and UNIX (I'm getting old, I remember installing RedHat from a dozen or so floppy disks) so the first half of my career has been spent writing fairly low-level/high performant (well, sometimes) code mainly in C/C++. So think storage and network drivers, boot code, deep packet inspection, and just general platform work.
However in 2016, I read "The Great A.I. Awakening" in the NYT and was completely blown away by it. The only time I have ever heard of anything related to neural networks was the venerable perceptron algorithm which I knew was used on some CPUs for branch prediction. But I had no idea how far the AI community had come with deep learning and vowed that I wanted to be part of the action too! Since then I've taken numerous online MOOCs on machine learning and now consider Andrew Ng to be one of my closest friends (disclaimer: I have never met Professor Ng).

The Data Incubator (TDI)

If you aren't familiar with TDI's Fellowship program, it is considered by many to be one of the premier data science bootcamps in the country (US anyway). It is an eight week program that is supposed to not only teach you the foundations of data science but also help you land a job as well through their ever expanding partner network. Their main competitor is probably Insight but they are also battling an entire cottage industry of multi-week camps such as Springboard and Metis to name a few.

Admissions

What's makes TDI somewhat unique compared to other bootcamps is their non-trivial admission process which is broken down into three rounds:

Resume/CV Review
Project Challenge
Capstone Pitch

My guess is the majority of folks get past the first round provided you have a graduate degree from a reputable university but get rejected in the second round during the project challenge phase. They claim their acceptance rate is 2-3% which is about right: My cohort I think had ~4k applicants with a little less than 36 attending.
The project challenge is actually broken down into two or three subprojects with each subproject covering areas of probability, statistics, and basic data science (mostly dataset handling and EDA, not modeling). I would say knowledge of Python is pretty much required to get through the challenges intact.
This is how it works: TDI will send you a link to a real, midsized dataset (at least a few gigabytes) and ask you to perform some in-depth EDA about it. For the stats/probs part, they will ask you to write some code to simulate an experiment and then ask various basic probability questions about your results. So I would say if you are looking to "book-up" for the admissions process you should be fairly comfortable with Python, pandas/numpy, web scraping, and SQL. Obviously, challenge questions will vary with each cohort but my guess is they are all similar with respect to the skillsets you need to do them. You have a few days to finish it and can submit as many times as you want, i.e. you can work on some, submit, work on another part, submit, redo the first section, submit, etc.
I talked to several classmates about the challenge project and I think on average most folks said they spent at least 20 hours working on it - so be prepared. The admission process says it takes a few hours to finish but that is just not realistic unless you happen to be not only fluent in the above technologies (I was) but also familiar with the dataset in question (obvioulsy not).
Frankly, I found all the challenge problems to be a lot of fun! I got to not only flex my data science muscles but also learn a few things along the way. However, I must admit that if I hadn't gotten accepted into the program it would have been a heavy investment on my part with very little gain (I literally didn't sleep one Friday night coding one of the challenges up, read: the wife was not happy).
If you make it pass the first two rounds, next is your capstone pitch which consists of a stand alone short video of yourself explaining what you want to do for your capstone project as well as a separate video preso further explaining it. It's typically very high level though; some candidates (including yours truly) had alpha/beta-ish projects from other courses as a basis which gave us a clear advantage while others were still in the incubation stage (literally a single page with scribbles on it that vaguely resembled "Look out, Data Science!").
Note that this round is mainly about gauging your personality, how you present in front of a crowd under a time crunch, and how articulate you are when talking about a technical topic. My main advice here would be to practice your pitch and have a few sensible slides to work off of. Please note that you do not get to share your desktop but rather have to send your slides to everyone over a chatroom, which means you can't drive the whole process as you normally would in a formal presentation.
If all goes well, you're in! Congrats!

Fellow vs. Scholar

During the admission's process, you can apply to be a Fellow or a Scholar but what does that really mean since both are part of the Fellowship program?
Fellows attend TDI tuition free but have to agree to interview with TDI's partner network for a period of time before being able to interview with any company of their choosing and have to attend in-person and thus can not be online.
Scholars on the other hand have to pay a tuition fee but are not tied to TDI's partner network. They can also attend online and are eligible for a 50% refund if they land a job with a partner.
However, after the admissions process, everyone is treated as a Fellow, i.e. there is no distinction during and after the program. It's purely an admission distinction only, and in fact the faculty at TDI treat everyone as Fellows - that includes partner meetings, projects, you name it. Again, there is no distinction once you start the program.
I attended the program as an Online Fellow since I worked full-time and was not going to leave my current job without another one in-hand.

Online vs In-Person

One aspect about the TDI Fellowship that really stands out is that it was designed to be accessible for online students since its inception. It's one of the major reasons why I applied in the first place and why I think the Insight program is a bit behind the times in this regard.
But that begs the question: Do you loose anything by being an Online Fellow instead of attending In-Person? Yes, there are a few drawbacks:

It's harder to build a relationship with any of the extremely talented and smart people in your cohort. My advice is if you are near any of TDI offices you should trek in for various events such as Partner Panels mainly in order to mingle with your classmates. That's how I made a few friends!
Some of the resident data scientists are honestly not as chatty as they should be over Slack. Getting help at times was like pulling teeth and it shouldn't really feel that way. Obviously, your mileage may vary, but that's how I felt throughout the course minus our capstone group leader who was awesome.
Working with the HR folks over Slack can be very painful. For example, an announcement is made for a company you are really interested in interviewing with but have a few basic questions not covered by the CRM so you quickly direct message the primary contact managing that account hoping to get an answer immediately. Think again. My guess is these folks are uindated with requests all day long and it's just hard for them to be very interactive over chat. Most of the time I would send a direct message and wait at leat 24 hours for a response. My guess is if you were in person it would be easier to interact with these folks and get the information you need faster or at least confirm the person is researching the answer for you. When you are online, you stare at a blank chat window for a while and hope they get back to you. I actually think one big improvement would be to treat the CRM kinda like JIRA and have the ability to file Issues in your dashboard (each Issue would be a question that would start a thread). I think this would be much more effective way for everyone involved.
Attending Partner Panels as an Online Fellow meant very little interactivity with said partner - both during and after the session is over.
There is no graduation ceremony for Online Fellows. Very minor but definitey something they could easily improve upon. I kinda felt that with the Online folks they should do a final get together to talk a little bit about their experience and do some kind of exit interview as a group.

Location, Location, Location!

TDI is a self-styled WeWork company so all the classes are held in shared workspaces (read: at any given moment that location's wireless connection may just drop). At the time of this writing, the main office locations are in New York, San Francisco and D.C. Everyone else is online. Note that your daily lecture maybe given from any of these sites based on what resident TDI DS is teaching it.
Here's the thing: If you do decide to attend in-person, your location will have a huge impact on your placement as the bulk of TDI partners are located in New York and San Franscisco (which to some extent is to be expected). If you are willing to relocate though, then this may not be too much of a big deal. But for those looking for jobs in their local metropolitan area, you are most likely on your own. Don't get me wrong, TDI has partners worldwide (seriously they do) but there is definitely a high concentration of them that bookend the US Coasts.

Onboarding

Before you officially start your cohort, TDI has a 12-day onboarding program that they recommend you work on as well as a homework project that you must complete and submit before attending class. So be prepared to start coding on day one after accepting the Fellowship.
The 12-day program is a crash course in data structures and algorithms, probability/statistics, and Python. Take it seriously. One of the biggest mistakes many Fellows made was to not to do go through the 12-day program in earnest and to work on their day-one homework assignment late in the game. I'm telling you as someone who knew about 95% of the 12-day program that I still needed a refresher on a few things: When is the last time you did any kind of dynamic programming? When is the last time you had to write quicksort from scratch? You get the idea.

A Day In the Life of a Fellow

Each day the course follows the following outline:

Coding challenge
One hour lecture on the topic of the week
Time to work on the Mini-Project due that week
Capstone group hours/pitch night discussion (twice a week)
Job interview lecture (twice a week)
Office hours (once a week) to discuss any mini-project issues/roadblocks
Partner Panels (one or two a week after the first month or so)

Coding Challenges

Every morning you have an hour to do a coding challenge. They are mandatory and vary wildly in quality. Once thing that really bothered me throughout the course was the fact that the coding challenges are somewhat random both in topic and difficulty. I also generally believe that HackerRank problems are generally less useful than say LeetCode which groups coding interview questions by company which is key if you are trying to find a common thread of topics across industries to study (and also somewhat motivating knowing full well that you may see that exact problem in an actual interview). There were many times while struggling on one particular HackerRank "Hard" problem where I was like no one (not even Google) is gonna ask me this.
The resident data scientists will go over the solution afterwards; though the reference solutions are sorely lacking in detail and occasionally flat out ridiculous, i.e. the solution will focus on brevity instead of explainability. But overall, I do think the coding challenges were good practice and gets you in the mode of what a job interview could be like.

Lectures

Every day there is lecture on a particular aspect of a major overarching topic for that week. So one week it maybe on machine learning while another week maybe dedicated to Apache Spark. Since they are only an hour each, it can sometimes feel very overwhelming, or counter intuitively very underwhelming, depending on your existing background on the topic and the course material itself. Most of the material is driven from a bunch of loosely coupled Juptyer notebooks which is good and bad - I found there was absolutely no excuse to have to open up multiple notebooks for an hour long lecture. I think that is just lazy and the notebooks should be re-organized accordingly. But I admit it's a relatively minor grievance in the grand scheme of things.
As for quality, again, it varies a lot. For example, I got a lot out of the Apache Spark and MapReduce lecture series and miniprojects since I have never worked with either of those technologies before and was very eager to learn. However ironically, I didn't get that much out of the machine learning ones since I already knew most of the material.
Overall, I think the lectures were OK. It's just very hard to teach advanced subjects in one hour chunks and it shows.

Mini-Projects

The mini-projects are nothing short of fantastic! Seriously, if there is one aspect of the program that I think they got right it's this one. They are challenging, realistic, and attempt to really test your understanding of the subject matter they cover. They are also as a result a lot of work and sometimes a bit frustrating too (and this is coming from someone who finished all of them a month early).
Every Satudary that week's mini-project is due and is auto-graded on a 0.0 to 1.0 scale. You need to get a 0.9 or higher on every mini-project to graduate the program. If you fall behind then you loose access to the CRM until you catch up. One thing they made clear is that this isn't to punish the Fellow. Rather, it's to ensure you understand the underlying course material - and I agree with them. Completing these mini-projects not only gives you a sense of accomplishment but actually makes you feel like a data scientist!

Capstone

Throughout the course you will be working on your capstone. The capstone can be the one you pitched during the admission process or can be sponsored by a TDI partner. I know what you're thinking: Of course, I'm going to do one sponsored by a TDI partner since that will allow me to get in the door for a interview and land that dream job! Well, yes and no.
I had two online colleagues that did sponsored projects and were treated pretty poorly by their partners. One person finished the capstone and the partner didn't even show up to watch the person pitch it during Pitch Night (I'm pretty sure they didn't even get an interview to boot). Ironically, another was treated fairly well up until he actually finished the project (he did a fantastic job too) only to find out that his partner wasn't really interested in hiring him full-time. My advice is to research the "sponsor" first and try to gauge if there is a post-capstone process in place.
In general though, I would pick a capstone you feel somewhat passionate about - either in its subject matter or its methodology. Remember, this project is mainly for you in that it gives you something to talk about in an interview when asked what kind of DS have you done outside of saving passengers on the Titanic (drum roll please)!

Job Interview Lectures

These were obviously less useful for me since I have been in the industry for many years and have gone through several interviews in my career. There are few things that I strongly disagreed with that they stressed during the lecture (outside of maybe finance I would never ever wear a suit and tie to an interview - not happening) but that's for a another day.
Overall the job presos were presented well and I did learn how to write a proper cover letter (though a few of my more experienced colleagues debated if anyone actually reads them?). I am also very happy with my updated resume.

Pitch Night

Pitch Night is exactly what it sounds like: Fellows pitch their capstones to a few TDI partners in order to both sell themselves and indirectly the quality of the TDI program. I participated in it and I thought the experience was positive overall. I did have the feeling that Pitch Night is more about TDI showing off their product (read: me) more so than about Fellows getting actual job interviews. But to be fair, that might have been more to do with the group of partners that showed up than the actual format of the program itself (read: selection bias).
If you do happen to be one of the lucky souls that gets voted to do Pitch Night, I encourage you to do it. The process is a bit nerve racking but the TDI staff really excelled at making sure you were ready for it.

Partner Panels

A Partner Panel is where a TDI partner is invited to one of the WeWork office spaces scattered over the country to give some background about their company, how they hire, what's it like to be a data scientist, etc. Usually, an in-person Fellow is the panel "lead" and is responsible for introducing the partner and asking the first set of questions.
I found that the quality varied widely - some partners were really prepared and made me want to interview with them. Others, not so much. But even more disappointing was the fact that Online Fellows had practically zero interaction with these folks which put us at a disadvantage. I also think that this process could be improved a lot by formalizing the partner side more, i.e. require them to follow a certain format and answer a few standard questions right off the bat. But overall I think they were generally positive experiences and I did learn a lot by just listening to partners answer other people's questions.

The "CRM"

TDI's partner network is encapsulated by their internal CRM website that allows bidirectional communication between Fellows and Partners. Fellows can spam Partners with their CV/resumes and cover letters begging for an interview while Partners can peruse Fellow's resumes/CVs and contact them directly.
The good: TDI has built a fairly large partner network and it is ever growing. Also TDI's reputation as far as I can tell is pretty good within the industry, e.g. there are some partners who only hire TDI graduates believe it or not!
The bad: The CRM is simply not kept up to date. So there were many, many partners in the CRM that were either listed as inactive or unresponsive. Worse still, there were some partners who were listed as "active" who really weren't. Obviously, some of this stuff is out of TDI's control but it is disheartening to spend hours crafting a cover letter the stuff of legends only to find out the company isn't hiring.
The ugly: The site iself I found pretty awful in layout and design. Seriously, Wordpress is their friend. I also thought even simple things are missing like complete descriptions about what the company does, have they hired TDI fellows before, and any interviewing tips you should know (you can ask for this but I think it should just be baked into the CRM as a free service for Fellows).

Conclusion/TLDR

TDI is overall a good program and definitely helped me transition into data science. But your results will vary a lot depending on your location, your background, and the number of partners involved in your particular cohort. I think the TDI staff is excellent but there are numerous places where they could improve the Fellowship's overall daily flow as well as make it a bit more personal (especially for Online folks).

FAQ

Can you attend TDI as an Online Fellow while working a full-time job?

You can but it will be difficult. I did the program this way but benefited from the fact that I knew the basics of all the topics being discussed and I already worked from home twice a week. The latter allowed me the flexibility to attend all of the mid-afternoon lectures and participate in my capstone study group. Lectures are usually at noon EST so I could just use my lunch hour to watch them. There were a few Online Fellows who did somthing similar. Some were successful, some still struggled to balance everything and fell behind (but did evenutally complete the program).
But the workload is a lot so be prepared to work long nights and weekends. I lost every night and my entire weekend for a few weeks which can be really tough if you have a family (read: I do).

Am I guaranteed a DS job if I take this program?

Probably not. Based on talking to a lot of folks who took the program in past cohorts, most were still looking for jobs months after the program ended. Again, it's really the luck of the draw when it comes to the number of partners who are participating and how many of them are actually hiring.

Ok, but do I have a better chance of landing a job if I'm a TDI graduate?

I would say generally speaking, yes! Particularly if you have no formal background in DS whatsoever (like yours truly). It gives you an instant network to work with between TDI's and your fellow classmates. Moreover, if things go well, you will at least land a few interviews off the bat and get some practice in. All good experiences for landing that first DS job!

Do you really have an advantage using their CRM versus applying directly on the partner website?

It depends. I believe in general, for smaller companies, you absolutely do have an advantage as a Fellow over a random applicant off the ether since you usually get to talk directly to a hiring manager. For larger companies though, I think you are treated like everyone else as most of the contacts are either someone in HR or a corporate wide recruiter.

What does the average TDI graduate make?

It varies wildly depending again on your location, your existing experience, and the industry you are working in. You already knew that so this answer is not going to be very satisfying. However, it is the ground truth (literally).

I heard that parters don't offer as much to TDI graduates since they have to pay TDI a finder's fee?

I forgot where I read this (either here or Quora) but this isn't universally true. There is some truth to it particularly for start-ups and small outfits where resources are by definition limited. But for medium to large enterprises, a finder's fee is fairly typical in the industry and has really noting to do with a certain position's salary range.

I got accepted as a Scholar. Should I try to reapply as a Fellow instead? And do I have a better chance of landing a job as a Fellow vs. a Scholar?

Honestly, if you can afford it, I would advocate that you just take the course. Also, TDI is heavily biased towards having a PhD for the tuition-free Fellowship program so if you only have a Master's degree keep that in mind.
As I said earlier, it's not more prestigious - everyone is a Fellow once you are admitted in the eyes of both the staff and perspective employers. It's simply a matter of cost.

Is there a TDI Alumni network?

Unfortunately, no there isn't. Apparently, this has to do with their partner confidentiality agreements (at least that was my impression after inquiring).

TDI Alum Slack Channel

I've started a TDI Alumni Slack channel which is invite only. Please PM me for details.

Step by step Help for you:

Platforms Node.js Frontend Development iOS Android IoT & Hybrid Apps Electron Cordova React Native Xamarin Linux Containers OS X Command-Line Screensavers watchOS JVM Salesforce Amazon Web Services Windows IPFS Fuse HerokuProgramming Languages JavaScript Promises Standard Style Must Watch Talks Tips Network Layer Micro npm Packages Mad Science npm Packages Maintenance Modules - For npm packages npm AVA - Test runner ESLint Swift Education Playgrounds Python Rust Haskell PureScript Go Scala Ruby Events Clojure ClojureScript Elixir Elm Erlang Julia Lua C C/C++ R D Common Lisp Perl Groovy Dart Java RxJava Kotlin OCaml Coldfusion Fortran .NET PHP Delphi Assembler AutoHotkey AutoIt Crystal TypeScriptFront-end Development ES6 Tools Web Performance Optimization Web Tools CSS Critical-Path Tools Scalability Must-Watch Talks Protips React Relay Web Components Polymer Angular 2 Angular Backbone HTML5 SVG Canvas KnockoutJS Dojo Toolkit Inspiration Ember Android UI iOS UI Meteor BEM Flexbox Web Typography Web Accessibility Material Design D3 Emails jQuery Tips Web Audio Offline-First Static Website Services A-Frame VR - Virtual reality Cycle.js Text Editing Motion UI Design Vue.js Marionette.js Aurelia Charting Ionic Framework 2 Chrome DevToolsBack-end Development Django Flask Docker Vagrant Pyramid Play1 Framework CakePHP Symfony Education Laravel Education Rails Gems Phalcon Useful .htaccess Snippets nginx Dropwizard Kubernetes LumenComputer Science University Courses Data Science Machine Learning Tutorials Speech and Natural Language Processing Spanish Linguistics Cryptography Computer Vision Deep Learning - Neural networks TensorFlow Deep Vision Open Source Society University Functional Programming Static Analysis & Code Quality Software-Defined NetworkingBig Data Big Data Public Datasets Hadoop Data Engineering StreamingTheory Papers We Love Talks Algorithms Algorithm Visualizations Artificial Intelligence Search Engine Optimization Competitive Programming MathBooks Free Programming Books Free Software Testing Books Go Books R Books Mind Expanding Books Book AuthoringEditors Sublime Text Vim Emacs Atom Visual Studio CodeGaming Game Development Game Talks Godot - Game engine Open Source Games Unity - Game engine Chess LÖVE - Game engine PICO-8 - Fantasy consoleDevelopment Environment Quick Look Plugins - OS X Dev Env Dotfiles Shell Command-Line Apps ZSH Plugins GitHub Browser Extensions Cheat Sheet Git Cheat Sheet & Git Flow Git Tips Git Add-ons SSH FOSS for DevelopersEntertainment Podcasts Email NewslettersDatabases Database MySQL SQLAlchemy InfluxDB Neo4j Doctrine - PHP ORM MongoDBMedia Creative Commons Media Fonts Codeface - Text editor fonts Stock Resources GIF Music Open Source Documents Audio VisualizationLearn CLI Workshoppers - Interactive tutorials Learn to Program Speaking Tech Videos Dive into Machine Learning Computer HistorySecurity Application Security Security CTF - Capture The Flag Malware Analysis Android Security Hacking Honeypots Incident ResponseContent Management System Umbraco Refinery CMSMiscellaneous JSON Discounts for Student Developers Slack Communities Conferences GeoJSON Sysadmin Radio Awesome Analytics Open Companies REST Selenium Endangered Languages Continuous Delivery Services Engineering Free for Developers Bitcoin Answers - Stack Overflow, Quora, etc Sketch - OS X design app Places to Post Your Startup PCAPTools Remote Jobs Boilerplate Projects Readme Tools Styleguides Design and Development Guides Software Engineering Blogs Self Hosted FOSS Production Apps Gulp AMA - Ask Me Anything Answers Open Source Photography OpenGL Productivity GraphQL Transit Research Tools Niche Job Boards Data Visualization Social Media Share Links JSON Datasets Microservices Unicode Code Points Internet of Things Beginner-Friendly Projects Bluetooth Beacons Programming Interviews Ripple - Open source distributed settlement network Katas Tools for Activism TAP - Test Anything Protocol Robotics MQTT - "Internet of Things" connectivity protocol Hacking Spots For Girls Vorpal - Node.js CLI framework OKR Methodology - Goal setting & communication best practices Vulkan LaTeX - Typesetting language Network Analysis Economics - An economist's starter kit

Few more resources:

🕎🕎🕎 HAPPY HANUKKAH 🕎🕎🕎
I found this piece posted as an anonymous answer on Quora. My intent in posting this is to celebrate Jewish EAs, not to be in any way anti-Semitic or political. Without further ado:

It is worth pointing out that many of the founders and prominent members of the effective altruism movement are ethnic Jews. The philosophical foundations of EA are usually traced back to Peter Singer’s 1971 essay “Famine, Affluence, and Morality,” which argued that it is immoral to spend money on luxuries when we could instead use that money to save lives. Peter Singer is an atheist utilitarian philosopher of Jewish descent whose grandparents died in the Holocaust. Even though Singer penned his argument in the 1970s, the EA community did not come about until the 21st century. There were a few key factors which together led to the formation of the EA movement as we know it today.
The first factor was the creation of the charity evaluator GiveWell, which analyzes different global health interventions in order to find the most cost-effective donation opportunities. GiveWell was founded in 2006 by hedge-fund analysts Holden Karnofsky and Elie Hassenfeld, both (ethnically) Jewish.
The second factor is the community formed around the website LessWrong. This website is dedicated to the “art of human rationality,” i.e. how to form accurate beliefs and effectively achieve your goals. It is easy to see the idea of rigorously analyzing charities for cost-effectiveness would appeal to this crowd. The LessWrong community also took a particular interest in reducing catastrophic risks that threaten human extinction, mainly focusing on risks due to advanced AI. Global catastrophic risks (including AI risk) remain a key focus area of EA to this day. Oh, and by the way, the founder of LessWrong is Eliezer Shlomo Yudkowsky. Enough said.
The third factor is the emergence of Giving What We Can at Oxford University. To the best of my knowledge, the key people involved in this project were gentiles, but I could be wrong.
Finally, the philanthropist funder who is most closely connected to the EA movement is Facebook co-founder Dustin Moskovitz. After reading Peter Singer’s book The Life You Can Save, he became convinced that he should use his fortune to most effectively help those in need. He reached out to GiveWell for advice on how to use his money. This collaboration resulted in the Open Philanthropy Project, which has donated over $928,000,000 as of November 2019.
Now, to answer the headline question [What proportion of effective altruists are Jewish?], I will draw on the 2018 Effective Altruism community survey and the 2019 Slate Star Codex survey. Slate Star Codex (SSC) is a blog that is connected to the LessWrong and effective altruism communities (and also happens to be written by a secular Jew).
The EA survey asked people for their religious affiliation. In total, 41 of 2,607 respondents (1.57%) identified as religious Jews. If restrict the results to respondents who claimed to be religious at all, we get 41 out of 387, or 10.59%. I would guess that the number of secular Jews in the community is much higher than the number of practicing Jews. The EA movement as a whole tends toward irreligion, with a majority identifying as atheist, agnostic, or non-religious. Unfortunately, the EA survey does not provide any further helping information for identifying the proportion of secular Jews.
Luckily, the SSC survey does provide that info. It asked separate questions for “Religious Denomination” and “Religious Background”. It turns out that 12.3% of SSC readers who currently practice a religion are Jewish. Furthermore, 9.6% of SSC readers have a Jewish family background, whether or not they are practicing.
However, not all SSC readers are effective altruists. So I downloaded the public SSC dataset and restricted my query to those who answered “Yes” when asked whether they identify as an effective altruist (other options were “No” and “Sorta”). The results are stunning: 18.89% of religious EA SSC readers are Jewish, and 13.94% of all EA SSC readers have a Jewish family background. Note that this is not necessarily representative of the entire EA movement, as SSC tends to attract a certain type of person. Nonetheless, it is pretty interesting.
If I had to guess why so many Jews are involved with the EA movement, I would hypothesize that it is because the EA movement tends to attract intelligent, well-educated, and middle/upper class individuals. Jews tend to score highly on all of these metrics.

Hi all, sorry for the length of this post. I'm curious to get back into programming and had a few questions.
For background: I took courses in C++ in college many years ago (did pretty well) but took a very different turn in my career path. Anyways, it was mostly learning to code linked lists, red-black trees, big-Os, that sorta thing.
I've been interested in getting back to programming for a few small ideas I had, and looked at my old college notebooks and had a small heart attack (dear god please don't make me write bubble sort again!).
I have been doing some python programing with my kids (First Bryson Payne's book then Python for Kids which we're doing now). It seems like the python libraries are full of great stuff much of which has c/c++ under the hood. I was contemplating just going forward with that and taking some more advanced courses myself but I have to say I do miss c++ for maybe no good reason other than I liked it in the past a lot.
Anyways, I then watched a pretty inspirational talk by Bjarne. He also mentions in it running a dataset in python taking 3 or 4 days and his code on C++ could run it in 10 minutes. Yikes! I thought there was better optimization in python. Anyways, I got his book C++ principles and practice and I'm still early into it. Some things are the same but clearly many things have changed (auto pointers? huh?). I'm happy to take the time to learn it. I'm not in a rush. This is more of a hobby. I do like the idea of knowing what's going on under the hood. My question is, is C++ beyond my scope?
I found this post to be helpful but I had a few lingering questions in no particular order.

If I stick with C++ how necessary will it be to learn CMake in order to build my projects? Even in college for bigger projects, I just compiled everything at the command line. Is this no longer very feasible?
Is there a reasonable equivalent for data science in C++ as python? I have a few little data sets to analyze. I've seen a few posts about possible alternatives Will these be much more complicated to learn than numpy and scipy? Would I be shooting myself in the foot in other words trying to do this in C++?
Also looking into a small webapp at some point. Is Wt a reasonable alternative to use instead of say python/django?
How much of these libraries have been updated to use the new "modern" C++ style with auto pointers and whatnot, or are most of these projects still using the older style and you end up with a semi-balkanized project with old and new pointers (excuse the pun) being used side-by-side?

I guess my question is, if you only had time to learn one language and were more of a dabbler in 2018 would you all as C++ programmers still think it's worth the investment? Or would I save myself a lot of headache by sticking with python?

I can't understand one thing about the random forest.
Could you please help me with an example?
I have doubts about random forrest used as a classification and as a regression. As a classification.
Suppose we need to classify whether a patient has a disease or not based on many variables.
We take into consideration many variables such as age weight heartbeat height etc ... my question is
1) When a new patient comes, how does the model determine if he has the disease or not? Is the model trained first with a dataset and then you see where it went wrong (they predicted "disease" when in reality the patient was not sick)?
What happens to models that have made a mistake? Are they discarded?
2) I don't understand how it works, for example how it propagates towards the leaves.
In this example we see that we start with a root node
https://qph.fs.quoracdn.net/main-qimg-a450d2b8d87a4368f234e034591c205c if x1
is greater than 0.5 go to the next leaf where a new condition is imposed '' x2 is greater than 0.5 '' ...
but the variable was not x1? https://www.quora.com/How-does-random-forest-work-for-regression-1
I'm talking about CART classification...what's the point of that? Why we need to trace a discrimination line?

I've seen other questions about time-to-completion and the answers are usually "couple hours or a couple days". I think this is a better way of getting a frame of reference for the "level" of competency I should aim for while learning.
Here are the projects: https://www.quora.com/What-are-some-good-data-science-projects
I'm assuming that these projects are at an amateur level (correct me if I'm wrong). Is there any common project/dataset out there that, if someone got a certain level of accuracy and completed in a certain amount of time, would make you think "this guy has the chops to work with me"? Perhaps not the resume (or the "rest of the package" that would result in a job offer), but simply a level of skill that's up to industry standards.
Edit: these are great, incredibly informative answers. Please keep them coming!

Project 2: Detecting Duplicate Quora Questions

quora questions dataset

quora questions dataset - win

Quora releases its first dataset -- Question Pairs

First Quora Dataset Release: Question Pairs

Deep text-pair classification with Quora's 2017 question dataset

Data At Quora: First Quora Dataset Release - Question Pairs

[N] 720+ new NLP models, 300+ supported languages, translation, summarization, question answering and more with T5 and Marian models! - John Snow Labs NLU 1.1.0

720+ new NLP models, 300+ supported languages, translation, summarization, question answering and more with T5 and Marian models! - John Snow Labs NLU 1.1.0

NLU 1.1.0 Release Notes

NLU 1.1.0 New Features

Translation

T5

Overview of every task available with T5

Every T5 Task with explanation:

Open book and Closed book question answering with Google's T5

What is a open book question?

Predict on text data with T5

from https://www.bbc.com/news/business-55728338

join question with context, works with Pandas DF aswell!

What is a closed book question?

Text Summarization with T5

Set the task on T5

define Data, add additional tags between sentences

Predict on text data with T5

Binary Sentence similarity/ Paraphrasing

Set the task on T5

define Data, add additional tags between sentences

Predict on text data with T5

How to configure T5 task for MRPC and pre-process text

Example pre-processed input for T5 MRPC - Binary Paraphrasing/ sentence similarity

Regressive Sentence similarity/ Paraphrasing

Set the task on T5

define Data, add additional tags between sentences

Predict on text data with T5

How to configure T5 task for stsb and pre-process text

Example pre-processed input for T5 STSB - Regressive semantic sentence similarity

Grammar Checking

Set the task on T5

define Data

Predict on text data with T5

Document Normalization

Word Segmenter

japanese for 'Donald Trump and Angela Merkel dont share many opinions'

Named Entity Extraction (NER) in Various Languages

Extract named chinese entities

Chinese for 'Donald Trump and Angela Merkel dont share many opinions'

Now translate [唐纳德, 安吉拉] back to english with NLU!

New NLU Notebooks

NLU 1.1.0 New Notebooks for new features

NLU 1.1.0 New Classifier Training Tutorials

Binary Classifier training Jupyter tutorials

Multi Class text Classifier training Jupyter tutorials

NLU 1.1.0 New Medium Tutorials

Installation

PyPi

Conda

Install NLU from Anaconda/Conda

Additional NLU ressources

720+ new NLP models, 300+ supported languages, translation, summarization, question answering, and more with T5 and Marian models! - John Snow Labs NLU 1.1.0

720+ new NLP models, 300+ supported languages, translation, summarization, question answering and more with T5 and Marian models! - John Snow Labs NLU 1.1.0

NLU 1.1.0 Release Notes

NLU 1.1.0 New Features

Translation

T5

Overview of every task available with T5

Every T5 Task with explanation:

Open book and Closed book question answering with Google's T5

What is a open book question?

Predict on text data with T5

from https://www.bbc.com/news/business-55728338

join question with context, works with Pandas DF aswell!

What is a closed book question?

Text Summarization with T5

Set the task on T5

define Data, add additional tags between sentences

Predict on text data with T5

Binary Sentence similarity/ Paraphrasing

Set the task on T5

define Data, add additional tags between sentences