1. Describe how the techniques described in ICS207 could be used to spot students who cheat on take home tests. - PowerPoint PPT Presentation

About This Presentation
Title:

1. Describe how the techniques described in ICS207 could be used to spot students who cheat on take home tests.

Description:

... whether it is safe to take the drugs Viagra and Phentermine at the same time. ... 'Viagra Phentermine' returns a billion online pharmacies, adding Safety, ... – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 22
Provided by: michaelj54
Learn more at: https://ics.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: 1. Describe how the techniques described in ICS207 could be used to spot students who cheat on take home tests.


1
  • 1. Describe how the techniques described in
    ICS207 could be used to spot students who cheat
    on take home tests.
  • What enhancements would be necessary if instead
    of copying word for word, students replaced some
    words with synonyms?
  • Discuss the precision/recall trade-off in this
    case.
  • If the penalty for copying were severe e.g.,
    expulsion from the university vs. get no credit
    for copied work, would that effect how the system
    might work?

2
  • Store each answer to each question. (To improve
    precision, create a separate index per question).
    Compute TF-IDF weights of terms
  • Treat each answer as query and see what answers
    are most similar using cosine similarity.
  • Return the N pairs most similar, or within a
    certain threshold.
  • (or cluster the documents)
  • Have the professor or TA judge the results. High
    recall catches all cheaters but has many false
    alarms. High precision returns few false alarms
    but lets more cheaters escape punishment.
  • Severe punishment may decrease the probability of
    cheating but also increases the importance of not
    having a false alarm.
  • Consider using LSA to deal with synonyms, or
    thesaurus, but be careful. Consider using
    phrases of 2-3 words.

3
  • 2 .Give an example of a stemming algorithm.
  • Why are such techniques used as part of IR
    systems?
  • Describe how the technique is included as part of
    the indexing, query interpretation and matching
    processes.
  • Porter stemmer finds root forms of terms
  • (.AEIOU.)ED -gt /1 // removed past tense ed
    ending
  • Stem documents before indexing
  • Stem queries before retrieval
  • Matching No change

4
  • 3. Why is the simple frequency of a index's
    occurrence in a document insufficient as a
    predictor of its utility?
  • Give an example of a particular index and a
    particular corpus that demonstrates this.
  • Give the formula for one term weighting scheme,
    and identify these two factors in the formula.

5
  • Use alone, term frequency favors common words in
    the corpus.
  • Doesnt take into account the discriminability of
    terms
  • Need a way to balance term frequency in the
    document with frequency in all documents such as
    term frequency divided by document frequency
  • wkdfkd / (Dk/NDoc)
  • where
  • wkd the weight for a keyword (k) in a document
    (d)
  • fkd the frequency of a keyword (k) in a
    document (d)
  • NDoc the total number of documents in the
    corpus
  • Dk the number of documents containing the
    keyword (k).
  • Probably should normailize for document length
    and use log of Document frequency
  • Not Zipf isnt document specific, so doesnt
    really apply here. Any variant of TF-IDF would be
    fine

6
Superheated air almost certainly seeped through a
breach in space shuttle Columbia's left wing and
possibly its wheel compartment during the craft's
fiery descent from orbit, an investigative board
has found. Investigators also disclosed Thursday
that Columbia had begun to have trouble,
including spikes in temperature readings in its
left wing, while it was well off the California
coast, much farther west than initially thought.
Sensors had noticed an unusual heat buildup of
about 30 degrees inside Columbia's left wheel
well minutes before the spacecraft disintegrated
over Texas on Feb. 1. But the accident
investigating board said Thursday that heat
damage from a missing tile would not be
sufficient to cause the unusual temperature
increases. Instead, the board, making its first
significant conclusion in the case, determined
that the temperature spikes were caused by the
presence inside Columbia of plasma, or
superheated air caused by the craft's re-entry
into the Earth's atmosphere, with temperatures of
roughly 2,000 degrees. It said investigators
were studying where a breach might have occurred
to allow plasma to seep inside the wheel
compartment or elsewhere in Columbia's left wing.
"We won't speculate about what it means or how
important it is," said retired Adm. Harold
Gehman, who heads the investigating board. "We
will keep the process open. We won't get into a
guessing game at this time." The board did not
specify whether such a breach could be the result
of a structural tear in Columbia's aluminum frame
or a hole from debris striking the spacecraft.
The board also did not indicate when the breach
occurred during the shuttle's 16-day mission.
The board visited the Marshall Space Flight
Center in Huntsville, Ala., on Friday to spend
two days learning about work performed on the
shuttle there. Investigators previously have
focused on an unusually large chunk of foam
insulation that broke off Columbia's external
fuel tank on liftoff on Jan. 16. Video footage
showed it struck the shuttle's left wing,
including its toughened leading edge and the
thermal tiles covering the landing gear door.
Gehman also said in Huntsville that rigid foam
insulation that fell off the external tank on
liftoff was still a possible factor. The
announcement focused renewed attention on
possible catastrophic failures inside the wheel
compartment that may have contributed to the Feb.
1 breakup that killed seven astronauts.
Officials are not sure where a breach might have
opened in Columbia's skin, James Hartsfield, a
spokesman for the National Aeronautics and Space
Administration, said. But he said the leading
edge or elsewhere on the left wing, the fuselage
or the left landing gear door were prime
candidates. "Any of those could be potential
causes for the temperature change we saw,"
Hartsfield said. "They do not and have not
pinpointed any general location as to where that
plasma flow would have to originate." In an
unusual plea for assistance, NASA urged the
public to share with them any photographs or
videotapes of Columbia's descent from California
to eastern Texas. Some members of the public
already have handed over images, "but more
material will help the investigation of the
Columbia accident," the agency said. The board's
announcement didn't surprise those experts who
long have believed that a mysterious failure of
sensors within Columbia's left wing indicated
that super-hot plasma had penetrated the shuttle.
"I think there was a substantial hole in the
wing," said Steven P. Schneider, an associate
professor at Purdue University's Aerospace
Sciences Laboratory. "That would not be at all
surprising. All the sensors in the wing failed or
gave bad readings" by the time ground controllers
lost contact with Columbia, he said. The board
also dismissed suggestions that Columbia's left
landing gear was improperly lowered as it raced
through Earth's atmosphere at more than 12,000
miles per hour. NASA disclosed earlier Thursday
that a sensor indicated the gear was down just 26
seconds before Columbia's destruction. If
Columbia's gear was lowered at that speed -- and
in those searing temperatures as the shuttle
descended over Texas from about 40 miles up --
the heat and rushing air would have sheared off
Columbia's tires and led quickly to the
spacecraft's tumbling destruction, experts said.
Officials said they were confident that the
unusual sensor reading was wrong. Tires are
supposed to remain raised until the shuttle is
about 200 feet over the runway and flying at 345
mph. Two other sensors in the same wheel
compartment indicated the gear was still properly
raised, they said. While Columbia's piloting
computers began almost simultaneously firing
thrusters in an attempt to keep the shuttle's
wings level, officials said, a mysterious
disruption in the air flowing near the left wing
was not serious enough to suggest the shuttle's
gear might be down. The investigating board
concluded that its research "does not support the
scenario of an early deployment of the left
gear." NASA also confirmed that searchers near
Hemphill, Texas, about 140 miles northeast of
Houston, recovered what is believed to be one of
Columbia's radial tires. A spokesman was not
immediately sure which of the shuttle's tires was
found. The tire was blackened and had a massive
split across its tread, but it was not known
whether the tire was damaged aboard Columbia or
when it struck the ground.
7
TF-IDF columbia-0.422 shuttle-0.26 gear-0.235
wing-0.225 plasma-0.186 breach-0.169 wheel-0.156
compartment-0.142 sensor-0.139 tire-0.134
temperature-0.13 unusual-0.107 hartsfield-0.098
spacecraft-0.095 huntsville-0.095 superheat-0.089
spikes-0.087 Frequency 4 the 3 left 3 in
2 wing 2 wheel 2 well 2 thursday 2 that
2 its 2 heat 2 had 2 from 2 columbia's
2 board 2 an 2 a
8
  • 4. What is the clustering hypothesis?
  • How can this hypothesis be exploited by an IR
    system?
  • Van Rijsbergens cluster hypothesis, closely
    associated documents tend to be relevant to the
    same request. The cluster hypothesis suggests
    that if we first compute a measure of distance X
    among all the documents, we could cluster those
    documents that seem to be close to one another or
    seem to about about the same topic and retrieve
    them all in response to a query that matched any
    of them.
  • Thanks to Jon Froehlich

9
  • 5. How does a linear support vector machine
    differ from a perceptron in classification?
  • Why is this difference important?
  • Linear support vector machine find the optimal
    dividing line by maximizing the margin between
    the positive and negative examples
  • Perceptron finds any separation.
  • SVM has less variance with different training
    sets and initial conditions and the optimal
    solution is more often a better generalization
    than a random one.

10
(No Transcript)
11
  • 6. Define precision and recall precisely.
  • What foolproof method could an IR system use to
    achieve perfect recall?
  • What would you expect the precision of this
    method to be?
  • Retr a set of retrieved documents
  • Rel a set of relevant documents
  • N number of documents
  • Recall Retr ? Rel / Rel
  • Precision Retr ? Rel / Retr
  • Return all documents to get perfect recall.
  • Precision of this method is Rel/N which can be
    close to 0
  • Note define terms in formulae

12
  • 7. What is a reasonable heuristic for identifying
    the outstanding papers in an area of research
    with which you are unfamiliar?
  • The documents with the most citations

13
  • 8.Use any general web search system to determine
    whether it is safe to take the drugs Viagra and
    Phentermine at the same time. What query did you
    use at first? What was your final query? Why
    isnt a general-purpose search engine ideally
    suited for this problem? Would relevance
    feedback help here? Would relevance feedback be
    effective if it only used positive feedback in
    this case?
  • Lots of answers.
  • Viagra Phentermine returns a billion online
    pharmacies, adding Safety, interaction, doesnt
    help. Subtracting sale, -buy, -purchase, -order
    gets closer. Negative feedback could help rules
    these
  • Just using one drug helps because there isnt a
    interaction, and documents tend to list what is
    not safe instead of what is safe.
  • drug interactions get you to specialized site
    with a database instead of a search engine.
  • Its easier to find something that exists than to
    verify it doesnt with search engine technology.

14
  • 9 .What problems does Latent Semantic Analysis
    address? Would this approach be better suited
    for queries of an encyclopedia or a newspaper?
    Why?
  • It helps with synonyms such as car auto
    automobile rental and leasing and homonyms
    such as tire
  • Its expensive to index but cheap to retrieve so
    better for collections that dont change much,
    such as encyclopedia.

15
  • 11. Pick a topic that interests you. Find an
    authority for that topic. Find a hub. Argue that
    the site is a hub or any authority based on the
    number of links to or from that site. Hint in
    Google the query linkhttp//www.uci.edu will
    find web sites that link to http//www.uci.edu

16
Latent Semantic Analysis in google.com and
picked the top 7 links. I typed each of the 10
links in google.com and found out the number of
links to it. Then I opened the 7 links and
counted the links from it by counting the
occurrence of string hrefhttp
So I think http//citeseer.nj.nec.com/deerwester90
indexing.html is an authority for this topic.
And http//www.upmf-grenoble.fr/sciedu/blemaire/l
sa.html is a hub Thanks to Qi Zhong
17
  • 11 Would a rule-based learner or collaborative
    filtering be more appropriate for filtering your
    e-mail? Why?
  • Rule based It learns from YOUR behavior
  • Collaborative Compares you to others, but you
    are the only one reading your mail, so cant help
    filter. (It might do okay at SPAM since many
    people read that, but SPAM is in the eye of the
    beholder, some people do want to help smuggle
    17M from Nigeria

18
  • 12 . Im interested in going to the machine
    learning conference this year. I typed Machine
    learning conference in to Google and it returned
    the following result
  • .1998 International Machine Learning
    Conference.ICML-2001.ICML2002 -
    Sydney.ICML-2000.ECML/PKDD-2002.ICML-99.ECML/P
    KDD-2001 Home.ICML 2003.International
    Conferences on Machine Learning and Applications
  • However, most people are more interested in the
    future conference than the past conference. Why
    is the 2003 conference ninth instead of first?
  • More links to older conference 172 172 vs 162.
  • Possibly more important links to it

19
13.  How could this happen, i.e., why did
google rank Microsoft first to the query Go to
hell? (although this is no longer first
"Microoft" gets you to Microsoft even though
the home page obviously doesnt contain the term
"Microoft" or Go to hell.   Microsoft tops
Google hell search rankings By John
Lettice Posted 18/09/2002 at 1730
GMT   Weirdisms in Google's ranking system are
sent to entertain us occasionally, and as we've
had this one from a couple of people today, we
suspect it's new. But don't blame us if it's not,
we lead sheltered lives sometimes. Currently,
if you type "go to hell" into Google, then
Microsoft Corporation, Where do you want to go
today? comes first.
20
  • Some people created web pages with the link text
    being Go to Hell and the destination microsoft.
    E.g.,
  • lta hrefhttp//www.microsoft.comgtGo To Helllt/agt
  • Microsoft has many links pointing to it, so its
    pagerank score is high
  • Google indexes terms in the link as well as terms
    in the page.

21
Final comments
  • This practice exam is a random sample of
    questions drawn from the distribution of possible
    exam questions. Topics not on this exam, but in
    the study guide for papers and study guide for
    the text can appear on the final exam.
  • Pay attention to instructions
  • USE THE SUBJECT indicated.
  • Re Practice Exam for ICS 207" so I can find it
    easily
  • How long did it take?
  • Final exam emailed at 6am PST March 16, due
    1159PM March 17 and posted on web site.
  • Final report format email by March 9 5 pages max
Write a Comment
User Comments (0)
About PowerShow.com